TL;DR: We’re still working to add API access and MCP integration to QuantJourney, so agents, IDEs, and services will soon be able to fetch canonical market data with first-class semantics. At the core are contracts-first Pydantic v2 models for all inputs and outputs, backed by deterministic per-vendor adapters. This enables new ways to run research, backtests, and validate investment ideas. The API is a near-term milestone - but the strategic goal is bigger: an MDM-grade canonical gateway and data lake serving everything from research to trading-desk support, with strict data contracts, point-in-time truth, and operational reliability.
This post explains what we’re building, why it matters to funds and analysts and serious investors, and how the pieces fit together. Stay tuned, we expect to share more code soon.
The Problem We’re Solving
Modern teams consume data from multiple custodians, brokers, banks, and market sources. Each feed uses its own schema, timestamp conventions, currencies, units, and revision logic. Without a canonical layer you get:
Fragile mapping code scattered across services.
Slow vendor swaps and costly maintenance.
Silent data drift (units, scaling, FX) that infects analytics.
Inability to reproduce what was known at time T.
That’s quite a common issue with LLMs currently which are prone to look-ahead bias and a lot of deeper inconsistencies. Even for simple strategies, we often need to pull data from multiple sources to support our investment thesis.
The QuantJourney Approach
1) Canonical Schemas (Contracts First)
We define versioned canonical schemas (semver + $id
) for core domains - positions, transactions, balances, prices, valuations, security master, fundamentals, calendars, reference. These schemas are the contract every internal app and external consumer can rely on.
Pydantic v2 is our contract engine - on both input params and output returns:
Input: We validate raw payloads into strongly typed internal models (with aliases for vendor field names) before normalization.
Output: We emit canonical Pydantic models - typed, documented, and versioned - so client code gets stable structures with clear upgrade paths.
We use Annotated types and validators to enforce semantics:
Money[CUR]
,Pct
,bps
,Date
,TZ
, enums (asset class, instrument type, period).Hard guarantees on required fields and units/scale invariants.
2) Adapters & Transforms (Deterministic Normalization)
Each feed is onboarded with a small adapter spec (our DSL) that expresses:
rename
,cast
,default
,scale
,derive
,fx_normalize
,pit_select
,map_enum
.Field-level precedence and merge rules when multiple sources provide the same attribute.
Adapters compile to pure Python callables at service startup (no per-request YAML parsing). They run fast, are unit-tested with golden fixtures, and produce identical canonical outputs irrespective of source.
3) Point-in-Time (PIT) Correctness
We support time-travel queries and restatements:
Keys include
as_of
,effective_from/to
,revision_id
,source_timestamp
.You can reproduce quarter-end views, audits, and backtests exactly as they were known.
4) FX & Units/Scale Normalization
Preserve native currency and expose normalized
value_{USD/EUR/...}
using dated FX at a declared policy (e.g., EOD, custody FX, WM/Ref style).Normalize units and scales (e.g., revenue in units, not thousands; or carry
scale
explicitly).No more guessing whether a column is $000s or $MM.
5) Entity & Identifier Resolution
Stable instrument_id
backed by global identifiers where available (ISIN/CUSIP/FIGI equivalents), plus legal-entity and account hierarchies. Ticker string changes and symbol churn don’t break downstream apps.
6) API + MCP
API: Fast, documented REST endpoints (versioned
/v1/...
),orjson
serialization, pagination, and batch routes.MCP: We expose QuantJourney as MCP “tools/resources,” so agents, IDEs, and orchestrators can request canonical data and metadata with first-class semantics (great for research copilots and automation).
Why Pydantic v2 Matters
Pydantic v2 is a high-performance data validation and parsing library for Python that lets us define structured, type-safe models for every API input and output. It ensures that all incoming data conforms to strict schemas, automatically handling type conversions, validation rules, and default values. Version 2 introduces a new parsing engine (pydantic-core) that’s 5–50× faster than v1, with lower memory usage and better error reporting. This speed matters in a trading context - we can validate thousands of messages per second without becoming the bottleneck. By basing our canonical contracts on Pydantic v2, we guarantee both semantic correctness and performance at scale.
Data Lake & Analytics Alignment
We align to a medallion flow - Bronze (raw) → Silver (normalized per-vendor) → Gold (canonical):
Storage in columnar formats (Arrow/Parquet) with schemas kept in lockstep with the Pydantic models.
Polars/Arrow for fast local analytics; Timeseries DB for query patterns that need it.
Event-sourced deltas to support PIT replays and restatements.
Example of YAML→Pydantic→FastAPI
Adapter snippet (YAML-ish):
Pydantic v2 (generated) - Canonical Output:
FastAPI (excerpt):
Where This Runs in Your Stack
As a gateway beneath your reporting, risk, OMS/EMS, or analytics - feeding them clean, stable contracts.
As a research plane for quants/analysts - consistent data for factor research, backtests, and dashboards.
As a trading-desk support service - exposures, greeks/risks, and scenario inputs pull from the same canonical truth.
Commercial Focus: B2B / B2B2B
This is infrastructure. Teams need SLAs, change management, support, and predictable contracts. Our focus is B2B / B2B2B - powering internal teams and platforms that, in turn, serve end clients. The economics and reliability requirements make this the only sensible path.
Therefore we spend significant time on adding B2B features as:
Source Router & Failover Policies
Per domain we define a preference order, health checks, and fallback chains.
Field-level quorum/precedence: if two sources disagree, we apply a rule (priority, recency, confidence score).
Circuit breakers, exponential backoff, and jittered retries to ride out partial outages.
Guaranteed Performance
Pre-compiled adapters; vectorized transforms where appropriate.
Pydantic v2 validation + orjson responses.
Caching with invalidation keyed by
as_of
, FX policy, and entity identifiers.
Observability with multiple KPIs
Coverage metrics (presence, null-rates), type-conflict counters.
Ingest/transform/serve latencies and error budgets.
Structured logs and tracing for lineage and auditability.
QuantJourney MDM Roadmap (abridged)
More canonical domains and deeper semantics (corporate actions engine; look-through hierarchies).
Adapter Studio to author & test transforms, policies, and PIT selection with instant validation.
Streaming (webhooks/Kafka) for low-latency consumers.
SDKs (Python/TypeScript) code-generated from OpenAPI + Pydantic models.
MCP expansion: richer tools/resources for agentic research and automation.
FAQ
Q: We already have internal systems - will this fit?
A: Yes. Treat QuantJourney as a drop-in canonical layer. Upstream sources plug into adapters; downstream, your reporting, accounting, risk, OMS/EMS, and analytics consume one stable contract via API or MCP. You can integrate gradually - domain by domain - without big-bang replacements.
Q: What happens when a source changes its schema or has an outage?
A: Adapters isolate change. We update the mapping; your canonical models (and their versions) remain stable. For outages, failover policies route to alternates, and caches mitigate transient incidents.
Q: Can we reproduce historical states?
A: Yes. Point-in-time queries are first-class. You can ask “as of 2024-12-31” and get the exact view, even if later restatements exist.
Q: How do you ensure data quality?
A: Contract-level validation (Pydantic), adapter-level checks, coverage metrics, and rule-based assertions (ranges, non-negativity, monotonic series where applicable). Bad data fails early - with clear diagnostics.
Happy trading!
Jakub