QuantJourney: An MDM-Grade Canonical Gateway and Research Data Lake

Aug 11, 2025

TL;DR: We’re still working to add API access and MCP integration to QuantJourney, so agents, IDEs, and services will soon be able to fetch canonical market data with first-class semantics. At the core are contracts-first Pydantic v2 models for all inputs and outputs, backed by deterministic per-vendor adapters. This enables new ways to run research, backtests, and validate investment ideas. The API is a near-term milestone - but the strategic goal is bigger: an MDM-grade canonical gateway and data lake serving everything from research to trading-desk support, with strict data contracts, point-in-time truth, and operational reliability.

This post explains what we’re building, why it matters to funds and analysts and serious investors, and how the pieces fit together. Stay tuned, we expect to share more code soon.

The Problem We’re Solving

Modern teams consume data from multiple custodians, brokers, banks, and market sources. Each feed uses its own schema, timestamp conventions, currencies, units, and revision logic. Without a canonical layer you get:

Fragile mapping code scattered across services.
Slow vendor swaps and costly maintenance.
Silent data drift (units, scaling, FX) that infects analytics.
Inability to reproduce what was known at time T.

That’s quite a common issue with LLMs currently which are prone to look-ahead bias and a lot of deeper inconsistencies. Even for simple strategies, we often need to pull data from multiple sources to support our investment thesis.

The QuantJourney Approach

1) Canonical Schemas (Contracts First)

We define versioned canonical schemas (semver + $id) for core domains - positions, transactions, balances, prices, valuations, security master, fundamentals, calendars, reference. These schemas are the contract every internal app and external consumer can rely on.

Pydantic v2 is our contract engine - on both input params and output returns:

Input: We validate raw payloads into strongly typed internal models (with aliases for vendor field names) before normalization.
Output: We emit canonical Pydantic models - typed, documented, and versioned - so client code gets stable structures with clear upgrade paths.

We use Annotated types and validators to enforce semantics:

Money[CUR], Pct, bps, Date, TZ, enums (asset class, instrument type, period).
Hard guarantees on required fields and units/scale invariants.

2) Adapters & Transforms (Deterministic Normalization)

Each feed is onboarded with a small adapter spec (our DSL) that expresses:

rename, cast, default, scale, derive, fx_normalize, pit_select, map_enum.
Field-level precedence and merge rules when multiple sources provide the same attribute.

Adapters compile to pure Python callables at service startup (no per-request YAML parsing). They run fast, are unit-tested with golden fixtures, and produce identical canonical outputs irrespective of source.

3) Point-in-Time (PIT) Correctness

We support time-travel queries and restatements:

Keys include as_of, effective_from/to, revision_id, source_timestamp.
You can reproduce quarter-end views, audits, and backtests exactly as they were known.

4) FX & Units/Scale Normalization

Preserve native currency and expose normalized value_{USD/EUR/...} using dated FX at a declared policy (e.g., EOD, custody FX, WM/Ref style).
Normalize units and scales (e.g., revenue in units, not thousands; or carry scale explicitly).
No more guessing whether a column is $000s or $MM.

5) Entity & Identifier Resolution

Stable instrument_id backed by global identifiers where available (ISIN/CUSIP/FIGI equivalents), plus legal-entity and account hierarchies. Ticker string changes and symbol churn don’t break downstream apps.

6) API + MCP

API: Fast, documented REST endpoints (versioned /v1/...), orjson serialization, pagination, and batch routes.
MCP: We expose QuantJourney as MCP “tools/resources,” so agents, IDEs, and orchestrators can request canonical data and metadata with first-class semantics (great for research copilots and automation).

Why Pydantic v2 Matters
Pydantic v2 is a high-performance data validation and parsing library for Python that lets us define structured, type-safe models for every API input and output. It ensures that all incoming data conforms to strict schemas, automatically handling type conversions, validation rules, and default values. Version 2 introduces a new parsing engine (pydantic-core) that’s 5–50× faster than v1, with lower memory usage and better error reporting. This speed matters in a trading context - we can validate thousands of messages per second without becoming the bottleneck. By basing our canonical contracts on Pydantic v2, we guarantee both semantic correctness and performance at scale.

Data Lake & Analytics Alignment

We align to a medallion flow - Bronze (raw) → Silver (normalized per-vendor) → Gold (canonical):

Storage in columnar formats (Arrow/Parquet) with schemas kept in lockstep with the Pydantic models.
Polars/Arrow for fast local analytics; Timeseries DB for query patterns that need it.
Event-sourced deltas to support PIT replays and restatements.

Example of YAML→Pydantic→FastAPI

Adapter snippet (YAML-ish):

Pydantic v2 (generated) - Canonical Output:

FastAPI (excerpt):

Where This Runs in Your Stack

As a gateway beneath your reporting, risk, OMS/EMS, or analytics - feeding them clean, stable contracts.
As a research plane for quants/analysts - consistent data for factor research, backtests, and dashboards.
As a trading-desk support service - exposures, greeks/risks, and scenario inputs pull from the same canonical truth.

Commercial Focus: B2B / B2B2B

This is infrastructure. Teams need SLAs, change management, support, and predictable contracts. Our focus is B2B / B2B2B - powering internal teams and platforms that, in turn, serve end clients. The economics and reliability requirements make this the only sensible path.

Therefore we spend significant time on adding B2B features as:

Source Router & Failover Policies
- Per domain we define a preference order, health checks, and fallback chains.
- Field-level quorum/precedence: if two sources disagree, we apply a rule (priority, recency, confidence score).
- Circuit breakers, exponential backoff, and jittered retries to ride out partial outages.
Guaranteed Performance
- Pre-compiled adapters; vectorized transforms where appropriate.
- Pydantic v2 validation + orjson responses.
- Caching with invalidation keyed by as_of, FX policy, and entity identifiers.
Observability with multiple KPIs
- Coverage metrics (presence, null-rates), type-conflict counters.
- Ingest/transform/serve latencies and error budgets.
- Structured logs and tracing for lineage and auditability.

QuantJourney MDM Roadmap (abridged)

More canonical domains and deeper semantics (corporate actions engine; look-through hierarchies).
Adapter Studio to author & test transforms, policies, and PIT selection with instant validation.
Streaming (webhooks/Kafka) for low-latency consumers.
SDKs (Python/TypeScript) code-generated from OpenAPI + Pydantic models.
MCP expansion: richer tools/resources for agentic research and automation.

FAQ

Q: We already have internal systems - will this fit?

A: Yes. Treat QuantJourney as a drop-in canonical layer. Upstream sources plug into adapters; downstream, your reporting, accounting, risk, OMS/EMS, and analytics consume one stable contract via API or MCP. You can integrate gradually - domain by domain - without big-bang replacements.

Q: What happens when a source changes its schema or has an outage?

A: Adapters isolate change. We update the mapping; your canonical models (and their versions) remain stable. For outages, failover policies route to alternates, and caches mitigate transient incidents.

Q: Can we reproduce historical states?

A: Yes. Point-in-time queries are first-class. You can ask “as of 2024-12-31” and get the exact view, even if later restatements exist.

Q: How do you ensure data quality?

A: Contract-level validation (Pydantic), adapter-level checks, coverage metrics, and rule-based assertions (ranges, non-negativity, monotonic series where applicable). Bad data fails early - with clear diagnostics.

Happy trading!

Jakub

Quant Journey

Discussion about this post