This post is adapted from the project README and focuses on architectural decisions and operational tradeoffs.

Financial Market Data Pipeline

Why I built this

I’ve worked on enough production systems to know that “just ingest the data” is usually where correctness goes to die.

Market data looks simple on the surface — prices go up, prices go down — but real-world feeds are late, corrected, inconsistent across vendors, and full of edge cases. Corporate actions like splits and dividends add another layer of complexity that can quietly corrupt historical data if they’re handled casually.

This project is my way of demonstrating how I think about systems like this. I’d rather build something deterministic, auditable, and operationally sane than something flashy that breaks the moment reality shows up.


The problem in one paragraph

The goal is to ingest end-of-day (EOD) OHLCV price data and corporate actions from multiple third-party vendors, normalize and validate that data, select a canonical “golden” price per symbol per day, apply deterministic corporate action adjustments, and publish final datasets for downstream consumers.

The system must support reruns, backfills, and vendor corrections without silent overwrites or loss of lineage. Correctness and reproducibility matter more than latency.


Constraints first, architecture second

Before touching architecture, I locked in a few non-negotiable constraints:

  • Correctness over latency — EOD batch processing is acceptable
  • Immutable raw data — no in-place mutation of vendor data
  • Idempotent ingestion — duplicate deliveries must be safe
  • Deterministic recomputation — history must be rebuildable on demand
  • Full lineage — every derived record must be traceable to its source

These constraints shape the system far more than any specific technology choice.


High-level architecture

flowchart LR
    VendorA[Stooq CSV] --> Ingest[Ingestion]
    VendorB[FMP API] --> Ingest
    Ingest --> RawData["Raw (Immutable)"]
    RawData --> Normalize["Normalization & Quality Checks"]
    Normalize --> Golden["Golden Price Selection"]
    Golden --> Adjust["Corporate Action Adjustments"]
    Adjust --> Publish["Published EOD Datasets"]

The pipeline is intentionally layered. Each stage produces new derived data rather than mutating existing records, which keeps recomputation explicit and auditable.


Core design principles

1. Immutable raw data

Vendor data is stored exactly as received, in an append-only raw layer. This provides:

  • a permanent audit trail
  • safe replay and historical backfills
  • insulation from upstream vendor corrections

Raw data is never updated or deleted. Every downstream artifact is derived from it.


2. Batch identity as a first-class concept

Every vendor delivery is treated as a batch, uniquely identified by:

(vendor, dataset, as_of_date, source_checksum)

This single decision unlocks several important guarantees:

  • duplicate deliveries with the same checksum are ignored safely
  • corrections create new batches instead of overwriting history
  • recomputation becomes explicit and traceable

Batch identity is one of those unglamorous details that quietly holds the entire system together.


3. Derived data only (no silent overwrites)

Normalized prices, golden prices, and adjusted prices are all derived tables.

If something changes upstream, affected date ranges are recomputed from raw data. Historical results are never silently patched in place. This avoids the most dangerous failure mode in data systems: quietly changing the past.


Example schema: raw_prices

To keep the design concrete, here’s a representative raw table:

ColumnTypeDescription
batch_idUUIDIngestion batch identifier
vendorTEXTData vendor
symbolTEXTVendor symbol
as_of_dateDATETrading date
payloadJSON / TEXTRaw vendor payload
source_checksumTEXTFile-level checksum
ingested_atTIMESTAMPIngestion timestamp

Everything else in the pipeline can be rebuilt from this foundation.


Normalization and quality checks

Vendor-specific schemas are normalized into a canonical model, and basic quality rules are applied:

  • missing or null fields
  • invalid prices
  • obvious outliers

Invalid records are quarantined, not discarded. The pipeline continues processing valid data even when partial failures occur, making data quality issues visible instead of destructive.


Golden price selection

Once data is normalized, the pipeline selects a single canonical EOD price per symbol per day.

Golden price selection is:

  • explicit — driven by rules, not assumptions
  • deterministic — identical inputs produce identical outputs
  • traceable — provenance metadata is stored alongside results

Golden prices represent the trusted truth for downstream consumers.


Corporate actions: where pipelines usually break

Corporate actions are where many market data pipelines quietly fail.

In this system:

  • split events are normalized into a clean representation
  • historical prices are back-adjusted deterministically
  • adjusted outputs are versioned and fully recomputable from raw data

Back-adjustment ensures downstream consumers see a continuous price series without hiding the fact that adjustments occurred.


Modular monolith vs microservices

This pipeline is intentionally implemented as a modular monolith.

That choice optimizes for:

  • correctness
  • development velocity
  • operational simplicity

Microservices become justified only when:

  • independent scaling is required
  • release coupling becomes limiting
  • team size and operational maturity justify the added complexity

Distributed systems are not a default — they’re a tradeoff.


What I intentionally didn’t build

This project deliberately excludes:

  • real-time streaming ingestion
  • dashboards or UIs
  • authentication and authorization
  • production-grade infrastructure

Not because they’re difficult, but because they’re orthogonal to the core design problem: correctness, lineage, and deterministic recomputation.


What’s next

Possible extensions include:

  • dividend adjustments
  • versioned historical snapshots
  • backtesting-ready datasets
  • exports to columnar formats for downstream analytics

Each extension builds on the same foundational principles rather than changing them.


Closing thoughts

This project mirrors how real-world financial data systems are designed and operated when correctness and trust matter more than speed.

The most important parts aren’t the libraries or the code — they’re the decisions around immutability, batch identity, and recomputation under imperfect data.


Source code: https://github.com/tkalp/fin-eod-pipeline