This post is adapted from the project README and focuses on architectural decisions and operational tradeoffs.

Financial Market Data Pipeline

Why I built this

I’ve worked on enough production systems to know that “just ingest the data” is usually where correctness goes to die.

Market data looks simple on the surface — prices go up, prices go down — but real-world feeds are late, corrected, inconsistent across vendors, and full of edge cases. Corporate actions like splits and dividends add another layer of complexity that can quietly corrupt historical data if they’re handled casually.

This project is my way of demonstrating how I think about systems like this. I’d rather build something deterministic, auditable, and operationally sane than something flashy that breaks the moment reality shows up.

The problem in one paragraph

The goal is to ingest end-of-day (EOD) OHLCV price data and corporate actions from multiple third-party vendors, normalize and validate that data, select a canonical “golden” price per symbol per day, apply deterministic corporate action adjustments, and publish final datasets for downstream consumers.

The system must support reruns, backfills, and vendor corrections without silent overwrites or loss of lineage. Correctness and reproducibility matter more than latency.

Constraints first, architecture second

Before touching architecture, I locked in a few non-negotiable constraints:

Correctness over latency — EOD batch processing is acceptable
Immutable raw data — no in-place mutation of vendor data
Idempotent ingestion — duplicate deliveries must be safe
Deterministic recomputation — history must be rebuildable on demand
Full lineage — every derived record must be traceable to its source

These constraints shape the system far more than any specific technology choice.

High-level architecture

flowchart LR
    VendorA[Stooq CSV] --> Ingest[Ingestion]
    VendorB[FMP API] --> Ingest
    Ingest --> RawData["Raw (Immutable)"]
    RawData --> Normalize["Normalization & Quality Checks"]
    Normalize --> Golden["Golden Price Selection"]
    Golden --> Adjust["Corporate Action Adjustments"]
    Adjust --> Publish["Published EOD Datasets"]

The pipeline is intentionally layered. Each stage produces new derived data rather than mutating existing records, which keeps recomputation explicit and auditable.

Core design principles

1. Immutable raw data

Vendor data is stored exactly as received, in an append-only raw layer. This provides:

a permanent audit trail
safe replay and historical backfills
insulation from upstream vendor corrections

Raw data is never updated or deleted. Every downstream artifact is derived from it.

2. Batch identity as a first-class concept

Every vendor delivery is treated as a batch, uniquely identified by:

(vendor, dataset, as_of_date, source_checksum)

This single decision unlocks several important guarantees:

duplicate deliveries with the same checksum are ignored safely
corrections create new batches instead of overwriting history
recomputation becomes explicit and traceable

Batch identity is one of those unglamorous details that quietly holds the entire system together.

3. Derived data only (no silent overwrites)

Normalized prices, golden prices, and adjusted prices are all derived tables.

If something changes upstream, affected date ranges are recomputed from raw data. Historical results are never silently patched in place. This avoids the most dangerous failure mode in data systems: quietly changing the past.

Example schema: raw_prices

To keep the design concrete, here’s a representative raw table:

Column	Type	Description
batch_id	UUID	Ingestion batch identifier
vendor	TEXT	Data vendor
symbol	TEXT	Vendor symbol
as_of_date	DATE	Trading date
payload	JSON / TEXT	Raw vendor payload
source_checksum	TEXT	File-level checksum
ingested_at	TIMESTAMP	Ingestion timestamp

Everything else in the pipeline can be rebuilt from this foundation.

Normalization and quality checks

Vendor-specific schemas are normalized into a canonical model, and basic quality rules are applied:

missing or null fields
invalid prices
obvious outliers

Invalid records are quarantined, not discarded. The pipeline continues processing valid data even when partial failures occur, making data quality issues visible instead of destructive.

Golden price selection

Once data is normalized, the pipeline selects a single canonical EOD price per symbol per day.

Golden price selection is:

explicit — driven by rules, not assumptions
deterministic — identical inputs produce identical outputs
traceable — provenance metadata is stored alongside results

Golden prices represent the trusted truth for downstream consumers.

Corporate actions: where pipelines usually break

Corporate actions are where many market data pipelines quietly fail.

In this system:

split events are normalized into a clean representation
historical prices are back-adjusted deterministically
adjusted outputs are versioned and fully recomputable from raw data

Back-adjustment ensures downstream consumers see a continuous price series without hiding the fact that adjustments occurred.

Modular monolith vs microservices

This pipeline is intentionally implemented as a modular monolith.

That choice optimizes for:

correctness
development velocity
operational simplicity

Microservices become justified only when:

independent scaling is required
release coupling becomes limiting
team size and operational maturity justify the added complexity

Distributed systems are not a default — they’re a tradeoff.

What I intentionally didn’t build

This project deliberately excludes:

real-time streaming ingestion
dashboards or UIs
authentication and authorization
production-grade infrastructure

Not because they’re difficult, but because they’re orthogonal to the core design problem: correctness, lineage, and deterministic recomputation.

What’s next

Possible extensions include:

dividend adjustments
versioned historical snapshots
backtesting-ready datasets
exports to columnar formats for downstream analytics

Each extension builds on the same foundational principles rather than changing them.

Closing thoughts

This project mirrors how real-world financial data systems are designed and operated when correctness and trust matter more than speed.

The most important parts aren’t the libraries or the code — they’re the decisions around immutability, batch identity, and recomputation under imperfect data.

Source code: https://github.com/tkalp/fin-eod-pipeline