From Legacy to Real-Time Lakehouse at Global Scale

The Challenge

Multi-gateway payment data was siloed across legacy systems, creating reconciliation delays, compliance risk, and limited visibility into transaction anomalies. Processing pipelines were batch-oriented, slow to adapt, and could not meet tightening GDPR and PCI-DSS audit requirements.

The Engagement

Modern Cloud Lakehouse Architecture

Designed and built a unified payment data Lakehouse on AWS using Databricks and Delta Lake. Ingested data from multiple payment gateways via Kafka-based streaming pipelines, replacing fragile nightly batch jobs with near-real-time data availability.

The medallion architecture (Bronze / Silver / Gold) enforced clear data quality contracts at each layer, with full lineage tracked end-to-end.

CI/CD for Data Pipelines

Introduced infrastructure-as-code and CI/CD practices to the data engineering workflow:

Pipeline definitions versioned in Git with automated testing on every merge
Databricks Asset Bundles used for environment promotion (dev → staging → prod)
Data quality assertions integrated into the deployment gate — no model or pipeline promoted without passing validation suites

ML-Based Anomaly Correction

Built a statistical anomaly detection layer that identified and flagged erroneous records at ingestion time. Key outcomes:

Trained on historical reconciliation exceptions across gateway formats
Reduced manual data correction effort by surfacing root-cause patterns
Doubled effective data accuracy for downstream analytics and reporting

Compliance by Design

Worked with the compliance and legal teams to embed GDPR and PCI-DSS controls directly into the platform architecture:

Data residency enforced at the storage layer via AWS region tagging
PII fields masked at Bronze ingestion with role-based access to Silver and Gold
Full audit log of data access, transformations, and model inference maintained in a tamper-evident store
Outcome: zero critical findings across two consecutive external audits

Results

Metric	Outcome
Processing time	40% faster
Infrastructure cost	15% OpEx reduction
Data accuracy	2x improvement
Compliance audits	0 critical findings (GDPR + PCI)

Key Lessons

Compliance is an architectural constraint, not a retrofit: Embedding data residency, masking, and audit logging from day one meant the platform passed audits without emergency remediation sprints.

Streaming unlocks more than latency: Moving from batch to Kafka-driven ingestion eliminated entire classes of reconciliation problems that stemmed from data arriving out of order across gateways.

ML for data quality before ML for analytics: The highest-ROI machine learning investment was anomaly correction in the pipeline — not a customer-facing model. Clean data multiplied the value of every downstream use case.