Taobao Instant Commerce: Real-Time Decisions at Scale with Apache Fluss

Every autumn in China, social media floods with posts about "The First Cup of Milk Tea in Autumn." With a tap on their phone, consumers expect their order delivered within 30 minutes. That effortless experience is no accident: it is the result of Taobao Instant Commerce making thousands of data-driven decisions every second.
Taobao Instant Commerce has scaled from a single-category food delivery service into a high-frequency platform spanning fresh produce, consumer electronics (3C), and beauty products. It operates under two very different modes: steady high-frequency daily transactions, and explosive traffic surges during promotional events where order volumes can multiply within minutes. Both demand the same thing: real-time responsiveness across hundreds of millions of SKUs.
Real-time is not a nice-to-have here; it is the lifeline for three critical functions:
- Operations: Refresh conversion rates and funnels within 30 seconds.
- Algorithms: Order prediction models must iterate at minute-level granularity.
- Quality Assurance: Canary release anomalies must be detected within seconds and trigger instant alerts.
The existing pipeline (built on Kafka, Flink, Paimon, and StarRocks) handled this at one scale.
Note: In Alibaba's internal infrastructure, TT (TimeTunnel) is the internal equivalent of Apache Kafka — a high-throughput distributed message queue. Throughout this post, "Kafka" refers to TT in the Taobao Instant Commerce context. But as the business grew, three fundamental bottlenecks emerged: unbounded state growth from stream joins, mounting complexity in building multi-stream denormalized tables, and excessive resource consumption from lakehouse synchronization. Together they formed an impossible triangle: no matter how the team tuned the system, latency, consistency, and cost could not all be optimized at once.
Fluss broke this impasse. By replacing the fragmented stream-batch architecture with a unified storage layer, its features (Delta Join, Partial Update, Streaming-Lakehouse Unification, Column Pruning, and Auto-Increment Columns) systematically eliminated all three bottlenecks and fundamentally reshaped how Taobao Instant Commerce handles real-time decision-making at scale.
The "Impossible Triangle" Dilemma: Business Pain Points & Technical Challenges
The real-time data system of Taobao Instant Commerce must reliably process ultra-large-scale, high-concurrency data streams. It must also build real-time denormalized tables (wide tables) that aggregate metrics across multiple business domains, and support core pipelines such as associating page view streams with order streams, as well as canary release monitoring. This imposes extreme requirements on latency, consistency, cost-efficiency, and system scalability.
Under the traditional technology stack, challenges across these four dimensions were deeply intertwined, ultimately forming the same trilemma of latency, consistency, and cost that the business had already identified.
The root cause lies in the stream-batch separation at the storage layer: Kafka serving as the message queue, represents the "stream" abstraction, while Paimon, as the lakehouse format, represents the "batch" abstraction. Bridging these two relies heavily on numerous Flink ETL jobs acting as a glue layer. Each additional layer of glue compounds latency, increases costs, and heightens the risk to data consistency.
Four Core Business Issues
