Skip to main content

Primary Key Tables: Unifying Log and Cache for 🚀 Streaming

Giannis Polyzos
PPMC member of Apache Fluss (Incubating)

Modern data platforms have traditionally relied on two foundational components: a log for durable, ordered event storage and a cache for low-latency access. Common architectures include combinations such as Kafka with Redis, or Debezium feeding changes into a key-value store. While these patterns underpin a significant portion of production infrastructure, they also introduce complexity, fragility, and operational overhead.

Apache Fluss (Incubating) addresses this challenge with an elegant solution: Primary Key Tables (PK Tables). These persistent state tables provide the same semantics as running both a log and a cache, without needing two separate systems. Every write produces a durable log entry and an immediately consistent key-value update. Snapshots and log replay guarantee deterministic recovery, while clients benefit from the simplicity of interacting with one system for reads, writes, and queries.

In this post, we will explore how Fluss PK Tables work, why unifying log and cache into a persistent design is a critical advancement, and how this model resolves long-standing challenges of maintaining consistency across multiple systems.

How Taobao uses Apache Fluss (Incubating) for Real-Time Processing in Search and RecSys

Xinyu Zhang
Xinyu Zhang
Senior Data Development Engineer of Taotian Group
Lilei Wang
Lilei Wang
Data Development Engineer of Taotian Group

Streaming Storage More Suitable for Real-Time OLAP

Introduction

The Data Development Team of Taobao has built a new generation of real-time data warehouse based on Apache Fluss. Fluss solves the problems of redundant data transfer, difficulties in data profiling, and challenges in large scale stateful workload operations and maintenance. By combining columnar storage with real-time update capabilities, Fluss supports column pruning, key-value point lookups, Delta Join, and seamless lake–stream integration, thereby cutting I/O and compute overhead while enhancing job stability and profiling efficiency.

Already deployed on Taobao’s A/B-testing platform for critical services such as search and recommendation, the system proved its resilience during the 618 Grand Promotion: it handled tens of millions of requests with sub-second latency, lowered resource usage by 30%, and removed more than 100 TB from state storage. Looking ahead, the team will continue to extend Fluss within a Lakehouse architecture and broaden its use across AI-driven workloads.

From Stream to Lake: Hands-On with Fluss Tiering into Paimon on Minio

Yang Guo
Contributor of Apache Fluss (Incubating)

Fluss stores historical data in a lakehouse storage layer while keeping real-time data in the Fluss server. Its built-in tiering service continuously moves fresh events into the lakehouse, allowing various query engines to analyze both hot and cold data. The real magic happens with Fluss's union-read capability, which lets Flink jobs seamlessly query both the Fluss cluster and the lakehouse for truly integrated real-time processing.

In this hands-on tutorial, we'll walk you through setting up a local Fluss lakehouse environment, running some practical data operations, and getting first-hand experience with the complete Fluss lakehouse architecture. By the end, you'll have a working environment for experimenting with Fluss's powerful data processing capabilities.

Fluss Joins the Apache Incubator

Jark Wu
PPMC member of Apache Fluss (Incubating)

On June 5th, Fluss, the next-generation streaming storage project open-sourced and donated by Alibaba, successfully passed the vote and officially became an incubator project of the Apache Software Foundation (ASF). This marks a significant milestone in the development of the Fluss community, symbolizing that the project has entered a new phase that is more open, neutral, and standardized. Moving forward, Fluss will leverage the ASF ecosystem to accelerate the building of a global developer community, continuously driving innovation and adoption of next-generation real-time data infrastructure.

ASF

Apache Fluss Java Client: A Deep Dive

Giannis Polyzos
PPMC member of Apache Fluss (Incubating)

Banner

Introduction

Apache Fluss is a streaming data storage system built for real-time analytics, serving as a low-latency data layer in modern data Lakehouses. It supports sub-second streaming reads and writes, storing data in a columnar format for efficiency, and offers two flexible table types: append-only Log Tables and updatable Primary Key Tables. In practice, this means Fluss can ingest high-throughput event streams (using log tables) while also maintaining up-to-date reference data or state (using primary key tables), a combination ideal for scenarios like IoT, where you might stream sensor readings and look up information for those sensors in real-time, without the need for external K/V stores.

Tiering Service Deep Dive

Yang Guo
Contributor of Apache Fluss (Incubating)

Background

At the core of Fluss’s Lakehouse architecture sits the Tiering Service: a smart, policy-driven data pipeline that seamlessly bridges your real-time Fluss cluster and your cost-efficient lakehouse storage. It continuously ingests fresh events from the fluss cluster, automatically migrating older or less-frequently accessed data into colder storage tiers without interrupting ongoing queries. By balancing hot, warm, and cold storage according to configurable rules, the Tiering Service ensures that recent data remains instantly queryable while historical records are archived economically.

In this blog post we will take a deep dive and explore how Fluss’s Tiering Service orchestrates data movement, preserves consistency, and empowers scalable, high-performance analytics at optimized costs.

Announcing Fluss 0.7

Jark Wu
PPMC member of Apache Fluss (Incubating)

Banner

🌊 We are excited to announce the official release of Fluss 0.7!

This version has undergone extensive improvements in stability, architecture, performance optimization, and security, further enhancing its readiness for production environments. Over the past three months, we have completed more than 250 commits, making this release a significant milestone toward becoming a mature, production-grade streaming storage platform.

Understanding Partial Updates

Giannis Polyzos
PPMC member of Apache Fluss (Incubating)

Banner

Traditional streaming data pipelines often need to join many tables or streams on a primary key to create a wide view. For example, imagine you’re building a real-time recommendation engine for an e-commerce platform. To serve highly personalized recommendations, your system needs a complete 360° view of each user, including: user preferences, past purchases, clickstream behavior, cart activity, product reviews, support tickets, ad impressions, and loyalty status.

That’s at least 8 different data sources, each producing updates independently.

The Story of Fluss Logo

Jark Wu
PPMC member of Apache Fluss (Incubating)

Introducing the Little Otter

Today is World Otter Day, and we are thrilled to introduce the little otter to the Fluss community! 🎉

Since open-sourced half a year ago, many community members and friends have asked us: "When will Fluss get a logo?" After more than a month of careful design work and over 30 iterations, we’re excited to finally unveil the official Fluss logo — a surfing otter! 🦦🌊

Announcing Fluss 0.6

Jark Wu
PPMC member of Apache Fluss (Incubating)

The Fluss community is pleased to announce the official release of Fluss 0.6.0. This version has undergone over three months of intensive development, bringing together the expertise and efforts of 45 contributors worldwide, with more than 200 code commits completed. Our heartfelt thanks go out to every contributor for their invaluable support!

Release Announcement