Skip to main content

From Stream to Lake: Hands-On with Fluss Tiering into Paimon on Minio

Yang Guo
Fluss Contributor

Fluss stores historical data in a lakehouse storage layer while keeping real-time data in the Fluss server. Its built-in tiering service continuously moves fresh events into the lakehouse, allowing various query engines to analyze both hot and cold data. The real magic happens with Fluss's union-read capability, which lets Flink jobs seamlessly query both the Fluss cluster and the lakehouse for truly integrated real-time processing.

In this hands-on tutorial, we'll walk you through setting up a local Fluss lakehouse environment, running some practical data operations, and getting first-hand experience with the complete Fluss lakehouse architecture. By the end, you'll have a working environment for experimenting with Fluss's powerful data processing capabilities.

Integrate with Paimon Minio Lakehouse

For this tutorial, we'll use Fluss 0.7 and Flink 1.20 to run the tiering service on a local cluster. We'll configure Paimon as our lake format on Minio as the storage backend. Let's get started:

Minio Setup

  1. Install Minio object storage locally.

    Check out the official guide for detailed instructions.

  2. Start the Minio server

    Run this command, specifying a local path to store your Minio data:

    minio server /tmp/minio-data
  3. Verify the Minio WebUI.

    When your Minio server is up and running, you'll see endpoint information and login credentials:

    API: http://192.168.2.236:9000  http://127.0.0.1:9000
    RootUser: minioadmin
    RootPass: minioadmin

    WebUI: http://192.168.2.236:61832 http://127.0.0.1:61832
    RootUser: minioadmin
    RootPass: minioadmin

    Open the WebUI link and log in with these credentials.

  4. Create a fluss bucket through the WebUI.

Fluss Cluster Setup

  1. Download Fluss

    Grab the Fluss 0.7 binary release from the Fluss official site.

  2. Add Dependencies

    Download the fluss-fs-s3-0.7.0.jar from the Fluss official site and place it in your <FLUSS_HOME>/lib directory.

    Next, download the paimon-s3-1.0.1.jar from the Paimon official site and add it to <FLUSS_HOME>/plugins/paimon.

  3. Configure the Data Lake

    Edit your <FLUSS_HOME>/conf/server.yaml file and add these settings:

    data.dir: /tmp/fluss-data
    remote.data.dir: /tmp/fluss-remote-data

    datalake.format: paimon
    datalake.paimon.metastore: filesystem
    datalake.paimon.warehouse: s3://fluss/data
    datalake.paimon.s3.endpoint: http://localhost:9000
    datalake.paimon.s3.access-key: minioadmin
    datalake.paimon.s3.secret-key: minioadmin
    datalake.paimon.s3.path.style.access: true

    This configures Paimon as the datalake format on Minio as the warehouse.

  4. Start Fluss

    <FLUSS_HOME>/bin/local-cluster.sh start
  1. Download Flink

    Download the Flink 1.20 binary package from the Flink downloads page.

  2. Add the Fluss Connector

    Download fluss-flink-1.20-0.7.0.jar from the Fluss official site and copy it to:

    <FLINK_HOME>/lib
  3. Add Paimon Dependencies

    • Download paimon-flink-1.20-1.0.1.jar and paimon-s3-1.0.1.jar from the Paimon official site and place them in <FLINK_HOME>/lib.
    • Copy these Paimon plugin jars from Fluss into <FLINK_HOME>/lib:
    <FLUSS_HOME>/plugins/paimon/fluss-lake-paimon-0.7.0.jar
    <FLUSS_HOME>/plugins/paimon/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar
  4. Increase Task Slots

    Edit <FLINK_HOME>/conf/config.yaml to increase available task slots:

    numberOfTaskSlots: 5 
  5. Start Flink

    <FLINK_HOME>/bin/start-cluster.sh
  6. Verify

    Open your browser to http://localhost:8081/ and make sure the cluster is running.

Launching the Tiering Service

  1. Get the Tiering Job Jar

    Download the fluss-flink-tiering-0.7.0.jar.

  2. Submit the Job

    <FLINK_HOME>/bin/flink run \
    <path_to_jar>/fluss-flink-tiering-0.7.0.jar \
    --fluss.bootstrap.servers localhost:9123 \
    --datalake.format paimon \
    --datalake.paimon.metastore filesystem \
    --datalake.paimon.warehouse s3://fluss/data \
    --datalake.paimon.s3.endpoint http://localhost:9000 \
    --datalake.paimon.s3.access-key minioadmin \
    --datalake.paimon.s3.secret-key minioadmin \
    --datalake.paimon.s3.path.style.access true
  3. Confirm Deployment

    Check the Flink UI for the Fluss Lake Tiering Service job. Once it's running, your local tiering pipeline is good to go.

Data Processing

Now let's dive into some actual data processing. We'll use the Flink SQL Client to interact with our Fluss lakehouse and run both batch and streaming queries.

  1. Launch the SQL Client

    <FLINK_HOME>/bin/sql-client.sh
  2. Create the Catalog and Table

    CREATE CATALOG fluss_catalog WITH (
    'type' = 'fluss',
    'bootstrap.servers' = 'localhost:9123'
    );

    USE CATALOG fluss_catalog;

    CREATE TABLE t_user (
    `id` BIGINT,
    `name` string NOT NULL,
    `age` int,
    `birth` DATE,
    PRIMARY KEY (`id`) NOT ENFORCED
    )WITH (
    'table.datalake.enabled' = 'true',
    'table.datalake.freshness' = '30s'
    );
  3. Write Some Data

    Let's insert a couple of records:

    SET 'execution.runtime-mode' = 'batch';
    SET 'sql-client.execution.result-mode' = 'tableau';

    INSERT INTO t_user(id,name,age,birth) VALUES
    (1,'Alice',18,DATE '2000-06-10'),
    (2,'Bob',20,DATE '2001-06-20');
  4. Union Read

    Now run a simple query to retrieve data from the table. By default, Flink will automatically combine data from both the Fluss cluster and the lakehouse:

    Flink SQL> select * from t_user;
    +----+-------+-----+------------+
    | id | name | age | birth |
    +----+-------+-----+------------+
    | 1 | Alice | 18 | 2000-06-10 |
    | 2 | Bob | 20 | 2001-06-20 |
    +----+-------+-----+------------+

    If you want to read data only from the lake table, simply append $lake after the table name:

    Flink SQL> select * from t_user$lake;
    +----+-------+-----+------------+----------+----------+----------------------------+
    | id | name | age | birth | __bucket | __offset | __timestamp |
    +----+-------+-----+------------+----------+----------+----------------------------+
    | 1 | Alice | 18 | 2000-06-10 | 0 | -1 | 1970-01-01 07:59:59.999000 |
    | 2 | Bob | 20 | 2001-06-20 | 0 | -1 | 1970-01-01 07:59:59.999000 |
    +----+-------+-----+------------+----------+----------+----------------------------+

    Great! Our records have been successfully synced to the data lake by the tiering service.

    Notice the three system columns in the Paimon lake table: __bucket, __offset, and __timestamp. The __bucket column shows which bucket contains this row. The __offset and __timestamp columns are used for streaming data processing.

  5. Streaming Inserts

    Let's switch to streaming mode and add two more records:

    Flink SQL> SET 'execution.runtime-mode' = 'streaming';

    Flink SQL> INSERT INTO t_user(id,name,age,birth) VALUES
    (3,'Catlin',25,DATE '2002-06-10'),
    (4,'Dylan',28,DATE '2003-06-20');

    Now query the lake again:

    Flink SQL> select * from t_user$lake;
    +----+----+--------+-----+------------+----------+----------+----------------------------+
    | op | id | name | age | birth | __bucket | __offset | __timestamp |
    +----+----+--------+-----+------------+----------+----------+----------------------------+
    | +I | 1 | Alice | 18 | 2000-06-10 | 0 | -1 | 1970-01-01 07:59:59.999000 |
    | +I | 2 | Bob | 20 | 2001-06-20 | 0 | -1 | 1970-01-01 07:59:59.999000 |


    Flink SQL> select * from t_user$lake;
    +----+----+--------+-----+------------+----------+----------+----------------------------+
    | op | id | name | age | birth | __bucket | __offset | __timestamp |
    +----+----+--------+-----+------------+----------+----------+----------------------------+
    | +I | 1 | Alice | 18 | 2000-06-10 | 0 | -1 | 1970-01-01 07:59:59.999000 |
    | +I | 2 | Bob | 20 | 2001-06-20 | 0 | -1 | 1970-01-01 07:59:59.999000 |
    | +I | 3 | Catlin | 25 | 2002-06-10 | 0 | 2 | 2025-07-19 19:03:54.150000 |
    | +I | 4 | Dylan | 28 | 2003-06-20 | 0 | 3 | 2025-07-19 19:03:54.150000 |

    The first time we queried, our new records hadn't been synced to the lake table yet. After waiting a moment, they appeared.

    Notice that the __offset and __timestamp values for these new records are no longer the default values. They now show the actual offset and timestamp when the records were added to the table.

  6. Inspect the Paimon Files

    Open the Minio WebUI, and you'll see the Paimon files in your bucket:

    You can also check the Parquet files and manifest in your local filesystem under /tmp/minio-data:

    /tmp/minio-data ❯ tree .
    .
    └── fluss
    └── data
    ├── default.db__XLDIR__
    │ └── xl.meta
    └── fluss.db
    └── t_user
    ├── bucket-0
    │ ├── changelog-1bafcc32-f88a-42a6-bc92-d3ccf4f62d4c-0.parquet
    │ │ └── xl.meta
    │ ├── changelog-f1853f1c-2588-4035-8233-e4804b1d8344-0.parquet
    │ │ └── xl.meta
    │ ├── data-1bafcc32-f88a-42a6-bc92-d3ccf4f62d4c-1.parquet
    │ │ └── xl.meta
    │ └── data-f1853f1c-2588-4035-8233-e4804b1d8344-1.parquet
    │ └── xl.meta
    ├── manifest
    │ ├── manifest-d554f475-ad8f-47e0-a83b-22bce4b233d6-0
    │ │ └── xl.meta
    │ ├── manifest-d554f475-ad8f-47e0-a83b-22bce4b233d6-1
    │ │ └── xl.meta
    │ ├── manifest-e7fbe5b1-a9e4-4647-a07a-5cc71950a5be-0
    │ │ └── xl.meta
    │ ├── manifest-e7fbe5b1-a9e4-4647-a07a-5cc71950a5be-1
    │ │ └── xl.meta
    │ ├── manifest-list-8975f7d7-9fec-4ac9-bb31-12be03d297d0-0
    │ │ └── xl.meta
    │ ├── manifest-list-8975f7d7-9fec-4ac9-bb31-12be03d297d0-1
    │ │ └── xl.meta
    │ ├── manifest-list-8975f7d7-9fec-4ac9-bb31-12be03d297d0-2
    │ │ └── xl.meta
    │ ├── manifest-list-bba1f130-e7ab-4f5e-8ce3-928a53524136-0
    │ │ └── xl.meta
    │ ├── manifest-list-bba1f130-e7ab-4f5e-8ce3-928a53524136-1
    │ │ └── xl.meta
    │ └── manifest-list-bba1f130-e7ab-4f5e-8ce3-928a53524136-2
    │ └── xl.meta
    ├── schema
    │ └── schema-0
    │ └── xl.meta
    └── snapshot
    ├── LATEST
    │ └── xl.meta
    ├── snapshot-1
    │ └── xl.meta
    └── snapshot-2
    └── xl.meta

    28 directories, 19 files
  7. View Snapshots

    You can also check the snapshots from the system table by appending $lake$snapshots after the Fluss table name:

    Flink SQL> select * from t_user$lake$snapshots;

    +-------------+-----------+----------------------+-------------------------+-------------+----------+
    | snapshot_id | schema_id | commit_user | commit_time | commit_kind | ... |
    +-------------+-----------+----------------------+-------------------------+-------------+----------+
    | 1 | 0 | __fluss_lake_tiering | 2025-07-19 19:00:41.286 | APPEND | ... |
    | 2 | 0 | __fluss_lake_tiering | 2025-07-19 19:04:38.964 | APPEND | ... |
    +-------------+-----------+----------------------+-------------------------+-------------+----------+
    2 rows in set (0.33 seconds)

Summary

In this guide, we've explored the Fluss lakehouse architecture and set up a complete local environment with Fluss, Flink, Paimon, and Minio. We've walked through practical examples of data processing that showcase how Fluss seamlessly integrates real-time and historical data. With this setup, you now have a solid foundation for experimenting with Fluss's powerful lakehouse capabilities on your own machine.

Fluss Joins the Apache Incubator

Jark Wu
PPMC member of Apache Fluss (Incubating)

On June 5th, Fluss, the next-generation streaming storage project open-sourced and donated by Alibaba, successfully passed the vote and officially became an incubator project of the Apache Software Foundation (ASF). This marks a significant milestone in the development of the Fluss community, symbolizing that the project has entered a new phase that is more open, neutral, and standardized. Moving forward, Fluss will leverage the ASF ecosystem to accelerate the building of a global developer community, continuously driving innovation and adoption of next-generation real-time data infrastructure.

ASF

Apache Fluss Java Client: A Deep Dive

Giannis Polyzos
PPMC member of Apache Fluss (Incubating)

Banner

Introduction

Apache Fluss is a streaming data storage system built for real-time analytics, serving as a low-latency data layer in modern data Lakehouses. It supports sub-second streaming reads and writes, storing data in a columnar format for efficiency, and offers two flexible table types: append-only Log Tables and updatable Primary Key Tables. In practice, this means Fluss can ingest high-throughput event streams (using log tables) while also maintaining up-to-date reference data or state (using primary key tables), a combination ideal for scenarios like IoT, where you might stream sensor readings and look up information for those sensors in real-time, without the need for external K/V stores.

Tiering Service Deep Dive

Yang Guo
Fluss Contributor

Background

At the core of Fluss’s Lakehouse architecture sits the Tiering Service: a smart, policy-driven data pipeline that seamlessly bridges your real-time Fluss cluster and your cost-efficient lakehouse storage. It continuously ingests fresh events from the fluss cluster, automatically migrating older or less-frequently accessed data into colder storage tiers without interrupting ongoing queries. By balancing hot, warm, and cold storage according to configurable rules, the Tiering Service ensures that recent data remains instantly queryable while historical records are archived economically.

In this blog post we will take a deep dive and explore how Fluss’s Tiering Service orchestrates data movement, preserves consistency, and empowers scalable, high-performance analytics at optimized costs.

Announcing Fluss 0.7

Jark Wu
PPMC member of Apache Fluss (Incubating)

Banner

🌊 We are excited to announce the official release of Fluss 0.7!

This version has undergone extensive improvements in stability, architecture, performance optimization, and security, further enhancing its readiness for production environments. Over the past three months, we have completed more than 250 commits, making this release a significant milestone toward becoming a mature, production-grade streaming storage platform.

Understanding Partial Updates

Giannis Polyzos
PPMC member of Apache Fluss (Incubating)

Banner

Traditional streaming data pipelines often need to join many tables or streams on a primary key to create a wide view. For example, imagine you’re building a real-time recommendation engine for an e-commerce platform. To serve highly personalized recommendations, your system needs a complete 360° view of each user, including: user preferences, past purchases, clickstream behavior, cart activity, product reviews, support tickets, ad impressions, and loyalty status.

That’s at least 8 different data sources, each producing updates independently.

The Story of Fluss Logo

Jark Wu
PPMC member of Apache Fluss (Incubating)

Introducing the Little Otter

Today is World Otter Day, and we are thrilled to introduce the little otter to the Fluss community! 🎉

Since open-sourced half a year ago, many community members and friends have asked us: "When will Fluss get a logo?" After more than a month of careful design work and over 30 iterations, we’re excited to finally unveil the official Fluss logo — a surfing otter! 🦦🌊

Announcing Fluss 0.6

Jark Wu
PPMC member of Apache Fluss (Incubating)

The Fluss community is pleased to announce the official release of Fluss 0.6.0. This version has undergone over three months of intensive development, bringing together the expertise and efforts of 45 contributors worldwide, with more than 200 code commits completed. Our heartfelt thanks go out to every contributor for their invaluable support!

Release Announcement

Towards A Unified Streaming & Lakehouse Architecture

Luo Yuxia
PPMC member of Apache Fluss (Incubating)

The unification of Lakehouse and streaming storage represents a major trend in the future development of modern data lakes and streaming storage systems. Designed specifically for real-time analytics, Fluss has embraced a unified Streaming and Lakehouse architecture from its inception, enabling seamless integration into existing Lakehouse architectures.

Fluss is designed to address the demands of real-time analytics with the following key capabilities:

  • Real-Time Stream Reading and Writing: Supports millisecond-level end-to-end latency.
  • Columnar Stream: Optimizes storage and query efficiency.
  • Streaming Updates: Enables low-latency updates to data streams.
  • Changelog Generation: Supports changelog generation and consumption.
  • Real-Time Lookup Queries: Facilitates instant lookup queries on primary keys.
  • Streaming & Lakehouse Unification: Seamlessly integrates streaming and lakehouse storage for unified data processing.

Introducing Fluss: Streaming Storage for Real-Time Analytics

Jark Wu
PPMC member of Apache Fluss (Incubating)

We have discussed the challenges of using Kafka for real-time analytics in our previous blog post. Today, we are excited to introduce Fluss, a cutting-edge streaming storage system designed to power real-time analytics. We are going to explore Fluss's architecture, design principles, key features, and how it addresses the challenges of using Kafka for real-time analytics.