Skip to main content
Version: Next

Lance

Introduction

Lance is a modern table format optimized for machine learning and AI applications. To integrate Fluss with Lance, you must enable lakehouse storage and configure Lance as the lakehouse storage. For more details, see Enable Lakehouse Storage.

Configure Lance as LakeHouse Storage

Configure Lance in Cluster Configurations

To configure Lance as the lakehouse storage, you must configure the following configurations in server.yaml:

# Lance configuration
datalake.format: lance

# Currently only local file system and object stores such as AWS S3 (and compatible stores) are supported as storage backends for Lance
# To use S3 as Lance storage backend, you need to specify the following properties
datalake.lance.warehouse: s3://<bucket>
datalake.lance.endpoint: <endpoint>
datalake.lance.allow_http: true
datalake.lance.access_key_id: <access_key_id>
datalake.lance.secret_access_key: <secret_access_key>

# Use local file system as Lance storage backend, you only need to specify the following property
# datalake.lance.warehouse: /tmp/lance

When a table is created or altered with the option 'table.datalake.enabled' = 'true', Fluss will automatically create a corresponding Lance table with path <warehouse_path>/<database_name>/<table_name>.lance. The schema of the Lance table matches that of the Fluss table.

Flink SQL
USE CATALOG fluss_catalog;

CREATE TABLE fluss_order_with_lake (
`order_id` BIGINT,
`item_id` BIGINT,
`amount` INT,
`address` STRING
) WITH (
'table.datalake.enabled' = 'true',
'table.datalake.freshness' = '30s'
);

Start Tiering Service to Lance

Then, you must start the datalake tiering service to tier Fluss's data to Lance. For guidance, you can refer to Start The Datalake Tiering Service . Although the example uses Paimon, the process is also applicable to Lance.

But in Prepare required jars step, you should follow this guidance:

Additionally, when following the Start Datalake Tiering Service guide, make sure to use Lance-specific configurations as parameters when starting the Flink tiering job:

<FLINK_HOME>/bin/flink run /path/to/fluss-flink-tiering-0.8-SNAPSHOT.jar \
--fluss.bootstrap.servers localhost:9123 \
--datalake.format lance \
--datalake.lance.warehouse s3://<bucket> \
--datalake.lance.endpoint <endpoint> \
--datalake.lance.allow_http true \
--datalake.lance.secret_access_key <secret_access_key> \
--datalake.lance.access_key_id <access_key_id>

NOTE: Fluss v0.8 only supports tiering log tables to Lance.

Then, the datalake tiering service continuously tiers data from Fluss to Lance. The parameter table.datalake.freshness controls the frequency that Fluss writes data to Lance tables. By default, the data freshness is 3 minutes.

You can also specify Lance table properties when creating a datalake-enabled Fluss table by using the lance. prefix within the Fluss table properties clause.

Flink SQL
CREATE TABLE fluss_order_with_lake (
`order_id` BIGINT,
`item_id` BIGINT,
`amount` INT,
`address` STRING
) WITH (
'table.datalake.enabled' = 'true',
'table.datalake.freshness' = '30s',
'lance.max_row_per_file' = '512'
);

For example, you can specify the property max_row_per_file to control the writing behavior when Fluss tiers data to Lance.

Reading with Lance ecosystem tools

Since the data tiered to Lance from Fluss is stored as a standard Lance table, you can use any tool that supports Lance to read it. Below is an example using pylance:

Lance Python
import lance
ds = lance.dataset("<warehouse_path>/<database_name>/<table_name>.lance")

Data Type Mapping

Lance internally stores data in Arrow format. When integrating with Lance, Fluss automatically converts between Fluss data types and Lance data types.
The following table shows the mapping between Fluss data types and Lance data types:

Fluss Data TypeLance Data Type
BOOLEANBool
TINYINTInt8
SMALLINTInt16
INTInt32
BIGINTInt64
FLOATFloat32
DOUBLEFloat64
DECIMALDecimal128
STRINGUtf8
CHARUtf8
DATEDate
TIMETime
TIMESTAMPTimestamp
TIMESTAMP WITH LOCAL TIMEZONETimestamp
BINARYFixedSizeBinary
BYTESBinary