Version: 1.2.0

Batch Reads

Spark DataSource API

The hudi-spark module offers the DataSource API to read a Hudi table into a Spark DataFrame.

A time-travel query example:

val tripsDF = spark.read.
    option("as.of.instant", "2021-07-28 14:11:08.000").
    format("hudi").
    load(basePath)
tripsDF.where(tripsDF.fare > 20.0).show()

Flink Batch (Snapshot) Read

Flink can read a Hudi table as a snapshot (batch) query by leaving read.streaming.enabled at its default value of false.

CREATE TABLE hudi_table (
  uuid VARCHAR(20) PRIMARY KEY NOT ENFORCED,
  name VARCHAR(10),
  age INT,
  ts TIMESTAMP(3),
  `partition` VARCHAR(20)
)
PARTITIONED BY (`partition`)
WITH (
  'connector' = 'hudi',
  'path' = '${path}',
  'table.type' = 'MERGE_ON_READ'
  -- read.streaming.enabled defaults to false → batch/snapshot read
);

-- Snapshot query
SELECT * FROM hudi_table WHERE age > 25;

For more Flink read options, see Using Flink.

Daft

Daft supports reading Hudi tables using daft.read_hudi() function.

# Read Apache Hudi table into a Daft DataFrame.
import daft

df = daft.read_hudi("some-table-uri")
df = df.where(df["foo"] > 5)
df.show()

Check out the Daft docs for Hudi integration.

Spark DataSource API​

Flink Batch (Snapshot) Read​

Daft​

Spark DataSource API

Flink Batch (Snapshot) Read

Daft