Release 0.15.0
Release 0.15.0 (docs)
Apache Hudi 0.15.0 release brings enhanced engine integration, new features, and improvements in several areas. These include Spark 3.5 and Scala 2.13 support, Flink 1.18 support, better Trino Hudi native connector support with newly introduced Hadoop-agnostic storage and I/O abstractions. We encourage users to review the Release Highlights and Migration Guide down below on relevant module and API changes and behavior changes before using the 0.15.0 release.
Migration Guide
This release keeps the same table version (6
) as 0.14.0 release, and there is no need for
a table version upgrade if you are upgrading from 0.14.0. There are a
few module and API changes
and behavior changes as
described below, and users are expected to take action accordingly before using 0.15.0 release.
If migrating from an older release (pre-0.14.0), please also check the upgrade instructions from each older release in sequence.
Bundle Updates
New Spark Bundles
We have expanded Hudi support to Spark 3.5 with two new bundles:
- Spark 3.5 and Scala 2.12: hudi-spark3.5-bundle_2.12
- Spark 3.5 and Scala 2.13: hudi-spark3.5-bundle_2.13
New Utilities Bundles for Scala 2.13
Besides adding a bundle for Spark 3.5 and Scala 2.13, we have added new utilities bundles to use with Scala 2.13, hudi-utilities-bundle_2.13 and hudi-utilities-slim-bundle_2.13.
New and Deprecated Flink Bundles
We have expanded Hudi support to Flink 1.18 with a new bundle, hudi-flink1.18-bundle. This release removes Hudi support on Flink 1.13.
Module and API Changes
Hudi Storage and I/O Abstractions
This release introduces new storage and I/O abstractions that are Hadoop-agnostic to improve integration with query
engines, including Trino, which uses its own native File System APIs. Core Hudi classes
including HoodieTableMetaClient
,
HoodieBaseFile
,
HoodieLogFile
,
HoodieEngineContext
,
etc., now depend on new
storage and I/O classes. If you’re using these classes directly in your applications, you need to change your
integration code and usage. For more details, check out this section.
Module Changes
As part of introducing new storage and I/O abstractions and making core reader logic Hadoop-agnostic, this release restructures the Hudi modules to clearly reflect the layering. Specifically,
hudi-io
module is added for I/O related functionality, and the Hudi-native HFile reader implementation sits inside this new module;hudi-common
module contains the core implementation of the Apache Hudi Technical Specification and is now Hadoop-independent;hudi-hadoop-common
module contains implementation based on Hadoop file system APIs to be used withhudi-common
module on engines including Spark, Flink, Hive, and Presto.
If you
use hudi-common
module as
the dependency before and Hadoop file system APIs and implementations, you should include all three
modules, hudi-io
, hudi-common
,
and hudi-hadoop-common
,
as the dependency now. Note that, Presto and Trino will be released based on Hudi 0.15.0 release with such changes.
Lock Provider API Change
The LockProvider
instantiation now expects
the StorageConfiguration
instance as the second argument of the constructor. If you
extend LockProvider
to implement a custom lock provider
before, you need to change the constructor to match the aforementioned constructor signature. Here's an example:
public class XYZLockProvider implements LockProvider<String>, Serializable {
...
public XYZLockProvider(final LockConfiguration lockConfiguration, final StorageConfiguration<?> conf) {
...
}
...
}
Behavior Changes
Improving Cleaning Table Service
We have improved the default cleaner behavior to only schedule a new cleaner plan if there is no inflight plan, by
flipping the default of hoodie.clean.allow.multiple
from true
to false
. This simplifies cleaning table service when the metadata table is enabled. The config is deprecated now and
will be removed after the next release.
Allowing Duplicates on Inserts
We now allow duplicate keys on INSERT
operation by default, even if inserts are routed to merge with an existing
file (for ensuring file sizing), by flipping the default
of hoodie.merge.allow.duplicate.on.inserts
from false
to true
. This is only relevant for INSERT
operation, since UPSERT
, and DELETE
operations always ensure unique
key constraint.
Sync MOR Snapshot to The Metastore
To better support snapshot queries on MOR tables on OLAP engines, the MOR snapshot or RT is synced to the metastore with
the table name by default, by flipping the default
of hoodie.meta.sync.sync_snapshot_with_table_name
from false
to true
.
Flink Option Default Flips
The default value of read.streaming.skip_clustering
is false
before this release, which could cause the situation that Flink streaming reading reads the replaced file slices of
clustering and duplicated data (same concern
for read.streaming.skip_compaction
). The
0.15.0 release makes Flink streaming read to skip clustering and compaction instants for all cases to avoid reading the
relevant file slices, by flipping the default
of read.streaming.skip_clustering
and read.streaming.skip_compaction
from false
to true
.
Release Highlights
Hudi Storage and I/O Abstractions
To provide better integration experience with query engines including Trino which uses its own native File System APIs, this release introduces new storage and I/O abstractions that are Hadoop-agnostic.
To be specific, the release introduces Hudi storage
abstraction HoodieStorage
which provides all I/O APIs to read and write files and directories on storage, such as open
, read
, etc. This class
can be extended to implement storage layer optimizations like caching, federated storage layout, hot/cold storage
separation, etc. This class needs to be implemented based on particular systems, such as
Hadoop’s FileSystem
and
Trino’s TrinoFileSystem
.
Core classes are introduced for accessing file systems:
StoragePath
: represents a path of a file or directory on storage, which replaces Hadoop'sPath
.StoragePathInfo
: keeps the path, length, isDirectory, modification time, and other information which are used by Hudi, which replaces Hadoop'sFileStatus
.StorageConfiguration
: provides the storage configuration by wrapping the particular configuration class object used by the corresponding file system.
The HoodieIOFactory
abstraction is introduced to provide APIs to create readers and writers for I/O without depending on Hadoop classes.
By using the new storage and I/O abstractions, we make
the hudi-common
module and
core reader logic in Hudi
Hadoop-independent in this release. We have introduced a
new hudi-hadoop-common
module
which contains
the implementation
of HoodieStorage
and HoodieIOFactory
based on Hadoop’s file system APIs and implementation, and
existing reader and writer logic that depends on Hadoop-dependent APIs.
The hudi-hadoop-common
module
is used by
Spark, Flink, Hive, and Presto integration where the logic remains unchanged.
For the engine that is independent of Hadoop, the integration should use
the hudi-common
module and
plug in its own
implementation
of HoodieStorage
and HoodieIOFactory
by setting the new
configs hoodie.storage.class
and hoodie.io.factory.class
through the storage configuration.
Engine Support
Spark 3.5 and Scala 2.13 Support
This release has added the Spark 3.5 support and Scala 2.13 support; users who are on Spark 3.5 can use the new Spark bundle based on the Scala version: hudi-spark3.5-bundle_2.12 and hudi-spark3.5-bundle_2.13. Spark 3.4, 3.3, 3.2, 3.1, 3.0 and 2.4 continue to be supported in this release. To quickly get started with Hudi and Spark 3.5, you can explore our quick start guide.
Flink 1.18 Support
This release has added the Flink 1.18 support with a new compile maven profile flink1.18
and new Flink
bundle hudi-flink1.18-bundle.
Hudi-Native HFile Reader
Hudi uses HFile format as the base file format for storing various metadata, e.g., file listing, column stats, and bloom filters, in the metadata table (MDT), as HFile format is optimized for range scans and point lookups. HFile format is originally designed and implemented by HBase. To avoid HBase dependency conflict and make engine integration easy with Hadoop-independent implementation, we have implemented a new HFile reader in Java which is independent of HBase or Hadoop dependencies. This HFile reader is backwards compatible with existing Hudi releases and storage format.
We have also written a HFile Format Specification, that defines the HFile Format required by Hudi. This makes HFile reader and writer implementation possible in any language, for example, C++ or Rust, by following this Spec.
New Features in Hudi Utilities
StreamContext and SourceProfile Interfaces
For the Hudi streamer, we have introduced the
new StreamContext
and SourceProfile
Interfaces. These are
designed to contain details about how the data should be consumed from the source and written (e.g., parallelism) in the
next sync round in StreamSync. This allows users to control the behavior and performance of source reading and data
writing to the target Hudi table.
Enhanced Proto Kafka Source Support
We have added the support for deserializing with the Confluent proto deserializer, through a new
config hoodie.streamer.source.kafka.proto.value.deserializer.class
to specify the Kafka Proto payload deserializer
class.
Ignoring Checkpoint in Hudi Streamer
Hudi streamer has a new option, --ignore-checkpoint
, to ignore the last committed checkpoint for the source. This
options should be set with a unique value, a timestamp value or UUID as recommended. Setting this config indicates that
the subsequent sync should ignore the last committed checkpoint for the source. The config value is stored in the commit
history, so setting the config with same values would not have any affect. This config can be used in scenarios like
kafka topic change where we would want to start ingesting from the latest or earliest offset after switching the topic
(in this case we would want to ignore the previously committed checkpoint, and rely on other configs to pick the
starting offset).
Meta Sync Improvements
Paralleled Listing in Glue Catalog Sync
AWS Glue Catalog sync now supports paralleled listing of partitions to improve the listing performance and reduce meta sync latency. Three new configs are added to control the listing parallelism:
hoodie.datasource.meta.sync.glue.all_partitions_read_parallelism
: parallelism for listing all partitions (first time sync).hoodie.datasource.meta.sync.glue.changed_partitions_read_parallelism
: parallelism for listing changed partitions (second and subsequent syncs).hoodie.datasource.meta.sync.glue.partition_change_parallelism
: parallelism for change operations such as create, update, and delete.
BigQuery Sync Optimization with Metadata Table
The BigQuery Sync now loads all partitions from the metadata table once if the metadata table is enabled, to improve the file listing performance.
Metrics Reporting to M3
A new MetricsReporter
implementation M3MetricsReporter
is added to support emitting metrics to M3. Users can now enable reporting metrics to M3 by
setting hoodie.metrics.reporter.type
as M3
and their
corresponding
host address and port in hoodie.metrics.m3.host
and hoodie.metrics.m3.port
.
Other Features and Improvements
Schema Exception Classification
This release introduces the classification of schema-related exceptions (HUDI-7486) to make it easy for users to understand the root cause, including the errors during converting the records from Avro to Spark Row due to illegal schema, or the records are incompatible with the provided schema.
Record Size Estimation Improvement
The record size estimation in Hudi is improved by considering replace commits and delta commits additionally (HUDI-7429).
Using s3
Scheme for Athena
Recent Athena version silently drops Hudi data when the partition location has a s3a
scheme. Recreating the table with
partition s3
scheme fixes the issue. We have added a fix to use s3
scheme for the Hudi table partitions in AWS Glue
Catalog sync (HUDI-7362).
Known Regressions
The Hudi 0.15.0 release introduces a regression related to Complex Key generation when the record key consists of a single field. This issue was also present in version 0.14.1. When upgrading a table from previous versions, it may silently ingest duplicate records.
Avoid upgrading any existing table to 0.14.1 and 0.15.0 from any prior version if you are using ComplexKeyGenerator and number of fields in record key is 1.
Raw Release Notes
The raw release notes are available here.