Release 1.1.0
Release 1.1.0
Apache Hudi 1.1.0 is a major release that brings significant performance improvements, new features, and important changes to the platform. This release focuses on enhanced table format support, improved indexing capabilities, expanded engine support, and modernized record merging APIs.
Highlights
- Pluggable Table Format Framework - Native integration of multiple table formats with unified metadata management
- Spark 4.0 and Flink 2.0 Support - Full support for latest major compute engine versions
- Enhanced Indexing - Non-global Record Index, partition-level bucket index, native HFile writer, and Column Stats V2
- Table Services Optimization - Parquet file stitching and incremental scheduling for compaction/clustering
- Storage-based Lock Provider - Multi-writer concurrency control without external dependencies
- Record Merging Evolution - Deprecation of payload classes in favor of merge modes and merger APIs
New Features
Table Format
Pluggable Table Format Support
Hudi 1.1.0 introduces a new Pluggable Table Format framework that enables native integration of multiple table formats within the system. This foundation includes a base interface for pluggable table formats, designed to simplify extension and allow seamless interoperability across different storage backends. The Metadata Table (MDT) integration has been enhanced to support pluggability, ensuring modularity and unified metadata management across all supported table formats.
This release brings native Hudi integration through the new framework, allowing users to leverage Hudi's advanced capabilities directly while maintaining consistent semantics and performance. The configuration hoodie.table.format is set to native by default, which works as the Hudi table format. No configuration changes are required for existing and new Hudi tables. As additional table formats are supported in future releases, users will be able to set this configuration to work natively with other formats.
Table Version 9 with Index Versioning
Hudi 1.1.0 introduces table version 9 with support for index versioning. Indexes in the Metadata Table (column stats, secondary index, expression index, etc) now have version tracking. In 1.1.0, these indexes use V2 layouts with enhanced capabilities including comprehensive logical data type support. Tables migrated from older versions will retain V1 index layouts, while new tables created with 1.1.0 use V2. Both versions remain backward compatible, and no action is required when upgrading to 1.1.0.
Indexing
Non-Global Record Index
In addition to the global Record Index introduced in 0.14.0, Hudi 1.1.0 adds a non-global variant that guarantees uniqueness for partition path and record key pairs. This index speeds up lookups in very large partitioned datasets.
Prior to 1.1.0, only global Record Index was available, configured as:
hoodie.metadata.record.index.enable=true
hoodie.index.type=RECORD_INDEX
From 1.1.0 onwards, both global and non-global variants are available:
For non-global Record Index:
- Metadata table:
hoodie.metadata.record.level.index.enable=true - Write index:
hoodie.index.type=RECORD_LEVEL_INDEX
For global Record Index:
- Metadata table:
hoodie.metadata.global.record.level.index.enable=true - Write index:
hoodie.index.type=GLOBAL_RECORD_LEVEL_INDEX
Partition-Level Bucket Index
A new bucket index type that addresses bucket rescaling challenges. Users can set specific bucket numbers for different partitions through a rule engine (regex pattern matching). Existing Bucket Index tables can be upgraded smoothly and seamlessly.
Key Configurations:
hoodie.bucket.index.partition.rule.type- Rule parser for expressions (default: regex)hoodie.bucket.index.partition.expressions- Expression and bucket number pairshoodie.bucket.index.num.buckets- Default bucket count for partitions
For more details, see RFC-89
Native HFile Writer
Hudi now includes a native HFile writer, eliminating dependencies on HBase while ensuring compatibility with both HBase's HFile reader and Hudi's native reader. This significantly reduces the size of Hudi's binary bundle and enhances Hudi's ability to optimize HFile performance.
HFile Performance Enhancements
Multiple enhancements to speed up metadata table reads:
- HFile Block Cache (enabled by default): Caches HFile data blocks on repeated reads within the same JVM, showing ~4x speedup in benchmarks. Configure with
hoodie.hfile.block.cache.enabled - HFile Prefetching: For files under 50MB (configurable via
hoodie.metadata.file.cache.max.size.mb), downloads entire HFile upfront rather than multiple RPC calls - Bloom Filter Support: Speeds up HFile lookups by avoiding unnecessary block downloads. Configure with
hoodie.metadata.bloom.filter.enable
Column Stats V2 with Enhanced Data Type Support
Column Stats V2 significantly improves support for logical data types during writes and statistics collection. Logical types like decimals (with precision/scale) and timestamps are now preserved with proper metadata, improving accuracy for query planning and predicate pushdown.
Table Services
Parquet File Stitching
An optimization that enables direct copying of RowGroup-level data from Parquet files during operations like clustering, bypassing expensive compression/decompression, encoding/decoding, and column-to-row conversions. This optimization supports proper schema evolution and ensures Hudi metadata are collected and aggregated correctly. Experimental results show a 95% reduction in computational workload for clustering operations.
Incremental Table Service Scheduling
Significantly improves performance of compaction and clustering operations on tables with large numbers of partitions. Enabled by default via hoodie.table.services.incremental.enabled.
Concurrency Control
Storage-based Lock Provider
A new storage-based lock provider enables Hudi to manage multi-writer concurrency directly using the .hoodie directory in the underlying storage, eliminating the need for external lock providers like DynamoDB or ZooKeeper. Currently supports S3 and GCS, with lock information maintained under .hoodie/.lock.
Writers & Readers
Multiple Ordering Fields
Support for multiple ordering fields, a.k.a. precombine fields (deprecated name), using comma-separated lists. When records have the same key, Hudi compares the fields in order and keeps the record with the latest values.
Configuration: hoodie.table.ordering.fields = field1,field2,field3
Efficient Streaming Reads for Data Blocks
Support for efficient streaming reads of HoodieDataBlocks (currently for AvroDataBlock) reduces memory usage, improves read stability on HDFS, and lowers the risk of timeouts and OOM errors when reading log files.
ORC Support in FileGroupReader
Enhanced support for multiple base file formats (ORC and Parquet) in HoodieFileGroupReader. The 1.1.0 release introduced SparkColumnarFileReader trait and MultipleColumnarFileFormatReader to uniformly handle ORC and Parquet records for both Merge-on-Read (MOR) and Copy-on-Write (COW) tables.
Hive Schema Evolution Support
Hive readers can now handle schema evolution when schema-on-write is used.
Spark
Spark 4.0 Support
Spark 4.0 is now supported with necessary compatibility and dependency changes. Available through the new hudi-spark4.0-bundle_2.13 release artifact.
Metadata Table Streaming Writes
Streaming writes to the metadata table enable more efficient metadata record generation by processing data table and metadata table writes in the same execution chain, avoiding on-demand lookups. Enabled by default for Spark via hoodie.metadata.streaming.write.enabled.
SQL Procedures Enhancements
New CLEAN Procedures:
show_cleans- Displays completed cleaning operations with metadatashow_clean_plans- Shows clean operations in all states (REQUESTED, INFLIGHT, COMPLETED)show_cleans_metadata- Provides partition-level cleaning details
Enhanced Capabilities:
- Regex pattern support in
run_clusteringviapartition_regex_patternparameter - Base path and filter parameters for all non-action
SHOWprocedures with advanced predicate expressions
Flink
Flink 2.0 Support
Full support for Flink 2.0 including sink, read, catalog, and new bundle artifact hudi-flink2.0-bundle. Includes compatibility fixes for legacy APIs and supports sinkV2 API by default.
Deprecation: Removed support for Flink 1.14, 1.15, and 1.16
Performance Improvements
- Engine-native Record Support: Eliminates Avro transformations, utilizing RowData directly for more efficient serialization/deserialization. Write/read performance improved by 2-3x on average
- Async Instant Time Generation: Significantly improves stability for high throughput workloads by avoiding blocking on instant time generation
- Meta Fields Control: Support for
hoodie.populate.meta.fieldsin append mode, showing 14% faster writes when disabled
New Capabilities
- In-memory Buffer Sort: For pk-less tables, enables better compaction ratio for columnar formats (
write.buffer.sort.enabled) - Split-level Rate Limiting: Configure maximum splits per instant check for streaming reads (
read.splits.limit)
Catalogs
Polaris Integration
Integration with Polaris catalog by delegating table creation to the Polaris Spark client, allowing Hudi tables to be registered in the Polaris Catalog.
Configuration: hoodie.datasource.polaris.catalog.class (default: org.apache.polaris.spark.SparkCatalog)
AWS Glue & DataHub Sync Enhancements
- CatalogId support for cross-catalog scenarios
- Explicit database and table name configuration
- Resource tagging for Glue databases and tables
- TLS/HTTPS support for DataHub with custom CA certificates and mutual-TLS
Platform Components
Enhanced JSON-to-Avro Conversion for HudiStreamer
Improved JSON-to-Avro conversion layer for better reliability of the Kafka JSON source.
Prometheus Multi-table Support
Improved PrometheusReporter with reference-counting mechanism to prevent shared HTTP server shutdown when stopping metrics for individual tables.
API Changes & Deprecations
Deprecation of HoodieRecordPayload
Payload classes are now deprecated in favor of merge modes and the merger API. The payload-based approach was closely tied to Avro-formatted records, making it less compatible with native query engine formats like Spark InternalRow.
Migration Path:
- Standard Use Cases: Use merge mode configurations (
COMMIT_TIME_ORDERINGorEVENT_TIME_ORDERING) - Custom Logic: Implement
HoodieRecordMergerinterface instead of custom payloads - Automatic Migration: When upgrading to the latest table version, known payloads are automatically migrated to appropriate merge modes
The merge mode approach supports:
- Commit time and event time ordering
- Partial update strategies (replaces
OverwriteNonDefaultsWithLatestAvroPayloadandPostgresDebeziumAvroPayloadtoasted value handling) - Consistent merging semantics across engines
- Performance optimizations
BufferedRecordMerger API: The HoodieRecordMerger interface has been updated to leverage the new BufferedRecord class used during record merging.
For details, see RFC-97
Breaking Changes
Complex Key Generator Regression Fix
A regression affecting Complex Key Generator with a single record key field has been fixed. The regression was introduced in versions 0.14.1, 0.15.0, and 1.0.x releases, causing record key encoding to change from field_name:field_value to field_value, which could lead to duplicate records during upserts.
Who Is Affected: Tables using Complex Key Generator (ComplexAvroKeyGenerator or ComplexKeyGenerator) with a single record key field.
Default Behavior in 1.1.0: Reverts to the correct encoding format (field_name:field_value) matching 0.14.0 and earlier.
Migration Path:
- Upgrading from ≤0.14.0: No action needed, default behavior is correct
- New tables created in 0.15.0/1.0.x: Set
hoodie.write.complex.keygen.new.encoding=trueto maintain current encoding - Upgraded from 0.14.0 → 0.15.0/1.0.x: If you experienced duplicates, data repair may be needed. A repair tool is expected in 1.1.1
Validation: By default, writes will validate table configuration and alert you if affected. Disable with hoodie.write.complex.keygen.validation.enable=false if needed (readers are unaffected).
Note on Table Version 9: For tables upgraded to version 9, key encoding format is locked to prevent future regressions and ensure correct behavior for Record Index and Secondary Index.
For detailed technical information, see the PR documentation.
Removal of Auto-Commit Support in WriteClient
For developers using HoodieWriteClient directly, auto-commit support has been removed. The hoodie.auto.commit configuration is no longer honored.
Migration Required:
// Previous (with auto-commit)
HoodieWriteConfig config = // instantiate config
SparkRDDWriteClient writeClient = new SparkRDDWriteClient(engineContext, config);
String instantTime = writeClient.startCommit();
writeClient.upsert(JavaRDD<HoodieRecord>, instantTime);
// Now Required (explicit commit)
HoodieWriteConfig config = // instantiate config
SparkRDDWriteClient writeClient = new SparkRDDWriteClient(engineContext, config);
String instantTime = writeClient.startCommit();
JavaRDD<WriteStatus> writeStatus = writeClient.upsert(RDD<HoodieRecord>, instantTime);
writeClient.commit(instantTime, writeStatus); // Explicit commit now mandatory
This change applies to table services as well.
INSERT INTO Behavior Change
The default behavior of Spark SQL's INSERT INTO command has changed. Previously, it used "upsert" operation for tables with ordering field(s), which caused deduplication. From 1.1.0, INSERT INTO now performs "insert" operation by default, ingesting records as-is without deduplication.
Example:
Commit1:
Partition1, recordKey1, val1, orderingValue1
Partition1, recordKey1, val2, orderingValue2
Pre-1.1.0: Returns one record (based on ordering field)
From 1.1.0: Returns both records
To Restore Previous Behavior: Set hoodie.spark.sql.insert.into.operation = upsert
Restriction on Multiple Bulk Inserts for COW with Simple Bucket
Multiple bulk_insert operations are now restricted for Copy-On-Write tables when all of the following conditions are met:
- Table type is COW
- Index type is
BUCKETwith engine set toSIMPLE - Spark native row writer is disabled (
hoodie.datasource.write.row.writer.enable = false)
Resolution: Use upsert operation after the initial bulk_insert, or enable Spark native row writer.
Flink Bucket Index Restriction
Bucket index now supports only UPSERT operations and cannot be used with append write mode in Flink.
Behavior Changes
Default Merge Mode Changed
The default record merging behavior has changed based on whether an ordering field is set:
- No Ordering Field: Defaults to
COMMIT_TIME_ORDERINGusingOverwriteWithLatestAvroPayload(last-write-wins) - Ordering Field Set: Defaults to
EVENT_TIME_ORDERINGusingDefaultHoodieRecordPayload(classic event-time based comparison)
Users should use hoodie.table.ordering.fields instead of hoodie.datasource.write.precombine.field (deprecated).
The default value for hoodie.datasource.write.precombine.field (previously ts) has been removed.
Why This Change: The previous default required users to have a ts field in their schema even for simple upsert cases where last-write-wins was sufficient. The new behavior makes Hudi work out-of-the-box for simple upserts while still supporting event-time ordering when explicitly configured.
Impact on HoodieStreamer: The --source-ordering-field parameter no longer defaults to ts.
- Before: Automatically enabled
EVENT_TIME_ORDERING - Now: Defaults to
COMMIT_TIME_ORDERINGif not specified - Action Required: For new Deltastreamer jobs needing event-time merging, explicitly provide
--source-ordering-field ts(or your ordering field)
Validation Changes: Configuration mismatches now log warnings instead of failing jobs:
- Using
COMMIT_TIME_ORDERINGwith an ordering field logs a warning that the field is ignored - Using
EVENT_TIME_ORDERINGwithout an ordering field logs a warning instead of failing
Incremental Query Fallback Behavior
The default value for hoodie.datasource.read.incr.fallback.fulltablescan.enable changed from false to true. Incremental queries that cannot find necessary commit/data files will now automatically fall back to a full table scan instead of failing.
Performance Impact: Queries that previously failed fast will now succeed but may run significantly slower due to full table scans.
To Restore Fail-fast Behavior: Set hoodie.datasource.read.incr.fallback.fulltablescan.enable = false
Incremental Query Start Time Semantics
The hoodie.datasource.read.begin.instanttime configuration behavior has been reverted to be exclusive (matching 0.x behavior), correcting the inclusive behavior introduced in 1.0.0.
Timestamp Logical Type Handling
With Column Stats V2, timestamp fields are now correctly handled as timestamp_millis instead of timestamp_micros (as in 0.15.0 and earlier 1.x releases). The release maintains backward compatibility when reading tables written with older versions.
Flink Bucket Index Append Mode Restriction
Bucket index now supports only UPSERT operations and cannot be used with append write mode in Flink.
Upgrade Notes
- Review Complex Key Generator Usage: Users upgrading from 0.14.0 or earlier should review their configurations
- Update WriteClient Usage: If using
HoodieWriteClientdirectly, add explicit commit calls - Check INSERT INTO Behavior: Verify if your pipelines depend on the old upsert behavior
- Review Merge Mode Configurations: Especially for new tables or when using HoodieStreamer
- Test Incremental Queries: Be aware of new fallback behavior and potential performance impacts
- Flink Version: Ensure you're using Flink 1.17+ or 2.0 (1.14-1.16 no longer supported)
Hudi Version Support
End of Life for Hudi Versions Prior to 0.14.0
As of this release, Hudi versions prior to 0.14.0 have reached end of life. Users on these older versions should plan to upgrade to 1.1.0 or later to receive ongoing support, bug fixes, and new features. The Hudi community will focus support efforts on versions 0.14.0 and later.
For more details, see the community discussion.
Contributors
Hudi 1.1.0 is the result of contributions from the entire Hudi community. We thank all contributors who made this release possible.