Skip to main content

21 Unique Reasons Why Apache Hudi Should Be Your Next Data Lakehouse

Vinoth Chandar
9 min read

Apache Hudi is continuously redefining the data lakehouse, pushing the technical boundaries and offering cutting-edge features to handle data quickly and efficiently. If you have ever wondered how Apache Hudi has sustained its position over the years as the most comprehensive, open, high-performance data lakehouse project, this blog aims to give you some concise answers. Below, we shine a light on some unique capabilities in Hudi, that go beyond the lowest-common-denominator across the different projects in the space.

1. Well-Balanced Storage Format

Hudi’s storage format perfectly balances write speed (record-level changes) and query performance (scan+lookup optimized), at the cost of additional storage space to track indexes. In contrast, Apache Iceberg/Delta Lake formats produce storage layouts aimed at vanilla scans, focus more on metadata to help scale/prune the scans. Recent effots that adopt LSM tree structures to improve write performance, inevitably sacrifice query performance. See RUM conjecture.

2. Database-like Secondary Indexes

In a long line of unique technical contributions to the lakehouse tech, Hudi recently added secondary indexes (record level, bloom filters, …), with support for even creating indexes on expressions on columns. Features heavily inspired by relational databases like Postgres, that can unlock completely new use-cases on the data lakehouse like HTAP or index-joins.

3. Efficient Merge-on-Read (MoR) Design

Hudi’s optimized MoR design minimizes read/write amplification, by a range of techniques like file grouping and partial updates. Grouping helps cut down the amount of update blocks/deletion blocks/vectors to be scanned to serve snapshot queries. It also helps preserve temporal locality of data that dramatically improves time-based access for e.g building dashboards based on time - last hour, last day, last week, … - that are table stakes for warehouse/lakehouse users.

4. Scalable Metadata for Large-Scale Datasets

Hudi’s metadata table efficiently handles millions of files, by storing them efficiently in an indexed SSTable based file format. Similarly, Hudi also indexes other metadata like column statistics, such that query planning scales linearly with O(number_of_columns_in_query), as opposed to flat-file storage like avro that scales poorly with size of tables, large number of files or wide-columns.

5. Built-In Table Services

Hudi comes loaded with automated table services like compaction, clustering, indexer, de-duplication, archiver, TTL enforcement and cleaning, that are scheduled, executed, retried, automatically with every write without requiring any external orchestration or manual SQL commands for table maintenance. Hudi’s marker mechanism efficiently cleans up uncomitted/orphaned files during writes without requiring full-listing of cloud storage to identify such files (can take hours or even timeout forever).

6. Data Management Smarts

Stepping in level deeper, Hudi fully manages everything around storage : file sizes, partitions and metadata maintenance automatically on each write, to provide consistent, dependable read/write performance. Further more, Hudi provides advanced sorting/clustering capabilities, that can be incrementally run with new writes, to keep tables optimized.

7. Concurrency Control Purpose-built For the Lake

Hudi’s concurrency control is carefully designed to deliver high throughput for data lakehouse workloads, without blindly rehashing approaches that work for OLTP databases. Hudi brings novel MVCC based approaches and non-blocking concurrency control. Data pipelines/SQL ETLs and table services won’t fail/livelock each other eliminating wastage of compute cycles, improving data freshness and reducing cloud bills. Even on optimistic concurrency control model (L.C.D across projects), Hudi provides early conflict detection to pre-emptively abort writes that will eventually fail due to conflicts, saving countless compute hours.

8. Performance at Scale

Hudi stands out on the toughest workloads you should be testing first before deciding your lakehouse stack : CDC ingest, expensive SQL merges or TB-PB scale streaming data. Hudi provides about half a dozen writer side indexes including advanced record level indexes, range indexes built on interval trees or consistent-hashed bucket indexes to scale writes for such workloads. Hudi is the only lakehouse project, that can rapidly ingest/write and handle small-file compaction without blocking those writes.

9. Out-of-box CDC/Streaming Ingestion

Hudi provides powerful, fully-production ready ingestion tools for both Spark/Flink/Kafka users, that help users build data lakehouses from their data, with a single-command. In fact, many many Hudi users blissfully use these tools, unaware of all the underlying machinery balancing write/read performance or table maintenance. This way, Hudi provides a self-managing runtime environment, for your data lakehouse pipelines, without having to pay for closed-services from vendors. Hudi ingest tools natively support popular CDC formats like Debezium/AWS DMS/Mongo and sources like S3, GCS, Kafka, Pulsar and the like.

10. First-Class Support for Keys

Hudi treats record keys as first-class citizen, used everywhere from indexing, de-duplication, clustering, compaction to consistently track/control movement of records within a table, across files. Additionally, Hudi also tracks necessary record-level metadata that help implement powerful features like incremental queries, in conjunction with queries. Ingest tools seamlessly map source primary keys to Hudi primary keys or auto-generate highly-compressible keys to aid these capabilities.

11. Streaming-First Design

Hudi was born out of a need to bridge the gap between batch processing and stream processing models. Thus, naturally, Hudi offers best-in-class and unique capabilities around handling streaming data. Hudi supports event time ordering and late data handling natively in storage where MoR is employed heavily. RecordPayload/RecordMerger APIs let you merge updates in the database LSN order compared to other approaches, avoiding cases like tables going back in (event) time, if the input is out-of-order/late-arriving (which is more the norm/nor an exception).

12. Efficient Incremental Processing

All roads in Hudi, lead to efficiency in storage and compute. Storage by reducing the amount of data stored/accessed, compute by reducing the time needed write/read. Hudi supports unique incremental queries, along with CDC queries to allow downstream data consumers to quickly obtain changes to a table, between two time intervals. Owing to scalable metadata design, a LSM-tree backed timeline history and record-level change tracking, Hudi is able to support near infinite retention for such streams, provide very useful when dealing with transactional data/logs.

13. Powerful Apache Spark Implementation

Hudi comes with a very feature-rich, advanced integration with Apache Spark - across SQL, DataSource, RDD APIs, Structured Streaming and Spark Streaming. When combined together, Hudi + Spark almost gives users a database - with built-in data management, ingestion, streaming/batch APIs, ANSI SQL and programmatic access from Python/JVM. Much like a database, the write/read implementation paths automatically pick the right storage layout to optimize storage at rest or do necessary index pruning to speed up queries.

14. Next-Gen Flink Writer for Streaming Pipelines

Hudi and Flink have the best impedance match when it comes to handling streaming data. Hudi Flink sink is built on a deep integration between the two project capabilities, by leveraging Flink’s state backends as an writer side index in Hudi. With the combination of non-blocking concurrency and partial updates, Hudi is the only lakehouse storage sink for Flink, that can allow multiple streaming writers concurrently write a table (without having to fail one). Just like Spark, Flink writer comes with built-in table services, akin to a “streaming database” for the lakehouse.

15. Avoid Compute Lockins

Don’t let the noise fool you. Hudi is widely supported across cloud warehouses (Redshift, BigQuery), open-source query/processing engines (Spark, Presto, Trino, Flink, Hive, Clickhouse, Starrocks, Doris) and also hosted offering of those open-source engines (AWS Athena, EMR, DataProc, Databricks). This means, you have the power to fully control not just the open format you store data in, but also the end-end ingestion, transformation and optimizations of your tables, avoiding any “compute lockin” with these engines.

16. Seamless Interop Iceberg/Delta Lake and Catalog Syncs

To make the point above really easy, Hudi also ships with a catalog sync mechanism, that supports about 6 different data catalogs to keep your table definitions in sync over time. Hudi tables can be readily queried as external tables on cloud data warehouses. And, with the Apache XTable (Incubating) catalog sync, Hudi enables interoperability with Iceberg and Delta Lake table format, without the need to duplicate data storage or processing. Thus, Hudi offers the most open way to manage your data on the cloud.

17. Truly Open and Community-Driven

Apache Hudi is an open-source project, actively developed by a diverse global community. In fact, the grass-roots nature of the project and its community have been the crucial reason for the lasting success Hudi has had in the industry, inspite 100-1000x bigger vendor teams marketing/selling users in other directions. Project has an established track record of truly, collaborative way of software development, the apache way.

18. Massive Adoption Across Industries

For system/infrastructure software like Hudi, it’s very important to gain/prove maturity by clocking massive amounts of server hours. Hudi is used at massive scale at much of the Fortune 100s and large organizations like Uber, AWS, ByteDance, Peloton, Huawei, Alibaba, and more, adding immense value in terms of a steady stream of high-quality bug reports and feature asks shaping the projects roadmap. This way, Hudi users get highly capable lakehouse software, that can address a diverse range of use-cases.

19. Proven Reliability in High-Pressure Workloads

Hudi has been pressure-tested at some of the most demanding worloads there is, on the data lakehouse. From minute-level latency on petabytes to storing ingesting > 100GB/s or just very tough random write workloads, that test even the best OLTP databases out there. Hudi has been deployed industry-wide for very critical data processing needs like financial clearing jobs, ride-sharing payments or transactional reconciliation.

20. Cloud-Native and Lakehouse-Ready

Don’t let the origins from a Hadoop mislead you either. Hudi has long evolved past HDFS and works seamlessly with S3, GCS, Azure, Alibaba, Huawei and many other cloud storage systems. Together with the cloud-native integrations or just via easy integrations outside of Cloud-native services, Hudi provides a very portable (cross-engine, format, cloud) way for building cloud data lakehouses.

21. Future-Proof and Actively Evolving

Hudi’s community boasts about 40-50 monthly active developers, which is growing even more with efforts like hudi-rs. Hudi’s rapid development ensures constant improvements and cutting-edge features on one hand, while the openness of the community to truly work across the entire cloud data ecosystem on the other, ensure your data stays as open as possible.

In summary, there is no secret sauce. The answer to the original question is simply how these design and implementation differences have compounded over time into unmatched technical capabilities that data engineers across the industry widely recognize. These have resulted from 6+ years of evolution, hardening and iteration from an OSS community. And, it's always a moving target, given the amount of innovation that is still ahead of us, in the data lakehouse space. By the time some of these differences make it to other projects, the community might have innovated 21 more reasons.

Apache Hudi is the best-in-class open-source data lakehouse platform —powerful, efficient, and future-proof. Start exploring it today! 🚀