Apache Hudi 2024: A Year In Review
As we wrap up another remarkable year for Apache Hudi, I am thrilled to reflect on the tremendous achievements and milestones that have defined 2024. This year has been particularly special as we achieved several significant milestones, including the landmark release of Hudi 1.0, the publication of comprehensive books, and the introduction of new tools that expand Hudi's ecosystem.
Community Growth and Engagement
The Apache Hudi community continued its impressive growth trajectory in 2024. The number of new PRs has remained stable, indicating a consistent level of development activities:
Our community presence expanded significantly across various platforms:
- The community grew to over 10,500 followers on LinkedIn
- Added 8,755 new followers in the last 365 days
- Generated 441,402 content impressions
- Received 6,555 reactions and 493 comments across platforms
- Our Slack community remained vibrant with rich technical discussions and knowledge sharing
Major Milestones
Apache Hudi 1.0 Release
2024 marked a historic moment with the release of Apache Hudi 1.0, representing a major evolution in data lakehouse technology. This release brought several groundbreaking features:
- Secondary Indexing: First of its kind in lakehouses, enabling database-like query acceleration with demonstrated 95% latency reduction on 10TB TPC-DS for low-moderate selectivity queries
- Logical Partitioning via Expression Indexes: Introducing PostgreSQL-style expression indexes for more efficient partition management
- Partial Updates: Achieving 2.6x performance improvement and 85% reduction in bytes written for update-heavy workloads
- Non-blocking Concurrency Control (NBCC): An industry-first feature allowing simultaneous writing from multiple writers
- Merge Modes: First-class support for both
commit_time_ordering
andevent_time_ordering
- LSM Timeline: Revamped timeline storage as a scalable LSM tree for extended table history retention
- TrueTime: Strengthened time semantics ensuring forward-moving clocks in distributed processes
Please check out the announcement blog.
Launch of Hudi-rs
A significant expansion of the Hudi ecosystem occurred with the release of Hudi-rs, the native Rust implementation for Apache Hudi with Python API bindings. This new project enables:
- Reading Hudi Tables without Spark or JVM dependencies
- Integration with Apache Arrow for enhanced compatibility
- Support for Copy-on-Write (CoW) table snapshots and time-travel reads
- Cloud storage support across AWS, Azure, and GCP
- Native integration with Apache DataFusion, Ray, Daft, etc
Published Books and Educational Content
2024 saw the release of two comprehensive guides to Apache Hudi:
- "Apache Hudi: The Definitive Guide" (O'Reilly) - Released in early access, free copy available, providing comprehensive coverage of:
- Distributed query engines
- Snapshot and time travel queries
- Incremental queries
- Change-data-capture modes
- End-to-end ingestion with Hudi Streamer
- "Apache Hudi: From Zero to One" - A 10-part blog series turned into an ebook, offering deep technical insights into Hudi's architecture and capabilities, covering:
- Storage format and operations
- Read and write flows
- Table services and indexing
- Incremental processing
- Hudi 1.0 features
Community Events and Sharing
The Apache Hudi community maintained a strong presence at major industry events throughout 2024:
- Databricks' Data+AI Summit - Presenting Apache Hudi's role in the lakehouse ecosystem and its interoperability with other table formats through XTable, an open-source project enabling seamless conversion between Hudi, Delta Lake, and Iceberg
- Confluent's Current 2024 - Demonstrating Hudi's powerful CDC capabilities with Apache Flink, showcasing real-time data pipelines and the innovative Non-Blocking Concurrency Control (NBCC) for high-volume streaming workloads
- Trino Fest 2024 - Showcasing Hudi connector's evolution and innovations in Trino, including multi-modal indexing capabilities and the roadmap for enhanced query performance through Alluxio-powered caching and expanded DDL/DML support
- Bangalore Lakehouse Days - Deep dive into Apache Hudi 1.0's groundbreaking features including LSM-based timeline, functional indexes, and non-blocking concurrency control, demonstrating Hudi's continued innovation in the lakehouse space
Additionally, the community launched several new initiatives to foster learning and knowledge sharing:
Lakehouse Chronicles with Apache Hudi
A new community series with 4 episodes released.
Hudi Newsletter
9 editions published, keeping the community informed about latest developments.
Community Syncs
Featured 8 user stories from major organizations including Amazon, Peloton, Shopee and Uber.
- Powering Amazon Unit Economics with Configurations and Hudi
- Modernizing Data Infrastructure at Peleton using Apache Hudi
- Innovative Solution for Real-time Analytics at Scale using Apache Hudi (Shopee)
- Scaling Complex Data Workflows using Apache Hudi (Uber)
Notable User Stories and Technical Content
Throughout 2024, several organizations shared their Hudi implementation experiences:
- Notion's transition from Snowflake to Hudi
- Grab's implementation of near-realtime data analytics
- AWS's data sharing capabilities with AWS Data Exchange
- Yuno's data lake transformation
- Halodoc's cost optimization strategies
- Upstox's data platform evolution
Looking Ahead to 2025
As we look forward to 2025, Apache Hudi's roadmap includes several exciting developments:
- Enhanced core engine with modernized write paths and advanced indexing (bitmap, vector search)
- Multi-modal data support with improved storage engine APIs and cross-format interoperability
- Enterprise-grade features including multi-table transactions and advanced caching
- Robust platform services with Data Lakehouse Management System (DLMS) components
- Broader adoption of Hudi-rs across the ecosystem
- Continued focus on stability and seamless migration path for the community
These initiatives reflect our commitment to advancing data lakehouse technology while ensuring reliability and user experience.
Get Involved
Join our thriving community:
- Contribute to the project on GitHub: Hudi & Hudi-rs
- Join our Slack community
- Follow us on LinkedIn and X (Twitter)
- Subscribe to our YouTube channel
- Participate in our community syncs and office hours.
- Subscribe to the dev mailing list by sending an empty email to
dev-subscribe@hudi.apache.org
The success of Apache Hudi in 2024 wouldn't have been possible without our dedicated community of contributors, users, and supporters. As we celebrate these achievements, we look forward to another year of innovation and growth in 2025.