Apache Hudi: User-Facing Analytics

Scaling Autonomous Vehicle Data Infrastructure with Apache Hudi at Applied Intuition

Thu, 22 Jan 2026 00:00:00 GMT

This post summarizes Applied Intuition's talk from the Apache Hudi community sync. Watch the recording on YouTube.

Applied Intuition is the foremost enabler of autonomous vehicle (AV) systems, providing a suite of tools that help AV companies improve their entire stack—from simulation to data exploration. To support their mission, Applied Intuition built a unique data infrastructure that is flexible, scalable, and secure. After migrating to an Apache Hudi-powered data lakehouse, they transformed their data capabilities: query times dropped from 10 minutes to under 25 seconds, and they can now query 3-4 orders of magnitude more data than before.

Building a Unique Data Infrastructure

Applied Intuition's data infrastructure is designed to meet the specific needs of its diverse customer base, including 17 of the top 20 OEMs. Their infrastructure is built around four core principles.

First, schemas must be flexible. Each customer determines their own data schema, so the infrastructure must handle a wide variety of data points without requiring rigid upfront definitions.

Second, compute needs to be tunable. Some customers are more cost-sensitive while others have larger-scale needs, so the infrastructure can adjust compute resources on a per-customer basis.

Third, everything must be cloud agnostic. Because customers operate on different cloud providers, the infrastructure—built on Kubernetes—works seamlessly across all of them without relying on a single vendor.

Finally, security and privacy are paramount. All data and infrastructure live within the customer's own cloud accounts. This ensures that customers fully own and control their data, enabling strict security, privacy, and retention policies.

The Challenges Before Apache Hudi

Before adopting Apache Hudi, Applied Intuition's data infrastructure directly queried a raw data lake on S3/ABFS using SQL engines. While this approach worked initially, significant issues emerged as scale increased.

The system struggled to provide ACID transaction guarantees critical for data integrity. Storage costs kept climbing because storing all data in raw format was expensive. As small files accumulated, query performance degraded dramatically due to the I/O overhead of opening and closing countless files.

To address these challenges, Applied Intuition adopted Apache Hudi, which introduced a transactional layer and metadata management to their data lake—transforming file system storage into a modern data lakehouse.

Applied Intuition primarily uses Copy-on-Write (COW) tables. While Hudi also offers Merge-on-Read (MOR) tables for faster ingestion, their main priority is query performance. With COW tables, they achieve fast query execution while accepting slightly higher write latency.

Leveraging Hudi Features to Shape Data Architecture

Applied Intuition leverages three core Hudi services to optimize its data infrastructure: file sizing, clustering, and metadata indexing.

File Sizing: Solving the Small File Problem

File sizing was the very first reason they started using Hudi. The company runs thousands of simulations daily, each generating numerous small files. This led to the classic "small file problem"—Spark queries would spend significant time just opening and closing files to read metadata. Spark SQL performs best with files around 512MB, but simulation files are often just kilobytes.

Hudi's file sizing service efficiently packs small files into optimally-sized files by analyzing previous commits and estimating the number of records per file. By combining countless kilobyte-sized files, Hudi drastically reduced I/O overhead and improved query performance. Their data now takes up 20x less space than with raw Parquet files, resulting in substantial S3 cost savings.

Rohit recalls this as the first "aha moment" with Hudi: "We had less than a gigabyte of data, but query performance was really slow. When we first tried file sizing, performance improved dramatically—and we saw all our data fit within megabytes. It was really cool to see that level of compression and query performance just out of the box."

Clustering: Optimizing for Query Patterns

Many of Applied Intuition's queries focus on specific chunks of data, or batches. Hudi's clustering feature improves data co-location by arranging related records together, minimizing the number of files touched per query.

For example, by clustering all data from a single "simulation run ID" into just one or two files, Hudi allows queries to avoid scanning thousands of files. This has led to massive improvements in query performance. Applied Intuition runs clustering jobs asynchronously to maintain low write latency while keeping query performance high.

Metadata Indexing: From Minutes to Seconds

Before Hudi, loading data from raw cloud storage could take minutes, especially when listing millions of files. Hudi's metadata indexing creates a file index that allows the dataframe to load in under two seconds—a huge UX improvement for their customers.

Additionally, they use column stats indices, which store min/max values for key columns. When a query runs, Hudi uses these stats to skip irrelevant files that don't match the query criteria, enabling much faster lookups.

Extending Hudi for Schema Flexibility

Given their wide customer base and evolving schema needs, Applied Intuition extended Hudi with two customizations: one to evict the cached file schema provider so mid-day schema updates are picked up during writes, and another to allow Parquet batching even when schemas differ across commits—common in simulation data where batches may have different columns.

Impact: Performance, Cost, and Scale

The improvements with Hudi have been transformative. Applied Intuition can now query 3-4 orders of magnitude more data than before. Storage costs dropped significantly thanks to file packing that achieves 20x compression compared to raw Parquet files. Query times that once took 10 minutes now complete in under 25 seconds, and dataframe initialization that used to take minutes now happens in seconds.

Despite running on tight compute resources—just 1-2 machines running DeltaStreamer—their ingestion latency sits around 15 minutes. They can easily scale this by adding more compute when needed.

Next Steps: Moving Beyond PostgreSQL

After successfully implementing Hudi on a few key tables, Applied Intuition is scaling Hudi to support their entire data lake architecture.

A proof of concept integrates PostgreSQL CDC via Debezium into Kafka, which feeds into Hudi DeltaStreamer, replicating transactional data into a Hudi-powered lakehouse. This setup enables non-critical queries to shift away from PostgreSQL, reducing database load and improving overall product performance. It also opens up deeper analytical insights directly from the data lake for both internal teams and customers.

The team worked through some initial setup challenges, resolving tombstone record handling through PostgreSQL and Debezium configuration updates.

Acknowledgments

Applied Intuition is grateful for the incredible support from the Apache Hudi community, which has significantly improved their data infrastructure. The Onehouse team—Sivabalan, Ethan, and Nadine—has been particularly helpful, staying up on long night calls to help debug issues and ensure the team understood the product deeply. Nadine also provided ongoing support by answering questions on Slack.

Conclusion

Applied Intuition's journey with Apache Hudi demonstrates how a modern data lakehouse platform can solve complex data infrastructure challenges while unlocking new levels of performance and insight.

Apache Hudi™ at Uber: Engineering for Trillion-Record-Scale Data Lake Operations

Fri, 16 Jan 2026 00:00:00 GMT

Redirecting... please wait!! or click here

From Legacy to Leading: Funding Circle's Journey with Apache Hudi

Thu, 15 Jan 2026 00:00:00 GMT

This post summarizes Funding Circle's presentation at the Apache Hudi Community Sync. Watch the recording on YouTube.

Funding Circle is a lending company focused on helping small and medium-sized businesses access the funding they need to grow. Their instant decisioning engine allows customers to complete loan applications in minutes and receive decisions in seconds. To date, the company has supported more than 135,000 businesses with over £15 billion in loans.

In this community sync, Daniel Ford, Data Platform Engineer at Funding Circle, shared how his team built a modern data ingestion framework using Apache Hudi and the significant improvements they achieved.

Funding Circle's lending capability relies on a broad and complex data platform. The goal of the platform is to make data easy to work with, enabling users to drive business decisions without requiring specialized data engineering knowledge. Developers should be able to build quickly, while data consumers should be able to find and use information without constantly seeking support.

However, their existing legacy Kafka ingestion solution, developed back in 2018-2019, was actively undermining these goals.

Why the Legacy Ingestion System Needed to Change

The legacy ingestion tool suffered from multiple critical problems that made maintaining and scaling the platform difficult:

Schema Evolution Instability: Handling changes in data schemas was often unreliable or completely unviable, leading to frequent pipeline breaks and manual intervention.
Code Complexity and Age: The codebase was massively complex and dated, predating almost every current engineer on the Data Platform team, making maintenance and updates a slow, painful process.
Centralized Control vs. Data Mesh: The pipelines featured centralized deployment and management, which fundamentally opposed the company's strategic aim of achieving a data mesh architecture.
Lack of Ownership and Observability: There was no effective way to establish true end-to-end domain ownership for a data pipeline, and a general lack of observability made it hard to quickly diagnose and fix issues.
Prohibitive Backfilling Times: Syncing large Kafka topics took prohibitively long in the legacy solution, severely limiting the ability to correct or reload historical data.
Inability to Support Real-Time Data: The system was not designed to support near-real-time ingestion, trapping users in slow batch-processing cycles.

To move forward, the team defined clear goals: the new system needed to deliver data within ten minutes, support stable schema evolution, integrate with metadata systems like DataHub, and offer strong monitoring, scalability, and built-in PII masking. It also needed to shift ownership to individual teams by allowing decentralized pipeline definitions.

Introducing Project Kirby: The New Ingestion Framework

After six to seven months of development, the team created Kirby—short for "Kafka Ingestion Real-Time or Batch through YAML"—which provides a simple, configuration-driven interface built around Apache Hudi Streamer running on AWS EMR.

Architecture and User Experience

Kirby organizes ingestion into three major areas: declaration, compute, and access.

Pipeline Declaration and Deployment

Users define their pipeline in a simple YAML file stored in their own repository. This file specifies metadata such as region, domain ownership, and pipeline type. Users deploy the pipeline through GitHub Actions or Drone, depending on their existing CI/CD setup. Kirby converts the YAML definition into an Airflow DAG, letting users track the history and progress of their ingestion tasks. To safely manage secrets, the team extended the EMR step operator to load credentials from AWS Secrets Manager at runtime. Metadata embedded in the DAG also gives the central team visibility into how many pipelines exist, who owns them, and which versions are running.

Compute Layer

The actual ingestion jobs run via Hudi Streamer 0.11.1 on EMR 6.8, utilizing both batch and continuous streaming configurations. Hudi Streamer directly ingests data from Kafka on Funding Circle's centrally maintained EMR cluster.

Access Layer

Data is stored in S3 in the Hudi format and automatically registered with the Glue Data Catalog, making it instantly queryable via Athena.

The team currently serves approximately 80 topics with a range of batch and streaming workloads running simultaneously—some completing in minutes, others running continuously for hours.

Challenges and Lessons Learned

The team discovered several challenges in building the system:

EMR Learning Curve: Managing EMR required a deep understanding of Hadoop, YARN, HDFS, and Spark—knowledge the team initially lacked. For a young team of engineers with no prior domain knowledge, this was a significant hurdle to overcome. Issues still crop up occasionally, but the team now has a firm grasp on maintaining the cluster in a sustainable, scalable manner.

Debugging Complexity: Debugging proved difficult due to EMR cluster complexity. Without a deep understanding of Spark and Hadoop, investigations led to many red herrings and doubling back. Building proficiency with these underlying technologies was essential for efficient troubleshooting.

Scope Management: Initially excited by Hudi's extensive capabilities—such as time-travel querying and advanced table management—the team promised features that, while technically impressive, weren't core business requirements. This diluted the project scope and distracted from the primary goal: building a fast, simple Kafka ingestion pipeline. Through thorough testing and leveraging the Hudi community's expertise—which Daniel described as "the best I've ever encountered"—the team learned to align their scope more precisely with actual business needs.

The key lesson: properly assess business cases before committing to features. The team is now far more mature and capable of building future iterations, having gained this wealth of knowledge.

Concrete Achievements and Business Value

Despite the challenges, the transition to Kirby powered by Hudi delivered multiple benefits:

Real-Time Data Enablement: Pipeline refresh times dropped dramatically—from 30 to 60 minutes down to just 3 to 4 minutes. This enabled minute-level refreshes for analytics teams, providing fresher data and establishing a new paradigm for self-service analytics at the company.
Elimination of Backfilling Problems: Massive-scale backfills that previously took weeks now complete in hours. Instead of working around hacky solutions to backfill data into the lake, teams can now use the same tool for that task.
Automatic Schema Management: Schema evolution no longer requires manual work. What used to be an expensive full-load activity is now just an afterthought. Data producers have significantly increased confidence in their ability to evolve schemas without fear of breaking downstream pipelines.
Engineering Efficiency: The shift to decentralized ownership allows teams to build and maintain their own ingestion pipelines without relying on central platform engineers. Integration with DataHub enables automatic column-level lineage between Kafka topics and Athena tables. Engineering teams have drastically simplified their ingestion workflows, freeing up valuable time to focus on building quality data products.
Platform Modernization: Hudi served as the catalyst for the team to learn about and apply table formats to their platform for the first time. This knowledge has led them to identify a wealth of future use cases, such as migrating legacy pipelines to modern Change Data Capture (CDC) patterns using Hudi's upsert capabilities.

The Future with Hudi

Kirby currently serves approximately 80 topics across the UK and US, with a range of batch and streaming workloads. That volume is expected to grow to over 200 topics in the coming years. Future development focuses on stability and expansion:

Reduce EMR Reliance: The team hopes to move to lightweight compute services such as AWS Glue for less time-critical batch workloads.
S3 as a Source: To achieve 100% coverage, the team plans to leverage Hudi Streamer's S3 source capabilities, extending their unified ingestion interface to handle additional topic types beyond direct Kafka ingestion.
Extended Source Support: The team also wants to extend the familiar Kirby interface to support new source types, including SFTP, API-based sources, and other internal systems.

Project Kirby and Apache Hudi are poised to become the foundation of Funding Circle's future ingestion platform, transforming a complex, legacy system into a scalable, high-performance, and decentralized engine for data-driven decisions.

Conclusion

Funding Circle's presentation offered a clear and practical look at how adopting a modern data platform can solve significant real-world engineering challenges. By moving away from their legacy architecture and building Project Kirby around Apache Hudi, the team delivered a solution that is faster, more reliable, and far easier for teams across the organization to use. The core of their success was Hudi's ability to provide a unified data lake storage format, enabling crucial features like automatic schema evolution and near-real-time ingestion. Ultimately, Hudi catalyzed the maturation of the team's data platform and decentralized model, setting the stage for advanced use cases like Change Data Capture (CDC).

Using Amazon EMR DeltaStreamer to stream data to multiple Apache Hudi tables

Thu, 15 Jan 2026 00:00:00 GMT

Redirecting... please wait!! or click here

ExternalSpillableMap: Handle Maps Too Big for Memory

Tue, 13 Jan 2026 00:00:00 GMT

Redirecting... please wait!! or click here

Apache Hudi 1.1 Deep Dive: Async Instant Time Generation for Flink Writers

Fri, 09 Jan 2026 00:00:00 GMT

This blog was translated from the original blog in Chinese.

Background

Before the Hudi 1.1 release, in order to guarantee the exactly-once semantics of the Hudi Flink sink, a new instant could only be generated after the previous instant was successfully committed to Hudi. During this period, Flink writers had to block and wait. Starting from Hudi 1.1, we introduce a new asynchronous instant generation mechanism for Flink writers. This approach allows writers to request the next instant even before the previous one has been committed successfully. At the same time, it still ensures the ordering and consistency of multi-transaction commits. In the following sections, we will first briefly introduce some of Hudi's basic concepts, and then dive into the details of asynchronous instant time generation.

Instant Time

Timeline is a core component of Hudi's architecture. It serves as the single source of truth for the table's state, recording all operations performed on a table. Each operation is identified by a commit with a monotonically increasing instant time, which indicates the start time of each transaction.

Hudi provides the following capabilities based on instant time:

More efficient write rollbacks: Each Hudi commit corresponds to an instant time. The instant timestamp can be used to quickly locate files affected by failed writes.
File name-based file slicing: Since instant time is encoded into file names, Hudi can efficiently perform file slicing across different versions of files within a table.
Incremental queries: Each row in a Hudi table carries a _hoodie_commit_time metadata field. This allows incremental queries at any point in the timeline, even when full compaction or cleaning services are running asynchronously.

Completion Time

File Slicing Based on Instant Time

Before release 1.0, Hudi organized data files in units called FileGroup. Each file group contains multiple FileSlices. Each file slice contains one base file and multiple log files. Every compaction on a file group generates a new file slice. The timestamp in the base file name corresponds to the instant time of the compaction operation that wrote the file. The timestamp in the log file name is the same as the base instant time of the current file slice. Data files with the same instant time belong to the same file slice.

In concurrent write scenarios, the instant time naming convention of log files introduces certain limitations: as asynchronous compaction progresses, the base instant time can change. To ensure that writers can correctly determine the base instant time, the ordering between write commits and compaction scheduling must be enforced—meaning that compaction can only be scheduled when there are no ongoing write operations on the table. Otherwise, a log file might be written with an incorrect base instant time, potentially leading to data loss. As a result, compaction scheduling may block all writers in concurrency mode.

File Slicing Based on Completion Time

To address these issues, starting from version 1.0, Hudi introduced a new file slicing model based on a time interval defined by requested time and completion time. In release 1.x, each commit has two important time concepts: requested time and completion time. All generated timestamps are globally monotonically increasing. The timestamp in the log file name is no longer the base instant time, but rather the requested instant time of the write operation. During the file slicing process, Hudi looks up the completion time for each log file using its instant time and applies a new file slicing rule:

A log file belongs to the file slice with the maximum base requested time that is less than or equal to the log file's completion time. [5]

The new file slicing mechanism is more flexible, allowing compaction scheduling to be completely decoupled from the writer's ingestion process. Based on this mechanism, Hudi has also implemented powerful non-blocking concurrency control. For details, refer to RFC-66 [5].

LSM Timeline

The new file slicing mechanism requires efficient querying of completion time based on instant time. Starting from version 1.x, Hudi re-implemented the archived timeline. The new archived timeline organizes its data files in an LSM-tree structure, enabling fast range-filtering queries based on instant time and supporting efficient data skipping. For more details, refer to [6].

TrueTime

A critical premise of Hudi's timeline is that the timestamp generated for instants must be globally monotonically increasing without any conflicts. However, maintaining time monotonicity in distributed transactions has long been a thorny problem due to:

Clock skew: Physical clocks drift apart across machines.
Network latency: Communication delays between data centers.
Concurrent transactions: Difficulty in determining event ordering across regions.

To solve this problem, Google Spanner's TrueTime [1] uses a globally synchronized clock with bounded uncertainty, allowing it to assign timestamps monotonically without conflicts. Similarly, Hudi introduced the TrueTime API [2] starting from the 1.0 release. There are generally two approaches to realize TrueTime semantics:

A single shared time generator process or service, like Google Spanner's time service.
Each process generates its own time and waits until time >= maximum expected clock drift across all processes, coordinated within a distributed lock.

Hudi's TrueTime API adopts the second approach: using a distributed lock to ensure only one process generates time at any given moment. The waiting mechanism ensures sufficient time passes so that the instant time is monotonically increasing globally.

Blocking Instant Time Generation for Flink Writers

For Flink streaming ingestion, each incremental transaction write can be roughly divided into the following stages:

Writers write records into the in-memory buffer.
When the buffer is full or a checkpoint is triggered, writers send a request to the coordinator for an instant time.
Writers create data files based on the instant time and flush the data to storage.
The coordinator commits the flushed files and metadata after receiving the ACK event for the successful checkpoint.

From Hudi's file slicing mechanism, we can see that the instant time is required before writers flush records into storage. Before 1.1, although the committing operation was performed asynchronously in the coordinator, the writer's request for a new instant would be blocked, because only after the previous instant is successfully committed can the coordinator create a new instant. Let's say a batch of records finished flushing with checkpoint ckp_1 in the writer at T1, and ckp_1 completed at time T2—the writer will be blocked in the time interval [T1, T2] (a new instant time is generated only after the ckp_1 completion event is handled and committed to the Hudi timeline). This blocking behavior ensures strict transaction ordering across multiple instants; however, it can lead to significant throughput fluctuations under large-scale workloads.

Async Instant Time Generation

To address throughput fluctuation caused by blocking instant generation, Hudi 1.1 introduces asynchronous instant generation for Flink writers. Using the aforementioned example, "async" means that before the previous instant is successfully committed to the timeline (during the time range [T1, T2]), the coordinator can create a new instant and return it to the writer for flushing a new batch of data. Thus, writers are no longer blocked during checkpoints, and the ingestion throughput will be more stable and smoother.

Overall write workflow:

Writer: Requests instant time from the coordinator before flushing data:
- Data flushing may be triggered either by a checkpoint or by the buffer being full, so the request carries the last completed checkpoint ID, rather than the current checkpoint ID.
- For a request made at the initial startup of a task, if it's recovered from state, the checkpoint ID is fetched from the state; otherwise, it's set to -1.
Coordinator: Responsible for instant generation. Each checkpoint ID corresponds to an instant. The coordinator maintains the mapping in the WriteMeta buffer: CheckpointID → { Instant → { Writer TaskID → Writer MetaEvent }}, where {Writer TaskID → Writer MetaEvent} is the mapping between each writer's parallel task ID and the file metadata written during the current checkpoint interval. Upon receiving an instant request from a writer:
- If an instant corresponding to the checkpoint ID already exists in the WriteMeta buffer, the cached instant is returned directly.
- Otherwise, a new instant is generated, added to the WriteMeta buffer, and then returned to the writer.
Writer: Completes data writing, generates data files, and sends the file metadata to the coordinator.
Coordinator: Upon receiving the file metadata from a writer, updates the WriteMeta buffer:
- If there is no existing metadata for the current writer task in the buffer, the metadata is written directly into the cache.
- If metadata for the current writer task already exists (i.e., the writer has flushed multiple times), the metadata is merged and then the cache is updated.
Coordinator: When the ACK event for a checkpoint (ID = n) is received, it starts serially and orderly committing the instants recorded in the WriteMeta buffer. The commit scope includes all instants corresponding to checkpoint IDs < n. Once committed successfully, the instants are removed from the buffer.
- Since Flink's checkpoint ACK mechanism does not guarantee that the coordinator will receive an ACK for every checkpoint, the commit logic follows Flink's Checkpoint Subsume Contract [4]: if the ACK for checkpoint ckp_i is not received, its written metadata is subsumed into the pending metadata for the next checkpoint ckp_i+1, and will be committed once the ACK for ckp_i+1 is received.

WriteMeta Failover

Considering that the coordinator cannot guarantee receiving every checkpoint ACK event, the in-memory WriteMeta buffer needs to be persisted to Flink state to prevent data loss after a task failover.

When designing the snapshot for the WriteMeta buffer, the following points need to be considered:

WriteMeta is initially stored in the writer's buffer. Only when a checkpoint is triggered or an eager flush occurs does the writer send the current WriteMeta to the coordinator.
During checkpointing, the coordinator takes a snapshot first, followed by the writers' snapshot. At this point, the coordinator's cache does not yet include the WriteMeta for the current checkpoint interval.

Based on the above considerations, the WriteMeta snapshot process for checkpoint i is as follows:

The coordinator first persists the WriteMeta for all checkpoint IDs < i to Flink state.
The writer sends the WriteMeta generated during checkpoint i to the coordinator, awaiting commit.
The writer cleans up the historical WriteMeta in state and persists the WriteMeta for checkpoint i to Flink state.

This approach of persisting the WriteMeta buffer ensures that metadata is not lost, while also preventing state bloat in the WriteMeta state.

Conclusion

The async instant time generation mechanism introduced in Hudi 1.1 is an important feature for improving the stability of Flink streaming ingestion. By eliminating the blocking dependency on the completion of the previous instant's commit, this mechanism solves the throughput fluctuation and backpressure problems under large-scale workloads. At the same time, it still maintains strong transactional guarantees and seamless integration with Flink's checkpoint semantics. This enhancement is fully backward-compatible and transparent to end users, enabling existing Flink streaming ingestion jobs to immediately benefit from smoother and more scalable ingestion with minimal changes. As the demands on real-time data lakes continue to grow, such innovations are key to building robust, high-performance lakehouse architectures.

References

[1] https://research.google/pubs/spanner-truetime-and-the-cap-theorem/

[2] https://hudi.apache.org/docs/next/timeline/#timeline-components

[3] https://github.com/apache/hudi/blob/master/rfc/rfc-66/rfc-66.md

[4] https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/state/CheckpointListener.java

[5] https://github.com/apache/hudi/blob/master/rfc/rfc-66/rfc-66.md

[6] https://hudi.apache.org/docs/next/timeline/#lsm-timeline-history

Apache Hudi 2025: A Year In Review

Mon, 29 Dec 2025 00:00:00 GMT

As we close out 2025, it's clear that this has been a transformative year for Apache Hudi. The community continued to grow, delivered major advances in interoperability and performance, published "The Definitive Guide" book, and expanded our global reach through meetups, conferences, and adoption worldwide.

Community and Growth

2025 brought steady momentum in contributions and community engagement.

GitHub pull request activities remained steady throughout 2025, reflecting consistent development momentum. GitHub contributors reached over 500. Community engagement also grew — followers across social media platforms (LinkedIn, X, YouTube, and WeChat) increased to about 35,000, while Slack community users grew to close to 5,000.

The project celebrated new milestones in contributor recognition. Yue Zhang was elected to the Project Management Committee (PMC) for driving several major RFCs and features, evangelizing Hudi at meetups, and leading the effort to productionize hundreds of petabytes at JD.COM. Tim Brown was nominated as a committer for modernizing read/write operations behind Hudi 1.1's performance gains. Shawn Chang was nominated as a committer for strengthening Spark integration and expanding Hudi adoption on AWS EMR.

Development Highlights

Hudi 1.1 landed with over 800 commits from 50+ contributors. The headline feature is the pluggable table format framework — Hudi's storage engine is now pluggable, allowing its battle-tested transaction management, indexing, and concurrency control to work while storing data in Hudi's native format or other table formats like Apache Iceberg via Apache XTable (Incubating).

Performance saw major gains across the board. Parquet binary copy for clustering delivered 10-15x faster execution with 95% compute reduction. Apache Flink writer achieved 2-3x improved throughput with Avro conversion eliminated in the write path. Apache Spark metadata-table streaming ran ~18% faster for update-heavy workloads. Indexing enhancements — partitioned record index, partition-level bucket index, HFile caching, and Bloom filters — delivered up to 4x speedup for lookups on massive tables.

Spark 4.0 and Flink 2.0 support was added. Apache Polaris (Incubating) catalog integration enabled multi-engine queries with unified governance. Operational simplicity improved with storage-based locking that eliminated external dependencies. New merge modes replaced legacy payload classes with declarative options, and SQL procedures enhanced table management directly in Spark SQL. See more details in the release blog.

Hudi-rs expanded its feature support — release 0.3.0 introduced Merge-on-Read and incremental queries, while 0.4.0 added C++ bindings and Avro log file support. The native Rust implementation now powers Ray Data and Daft integrations for ML and multi-cloud analytics.

New Book Published

Apache Hudi: The Definitive Guide, published by O'Reilly, distills Hudi's 8+ years of innovation from 500+ contributors into a comprehensive resource for data engineers and architects.

Across 10 chapters, the guide covers writing and reading patterns, indexing strategies, table maintenance, concurrency handling, streaming pipelines with Hudi Streamer, and production deployment. With practical examples spanning batch, interactive, and streaming analytics, it takes you from getting started to building end-to-end lakehouse solutions at enterprise scale.

Meetups and Conferences

Community events brought together thousands of attendees across meetups and conferences worldwide.

Bangalore Hudi Community Meetup

A full-house gathering took place at Onehouse's India office in January, featuring deep dives into Hudi 1.0 and lakehouse architecture. The event connected developers, contributors, and enthusiasts from India's growing data engineering community.

1st Hudi Asia Meetup by Kuaishou

Our first-ever Asia meetup in March, organized by Kuaishou, drew 231 in-person attendees and 16,673 total platform views. The event featured discussions on Hudi's roadmap, lakehouse architecture patterns, and adoption stories from leading Chinese tech companies.

2nd Hudi Asia Meetup by JD.com

JD.com hosted the second Asia meetup in October, bringing together contributors and adopters for talks on real-world lakehouse implementations, streaming ingestion at scale, and roadmap discussions.

CMU Database Seminar

Vinoth Chandar, Hudi PMC Chair, presented "Apache Hudi: A Database Layer Over Cloud Storage for Fast Mutations & Queries" at Carnegie Mellon University's Future Data Systems Seminar. The talk covered how Hudi brings database-like abstractions to data lakes — enabling efficient mutations, transaction management, and fast incremental reads through storage layout, indexing, and concurrency control design decisions. A must-watch that covers both the breadth and depth of Hudi's design concepts and what problems it solves.

OpenXData

Amazon engineers presented "Powering Amazon Unit Economics at Scale Using Hudi" at OpenXData, sharing how they built Nexus — a scalable, configuration-driven platform with Hudi as the cornerstone of its data lake architecture. The talk highlighted how Hudi enables Amazon to tackle the massive challenge of understanding and improving unit-level profitability across their ever-growing businesses.

VeloxCon

Shiyan Xu, Hudi PMC member, presented on Hudi integration with Velox and Gluten for accelerating query performance, and on how Hudi-rs, the native Rust implementation with multi-language bindings such as Python and C++, can enable high-performance analytics across the open lakehouse ecosystem.

Data Streaming Summit

Two talks showcased Hudi's streaming capabilities. Zhenqiu Huang shared how Uber runs 5,000+ Flink-Hudi pipelines, ingesting 600TB daily with P90 freshness under 15 minutes. Shiyan Xu presented on Hudi's high-throughput streaming capabilities such as record-level indexing, async table services, and the non-blocking concurrency control mechanism introduced in Hudi 1.0.

Open Source Data Summit

Shiyan Xu presented on streaming-first lakehouse designs, covering Merge-on-Read table type, record-level indexing, auto-file sizing, async compaction strategies, and non-blocking concurrency control that enable streaming lakehouses with optimized mutable data handling.

Content Highlights

Throughout 2025, organizations across industries showcased their production adoption and implementation journeys through community syncs and blogs. These stories highlight Hudi's versatility as a lakehouse platform.

Featured adoption stories:

Uber: From Batch to Streaming: Accelerating Data Freshness in Uber's Data Lake
Amazon: Powering Amazon Unit Economics at Scale Using Apache Hudi
Kuaishou: Hudi Lakehouse: The Evolution of Data Infrastructure for AI workloads
Southwest Airlines: Modernizing Data Infrastructure using Apache Hudi
Halodoc: Optimizing Apache Hudi Workflows: Automation for Clustering, Resizing & Concurrency
Uptycs: From Transactional Bottlenecks to Lightning-Fast Analytics

The community also published noteworthy content:

21 Unique Reasons Why Apache Hudi Should Be Your Next Data Lakehouse
The Future of Data Lakehouses: A Fireside Chat with Vinoth Chandar
Deep Dive into Hudi's Indexing Subsystem: part 1 and part 2
Hudi 1.1 Deep Dive: Optimizing Streaming Ingestion with Flink
Maximizing Throughput with Apache Hudi NBCC: Stop Retrying, Start Scaling
Exploring Apache Hudi’s New Log-Structured Merge (LSM) Timeline
Introducing Secondary Index in Apache Hudi
Lakehouse Chronicles YouTube episodes: ep. 5, ep. 6, and ep. 7
Interactive Jupyter Notebooks for hands-on learning

Looking Ahead

The upcoming releases in 2026 will bring AI/ML-focused capabilities — including unstructured data types, column groups for embeddings, vector search, and Lance/Vortex format support. We'll also continue expanding multi-format interoperability through the pluggable table format framework and advancing streaming-first optimizations for real-time ingestion and sub-minute freshness. See the full roadmap for details.

Whether you're contributing code, sharing feedback, or spreading the word — we'd love to have you involved:

Contribute on GitHub: Hudi & Hudi-rs
Join our Slack community
Follow us on LinkedIn and X (Twitter)
Subscribe to our YouTube channel
Follow WeChat account "ApacheHudi" for news and content in Chinese
Participate in the developer syncs, community syncs, and office hours
Subscribe to the dev mailing list (by sending an empty email): dev-subscribe@hudi.apache.org

A huge thank you to everyone who contributed to Hudi's growth this year. With such a strong foundation, 2026 promises to be our most exciting year yet.

How Zupee Cut S3 Costs by 60% with Apache Hudi

Mon, 22 Dec 2025 00:00:00 GMT

This post summarizes Zupee's talk from the Apache Hudi community sync. Watch the recording on YouTube.

Zupee is India's largest skill-based Ludo platform (Ludo is a classic Indian board game), founded in 2018 with a vision of bringing moments of joy to users through meaningful entertainment. The company was the first to introduce a skill element to culturally relevant games like Ludo, reviving the joy of traditional Indian gaming.

At Zupee, data plays a crucial role in everything they do—from understanding user behavior to optimizing services. It sits at the core of their decision-making process. In this community sync, Amarjeet Singh, Senior Data Engineer at Zupee, shared how his team built a scalable data platform using Apache Hudi and the significant performance gains they achieved.

Data Platform Architecture

Zupee's data platform architecture is designed to handle complex data needs efficiently. It consists of several layers working together:

Three-Tiered Data Lake

The data lake is structured into three zones:

Landing Zone: Where raw data first arrives—like a receiving dock where all incoming data is stored in its original format.
Exploration Zone: Data is cleaned and prepared for analysts and data scientists to explore and derive insights.
Analytical Zone: Processed data optimized for analytical queries. This layer stores OLAP tables, facts and dimensions, or denormalized wide tables.

Metastore and API Layer

This layer acts as the brain of the data platform, managing metadata and providing APIs for data access and integration.

Orchestration and Framework Layer

The orchestration layer includes tools like Apache Airflow for scheduling and managing workflows, along with in-house tools and frameworks for ML operations, data ingestion, and data computation—including Hudi Streamer for data ingestion.

Compute Layer

The compute layer includes Apache Spark for large-scale data processing, and Amazon Athena and Trino for querying.

Real-Time Serving Layer

For real-time data needs, Zupee uses Apache Flink as a powerful streaming framework to power their feature store and enable real-time model predictions.

The serving layer includes real-time dashboards powered by Flink, analytical dashboards powered by Athena and Trino, and Jupyter notebooks for ad-hoc analysis by data scientists and machine learning teams.

Workflow-Based Data Ingestion

Zupee designed a data ingestion pipeline that provides smooth integration and scalability. At the heart of their approach is a centralized configuration system that gives them granular control over every aspect of their jobs.

Centralized Configuration

The team uses YAML files to centrally manage job details and tenant-level settings. This promotes consistency and makes updates easy. The centralized system is hosted on Amazon EMR, ensuring uniformity across all ingestion jobs.

Multi-Tenant Pipeline

Zupee runs a multi-tenant setup of generic pipelines. They can switch between different versions of Spark, Hudi Streamer, and Scala—all controlled at the tenant level.

Automated Spark Command Generation

The system automatically generates Spark commands based on YAML configurations. This significantly reduces manual intervention, minimizes errors, and accelerates the development process.

The Ingestion Flow

Here's how the workflow operates:

A generic job reads a YAML file containing specific job information and tenant-level configurations.
A single trigger starts each pipeline.
The generic job creates Spark configurations and generates a spark-submit command with Hudi Streamer settings.
The command is submitted using Livy or EMR steps.
Throughout the process, the team tracks data lineage for monitoring and debugging.

This workflow-based approach streamlines data ingestion, making it both scalable and reliable for handling large volumes of data.

Real-Time Ingestion with Hudi Streamer

Zupee uses Hudi Streamer, a utility built on checkpoint-based ingestion. It supports various sources including distributed file systems (S3, GCS), Kafka, and JDBC sources like MySQL and MongoDB.

Key Benefits

Checkpoint-based consistency: Ensures the ability to resume from the last checkpoint in case of interruption.
Easy backfills: Reprocess historical data efficiently based on checkpoints, without re-ingesting the entire dataset—saving both time and resources.
Data catalog syncing: Automatically syncs metadata to catalogs such as AWS Glue and Hive Metastore.
Built-in transformations: Supports SQL transformations and flat transformations, allowing complex logic to be applied on the fly.
Custom transformations: Developers can build custom transformation classes for specific requirements.

Deep Dive: How Hudi Streamer Works

Here's how Hudi Streamer processes data internally:

Job Submission: A Spark job is submitted with Hudi properties specifying primary keys, checkpointing details, and other configurations.
StreamSync Wrapper: The job creates a StreamSync wrapper (formerly DeltaSync) based on the provided properties.
Continuous or Single Run: StreamSync initiates a synchronized process that can run continuously or as a single batch, depending on the parameters.
Checkpoint Retrieval: Hudi Streamer checks if a checkpoint exists and retrieves the last position from the Hudi commit metadata. For S3-based ingestion, it checks the last modified date; for Kafka sources, it retrieves the last committed offset.
Transformation: If transformations are configured (single or chained), they are applied to the source data based on the format (JSON, Avro, etc.).
Write and Compaction: Data is written to the Hudi table. For Merge-On-Read tables, compaction runs inline or asynchronously to merge log files with base files.
Metadata Sync: Finally, if enabled, metadata is synced to ensure catalog consistency.

Custom Solutions

The Zupee team developed several custom solutions to enhance their Hudi Streamer pipeline:

Dynamic Schema Generator: A custom class that dynamically generates schemas for JSON sources, enabling automatic schema creation based on incoming data.
Post-Process Transformations: Custom transformations to handle schema evolution at the source level.
Centralized Configurations: Hudi configurations managed centrally via YAML files or hoodie-config.xml, simplifying maintenance and updates.
Raw Data Handling: For raw data ingestion, they discovered that Hudi Streamer can infer schemas automatically without requiring a schema provider.

Results: Cost Savings and Performance Gains

The migration to a Hudi-powered platform delivered significant outcomes:

60% Reduction in S3 Network Costs

After migrating from Hudi 0.10.x to 0.12.3 and enabling the Metadata Table, Zupee reduced S3 network costs by over 60%. The metadata table eliminates expensive S3 file listing operations by maintaining an internal index of all files. The key settings are hoodie.metadata.enable=true for Spark and hudi.metadata.listing-enabled=true for Athena.

15-Minute Ingestion SLA

The team achieved a 15-minute SLA for ingesting 2-5 million records using Merge-On-Read (MOR) tables with Hudi's indexing for efficient record lookups during upserts.

30% Storage Reduction

Switching from Snappy to ZSTD compression resulted in a 30% decrease in data size. While write times increased slightly, query performance improved significantly, and both storage and Athena costs decreased.

Small File Management

Async compaction handles the small file problem by consolidating smaller files into larger ones, configured via hoodie.parquet.small.file.limit and hoodie.parquet.max.file.size. The team also explored Parquet page-level indexing to reduce query costs by allowing engines to read only relevant pages from files.

Why Hudi Over Other Table Formats?

During the Q&A, Amarjeet explained why Zupee chose Hudi over other table formats. The team ran POCs with Delta Lake but found that for near-real-time ingestion, Hudi performs much better. Since Zupee primarily works with real-time data rather than batch workloads, this was the deciding factor.

The native Hudi Streamer utility also adds significant value with its built-in checkpoint management, compaction, and catalog syncing—features that would require additional work with other ingestion approaches.

Best Practices for EMR Upgrades

When asked about upgrading Hudi on EMR, Amarjeet shared their approach: instead of using the EMR-provided JARs, they use open-source JARs. This way, they can upgrade JARs using their multi-tenant framework, which controls which JAR goes to which job. They can also use different Spark versions for different jobs. For example, if they are currently running Hudi 0.12.3 and want to test version 0.14.1, they simply specify the JAR in their YAML file for that particular job.

Conclusion

Zupee's journey with Apache Hudi demonstrates how a modern data platform can solve real-world engineering challenges at scale. By moving from legacy architecture to a Hudi-powered lakehouse, they built a platform that is robust, scalable, and cost-effective.

The keys to their success were:

Enabling Hudi's Metadata Table to eliminate file listing overhead
Using Merge-On-Read tables with indexing for efficient upserts
Building custom transformations for flexible schema evolution
Centralizing configuration management for operational simplicity

Watch the full presentation on YouTube. Ready to optimize your own data platform? Get started with Hudi and explore Hudi Streamer for your ingestion needs.

Maximizing Throughput with Apache Hudi NBCC: Stop Retrying, Start Scaling

Tue, 16 Dec 2025 00:00:00 GMT

Data lakehouses often run multiple concurrent writers—streaming ingestion, batch ETL, maintenance jobs. The default approach, Optimistic Concurrency Control (OCC), assumes conflicts are rare and handles them through retries. That assumption breaks down in increasingly common scenarios, such as running maintenance batch jobs on tables receiving streaming writes. When conflicts become the norm, retries pile up with OCC, and the write throughput tanks.

Hudi introduced Non-Blocking Concurrency Control (NBCC) in release 1.0, solving this problem by allowing writers to append data files in parallel and using the write completion time to determine the serialization order for reads or merges. We'll explore why OCC struggles under real-world concurrency, how NBCC works under the hood, and how to configure NBCC in your pipelines.

The Problem with Retries

Picture this scenario: your streaming pipeline ingests clickstream data every minute from multiple Kafka topics. A nightly GDPR deletion job kicks off at midnight, scanning across thousands of partitions to purge user records—also touching data files the ingestion pipeline is actively writing to. By 3 AM, you get paged—the deletion job has failed repeatedly, burning compute resources while the ingestion writer keeps winning the race to commit.

OCC assumes conflicts are rare—an assumption that held in traditional batch-oriented data lakes where jobs were scheduled sequentially. Most transactions will not overlap, so let them proceed optimistically and check for conflicts at commit time. But high-frequency streaming breaks this assumption: when you have minute-level ingestion plus long-running maintenance jobs, overlapping writes are not the exception—they are the norm.

This is a classic concurrency anti-pattern: under OCC, conflict probability grows with transaction duration. Long-running jobs competing against frequent short writes lose nearly every commit race and retry indefinitely. When both concurrent writers are running ingestion, without careful coordination between the writers (e.g., segregating writers by partitions), the consequences become more severe: conflicts occur more often, overall throughput is reduced, and compute costs increase. The key insight is that retries are the throughput killer—we need a fundamentally different approach.

Hudi NBCC: Write in Parallel, Serialize by Completion Time

NBCC avoids conflicts by design: let every writer append updates to Hudi’s log files in the Merge-on-Read (MOR) table, then let readers or mergers follow the serialization order based on write completion time. Let's say there are two writers, both updating a record concurrently. Under NBCC, each writer produces its own log file containing the update. Since there's no file contention, there's nothing to conflict on. At read time or during compaction, Hudi follows the write completion time and processes the associated log files in the proper order.

Both OCC and NBCC require locking—OCC during commit validation, NBCC during timestamp generation. The key difference is how long the lock is held, and what happens after. OCC holds the lock while validating: for concurrent commits, it compares the sets of written files to detect conflicts—so validation time grows with both transaction size. If validation detects a conflict, the losing writers discard their completed work and retry. NBCC's lock duration is a negligible constant (a configurable clock skew duration, 200ms by default) regardless of transaction size: acquire lock, generate timestamp, sleep for clock skew, release. No file-level validation, no conflict detection, no retries.

	OCC	NBCC
On conflict	Abort and retry	No conflicts—each writer appends separately
Lock duration	Scales with the number of written files to validate	Constant (brief clock skew duration)
Resource waste	High	Nearly none

Hudi supports both OCC and NBCC for multi-writer scenarios. Hudi also offers early conflict detection for OCC, which can reduce wasted work by failing faster. However, OCC's validation lock duration still exceeds NBCC's timestamp generation time, and retries still occur after conflicts are detected—both impacting overall write throughput.

How NBCC Works Under the Hood

Hudi NBCC relies on several design foundations to enable conflict-free concurrent writes and maximize throughput.

Record Keys and File Groups

Hudi organizes data into file groups, where records with the same record key always route to the same file group. Hudi uses indexes to efficiently route records to file groups. For MOR tables, updates don't rewrite base files—instead, writers append updates to log files within the file group.

This record colocation is a key foundation for making NBCC possible. Records and their updates will always be routed to the same file group based on record keys, either in base files or log files—all associated with the same file ID that identifies the file group. The record key to file group mapping and the file ID association support read and merge operations by efficiently locating files to process. Also, concurrent Hudi writers use timestamps and write tokens to generate non-conflicting file names within each file group, and thus data writing can be non-blocking.

Completion Time: Serializing Concurrent Writes

With NBCC, concurrent writers produce log files whose write transactions overlap in time. To process these files correctly, we need a proper serialization order. Consider: Writer A starts deltacommit 1 at T1 and completes at T5; Writer B starts deltacommit 2 at T2 and completes at T4. If we order by start timestamp, deltacommit 1 would be processed first—but it actually finished later than deltacommit 2. Completion time reflects the correct order for processing the files.

Tracking the completion time is critical for NBCC. Concurrent writers flush records to files in parallel without any guarantee of completion order based on the start time. Hudi timeline tracks when each write is actually completed, enabling the correct serialization order for the writes.

TrueTime-like Timestamp Generation

Distributed writers running on different machines face clock skew—their local clocks may differ by tens or even hundreds of milliseconds. Without coordination, two writers could generate the same timestamp or produce incorrect ordering.

Hudi solves this with a TrueTime-like mechanism inspired by Google Spanner:

The process works as follows:

Acquire a distributed lock
Generate timestamp using local clock
Sleep for X milliseconds
Release lock

The sleep accounts for the worst-case clock skew between writers. By waiting longer than the maximum expected skew before releasing the lock, Hudi guarantees that any subsequent writer will generate a strictly greater timestamp.

These monotonically increasing timestamps ensure that concurrent writers never produce conflicting or out-of-order commits—achieving transaction serializability and completing the foundation for NBCC.

Supporting Designs

In scenarios suited for NBCC, we may encounter long commit histories and need to properly merge records. Hudi 1.0 introduced two supporting designs that complement NBCC.

LSM Timeline: High-frequency streaming can produce millions of commits over time. Listing and parsing individual timeline files would be prohibitively slow, and the storage overhead can become a headache. Hudi timeline uses an LSM Tree structure—archiving older commits into sorted and compacted Parquet files for efficient lookups and reduced storage footprint.

Merge Modes: When log files from concurrent writers need to be merged during reads or compaction, we need proper record-merging logic. Hudi supports flexible merge modes that control how records with the same key are resolved—whether to keep the latest by commit time, respect user-defined ordering fields, or apply custom merge functions.

Using NBCC

With the design foundations covered, let's see how NBCC works in practice.

NBCC in Action

NBCC allows concurrent writers to append data to log files in a non-conflicting way. Subsequent read or merge operations will need to follow the write completion time to process the files. Take compaction as an example, as shown in the diagram below.

Consider two writers: Writer A creates deltacommit 1 (starts at T1, completes at T5); Writer B creates deltacommit 2 (starts at T2, completes at T3). When compaction is scheduled at T4, the planner only includes files from deltacommit 2, since its completion time (T3) is earlier than the compaction schedule time (T4). Deltacommit 1, though started earlier, is excluded because it hadn't completed yet when compaction was planned—its files will be included in a later compaction.

Snapshot reads follow the same rules. A query at T4 includes data from deltacommit 2 but excludes deltacommit 1, which hasn't finished yet. A query after T5 includes both deltacommits, with log files read in completion time order and records merged according to the configured merge mode.

Configuration

NBCC requires Hudi 1.0+ and a lock provider for TrueTime-like timestamp generation. Common options for lock providers include ZooKeeper-based and DynamoDB-based, which integrate with existing infrastructure many organizations already run. Hudi 1.1 introduced the storage-based lock provider, which uses cloud storage conditional writes (S3, GCS) and requires no external server—an option with less operational overhead for cloud-native deployments.

hudi_writer_options = {
  'hoodie.write.concurrency.mode': 'NON_BLOCKING_CONCURRENCY_CONTROL',
  'hoodie.write.lock.provider': 'org.apache.hudi.client.transaction.lock.StorageBasedLockProvider',
}

To enable NBCC for your concurrent writers, configure the concurrency mode and lock provider options for each writer, as shown in the example above.

When to Use NBCC

If you are running multiple concurrent streaming writers, or running streaming ingestion with batch maintenance jobs like GDPR deletion, NBCC is more suitable than OCC. The table below summarizes some common examples:

Use Case	Recommendation	Why
Batch ETL with single writer or multiple coordinated writers	OCC is fine	No concurrency conflicts
Multiple concurrent streaming writers	NBCC	Avoid retry storms
Mixed streaming + batch maintenance	NBCC	Long-running jobs will not starve
Copy-on-Write (COW) tables with infrequent updates	OCC is fine	COW rewrites base files anyway
MOR tables with frequent updates	NBCC	Maximum benefit from log file separation

Hudi NBCC is designed specifically for MOR tables. COW tables rewrite entire base files on updates, so file-level conflicts are unavoidable regardless of concurrency control mode. As of now, NBCC is restricted to working with tables using the simple bucket index or partition-level bucket index. Learn more from the concurrency control docs page.

Summary

OCC assumes conflicts are rare. When you mix high-frequency streaming with long-running maintenance jobs, OCC's retry-on-conflict model breaks down—causing wasted compute, reduced throughput, and job starvation. Retries are the throughput killer.

NBCC takes a different approach: let every writer succeed by appending to separate log files, then follow the write completion time for reads and compaction. Three design foundations make this possible—record keys and file groups that colocate records and their updates, completion time tracking that properly orders overlapping write transactions, and TrueTime-like timestamp generation that guarantees monotonically increasing timestamps across distributed writers.

The result: maximum throughput for concurrent writes in Hudi pipelines. Long-running jobs complete without being starved, multiple ingestion pipelines coexist without contention, and your data platform scales without coordination overhead. Stop retrying, start scaling—see the docs to get started.

From Batch to Streaming: Accelerating Data Freshness in Uber's Data Lake

Fri, 12 Dec 2025 00:00:00 GMT

Redirecting... please wait!! or click here

Apache Hudi 1.1 Deep Dive: Optimizing Streaming Ingestion with Apache Flink

Wed, 10 Dec 2025 00:00:00 GMT

This blog was translated from the original blog in Chinese.

Background

With the rise of real-time data processing, streaming ingestion has become a critical use case for Apache Hudi. Apache Flink, a robust stream processing framework, has been seamlessly integrated with Hudi to support near-real-time data ingestion. While the Flink integration has already provided powerful and comprehensive capabilities—such as robust exactly-once guarantees backed by Flink's checkpointing mechanism, flexible write modes, and rich index management—as data volumes scale into petabytes, achieving optimal performance for streaming ingestion becomes a challenge, leading to backpressure and high resource costs for streaming jobs.

There are multiple factors that impact streaming ingestion performance, such as the network shuffle overhead between Flink operators, SerDe costs within Hudi writers, and GC issues caused by memory management of in-memory buffers. With Hudi 1.1, meticulous refactoring and optimization work has been conducted to solve these problems, significantly enhancing the performance and stability of streaming ingestion with Flink.

In the subsequent sections, several key performance optimizations are introduced, including:

Optimized SerDe between Flink operators
New performant Flink-native writers
Eliminated bytes copy for MOR log file writing

Following that, performance benchmarks for streaming ingestion in Hudi 1.1 are presented to demonstrate the concrete improvements achieved.

Optimized SerDe Between Flink Operators

Before Hudi 1.1, Avro was the default internal record format in the Flink writing/reading path, which means the first step in the write pipeline was converting Flink RowData into Avro record payload, and then it was used to create Hudi internal HoodieRecord for the following processing.

Almost every Flink job has to exchange data between its operators, and when the operators are not chained together (located in the same JVM process), records need to be serialized to bytes first before being sent to the downstream operator through the network. The shuffle serialization alone can be quite costly if not executed efficiently, and thus, when you examine the profiler output of the job, you will likely see serialization among the top consumers of CPU cycles. Flink actually has an out-of-the-box serialization framework, and there are performant internal serializers for basic types, like primitive types and row type. However, for generic types, e.g., HoodieRecord, Flink will fall back to the default serializer based on Kryo, which exhibits poor serialization performance.

RFC-84 proposed an improvement on SerDe between operators in the Hudi write pipeline based on the extensible type system of Flink:

HoodieFlinkInternalRow: an object to replace HoodieRecord during data shuffle, which contains RowData as the data field and some necessary metadata fields, e.g., record key, partition path, etc.
HoodieFlinkInternalRowTypeInfo: a customized Flink type information for HoodieFlinkInternalRow.
HoodieFlinkInternalRowSerializer: a customized and efficient Flink TypeSerializer for the SerDe of HoodieFlinkInternalRow.

With the customized Flink-native data structure and the companion serializer, the average streaming ingestion throughput can be boosted by about 25%.

New Performant Flink-Native Writers

Historically, Hudi Flink writer used HoodieRecord with data serialized using Avro to represent incoming records. While this unified format worked across engines, it came at a performance cost:

Redundant SerDe: Take MOR table ingestion as an example. Flink reads records as RowData, which are converted to Avro GenericRecord and then serialized to Avro bytes for internal HoodieAvroRecord. During the later log writing, the Avro bytes in HoodieAvroRecord are deserialized back into Avro IndexedRecord for further processing before being appended to log files. Apparently, with Avro as the intermediate record representation, significant SerDe overhead is introduced.
Excess memory usage: The buffer in the writers is a Java List with intermediate Avro objects which will be released after being flushed to disk. These objects can increase heap usage and GC pressure, particularly under high-throughput streaming workloads.

RFC-87: A Flink-Native Write Path

RFC-87 proposes and implements a shift in data representation: instead of transforming RowData to GenericRecord, the Flink writer now directly wraps RowData inside a specialized HoodieRecord. The entire write path now operates around the RowData structure, eliminating all redundant conversion overhead. This change is Flink engine-specific and transparent to the overall Hudi writer lifecycle.

Key Changes:

Customized HoodieRecord: Introduced HoodieFlinkRecord to encapsulate Flink's RowData directly—no extra conversion to Avro record in the writing path anymore.
Self-managed binary buffer: Flink's internal BinaryRowData is in binary format. We've implemented a binary buffer to cache the bytes of the RowData. The memory is managed by the writer and can be reused after the previously cached RowData bytes are flushed to disk. With this self-managed binary buffer, GC pressure can be effectively reduced even under high-throughput workloads.
Flexible log formats: For MOR tables, records are written as data blocks in log files. We currently support two kinds of block types:
- Avro data block: Optimized for row-level writes, making it ideal for streaming ingestion or workloads with frequent updates and inserts. It's the default block type for log files.
- Parquet data block: Columnar and better suited for data with a high compression ratio, e.g., records with primitive type fields.

By leveraging the Flink-native data model, the redundant Avro conversions along with SerDe overhead are eliminated in the streaming ingestion pipeline, which brings significant improvements in both write latency and resource consumption.

Eliminate Bytes Copy for MOR Log File Writing

The log file of a MOR table is composed of data blocks. When writing a log file, the basic unit is the data block. The main contents of each block are shown in the figure below, where the Content is binary bytes generated by serializing the records in the buffer into the specified block format.

As mentioned in the previous section, the default data block format is Avro. In Hudi 1.1, the serialization of this block type has been carefully optimized by eliminating record-level bytes copy. Specifically, before Hudi 1.1, each record was serialized by the Avro Writer into a ByteArrayOutputStream, then the toByteArray method was invoked to obtain the data bytes for writing into the outer output stream for the data block. However, under the hood, toByteArray involves the creation of a new byte array and bytes copy. Since the bytes copy happens at the record level, it generates a large number of temporary objects in high-throughput scenarios, which further increases GC pressure.

Improvement in Hudi 1.1

In Hudi 1.1, the writeTo method of ByteArrayOutputStream is utilized to directly write the underlying data bytes to the outer output stream for the block, thereby avoiding additional record-level bytes copy and effectively reducing GC pressure.

Benchmark

To demonstrate performance improvements in streaming ingestion with Flink, we performed comprehensive benchmark testing of Apache Hudi 1.1 vs. Apache Hudi 1.0 vs. Apache Paimon 1.0.1. Currently there is no standard test dataset for streaming read/write. Since it's important to run transparent and reproducible benchmarking, we decided to use the existing Cluster benchmark program of Apache Paimon inspired by Nexmark, as Paimon has previously claimed multi-fold better ingestion performance than Hudi in this benchmark, which is one of their key selling points.

Cluster Environment

The benchmarks were run on an Alibaba Cloud EMR cluster, with the following settings:

EMR on ECS: version 5.18.1
- Master (x1): 8 vCPU, 32 GiB, 5 Gbps
- Worker (x4): 24 vCPU, 96 GiB, 12 Gbps
Apache Hudi version: 1.1 and 1.0.1
Apache Paimon version: 1.0.1
Apache Flink version: 1.17.2
HDFS version: 3.2.1

Streaming Ingestion Scenario

The testing scenario is the most common streaming ingestion case in production: MOR table with UPSERT operation and Bucket index. We used the Paimon Benchmark program to simulate the workloads, where the data source was generated by the Flink DataGen connector, which produces records with primary keys ranging from 0 to 100,000,000, and the total record number is 500 million. For more detailed settings, refer to this repository.

Note that we disabled compaction in the test ingestion jobs, since it can significantly impact the performance of the ingestion job for both Hudi and Paimon and interfere with a fair comparison of ingestion performance. In fact, this is also common practice in production.

Additionally, the schema of a table also has a significant impact on write performance. There is a noticeable difference in the processing overhead between numeric primitive type fields and string type fields. Therefore, besides the default table schema (Schema1) used in the Paimon Benchmark program, we also added 3 different schemas containing mostly STRING-type fields, which are much more common in production scenarios.

Schema1	Schema2	Schema3	Schema4
1 String + 10 BIGINT fields	20 String fields	50 String fields	100 String fields

Benchmark Results

For Schema1 with almost all fields being numeric type, the streaming ingestion performance of Hudi 1.1 is about 3.5 times that of Hudi 1.0. The performance gain mainly comes from the optimizations introduced in RFC-84 and RFC-87, which reduce the shuffle SerDe overhead in the ingestion pipeline and internal Avro SerDe costs in the writer.
Streaming ingestion throughput of Paimon is slightly higher than that of Hudi 1.1. Through detailed profiling and analysis, we found that this performance gap mainly stems from the fact that each HoodieRecord contains 5 extra String-type metadata fields by default, and in simple schema scenarios like Schema1, these record-level additional fields have a significant performance impact.

For schemas where most fields are of STRING type, Hudi 1.1 achieves the best streaming ingestion performance. Based on the profiling analysis, we found that when the data fields are all strings, Paimon's approach of directly writing to Parquet files incurs noticeable compression overhead, which negatively impacts throughput—even though the benchmark tests used SNAPPY, the fastest available compression codec. Hudi, on the other hand, writes incremental data in row-based Avro format to log files. While its compression ratio is lower, this approach is more favorable for ingestion throughput across a variety of workloads.

Summary

The optimizations in Hudi 1.1 around writer performance for Flink have brought significant, multi-fold improvements to streaming ingestion throughput. These enhancements are transparent and backward-compatible, allowing users to seamlessly upgrade their jobs from earlier Hudi versions to the latest version and enjoy the substantial performance gains without any additional operational overhead.

Mastering Schema Evolution with Apache Hudi

Wed, 03 Dec 2025 00:00:00 GMT

Redirecting... please wait!! or click here

Next Generation Lakehouse: New Engine for the Intelligent Future | Apache Hudi Meetup Asia Recap

Mon, 01 Dec 2025 00:00:00 GMT

This blog was translated from the original blog in Chinese.

Recently, the Apache Hudi Meetup Asia, hosted by JD.com, was successfully held at JD.com Group headquarters. Four technical experts from Onehouse, JD.com, Kuaishou, and Huawei gathered together, not only bringing a preview of Apache Hudi release 1.1, but also sharing their unique approaches to building data lakehouses. From AI scenario support to real-time data processing and cost optimization, each topic directly addressed the pain points that data engineers care about most.

Hudi Community Leader Joined Remotely

First, Vinoth Chandar, CEO & Founder of Onehouse and Apache Hudi PMC Chair, delivered the opening remarks via video. He stated that after eight years of development, Hudi has become an important cornerstone in the data lake domain, and its vision has transformed into widely recognized achievements in the industry. The 1.0 version released last year marked the project's entry into a mature stage, bringing many database-like capabilities to the lakehouse.

Currently, the community is steadily advancing the 1.x series of versions, focusing on improving Flink performance, launching a new Trino connector, and enhancing interoperability through a pluggable table format layer. Facing the rapid development in the data lake field, Vinoth emphasized that excellent technology and robust design are the keys to long-term success. Hudi has now achieved many capabilities that commercial engines have not been able to deliver, thanks to its intelligent and creative community. Looking ahead, the community will be committed to building Hudi into a storage engine that supports all scenarios from BI to AI, exploring trending areas including unstructured data management and vector search.

Vinoth specially thanked JD.com for its significant contributions to Apache Hudi. Among the top 100 contributors, 6 were from JD.com. Finally, he also invited more developers to join this vibrant community to jointly promote innovation and development in data infrastructure.

JD Retail: Data Lake Technical Challenges and Outlook

As the co-host of the event, Zhang Ke, Head of AI Infra & Big Data Computing at JD Retail, welcomed guests and attendees who participated in this Meetup. He also pointed out two core challenges facing the data domain:

At the BI level, the long-standing problem of "unified stream and batch processing" has not yet been perfectly solved, forcing data R&D personnel to duplicate work across multiple systems. This requires fundamentally reconstructing the data architecture and finding a new paradigm for unified stream and batch processing.

At the AI level, with the arrival of the multimodal era, traditional solutions that only handle structured data can no longer meet the needs. Whether it is data supply efficiency for model training, real-time feature computation for recommendation systems, or knowledge base construction required for large models, there is an urgent need for an underlying support system that can unify storage of multimodal data while balancing cost and performance.

The industry is looking forward to building a storage foundation through open-source technologies like Apache Hudi that can uniformly carry batch processing, stream computing, data analysis, and AI workloads.

Apache Hudi 1.1 Preview and AI-Native Lakehouse Evolution

In the session "Apache Hudi 1.1 Preview and AI-Native Lakehouse Evolution," Ethan Guo (Yihua Guo), Data Architecture Engineer at Onehouse and Apache Hudi PMC member, shared Hudi's technical evolution path and future outlook. As the top contributor to the Hudi codebase, he systematically elaborated on the project positioning, version planning, and AI-native architecture.

Ethan pointed out that Apache Hudi's positioning goes far beyond being an open table format—it is an embedded, headless, distributed database system built on top of cloud storage. Hudi is moving from "a transactional database on the lakehouse" toward "an AI-native Lakehouse platform."

In the then-upcoming 1.1 release (now released), Hudi has achieved several important breakthroughs. Among them, the pluggable table format architecture effectively solves the pain point of format fragmentation in the current data lake ecosystem, enabling users to "write once, read in multiple formats." At the same time, Hudi has deeply optimized Flink integration, solving the throughput bottleneck in streaming writes through an asynchronous generation mechanism, and building a brand-new native writer that achieves end-to-end processing from Avro format to Flink RowData, significantly reducing serialization overhead and GC pressure. Real-world tests showed that Hudi 1.1's throughput performance in streaming lake ingestion scenarios was 3.5 times that of version 1.0.

Facing new challenges brought by the AI era, Hudi is actively building a native AI data foundation. By supporting unstructured data storage, optimizing column group structures for multimodal data, providing built-in vector indexing capabilities, and building a unified storage layer that supports transactions and version control, Hudi is committed to providing highly real-time, traceable, and easily extensible data support for AI workflows. This series of evolutions will propel Apache Hudi from an excellent data lake framework to a core data infrastructure supporting the AI era.

Latest Architecture Evolution of Apache Hudi at JD.com

In the session "Latest Architecture Evolution of Apache Hudi at JD.com," Han Fei, Head of JD Real-time Data Platform, systematically introduced the latest architectural evolution and implementation results of Hudi in JD's production environment.

Addressing the performance bottleneck of native MOR tables in high-throughput scenarios, JD's Data Lake team reconstructed the data organization protocol of Hudi MOR tables based on LSM-Tree architecture. By replacing the original "Avro + Append" update mode with "Parquet + Create" mode, lock-free concurrent write capability was achieved. Combined with a series of optimization methods such as Engine-Native data format, Remote Partitioner strategy, and streaming incremental Compaction scheduling mechanism, read and write performance were significantly improved. Benchmark test results showed that the MOR-LSM solution's read and write performance was 2-10 times that of the native MOR-Avro solution, demonstrating significant technical advantages.

Facing the growing near-real-time requirements of BI scenarios, streaming dimension widening had gradually become a common challenge for multi-subject domain data processing. Traditional Flink streaming Join had problems such as state bloat and high maintenance complexity. JD's Data Lake team, drawing on Hudi's partial-update multi-stream splicing approach, built an indexing mechanism that supported primary-foreign key mapping. This mechanism efficiently completed streaming dimension association and real-time updates through the coordinated operation of forward and reverse indexes. At the same time, pluggable HBase was introduced as index storage, ensuring high-performance access capability in point query scenarios.

In exploring AI scenarios, the team designed and implemented the Hudi NativeIO SDK. This SDK builds four core modules: data invocation layer, cross-language Transformation layer, Hudi view management layer, and high-performance query layer, creating an end-to-end process for sample training engines to complete training directly based on data lake tables.

JD had deeply integrated these capabilities with business scenarios, applying them to the near-real-time transformation of the traffic data warehouse ADM layer. After a series of optimizations, the write throughput of the traffic browsing link increased from 45 million per minute to 80 million, Compaction execution efficiency doubled, and real-time consistency maintenance of SKU dimension information was achieved, completing a comprehensive transformation from T+1 offline repair mode to real-time processing mode.

While promoting self-developed technology, JD also actively gave back to the open-source community, with a total of 109 contributed and merged PRs. In the future, the team will continue to deepen Hudi's application in the real-time data lake domain, providing stronger data support capabilities for business innovation.

How Kuaishou's Real-time Lake Ingestion Empowered BI & AI Scenario Architecture Upgrade

In the session "How Kuaishou's Real-time Lake Ingestion Empowers BI & AI Scenario Architecture Upgrade," Wang Zeyu, Data Architecture R&D Engineer at Kuaishou, introduced Kuaishou's complete evolution path and practical experience in building a real-time data lake based on Apache Hudi.

For traditional BI data warehouse scenarios, Kuaishou achieved an architecture upgrade from Mysql2Hive to Mysql2Hudi2.0. By introducing Hudi hourly partition tables, supporting multiple query modes such as full, incremental, and snapshot, and innovatively designing Full Compact and Minor Compact mechanisms to optimize data layout, Kuaishou improved the overall architecture. The introduction of bucket heterogeneity allowed full partitions and incremental partitions to support different bucket numbers, significantly reducing lake ingestion resource consumption. Compared with the original architecture, the new solution naturally supported long lifecycles and richer query behaviors. While reducing storage costs, it achieved a leap in data readiness time from day-level to minute-level.

At the AI storage architecture level, Kuaishou built a unified stream-batch data lake architecture, solving the core pain point of inconsistent offline and real-time training data. Through unified storage media, support for unified stream-batch consumption, logical wide table column splicing, and other capabilities, unified management and efficient reuse of training data were achieved. The metadata management mechanism based on Event-time timeline not only ensured data orderliness but also guaranteed real-time write performance through lock-free design.

In the future, Kuaishou will continue to improve the data lake's service capabilities in training, retrieval, analysis, and other multi-scenarios, promoting the evolution of the data lake toward a more intelligent and unified direction. Kuaishou's practice fully proves that the real-time data lake architecture based on Hudi can effectively support the modernization and upgrade needs of large-scale BI and AI scenarios.

Deep Optimization and AI Exploration of Apache Hudi on Huawei Cloud

In the session "Deep Optimization and AI Exploration of Apache Hudi on Huawei Cloud," Yang Xuan, Big Data Lakehouse Kernel R&D Engineer at Huawei, shared Huawei Cloud's technical practices and innovative breakthroughs in building a new generation Lakehouse architecture based on Apache Hudi. Facing challenges in real-time performance, intelligence, and management efficiency for enterprise-level data platforms, Huawei conducted in-depth exploration in three dimensions: platform architecture, kernel optimization, and ecosystem integration.

At the platform architecture level, Huawei developed the LDMS unified lakehouse management service platform, achieving fully managed operation and maintenance of table services. Through core capabilities such as intelligent data layout optimization and CBO statistics collection, this platform significantly reduced the operational complexity of the lakehouse platform, allowing users to focus more on business logic rather than underlying maintenance.

In terms of kernel optimization, Huawei made multiple deep modifications to Apache Hudi. Through de-Avro serialization optimization implemented via RFC-84/87, Flink write performance improved up to 10 times while significantly reducing GC pressure; the innovative LogIndex mechanism effectively solved the streaming read performance bottleneck in object storage scenarios; dynamic Schema change support made CDC lake ingestion processes more flexible; and the introduction of the column clustering mechanism provided a feasible solution for real-time processing of thousand-column sparse wide tables.

Hudi Native built a high-performance IO acceleration layer by rewriting Parquet read/write logic using Rust and adopting Arrow memory format to replace Avro. By providing a unified high-performance Java read/write interface through JNI, it achieved seamless integration with compute engines such as Spark and Flink, laying a solid foundation for future performance breakthroughs.

In ecosystem integration and AI exploration, Huawei built a management architecture supporting multimodal data. By using lake table formats to manage metadata of unstructured data, with actual files stored in object storage, it ensured ACID properties while avoiding data redundancy. At the same time, it integrated LanceDB to provide efficient vector retrieval capabilities, providing comprehensive data infrastructure support for AI application scenarios such as document retrieval and intelligent Q&A.

Conclusion

This meetup made us believe that the vast ocean of data lakehouses could not be separated from the "collective effort" of the open-source community and enterprises. Those technologies tempered on the business battlefield ultimately gave back as nutrients nourishing the entire ecosystem. This may be the purest romance of technology: making complex things simple and making the impossible possible. The road ahead is full of imagination, and together, we are shaping a more elegant and powerful future for data processing.

Apache Hudi Dynamic Bloom Filter"

Fri, 28 Nov 2025 00:00:00 GMT

Redirecting... please wait!! or click here

Apache Hudi 1.1 is Here—Building the Foundation for the Next Generation of Lakehouse

Tue, 25 Nov 2025 00:00:00 GMT

The Hudi community is excited to announce the release of Hudi 1.1, a major milestone that sets the stage for the next generation of data lakehouse capabilities. This release represents months of focused engineering on foundational improvements, engine-specific optimizations, and key architectural enhancements, laying the foundation for ambitious features coming in future releases.

Hudi continues to evolve rapidly, with contributions from a vibrant community of developers and users. The 1.1 release brings over 800 commits addressing performance bottlenecks, expanding engine support, and introducing new capabilities that make Hudi tables more reliable, faster, and easier to operate. Let’s dive into the highlights.

Pluggable Table Format—The Foundation for Multi-Format Support

Hudi 1.1 introduces a pluggable table format framework that opens up the powerful storage engine capabilities beyond Hudi’s native storage format to other table formats like Apache Iceberg and Delta Lake. This framework represents a fundamental shift in how Hudi approaches table format support, enabling native integration of multiple formats and giving you a unified system with total read-write compatibility across formats.

Vision and Design

The table format landscape in the modern lakehouse ecosystem is diverse and evolving. Like a game of rock-paper-scissors, different formats—Hudi, Iceberg, Delta Lake—each have unique strengths for specific use cases. Rather than forcing a one-size-fits-all approach, Hudi 1.1 introduces a pluggable table format framework that embraces the open lakehouse ecosystem and prevents vendor lock-in.

The framework is built on a clean abstraction layer that decouples Hudi’s core capabilities—transaction management, indexing, concurrency control, and table services—from the specific storage format used for data files. At the heart of this design is the HoodieTableFormat interface, which different format implementations can extend.

Key Architectural Components

Storage engine: Hudi’s storage engine capabilities, such as timeline management, concurrency control mechanisms, indexes, and table services, can work across multiple table formats
Pluggable adapters: Format-specific implementations handle the generation of conforming metadata upon writes

Hudi’s artifact provides support for the native Hudi format, while Apache XTable (incubating) supplies pluggable format adapters. For example, this XTable PR implements the Iceberg adapter to allow you to add dependencies to your running pipelines as needed. This architecture enables organizations to choose the right format for each use case while maintaining a unified operational experience and leveraging Hudi’s sophisticated storage engine across all of them.

In the 1.1 release, the framework comes with native Hudi format support (configured via hoodie.table.format=native by default). Existing users don't need to change anything—tables continue to work exactly as before. The real excitement lies ahead: the framework paves the way for supporting additional formats like Iceberg and Delta Lake. Imagine writing high-frequency updates to a Hudi table efficiently with Hudi's record-level indexing capability while maintaining Iceberg metadata through the Iceberg adapter, which supports a wide range of catalogs for reads. The pluggable table format framework in 1.1 makes such usage patterns possible—a game-changer for organizations that need flexibility and openness in their data architecture.

Indexing Improvements—Faster and Smarter Lookups

Hudi’s indexing subsystem is one of its most powerful features, enabling fast record lookups during writes and efficient data skipping during reads.

Partitioned Record Index

Since version 0.14.0, Hudi has supported a global record index in the indexing subsystem—a breakthrough that enables blazing-fast lookups on large datasets. While this is ideal for globally unique identifiers like order IDs or SSNs, many scenarios only require uniqueness within a partition—for example, user events partitioned by date. Hudi 1.1 introduces the partitioned record index, a non-global variant of the record index that works with the combination of partition path and record key, leveraging partition information to prune irrelevant partitions during lookups and dramatically reducing the search space, and thus achieving efficient lookups even on very large datasets.

-- Spark SQL: Create table with partitioned record index
CREATE TABLE user_activity (
  user_id STRING,
  activity_type STRING,
  timestamp BIGINT,
  event_date DATE
) USING hudi
TBLPROPERTIES (
  'primaryKey' = 'user_id',
  'preCombineField' = 'timestamp',
  -- Enable partitioned record index
  'hoodie.metadata.record.level.index.enable' = 'true',
  'hoodie.index.type' = 'RECORD_LEVEL_INDEX'
)
PARTITIONED BY (event_date);

The partitioned record index enables index lookups that scale proportionally with partition size—file group accesses correlate directly to the data partition size, optimizing performance across heterogeneous data distributions. The design also supports future clustering operations that can dynamically expand file groups within partitions as they grow.

Partition-level Bucket Index

The bucket index is a popular choice for high-throughput write workloads because it eliminates expensive record lookups by deterministically mapping keys to file groups. However, the existing bucket index has a key limitation: once you set the number of buckets, changing it requires rewriting the entire table.

The 1.1 release introduces partition-level bucket index, which enables different bucket counts for different partitions using regex-based rules. This design allows tables to adapt as data volumes change over time—for example, older, smaller partitions can use fewer buckets while newer, larger partitions can have more.

-- Spark SQL: Create table with partition-level bucket index
CREATE TABLE sales_transactions (
  transaction_id BIGINT,
  user_id BIGINT,
  amount DOUBLE,
  transaction_date DATE
) USING hudi
TBLPROPERTIES (
  'primaryKey' = 'transaction_id',
  -- Partition-level bucket index
  'hoodie.index.type' = 'BUCKET',
  'hoodie.bucket.index.hash.field' = 'transaction_id',
  'hoodie.bucket.index.partition.rule.type' = 'regex',
  'hoodie.bucket.index.partition.expressions' = '2023-.*,16;2024-.*,32;2025-.*,64',
  'hoodie.bucket.index.num.buckets' = '8'
)
PARTITIONED BY (transaction_date);

The partition-level bucket index is ideal for time-series data where partition sizes vary significantly over time. The adaptive bucket sizing helps you maintain optimal write performance as your data volume changes. See the docs and RFC 89 for more information.

Indexing Performance Optimizations

Beyond new indexes, Hudi 1.1 delivers substantial performance improvements for metadata table operations:

HFile block cache and prefetching: The new block cache stores recently accessed data blocks in memory, avoiding repeated reads from storage. For smaller HFiles, Hudi prefetches the entire file upfront rather than making multiple read requests. Benchmarks show approximately 4x speedup for repeated lookups, enabled by default.

HFile Bloom filter: Adding Bloom filters to HFiles enables Hudi to quickly determine whether a key might exist in a file before fetching data blocks, avoiding unnecessary I/O and dramatically speeding up point lookups. You can enable it with hoodie.metadata.bloom.filter.enable=true.

These optimizations compound to make the metadata table significantly faster, directly improving both write and read performance across your Hudi tables. Additionally, Hudi 1.1 adds its own native HFile writer implementation, eliminating the dependency on HBase libraries. This refactoring significantly reduces the Hudi bundle size and provides the foundation for future HFile performance optimizations.

Faster Clustering with Parquet File Binary Copy

Clustering reorganizes data to improve query performance, but traditional approaches are expensive—decompressing, decoding, transforming, re-encoding, and re-compressing data even when no transformation is needed.

Hudi 1.1 implements Parquet file binary copy for clustering operations. Instead of processing records, this optimization directly copies Parquet RowGroups from source to destination files when schema-compatible, eliminating redundant transformations entirely.

On 100GB test data, using Parquet file binary copy achieved 15x faster execution (18 minutes → 1.2 minutes) and 95% reduction in compute (28.7 task-hours → 1.3 task-hours) compared to the normal rewriting of Parquet files. Real-world validation with 1.7TB datasets (300 columns) showed approximately 5x performance improvement (35 min → 7.7 min) with CPU usage dropping from 90% to 60%.

The optimization is currently supported for Copy-on-Write tables and enabled automatically when safe, with Hudi intelligently falling back to traditional clustering when schema reconciliation is required. You may refer to this PR for more detail.

Storage-Based Lock Provider—Eliminating External Dependencies for Concurrent Writers

Multi-writer concurrency is critical for production data lakehouses, where multiple jobs need to write to the same table simultaneously. Historically, enabling multi-writer support in Hudi required setting up external lock providers like AWS DynamoDB, Apache Zookeeper, or Hive Metastore. While these work well, they add operational complexity—you need to provision, maintain, and monitor additional infrastructure.

Hudi 1.1 introduces a storage-based lock provider that eliminates this dependency entirely by managing concurrency directly using the .hoodie/ directory in your table's storage layer.

The implementation uses conditional writes on a single lock file under .hoodie/.locks/ to ensure only one writer holds the lock at a time, with heartbeat-based renewal and automatic expiration for fault tolerance. To use the storage-based lock provider, you need to add the corresponding Hudi cloud bundle (hudi-aws-bundle for S3 and hudi-gcp-bundle for GCS) and set the following configuration:

hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.StorageBasedLockProvider

This approach eliminates the need for DynamoDB, ZooKeeper, or Hive Metastore dependencies, reducing operational costs and infrastructure complexity. The cloud-native design works directly with S3 or GCS storage features, with support for additional storage systems planned, making Hudi easier to operate at scale in cloud-native environments. Check out the docs and RFC 91 for more detail.

Use Merge Modes and Custom Mergers—Say Goodbye to Payload Classes

A core design principle of Hudi is enabling the storage layer to understand how to merge updates to the same record key, even when changes arrive out of order—a common scenario with mobile apps, IoT devices, and distributed systems. Prior to Hudi 1.1, record merging logic was primarily implemented through payload classes, which were fragmented and lacked standardized semantics.

Hudi 1.1 deprecates payload classes and encourages users to adopt the new APIs introduced since 1.0 for record merging: merge modes and the HoodieRecordMerger interface.

Merge Modes—Declarative Record Merging

For common use cases, the COMMIT_TIME_ORDERING and EVENT_TIME_ORDERING merge modes provide a declarative way to specify merge behavior:

Merge mode	What does it do?
`COMMIT_TIME_ORDERING`	Picks the record with the highest completion time/instant as the final merge result (standard relational semantics or arrival time processing)
`EVENT_TIME_ORDERING`	Picks the record with the highest value on a user-specified ordering field as the final merge result. Enables event time processing semantics for handling late-arriving data without corrupting record state.

The default behavior is adaptive: if no ordering field (hoodie.table.ordering.fields) is configured, Hudi defaults to COMMIT_TIME_ORDERING; if one or more ordering fields are set, it uses EVENT_TIME_ORDERING. This makes Hudi work out-of-the-box for simple use cases while still supporting event-time ordering when needed.

Custom Mergers—The Flexible Approach

For complex merging logic—such as field-level reconciliation, aggregating counters, or preserving audit fields—the HoodieRecordMerger interface provides a modern, engine-native alternative to payload classes. You need to set the merge mode to CUSTOM and provide your own implementation of HoodieRecordMerger. By using the new API, you can achieve consistent merging across all code paths: precombine, updating writes, compaction, and snapshot reads—you are strongly encouraged to migrate to the new APIs. See the docs for more details. For migration guidance, see the release notes and RFC-97.

Apache Spark Integration Improvements

Spark remains one of the most popular engines for working with Hudi tables, and the 1.1 release brings several important enhancements.

Spark 4.0 Support

Spark 4.0 brought significant performance gains for ML/AI workloads, smarter query optimization with automatic join strategy switching, dynamic partition skew mitigation, and enhanced streaming capabilities. Hudi 1.1 adds Spark 4.0 support to unlock these improvements for working with Hudi tables. To get started, use the new hudi-spark4.0-bundle_2.13:1.1.1 artifact in your dependency list.

Metadata Table Streaming Writes

Hudi 1.1 introduces streaming writes to the metadata table, unifying data and metadata writes into a single RDD execution chain. The key design generates metadata records directly during data writes in parallel across executors, eliminating redundant file lookups that previously created bottlenecks and enhancing reliability when performing stage retries in Spark.

A benchmark with update-intensive workloads showed that this 1.1 feature delivered about 18% faster write times for tables with record index, compared to Hudi 1.0. The feature is enabled by default for Spark writers.

New and Enhanced SQL Procedures

Hudi 1.1 expands the SQL procedure library with useful additions and enhanced capabilities for table management and observability, bringing operational capabilities directly into Spark SQL.

The new procedures, show_cleans, show_clean_plans, and show_cleans_metadata, provide visibility into cleaning operations:

CALL show_cleans(table => 'hudi_table', limit => 10);
CALL show_clean_plans(table => 'hudi_table', limit => 10);
CALL show_cleans_metadata(table => 'hudi_table', limit => 10);

The enhanced run_clustering procedure supports partition filtering with regex patterns:

-- Cluster all 2025 partitions matching a pattern
CALL run_clustering(
  table => 'hudi_table',
  partition_regex_pattern => '2025-.*',
);

All show procedures, where applicable, were enhanced with path and filter parameters. path helps when table_name is not able to identify a table properly. filter can support advanced predicate expressions. For example:

-- Find large files in recent partitions
CALL show_file_status(
  path => '/data/warehouse/transactions',
  filter => "partition LIKE '2025-11%' AND file_size > 524288000"
);

The new and enhanced SQL procedures bring table management directly into Spark SQL, streamlining operations for SQL-focused workflows.

Apache Flink Integration Improvements

Flink is a popular choice for real-time data pipelines, and Hudi 1.1 brings substantial improvements to the Flink integration.

Flink 2.0 Support

Hudi 1.1 brings support for Flink 2.0, the first major Flink release in nine years. Flink 2.0 introduced disaggregated state storage (ForSt) that decouples state from compute for unlimited scalability, asynchronous state execution for improved resource utilization, adaptive broadcast join for efficient query processing, and materialized tables for simplified stream-batch unification. Use the new hudi-flink2.0-bundle:1.1.1 artifact to get started.

Engine-Native Record Support

Hudi 1.1 eliminates expensive Avro conversions by processing Flink's native RowData format directly, enabling zero-copy operations throughout the pipeline. This automatic change (no configuration required) delivers 2-3x improvement in write and read performance on average compared to Hudi 1.0.

The above shows a benchmark that inserted 500 million records with a schema of 1 STRING and 10 BIGINT fields: Hudi 1.1 achieved 235.3k records per second and Hudi 1.0 67k records per second—over 3 times higher throughput.

Buffer Sort

For append-only tables, Hudi 1.1 introduces in-memory buffer sorting that pre-sorts records before flushing to Parquet. This delivers 15-30% better compression and faster queries through better min/max filtering. You can enable this feature with write.buffer.sort.enabled=true and specify sort keys via write.buffer.sort.keys (e.g., "timestamp,event_type"). You may also adjust the buffer size for sorting via write.buffer.size (default 1000 records).

New Integration: Apache Polaris (Incubating)

Polaris (incubating) is an open-source catalog for lakehouse platforms that provides multi-engine interoperability and unified governance across diverse table formats and query engines. Its key feature is enabling data teams to use multiple engines—Spark, Trino, Dremio, Flink, Presto—on a single copy of data with consistent metadata, governed openly by a diverse committee including Snowflake, AWS, Google Cloud, Azure, and others to prevent vendor lock-in.

Hudi 1.1 introduces native integration with Polaris (pending a Polaris release that includes this PR), allowing users to register Hudi tables in the Polaris catalog and query them from any Polaris-compatible engine, simplifying multi-engine workflows and providing centralized role-based access control that works uniformly across S3, Azure Blob Storage, and Google Cloud Storage.

What’s Next—Join Us in Building the Future

The future of Hudi is incredibly exciting, and we're building it together with a vibrant, global community of contributors. Building on the strong foundation of 1.1, we're actively developing transformative AI/ML-focused capabilities for Hudi 1.2 and beyond—unstructured data types and column groups for efficient storage of embeddings and documents, Lance, Vortex, blob-optimized Parquet support, and vector search capabilities for lakehouse tables. This is just the beginning—we're reimagining what's possible in the lakehouse, from multi-format interoperability to next-generation AI/ML workloads, and we need your ideas, code, and creativity to make it happen.

Join us in building the future. Check out the 1.1 release notes to get started, join our Slack space, follow us on LinkedIn and X (twitter), and subscribe (send an empty email) to the mailing list—let's build the next generation of Hudi together.

Deep Dive Into Hudi's Indexing Subsystem (Part 2 of 2)

Wed, 12 Nov 2025 00:00:00 GMT

In part 1, we explored how Hudi's metadata table functions as a self-managed, multimodal indexing subsystem. We covered its internal architecture—a partitioned Hudi Merge-on-Read (MOR) table using HFile format for efficient key lookups—and how the files, column stats, and partition stats indexes work together to implement powerful data skipping. These indexes dramatically reduce I/O by pruning partitions and files that don't contain the data your query needs.

Now in part 2, we'll dive into more specialized indexes that handle different query patterns. We'll look at the record and secondary indexes, which provide exact file locations for equality-matching predicates rather than just skipping irrelevant files. We'll explore expression indexes that optimize queries with inline transformations like from_unixtime() or substring(). Finally, we'll cover async indexing, which lets you build resource-intensive indexes in the background without blocking your active read and write operations.

Equality Matching with Record and Secondary Indexes

Queries may contain equality-matching predicates like A = X or B IN (X, Y, Z). While data skipping indexes such as column stats and partition stats help here too, record-level indexing goes further by pinpointing the exact data files containing those values.

Hudi’s multimodal indexing subsystem implements the record index and secondary index to meet this need:

Record index: Stores mappings between record keys and the file locations that contain them.
Secondary index: Stores mappings between non-record-key column values and their corresponding record keys to support mapping to file locations.

Note that the record index is located at the record_index/ partition of the metadata table. You can create multiple secondary indexes, each for a chosen column, stored under a dedicated partition (prefixed with secondary_index_) in the metadata table.

The record index is a high-performance, general-purpose index that works on both the writer and reader sides. As described in this blog, its direct record-location lookup allows Hudi writers to efficiently route updates and deletes to their corresponding file groups in a Hudi table. The secondary index leverages the record index to look up non-record-key columns efficiently. The remainder of this section focuses on the reader side to show how these two indexes optimize equality-matching predicates.

The lookup process

Similar to the data skipping process, the query engine parses equality-matching predicates and pushes them down to the Hudi integration component. This component then performs the index lookup and returns the file locations to scan.

First, let's consider the record index. When a query with an equality filter like id = '001' runs against a Hudi table where id is the record key, the engine uses the record index to find the exact file locations for that key. The index returns these locations to the query engine, which then plans the read execution.

This direct lookup dramatically optimizes the query by ensuring only the relevant file locations are scanned. For example, on a 400 GB synthetic Hudi table with 20,000 file groups, a query filtering on a single record key saw its execution time drop from 977 seconds to just 12 seconds—a 98% reduction—when using the record index.

Now, let's consider the case when the equality filter is name = 'foo' where name is not a record key field. A secondary index built for the column name will be used for the lookup process. Entries in the secondary index contain mappings of all name values and their corresponding record keys. Because multiple distinct records can have the same name value, the lookup may return multiple record keys. The next step is to look up these returned record keys in the record index to find the enclosing file locations for scanning. As you can tell, the record index must be enabled for using the secondary index.

A recent TPCDS benchmarking shows that, by using the secondary index, query performance improved by about 45% on average, and the amount of data scanned was reduced by 90%.

SQL examples

You can specify hoodie.metadata.record.index.enable during table creation to enable the record index for the table:

CREATE TABLE trips (
    ts BIGINT,
    id STRING,
    rider STRING,
    driver STRING,
    fare DOUBLE,
    city STRING,
    state STRING
) USING hudi
 OPTIONS(
    primaryKey = 'id',
    hoodie.metadata.record.index.enable = 'true' -- enable record index
)
PARTITIONED BY (city, state);

To create a secondary index on a specific column, you can use CREATE INDEX like this:

CREATE INDEX driver_idx ON trips (driver); -- enable secondary index on column `driver`

When you write data to the example table, index data gets written to the record index and secondary index partitions in the metadata table, which then accelerates query execution during reads. Check out the SQL DDL page for more examples.

Expression Index

Query predicates often contain expressions that perform inline transformations on columns, such as from_unixtime() or substring(). These expressions prevent a direct match with standard column indexes like column stats or partition stats. To optimize such queries, Hudi provides the expression index that operates on transformed column values. A full list of supported expressions is available in the documentation.

Hudi currently supports two types of expression indexes:

Column stats type: Stores file-level statistics (min, max, null count, value count) for the transformed values after applying the expression.
Bloom filter type: Stores a file-level bloom filter built from the transformed values after applying the expression.

Each expression index—defined by its type, the expression used, and the target column—occupies a dedicated partition within the metadata table, identified by an expr_index_ prefix in its partition path.

The column stats expression index functions similarly to a standard column stats index and is effective for data skipping. As the diagram below illustrates, a predicate containing a from_unixtime() expression is processed for lookup, and the corresponding expression index prunes the file list for the query engine.

The bloom filter expression index is designed for equality-matching predicates. Unlike the record and secondary indexes, which provide exact file locations, this index uses a bloom filter—a space-efficient data structure for quick presence checks—to prune files. The query planner can skip a file if the bloom filter indicates a target value is definitively not present.

The bloom filter expression index is most effective for high-cardinality columns, where the probability of a "not present" result is higher, allowing more files to be skipped. For low-cardinality columns, the proposed bitmap index would be more efficient and represents a valuable future extension to Hudi's indexing subsystem.

SQL examples

Similar to creating a secondary index, you can create an expression index (column stats type) like this:

CREATE INDEX ts_date ON trips
  USING column_stats(ts) 
  OPTIONS(expr='from_unixtime', format='yyyy-MM-dd');

This example creates a column stats expression index on the column ts with the expression from_unixtime that transforms an epoch timestamp into a date string, allowing effective data skipping based on dates.

You can create a bloom filter expression index similarly:

CREATE INDEX bloom_idx_rider ON trips
  USING bloom_filters(rider)
  OPTIONS(expr='lower');

This example builds a bloom filter expression index using the lowercase values of column rider, optimizing for predicates that match lowercase rider names. Check out the SQL DDL page for more examples.

Building Indexes Efficiently with the Async Indexer

Hudi provides flexible mechanisms for managing indexes. You can use SQL DDL commands—such as CREATE INDEX, DROP INDEX, and SHOW INDEXES—or programmatically set writer configurations via the Spark DataSource and Flink DataStream APIs. For example, setting hoodie.metadata.index.partition.stats.enable=false during a write operation drops the partition stats index. This action deletes the corresponding partition from the metadata table and skips indexing computations for subsequent writes until the configuration is re-enabled.

Creating a new index can be a resource-intensive operation, particularly for large tables and for indexes with high space complexity. For instance, the space complexity of the column stats index is O(columns × files), while the record index requires O(records) space. When adding such an index to a large table via DDL or a writer configuration, the time-consuming index initialization process must not block ongoing read and write operations.

To address this challenge, Hudi's index management is designed with two key goals: index creation should not block concurrent reads and writes, and once built, an index must serve consistent data up to the latest table commit. Hudi meets these requirements with its async indexing (illustrated below), which builds indexes in the background without interrupting active writers and readers.

The async indexing process consists of two phases: scheduling and execution. First, the scheduler creates an indexing plan that covers data up to the latest data table commit. Next, the executor reads the required file groups from the data table and writes the corresponding index data to the metadata table. While this process runs, concurrent writers can continue ingesting data. The async indexing executor writes index data to base files in the target index partitions in the metadata table, while the ongoing writer append new index data to log files in those partitions. Hudi uses a conflict resolution mechanism to determine if an indexing operation needs to be retried due to concurrent write conflicts.

To manage this concurrency, a lock provider must be configured for both the indexer and the data writers. Upon successful completion, the operation is marked by a completed indexing commit in the Hudi table’s timeline. For future improvements, the metadata table will employ non-blocking concurrency control to gracefully absorb conflicting updates from both indexing and write operations, thus avoiding wasteful retries. You can find configuration examples in the documentation.

Summary

Throughout this two-part series, we've explored how Hudi's indexing subsystem brings database-grade performance to the data lakehouse. In part 1, we examined the metadata table's architecture and how files, column stats, and partition stats indexes work together to skip irrelevant data. In part 2, we covered specialized indexes—record, secondary, and expression indexes—that provide exact file locations for equality matching and handle transformed predicates. We also looked at async indexing, which lets you add resource-intensive indexes without blocking ongoing operations.

Here's a quick guide for choosing the right indexes for your workload:

Files: Always enabled in the metadata table—provides partition and file lists in the table to facilitate common indexing processes
Column stats and partition stats: Enable by default and configure hoodie.metadata.index.column.stats.column.list to include only the columns you frequently filter on. These indexes are essential for range predicates and data skipping
Record index: Enable when you have frequent point lookups on record keys or when you need secondary indexes. The record index also optimizes Hudi's write path by efficiently routing updates and deletes
Secondary index: Create secondary indexes for non-record-key columns that appear in equality predicates. Each secondary index adds maintenance overhead, so focus on high-value columns
Expression index: Use expression indexes when queries contain predicates with inline transformations. Choose column stats type for range queries on transformed values, or bloom filter type for equality matching on high-cardinality columns
Async indexing: Use async indexing when adding indexes to large tables. The async indexer builds indexes in the background, keeping your writers and readers unblocked

All indexes are maintained transactionally alongside data writes, ensuring consistency without sacrificing performance. The metadata table uses HFile format for fast point lookups and periodic compaction to keep reads efficient. This design makes Hudi's indexing subsystem both powerful and practical—ready to handle lakehouse-scale data while remaining simple to configure and operate.

As Hudi continues to evolve, the indexing subsystem is designed for extensibility. Upcoming features like the bitmap index for low-cardinality columns and vector search index for AI workloads will further expand its capabilities. By understanding these indexing patterns and following the configuration guidelines in this series, you can build lakehouse tables that deliver the query performance your analytics and data pipelines demand.

How FreeWheel Uses Apache Hudi to Power Its Data Lakehouse

Fri, 07 Nov 2025 00:00:00 GMT

This post summarizes a FreeWheel talk from the Apache Hudi Community Sync. Watch the recording on YouTube.

FreeWheel, a division of Comcast, provides advanced video advertising solutions across TV and digital platforms. As the business scaled, FreeWheel faced growing challenges maintaining consistency, freshness, and operational efficiency in its data systems. To address these challenges, the team began transitioning from a legacy Lambda architecture to a modern, Apache Hudi-powered lakehouse approach.

Their original stack, shown below, used multiple systems like Presto, ClickHouse, and Druid to serve analytical and real-time use cases. However, the architecture had some limitations:

Data freshness issues

Presto tables had a 3–4 hour delay, which was too slow for operational use cases.
Only ClickHouse and Druid offered near‑real‑time access (~5 minutes) but added complexity.

Complex ingestion

Data came from logs, CDC streams, files, and databases.
Each system had its own ingestion pipeline and refresh logic.

Query performance bottlenecks

With ~15 PB of data and 20M+ queries/day, scaling across three engines was costly and hard to operate.

Use Case 1: Lambda Architecture and Its Drawbacks

FreeWheel initially followed a traditional Lambda architecture, which separated the processing of batch and real‑time data. This approach created several problems: it required duplicate pipelines for batch and real‑time processing (leading to inefficient engineering workflows), and it struggled to scale ClickHouse for large aggregates.

By consolidating on Hudi as the table format for both streaming and historical data, FreeWheel unified the storage layer and eliminated duplicate pipelines. Hudi’s upserts by key and incremental processing make it possible to serve near–real‑time analytics. The result is simpler operations, consistent logic, and a platform that scales with data volume and query complexity.

Use Case 2: Real-Time Inventory Management

Historically, daily ad inventory updates were a significant challenge. This method led to low forecasting accuracy and frequent delivery-performance mismatches.

By modernizing the platform with Hudi, FreeWheel updates inventory within minutes. Order changes are applied as upserts to Hudi tables and become queryable shortly thereafter, dramatically improving forecast accuracy and reducing delivery‑vs‑forecast mismatches.

Use Case 3: Scalable Audience Data Processing

FreeWheel uses Aerospike to ingest audience segments for its online services, which involves handling high‑frequency, real‑time data. However, this setup brought a few key challenges—chiefly, the need for analytical insights on top of real‑time data and the need to efficiently manage bulk loads alongside frequent updates.

To address these challenges, FreeWheel introduced Hudi into the data pipeline. Hudi maintains a snapshot table for all audience data, enabling more flexible and efficient data management. It supports bulk inserts, upserts, and change data capture (CDC), enabling smoother handling of updates and large‑scale data loads. Using CDC, large batches of audience updates are applied incrementally to the snapshot. With Hudi in place, the back‑end analytics system became much stronger, while the responsiveness of the online systems was preserved. This setup also improved the stability of the online targeting system, as heavy analytical workloads were moved off the key‑value store, reducing pressure on Aerospike and enhancing overall performance.

Hudi in practice 1: Billion‑scale updates for audience‑segment ingestion

Use case overview

This implementation showcases how a large‑scale platform ingests and updates audience‑segmentation data at the billion‑record scale using Hudi tables. The architecture efficiently handles high‑frequency updates across more than 63,000 partitions and a table over 600 TB, with performance optimizations at both the data and infrastructure levels.

Key architecture and design principles

Partitioning and orchestration

FreeWheel uses the audience‑segment ID as the partition key. Each partition can be processed independently, allowing many Spark jobs to run in parallel. Each job upserts data to the Hudi lakehouse table.

A central scheduler allocates work based on input size, priority, and write concurrency limits. This enables dynamic scaling across more than 63,000 partitions, where per-partition input sizes range from 1 million to 100 billion records.

Decoupled ingestion pipeline

Scheduler: allocates resources based on input size and supports job priority, multi-writer concurrency control, and concurrency planning.
Ingestion job: Spark jobs process data and write it to the Hudi segment table in the lakehouse.

Challenges of input data at scale

Table size: over 600 TB.
Partition count: 63,000 audience‑segment partitions.
Data skew: massive variation in partition sizes, ranging from 1 million to 100 billion records.

Metrics and performance insights

Cost optimization
- Unit cost on AWS: ~$0.10 per million records updated.
Throughput: the pipeline supports up to 12 million upserts per second.

Operational optimizations

Handle S3 throttling by increasing partition parallelism. Hash partition prefixes and coordinate with AWS to raise per‑bucket request caps and remove I/O bottlenecks.
Balance SLA and cost with adaptive resource provisioning through the scheduler; choose resources based on input size to keep jobs stable while controlling spend.
Deduplicate before commit: group by record key, order by event timestamp, and write only the latest value to reduce churn and speed up writes.

Hudi in practice 2: Real‑time aggregated ingestion using Spark Streaming + clustering

Pipeline overview

This implementation showcases an efficient pipeline where Spark Streaming ingests aggregated data into a Hudi lakehouse using the bulk_insert operation, followed by asynchronous clustering.

Data ingestion flow

Kafka: raw events are streamed into Kafka.
Spark SQL on Streaming: consumes Kafka messages and performs near‑real‑time aggregations.
bulk_insert into Hudi lakehouse: aggregated data is appended using bulk_insert.
Clustering plan generation: clustering plans are created asynchronously.
HoodieClusteringJob: a cron job runs hourly to execute clustering and consolidate small files.

Results at a glance

Massive file reduction: clustering reduced total file count by nearly 90%, minimizing small‑file pressure and improving metadata performance.
Write throughput boost: increased by about 114% due to optimized file layout.
Faster queries: Presto query performance improved significantly after clustering.

However, Spark Streaming is a macro‑batch system, typically executing every one or two minutes. As a result, it does not trigger clustering jobs immediately but instead generates clustering plans for later execution. In production, clustering jobs are scheduled to run hourly and apply only to stable partitions, ensuring compaction and file optimization without impacting real‑time ingestion.

Conclusion

FreeWheel’s journey with Hudi transformed its data architecture—offering unified access, real‑time freshness, and scalable operations. The team credits Hudi’s community and feature set as key to its success.

“We’re lucky to choose Hudi as our Lakehouse. Thanks to the powerful Hudi community!” – Bing Jiang

Deep Dive Into Hudi’s Indexing Subsystem (Part 1 of 2)

Wed, 29 Oct 2025 00:00:00 GMT

For decades, databases have relied on indexes—specialized data structures—to dramatically improve read and write performance by quickly locating specific records. Apache Hudi extends this fundamental principle to the data lakehouse with a unique and powerful approach. Every Hudi table contains a self-managed metadata table that functions as an indexing subsystem, enabling efficient data skipping and fast record lookups across a wide range of read and write scenarios.

This two-part series dives into Hudi’s indexing subsystem. Part 1 explains the internal layout and data-skipping capabilities. Part 2 covers advanced features—record, secondary, and expression indexes—and asynchronous index maintenance. By the end, you’ll know how to leverage Hudi’s multimodal index to build more efficient lakehouse tables.

The Metadata Table

Within a Hudi table (the data table), the metadata table itself is a Hudi Merge-on-Read (MOR) table. Unlike a typical data table, it features a specialized layout. The table is physically partitioned by index type, with each partition containing the relevant index entries. For its physical storage, the metadata table uses HFile as the base file format. This choice is deliberate: HFile is exceptionally efficient at handling key lookups—the predominant query pattern for indexing. Let’s explore the partitioned layout and HFile’s internal structure.

Multimodal indexing

The metadata table is often referred to as a multimodal index because it houses a diverse range of index types, providing versatile capabilities to accelerate various query patterns. The following diagram illustrates the layout of the metadata table and its relationship with the main data table.

The metadata table is located in the .hoodie/metadata/ directory under the data table’s base path. It contains partitions for different indexes, such as the files index (under the files/ partition) for tracking the data table’s partitions and files, and the column stats index (under the column_stats/ partition) for tracking file-level statistics (e.g., min/max values) for specific columns. Each index partition stores mapping entries tailored to its specific purpose.

This partitioned design provides great flexibility, allowing you to enable only the indexes that suit your workload. It also ensures extensibility, making it straightforward to support new index types in the future. For example, the bitmap index and the vector search index are on the roadmap and will be maintained in their own dedicated partitions.

When committing to a data table, the metadata table is updated within the same transactional write. This crucial step ensures that index entries are always synchronized with data table records, upholding data integrity across the table. Therefore, choosing Merge-on-Read (MOR) as the table type for the metadata table is an obvious choice. MOR offers the advantage of absorbing high-frequency write operations, preventing the metadata table’s update process from becoming a bottleneck for overall table writes. To ensure efficient reading, Hudi automatically performs compaction on the metadata table based on its compaction configuration. By default, an inline compaction will be executed every 10 writes to the metadata table, merging accumulated log files with base files to produce a new set of read-optimized base files in HFile format.

HFile format

The HFile format stores key-value pairs in a sorted, immutable, and block-indexed way, modeled after Google’s SSTable introduced by the Bigtable paper. Here is the description of SSTable quoted from the paper:

An SSTable provides a persistent, ordered immutable map from keys to values, where both keys and values are arbitrary byte strings. Operations are provided to look up the value associated with a specified key, and to iterate over all key/value pairs in a specified key range. Internally, each SSTable contains a sequence of blocks (typically each block is 64KB in size, but this is configurable). A block index (stored at the end of the SSTable) is used to locate blocks; the index is loaded into memory when the SSTable is opened. A lookup can be performed with a single disk seek: we first find the appropriate block by performing a binary search in the in-memory index, and then reading the appropriate block from disk.

As you can tell, by implementing the SSTable, HFile is especially efficient at performing random access, which is the primary query pattern for indexing—given a specific piece of information, like a record key or a partition value, return matching results, such as the file ID that contains the record key, or the list of files that belong to the partition.

Because the keys in an HFile are stored in lexicographic order, a batched lookup with a common key prefix is also highly efficient, requiring only a sequential read of nearby keys.

Default behaviors

When a Hudi table is created, the metadata table will be enabled with three partitions by default: files, column stats, and partition stats:

Files: stores the list of all partitions and the lists of all base files and log files of each partition, located at the files/ partition of the metadata table.
Column stats: stores file-level statistics like min, max, value count, and null count for specified columns, located at the column_stats/ partition of the metadata table.
Partition stats: stores partition-level statistics like min, max, value count, and null count for specified columns, located at the partition_stats/ partition of the metadata table.

By default, when no column is specified for column_stats and partition_stats, Hudi will index the first 32 columns (controlled by hoodie.metadata.index.column.stats.max.columns.to.index) available in the table schema.

Whenever a new write is performed on the data table, the metadata table will be updated accordingly. For any available index, new index entries will be upserted to its corresponding partition. For example, if the new write creates a new partition in the data table with some new base files, the files partition will be updated and contain the latest partition and file lists. Similarly, the column stats and partition stats partitions will receive new entries indicating the updated statistics for the new files and partitions.

Note that by design, you cannot disable the files partition, as it is a fundamental index that serves both read and write processes. You can still, although not recommended, disable the entire metadata table by setting hoodie.metadata.enable=false during a write.

We will discuss more details about how the default indexes work to improve read and write performance. We will also introduce more indexes supported by the metadata table with usage examples in the following sections.

Data Skipping with Files, Column Stats, and Partition Stats

Data skipping is a core optimization technique that avoids unnecessary data scanning. Its most basic form is physical partitioning, where data is organized into directories based on columns like order_date in a customer order table. When a query filters on a partitioned column, the engine uses partition pruning to read only the relevant directories. More advanced techniques store lightweight statistics—such as min/max values—for data within each file. The query engine consults this metadata first; if the stats indicate a file cannot contain the required data, the engine skips reading it entirely. This reduction in I/O is a key strategy for accelerating queries and lowering compute costs.

The data skipping process

Hudi’s indexing subsystem implements a multi-level skipping strategy using a combination of indexes. Query engines like Spark or Trino can leverage Hudi’s files, partition stats, and column stats indexes to improve performance dramatically. The process, illustrated in the figure below, unfolds in several stages.

First, the query engine parses the input SQL and extracts relevant filter predicates, such as price >= 300. These predicates are pushed down to Hudi’s integration component, which manages the index lookup process.

The component then consults the files index to get an initial list of partitions. It prunes this list using the partition stats index, which holds partition-level statistics like min/max values. For example, any partition with a maximum price below 300 is skipped entirely.

After this initial pruning, the component consults the files index again to retrieve the list of data files within the remaining partitions. This file list is pruned further using the column stats index, which provides the same min/max statistics at the file level.

This multi-step process ensures that the query engine reads only the minimum set of files required to satisfy the query, significantly reducing the total amount of data processed.

SQL examples

The following examples demonstrate data skipping in action. We will create a Hudi table and execute Spark SQL queries against it, starting with both partition and column stats disabled to establish a baseline.

CREATE TABLE orders (
    order_id STRING,
    price DECIMAL(12,2),
    order_status STRING,
    update_ts BIGINT,
    shipping_date DATE,
    shipping_country STRING
) USING HUDI
PARTITIONED BY (shipping_country)
OPTIONS (
    primaryKey = 'order_id',
    preCombineField = 'update_ts',
    hoodie.metadata.index.column.stats.enable = 'false',
    hoodie.metadata.index.partition.stats.enable = 'false'
);

And insert some sample data:

INSERT INTO orders VALUES
('ORD001', 389.99, 'PENDING',    17495166353, DATE '2023-01-01', 'A'),
('ORD002', 199.99, 'CONFIRMED',  17495167353, DATE '2023-01-01', 'A'),
('ORD003', 59.50,  'SHIPPED',    17495168353, DATE '2023-01-11', 'B'),
('ORD004', 99.00,  'PENDING',    17495169353, DATE '2023-02-09', 'B'),
('ORD005', 19.99,  'PENDING',    17495170353, DATE '2023-06-12', 'C'),
('ORD006', 5.99,   'SHIPPED',    17495171353, DATE '2023-07-31', 'C');

Only the files index

With both column stats and partition stats disabled, only the files index is built during the insert operation. We’ll use the SQL below for our test:

SELECT order_id, price, shipping_country
FROM orders
WHERE price > 300;

This query looks for orders with price greater than 300, which only exist in partition 'A' (shipping_country = 'A'). After running the SQL, here's what we see in the Spark UI:

Spark read all 3 partitions and 3 files to find potential matches, but only 1 record from partition A actually satisfied the query condition.

Enabling column stats

Now let's enable column stats while keeping partition stats disabled. Note that we can't do it the other way around—partition stats requires column stats to be enabled first.

CREATE TABLE orders (
    order_id STRING,
    price DECIMAL(12,2),
    order_status STRING,
    update_ts BIGINT,
    shipping_date DATE,
    shipping_country STRING
) USING HUDI
PARTITIONED BY (shipping_country)
OPTIONS (
    primaryKey = 'order_id',
    preCombineField = 'update_ts',
    hoodie.metadata.index.column.stats.enable = 'true',
    hoodie.metadata.index.partition.stats.enable = 'false'
);

Running the same SQL gives us this in the Spark UI:

Now it shows all 3 partitions but only 1 file was scanned. Without partition stats, the query engine couldn't prune partitions, but column stats successfully filtered out the non-matching files. The compute cost of examining those 2 irrelevant partitions and their files could have been avoided with partition stats enabled.

Enabling column stats and partition stats

Now let's enable partition stats as well. Since both indexes are enabled by default in Hudi 1.x, we can simply omit those additional configs from the CREATE statement:

CREATE TABLE orders (
    order_id STRING,
    price DECIMAL(12,2),
    order_status STRING,
    update_ts BIGINT,
    shipping_date DATE,
    shipping_country STRING
) USING HUDI
PARTITIONED BY (shipping_country)
OPTIONS (
    primaryKey = 'order_id',
    preCombineField = 'update_ts'
);

Running the same SQL gives us this in the Spark UI:

Now we see the full pruning effect happened—only 1 relevant partition and 1 relevant file were scanned, thanks to both indexes working together. This blog shows a 93% reduction in query time running on a 1 TB dataset.

Configure relevant columns to be indexed

By default, Hudi indexes the first 32 columns for both partition stats and column stats. This limit prevents excessive metadata overhead—each indexed column requires computing min, max, null-count, and value-count statistics for every partition and data file. In most cases, you only need to index a small subset of columns that are frequently used in query predicates. You can specify which columns to be indexed to reduce the maintenance costs:

CREATE TABLE orders (
    order_id STRING,
    price DECIMAL(12,2),
    order_status STRING,
    update_ts BIGINT,
    shipping_date DATE,
    shipping_country STRING
) USING HUDI
PARTITIONED BY (shipping_country)
OPTIONS (
    primaryKey = 'order_id',
    preCombineField = 'update_ts',
    'hoodie.metadata.index.column.stats.column.list' = 'price,shipping_date'
);

The config hoodie.metadata.index.column.stats.column.list applies to both partition stats and column stats. By indexing just the price and shipping_date columns, queries filtering on price comparisons or shipping date ranges will already see significant performance improvements.

Key Takeaways and What's Next

Hudi’s metadata table is itself a Hudi Merge‑on‑Read (MOR) table that acts as a multimodal indexing subsystem. It is physically partitioned by index type (for example, files/, column_stats/, partition_stats/) and stores base files in the HFile (SSTable‑like) format. This layout provides fast point lookups and efficient batched scans by key prefix—exactly the access patterns indexing needs at lakehouse scale.

Index maintenance happens transactionally alongside data writes, keeping index entries consistent with the data table. Periodic compaction merges log files into read‑optimized HFile base files to keep point lookups fast and predictable. On the read path, Hudi composes multiple indexes to minimize I/O: the files index enumerates candidates, partition stats prune irrelevant partitions, and column stats prune non‑matching files. In effect, the engine scans only the minimum set of files required to satisfy a query.

In practice, the defaults are a strong starting point. Keep the metadata table enabled and explicitly list only the columns you frequently filter on via hoodie.metadata.index.column.stats.column.list to control metadata overhead. In part 2, we’ll go deeper into accelerating equality‑matching and expression‑based predicates using the record, secondary, and expression indexes, and discuss how asynchronous index maintenance keeps writers unblocked while indexes build in the background.

Partition Stats: Enhancing Column Stats in Hudi 1.0

Wed, 22 Oct 2025 00:00:00 GMT

For those tracking Apache Hudi's performance enhancements, the introduction of the column stats index was a significant development, as detailed in this blog. It represented a major advancement for query optimization by implementing a straightforward yet highly effective concept: storing lightweight, file-level statistics (such as min/max values and null counts) for specific columns. This provided Hudi's query engine a substantial performance improvement.

Instead of blindly scanning every single file for a query, the engine could first peek at the index entries—which is far more efficient than reading all the Parquet footers—to determine which files couldn't possibly contain the relevant data. This data-skipping capability meant engines could bypass large amounts of irrelevant data, slashing query latency. But that skipping process is conducted at the file level—what if we could apply a similar skipping logic at the partition level? Since a single physical partition can contain thousands of data files, applying this logic at the partition level can further amplify the performance gains by only considering files in the relevant partitions. This is precisely the capability that Hudi 1.0’s partition stats index introduces.

Multimodal Indexing

Hudi’s multimodal indexing subsystem enhances both read and write performance in data lakehouses by supporting versatile index types optimized for different workloads. This subsystem is built on a scalable, internal metadata table that ensures ACID-compliant updates and efficient lookups, which in turn reduces full data scans. It houses various indexes—such as the files, column stats, and partition stats—which work together to improve efficiency in reads, writes, and upserts, providing scalable, low-latency query performance for large datasets in the lakehouse.

The partition stats index is built on top of the column stats index by aggregating its file-level statistics up to the partition level. As we've covered, the column stats index tracks statistics (min, max, null counts) for individual files, enabling fine-grained file pruning. The partition stats index, in contrast, summarizes these same statistics across all files within a single partition.

This partition-level aggregation allows Hudi to efficiently prune entire physical partitions before even examining file-level indexes, leading to faster query planning and execution by skipping large chunks of irrelevant data early in the process. In other words, the partition stats index provides a coarse-grained, high-level pruning layer on top of the fine-grained, file-level pruning enabled by the column stats index.

Because partition-level pruning happens first, it narrows down the scope of files that the column stats index needs to inspect, improving overall query performance and reducing overhead on large datasets. The diagram below illustrates the file pruning process:

During query planning, the Hudi integration for the query engine takes the predicates parsed from user queries and queries the indexes within the metadata table.

The files index is queried first to return an initial list of all partitions in the table.
The partition stats index then filters this partition list by checking if each partition’s min/max values for the indexed columns fall within the predicate's range. For example, with a predicate of A = 100, the index skips any partition whose min(A) is greater than 100 or whose max(A) is less than 100.
The files index is queried again to retrieve a list of all files within these pruned partitions.
This file list is then passed to the column stats index, which performs the final, fine-grained pruning by applying the query predicates to the file-level statistics.
Finally, this pruned list of files is returned to the query engine to complete query planning.

This dual-layer pruning strategy is especially impactful in production systems managing large amounts of data. By complementing the fine-grained column stats index with this coarse-grained partition skipping, Hudi’s metadata table significantly reduces I/O, computation, and cost. For end-users, this translates directly into a better experience, turning queries that once took minutes into operations that complete in seconds.

Example: US Shipping Addresses

To understand the impact, let's use the example table below, which stores US shipping addresses for online orders and is partitioned by state. This table could contain billions of records, and we want to run a query filtering on the zip_code column.

By default, the files, column stats, and partition stats indexes are all enabled in Hudi 1.0. You can create the Hudi table using Spark SQL, for example, without needing additional configs to enable column stats and partition stats:

CREATE TABLE shipping_address (
    order_id STRING,
    state STRING,
    zip_code STRING,
    ...
) USING HUDI
TBLPROPERTIES (
    primaryKey ='order_id',
    hoodie.metadata.index.column.stats.column.list = 'zip_code'
)
PARTITIONED BY (state);

Note that, in practice, you would most likely want to use hoodie.metadata.index.column.stats.column.list to indicate which column(s) to index according to your business use case, otherwise, the first 32 columns in the table schema will be indexed by default, which probably won’t be optimal. The specified columns apply to both the column stats and partition stats indexes.

Without the column and partition stats indexes, a query for a specific ZIP code (e.g., zip_code = '90001') would force the query engine to perform a full table scan. This is highly inefficient, leading to high query latency and excessive resource consumption.

With the indexes enabled, the process is drastically different.

During write operations, the Hudi writer tracks statistics for the zip_code column. The column stats index stores min/max values for each data file, and the partition stats index aggregates and stores the min/max zip_code for each state.
At query time, suppose the partition stats index shows that the "California" partition contains ZIP codes from "90000" to "96199", while the "New York" partition contains ZIP codes from "10000" to "14999". When the query for zip_code = '90001' is executed, the query planner first consults the partition stats index. It sees that "90001" falls within the "California" partition's range but outside the "New York" partition's range.
The engine can therefore skip the entire "New York" partition (and any other partition like "Texas" or "Florida" whose ZIP code range doesn't include "90001"). The query proceeds by only reading data from the "California" partition—the only one that could possibly contain the data.

This ability to prune entire partitions before reading any files is what provides such a significant performance gain.

Results: the Data Skipping Effect

We conducted a focused benchmarking exercise using a synthetic dataset generated by the open-source tool lake_loader. Specifically, we created a 1 TB table for the US shipping addresses example and built both the column stats and partition stats indexes on this dataset.

The benchmarking objective was to evaluate the performance impact from the two indexes for data skipping. To do this, we executed the following query in two scenarios:

select count(1) from shipping_address where zip_code = '10001'

One with the column and partition stats indexes enabled (default), and one with both indexes disabled for reads, which forced a full table scan.

The Spark job was configured with:

Executor cores = 4
Executor memory = 10g
Number of executors = 60

The Spark DAGs for the two scenarios show the file pruning effect:

With both column stats and partition stats indexes enabled (the left-side DAG), the number of files read was 19,304. In contrast, the disabled setup (the right-side DAG) resulted in reading 393,360 files—about 20 times more.

The runtime comparison chart below shows the query time difference (shorter is better):

Enabling data skipping with both the column stats and partition stats indexes for the Hudi table delivers approximately a 93% reduction in query runtime compared to the full scan (no data skipping).

Conclusion

The new partition stats index is a powerful addition to Hudi's multimodal indexing subsystem, directly addressing the challenge of query performance on large-scale partitioned tables. By working in concert with the existing column stats index, it provides a crucial layer of coarse-grained pruning, allowing the query engine to eliminate entire partitions from consideration before inspecting individual files. As our benchmark showed, this two-level pruning strategy—first by partition, then by file—is not just a minor tweak. It results in a dramatic reduction in I/O, slashing query runtimes by over 93% and enabling near-interactive query speeds. This feature solidifies Hudi's data-skipping capabilities, making it even more efficient to run demanding analytical queries directly on the data lakehouse, saving both time and computation costs.

Modernizing Upstox's Data Platform with Apache Hudi, dbt, and EMR Serverless

Thu, 16 Oct 2025 00:00:00 GMT

Introduction

In this community sharing session, Manish Gaurav from Upstox shared insights into the complexities of managing data ingestion at scale. Drawing from the company’s experience as a leading online trading platform in India, the discussion highlighted challenges around file-level upserts, ensuring atomic operations, and handling small files effectively. Upstox shared how they built a modern data platform using Apache Hudi and dbt to address these issues. In this blog post, we’ll break down their solution and why it matters.

Upstox is a leading online trading platform that enables millions of users to invest in equities, commodities, derivatives, and currencies. With over 12 million customers generating 300,000 data requests daily, the company's data team is responsible for delivering the real-time insights that power key products, including:

Search functionality
A customer service chatbot (powered by OpenAI)
Personalized portfolio recommendations

Data Sources

Upstox ingests 250–300 GB of structured and semi-structured data per day from a variety of sources:

Order and transaction data from exchanges
Microservice telemetry from Cloudflare
Customer support data from platforms like Freshdesk and SquadStack
Behavioral analytics from Mixpanel
Data from operational databases (MongoDB, MySQL, and MS SQL) via AWS DMS

The Challenges with Initial Data Platform

As Upstox grew, so did the complexity of its data operations. Here are some of the early bottlenecks the company faced:

Data Ingestion Issues

Prior to 2023, Upstox relied on no-code ingestion platforms like Hevo. While easy to adopt, these platforms introduced several limitations, including high licensing costs and a lack of fine-grained control over ingestion logic. File-level upserts required complex joins between incoming CDC (change data capture) datasets and target tables. Additionally, a lack of atomicity often led to inconsistent data writes, and small-file issues were rampant. To combat these problems, the team had to implement time-consuming re-partitioning and coalescing, along with complex salting strategies to distribute data evenly.

Downstream Consumption Struggles

Analytics queries were primarily served through Amazon Athena, which presented several key limitations. For instance, it frequently timed out when querying large datasets and often exceeded the maximum number of partitions it could handle. Additionally, Athena's lack of support for stored procedures made it challenging to manage and reuse complex query logic. Attempts to improve performance with bucketing often created more small files, and the lack of native support for incremental queries further complicated their analytics workflow.

The Modern Lakehouse Architecture

To tackle these problems, Upstox implemented a medallion architecture, organizing data into bronze, silver, and gold layers:

Bronze (Raw Data): Data is ingested and stored in its raw format as Parquet files.
Silver (Cleaned and Filtered): Data is cleaned, filtered, and stored in Apache Hudi tables, which are updated incrementally.
Gold (Business-Ready): Data is aggregated for specific business use cases, modeled with dbt, and stored in Hudi.

The Solution: A Modern Stack with Hudi, dbt, and EMR Serverless

Upstox re-architected its platform using Apache Hudi as the core data lake technology, dbt for transformations, and EMR Serverless for scalable compute. Airflow was used to orchestrate the entire workflow. Here's how this new stack addressed their challenges:

Simplified Data Updates: Hudi provides built-in support for record-level upserts with atomic guarantees and snapshot isolation. This helped Upstox overcome the challenge of ensuring consistent updates to their fact and dimension tables.

Improved Upsert Performance: To optimize upsert performance, the team leveraged Bloom index, especially for transaction-heavy fact tables. Indexing strategies were chosen based on data characteristics to balance latency and efficiency.

Resolved Small-File Issues: Small files, which are common in streaming workloads, were mitigated using clustering jobs supported by Hudi. This process was scheduled to run weekly and ensured efficient file sizes and reduced storage overhead without manual intervention.

Enabled Incremental Processing: Incremental joins allowed Upstox to process only new data daily. This enabled timely updates to the aggregated tables in the gold layer that power user-facing dashboards—a task that was not feasible with traditional Athena queries.

Managed Metadata Growth: The accumulation of commit and metadata files in the Hudi table’s `.hoodie/` directory increased S3 listing costs and slowed down operations. Hudi's archival feature helped manage this by archiving older commits after a certain threshold, keeping metadata lean and efficient.

Streamlined Data Modeling: The team used dbt on EMR Serverless to create materialized views over the Hudi datasets. This enabled the creation of efficient transformation layers (silver and gold) using familiar SQL workflows and managed compute.

Flexible Data Materialization: dbt supported a variety of model types, including tables, views, and ephemeral models (Common Table Expressions, or CTEs). This gave teams the flexibility to optimize for performance, reuse, or simplicity, depending on the use case.

Out-of-the-Box Lineage and Documentation: dbt helps visualize how data flows from one table to another, making it easier to debug and understand dependencies. The glossary feature allows teams to document column meanings and transformations clearly.

Enforced Data Quality: With dbt, specific data quality rules can be added to individual tables or pipelines. This adds an extra layer of validation beyond the basic checks performed during data ingestion.

CI/CD and Orchestration

Upstox uses Apache Airflow for orchestration, with dbt pipelines deployed via a Git-based CI/CD process. Merging a pull request in GitLab triggers the CI/CD pipeline, which automatically builds a new dbt image and publishes the updated data catalog. Airflow then runs the corresponding dbt jobs daily or on-demand, automating the entire transformation workflow.

The Impact

The adoption of this modern data stack had a significant impact on Upstox's data platform. The company achieved extremely high data availability and consistency for critical datasets, reducing SLA breaches for complex joins by 70%. Furthermore, pipeline costs dropped by 40%, and query performance improved drastically thanks to Hudi's clustering and optimized joins.

Conclusion

By leveraging Apache Hudi, dbt, and EMR Serverless, Upstox built a robust and cost-efficient data platform to serve its 12M+ customers, overcoming the significant challenges of data ingestion and analytics at scale. This transformation resolved critical issues like inconsistent data writes, small-file problems, and query timeouts, leading to tangible improvements in both performance and efficiency. With a 70% reduction in SLA breaches and a 40% drop in pipeline costs, the new architecture has empowered their BI and ML teams to move faster. Ultimately, this success story demonstrates how a modern data stack can not only solve immediate technical bottlenecks but also lay the groundwork for a scalable, self-service future that enables continued innovation.

Apache Hudi: User-Facing Analytics

Scaling Autonomous Vehicle Data Infrastructure with Apache Hudi at Applied Intuition

Building a Unique Data Infrastructure​

The Challenges Before Apache Hudi​

Leveraging Hudi Features to Shape Data Architecture​

File Sizing: Solving the Small File Problem​

Clustering: Optimizing for Query Patterns​

Metadata Indexing: From Minutes to Seconds​

Extending Hudi for Schema Flexibility​

Impact: Performance, Cost, and Scale​

Next Steps: Moving Beyond PostgreSQL​

Acknowledgments​

Conclusion​

Apache Hudi™ at Uber: Engineering for Trillion-Record-Scale Data Lake Operations

From Legacy to Leading: Funding Circle's Journey with Apache Hudi

Why the Legacy Ingestion System Needed to Change​

Introducing Project Kirby: The New Ingestion Framework​

Architecture and User Experience​

Pipeline Declaration and Deployment​

Compute Layer​

Access Layer​

Challenges and Lessons Learned​

Concrete Achievements and Business Value​

The Future with Hudi​

Conclusion​

Using Amazon EMR DeltaStreamer to stream data to multiple Apache Hudi tables

ExternalSpillableMap: Handle Maps Too Big for Memory

Apache Hudi 1.1 Deep Dive: Async Instant Time Generation for Flink Writers

Background​

Instant Time​

Completion Time​

File Slicing Based on Instant Time​

File Slicing Based on Completion Time​

LSM Timeline​

TrueTime​

Blocking Instant Time Generation for Flink Writers​

Async Instant Time Generation​

WriteMeta Failover​

Conclusion​

References​

Apache Hudi 2025: A Year In Review

Community and Growth​

Development Highlights​

New Book Published​

Meetups and Conferences​

Bangalore Hudi Community Meetup​

1st Hudi Asia Meetup by Kuaishou​

2nd Hudi Asia Meetup by JD.com​

CMU Database Seminar​

OpenXData​

VeloxCon​

Data Streaming Summit​

Open Source Data Summit​

Content Highlights​

Looking Ahead​

How Zupee Cut S3 Costs by 60% with Apache Hudi

Data Platform Architecture​

Three-Tiered Data Lake​

Metastore and API Layer​

Orchestration and Framework Layer​

Compute Layer​

Real-Time Serving Layer​

Workflow-Based Data Ingestion​

Centralized Configuration​

Multi-Tenant Pipeline​

Automated Spark Command Generation​

The Ingestion Flow​

Real-Time Ingestion with Hudi Streamer​

Key Benefits​

Deep Dive: How Hudi Streamer Works​

Custom Solutions​

Results: Cost Savings and Performance Gains​

60% Reduction in S3 Network Costs​

15-Minute Ingestion SLA​

30% Storage Reduction​

Small File Management​

Why Hudi Over Other Table Formats?​

Best Practices for EMR Upgrades​

Conclusion​

Maximizing Throughput with Apache Hudi NBCC: Stop Retrying, Start Scaling

Building a Unique Data Infrastructure

The Challenges Before Apache Hudi

Leveraging Hudi Features to Shape Data Architecture

File Sizing: Solving the Small File Problem

Clustering: Optimizing for Query Patterns

Metadata Indexing: From Minutes to Seconds

Extending Hudi for Schema Flexibility

Impact: Performance, Cost, and Scale

Next Steps: Moving Beyond PostgreSQL

Acknowledgments

Conclusion

Why the Legacy Ingestion System Needed to Change

Introducing Project Kirby: The New Ingestion Framework

Architecture and User Experience

Pipeline Declaration and Deployment

Compute Layer

Access Layer

Challenges and Lessons Learned

Concrete Achievements and Business Value

The Future with Hudi

Conclusion

Background

Instant Time

Completion Time

File Slicing Based on Instant Time

File Slicing Based on Completion Time

LSM Timeline

TrueTime

Blocking Instant Time Generation for Flink Writers

Async Instant Time Generation

WriteMeta Failover

Conclusion

References

Community and Growth

Development Highlights

New Book Published

Meetups and Conferences

Bangalore Hudi Community Meetup

1st Hudi Asia Meetup by Kuaishou

2nd Hudi Asia Meetup by JD.com

CMU Database Seminar

OpenXData

VeloxCon

Data Streaming Summit

Open Source Data Summit

Content Highlights

Looking Ahead

Data Platform Architecture

Three-Tiered Data Lake

Metastore and API Layer

Orchestration and Framework Layer

Compute Layer

Real-Time Serving Layer

Workflow-Based Data Ingestion

Centralized Configuration

Multi-Tenant Pipeline

Automated Spark Command Generation

The Ingestion Flow

Real-Time Ingestion with Hudi Streamer

Key Benefits

Deep Dive: How Hudi Streamer Works

Custom Solutions

Results: Cost Savings and Performance Gains

60% Reduction in S3 Network Costs

15-Minute Ingestion SLA

30% Storage Reduction

Small File Management

Why Hudi Over Other Table Formats?

Best Practices for EMR Upgrades

Conclusion

The Problem with Retries

Hudi NBCC: Write in Parallel, Serialize by Completion Time

How NBCC Works Under the Hood

Record Keys and File Groups

Completion Time: Serializing Concurrent Writes

TrueTime-like Timestamp Generation

Supporting Designs

Using NBCC

NBCC in Action

Configuration