Introducing Secondary Index in Apache Hudi Lakehouse Platform
Apache Hudi 1.0 introduces Secondary Indexes, enabling faster queries on non-primary key fields. This improves data retrieval in Lakehouse architectures by reducing data scans. Hudi also offers asynchronous indexing for scalability and efficient index maintenance without disrupting data ingestion. By the end of this blog, you'll understand how these features enhance Hudi's capabilities as a high-performance lakehouse platform.
Indexes are a fundamental data structure that enables efficient data retrieval by eliminating the need to scan the entire dataset for every query. In the context of a Lakehouse, where records are written as immutable data files (such as Parquet) at scale, indexing becomes crucial in reducing lookup times. Otherwise, a lot of time will be spent by the compute engine on finding out where exactly a particular record exists amongst thousands of files in the data lake storage, which is computationally expensive at scale. Indexing is not only important for reads in a lakehouse architecture, but also for writes, such as upserts and deletes, as you need to know where the record is to update it.
One of the standout design choices in Apache Hudi that separates it from other lakehouse formats is its indexing capability, which has been central to its architecture from the beginning. Hudi is heavily optimized to handle mutable change streams with varying write patterns, and indexing plays a pivotal role in making upserts and deletes efficient.
Hudi's indexing mechanism is designed to efficiently manage record lookups and updates by maintaining a structured mapping between records and file groups. Here's how it works:
-
The first time a record is ingested into Hudi, it is assigned to a File Group - a logical grouping of files. This assignment typically remains unchanged throughout the record's lifecycle. However, in cases such as clustering or cross-partition updates, the record may be remapped to a different file group. Even in such scenarios, Hudi ensures that a given record key is associated with exactly one file group at any completed instant on the timeline
-
Hudi maintains a mapping between the incoming record’s key (unique identifier) and the File Group where it resides.
-
The index is responsible for quickly locating records based on this File Group mapping, eliminating the need for full dataset scans.
This strategy allows Hudi to determine whether a record exists and pinpoint its exact location, enabling faster upserts and deletes.
Apache Hudi's Multi-Modal Indexing System
While Hudi’s indexes have set a benchmark for fast writes, bringing those advantages to queries was equally important. This led to the design of a generalized indexing subsystem that enhances performance in the lakehouse. Hudi’s multi-modal indexing redefines indexing in data lakes by employing multiple index types, each optimized for different workloads and query patterns. It is built on scalable metadata that supports multiple index types without extra overhead, ACID-compliant updates to keep indexes in sync with the data table, and optimized lookups that minimize full scans for low-latency queries on large datasets.
At the core of Hudi’s indexing design is its metadata table, a specialized Merge-on-Read table that houses multiple index types as separate partitions. These indexes serve various purposes, improving the efficiency of reads, writes, and upserts.

Some key indexes within Hudi’s metadata table include:
- File Index - Stores a compact listing of files, reducing the overhead of expensive file system operations.
- Column Stats Index - Tracks min/max statistics for each column, enabling more effective data pruning.
- Bloom Filter Index - Stores precomputed bloom filters for all data files, optimizing record lookups.
- Partition Stats Index - Stores aggregated partition-related information which helps in efficient partition pruning by skipping entire folders very quickly.
- Record-Level Index - Maintains direct mappings to individual records, facilitating faster upserts and deletes.
- Secondary Index - Allow users to create indexes on columns that are not part of record key columns in Hudi tables.
By structuring these indexes as individual partitions within the metadata table, Hudi ensures efficient retrieval, quick lookups, and scalability, even as the data volume grows. In this blog, we will focus on secondary indexes and understand how it can help accelerate query performance in a lakehouse.
Introducing Secondary Index
A secondary index is an indexing mechanism commonly used in database systems to provide efficient access to records based on non-primary key attributes. Unlike primary indexes, which enforce uniqueness and define the main data layout, secondary indexes serve as auxiliary data structures that accelerate lookups on fields that are frequently queried but are not the primary key.
For example, in an OLTP (Online Transaction Processing) database, a primary index might be defined on a unique order_id
, whereas a secondary index could be created on customer_id
to quickly fetch all orders placed by a specific customer. Secondary indexes enhance query performance by reducing the need for full table scans, especially in analytical workloads that involve complex filtering or joins.
With Hudi 1.0, Apache Hudi introduces secondary indexes, bringing database-style indexing capabilities to the Lakehouse. Secondary indexes allow queries to scan significantly fewer files, reducing query latency and compute costs. This is especially beneficial for cloud-based query engines (such as AWS Athena), where pricing is based on the amount of data scanned. A secondary index in Hudi allows users to index any column beyond the record key (primary key), making queries on non-primary key fields much faster. This extends Hudi’s existing record-level index, which optimizes writes and reads based on the record key.

Here is how the secondary index works in Hudi.
- Indexes Non-Primary Key Columns: Unlike the record-level index, which tracks record keys, secondary indexes help accelerate queries on fields outside the primary key.
- Stores Mappings Between Secondary and Primary Keys: Hudi maintains a mapping between secondary keys (e.g., city, driver) and record keys, enabling fast lookups for non-primary key queries.
- Minimizes Data Scans via Index-Aware Query Execution: During query execution, the secondary index enables data skipping, allowing Hudi to prune unnecessary files before scanning.
- SQL-Based Index Management: Users can create, drop, and manage indexes using SQL, making secondary indexes easily accessible.
Hudi supports hash-based secondary indexes, which are horizontally scalable by distributing keys across shards for fast writes and lookups.
If you are interested in the implementation details of secondary indexes, you can read more here.
Creating a Secondary Index in Hudi
In Hudi 1.0, secondary indexes are supported currently in Apache Spark, with future support planned for Flink, Presto, and Trino in Hudi 1.1.
Let’s see an example of creating a Hudi table with a secondary index.
First, let’s create a table with a record index enabled. The record index maintains mappings of record keys (id
) to file groups, enabling fast updates, deletes, and lookups.
DROP TABLE IF EXISTS hudi_table;
CREATE TABLE hudi_table (
ts BIGINT,
id STRING,
rider STRING,
driver STRING,
fare DOUBLE,
city STRING,
state STRING
) USING hudi
OPTIONS (
primaryKey = 'id',
hoodie.metadata.record.index.enable = 'true', -- Enable record index
hoodie.write.record.merge.mode = "COMMIT_TIME_ORDERING" -- Only Required for 1.0.0 version
)
PARTITIONED BY (city, state)
LOCATION 'file:///tmp/hudi_test_table';
Now we can create a secondary index on the city
field to optimize queries filtering on this column.
CREATE INDEX idx_city ON hudi_table(city);
Now, when executing a query such as:
SELECT rider FROM hudi_table WHERE city = 'SFO';
✅ Hudi first checks the secondary index to determine which records match the filter condition.