Blogs
Welcome to Apache Hudi blogs! Here you'll find the latest articles, tutorials, and updates from the Hudi community.
All Blog Posts

How FreeWheel Uses Apache Hudi to Power Its Data Lakehouse
Talk title slide

Deep Dive Into Hudi’s Indexing Subsystem (Part 1 of 2)
For decades, databases have relied on indexes—specialized data structures—to dramatically improve read and write performance by quickly locating specific records. Apache Hudi extends this fundamental principle to the data lakehouse with a unique and powerful approach. Every Hudi table contains a self-managed metadata table that functions as an indexing subsystem, enabling efficient data skipping and fast record lookups across a wide range of read and write scenarios.

Partition Stats: Enhancing Column Stats in Hudi 1.0
For those tracking Apache Hudi's performance enhancements, the introduction of the column stats index was a significant development, as detailed in this blog. It represented a major advancement for query optimization by implementing a straightforward yet highly effective concept: storing lightweight, file-level statistics (such as min/max values and null counts) for specific columns. This provided Hudi's query engine a substantial performance improvement.

Modernizing Upstox's Data Platform with Apache Hudi, dbt, and EMR Serverless
Introduction

Real-Time Cloud Security Graphs with Apache Hudi and PuppyGraph
CrowdStrike’s 2025 Global Threat Report puts average eCrime breakout time at 48 minutes, with the fastest at 51 seconds. This means that by the time security teams are even alerted about the potential breach, attackers have already long infiltrated the system. And that’s assuming they even get alerted. Cloud environments generate massive amounts of access logs, configuration changes, alerts, and telemetry. Reviewing these events in isolation rarely surfaces patterns like lateral movement or privilege escalation.

Automatic Record Key Generation in Apache Hudi
In database systems, the primary key is a foundational design principle for managing data at the record level. Its function is to provide each record with a unique and stable logical identifier, which decouples the record's identity from its physical location on storage. While using direct physical address pointers (e.g., position inside a file being used as a key) can be convenient, the physical address can change when records are moved around within the table for things like clustering or z-ordering (called out here).

Building a RAG-based AI Recommender (2/2)
Redirecting... please wait!!

A Deep Dive on Merge-on-Read (MoR) in Lakehouse Table Formats
TL;DR

How PayU built a secure enterprise AI assistant using Amazon Bedrock
Redirecting... please wait!!

Modernizing Data Infrastructure at Peloton Using Apache Hudi
Peloton re-architected its data platform using Apache Hudi to overcome snapshot delays, rigid service coupling, and high operational costs. By adopting CDC-based ingestion from PostgreSQL and DynamoDB, moving from CoW to MoR tables, and leveraging asynchronous services with fine-grained schema control, Peloton achieved 10-minute ingestion cycles, reduced compute/storage overhead, and enabled time travel and GDPR compliance.