Skip to main content
Version: Current

Unstructured Data

The BLOB type lets you store raw binary data — images, PDFs, audio clips, video files, model weights — directly in Hudi tables. Combined with Hudi's transactional guarantees and table services, BLOB makes the data lakehouse a single source of truth for both structured and unstructured data.

BLOB Type Overview

A BLOB column stores binary data in one of two modes:

ModeStorageTable FootprintRead Pattern
InlineRaw bytes embedded in the table rowLarger (bytes stored in data files)Direct read — no external fetch
Out-of-linePointer to external storage locationVery small (< 1% of data size)On-demand via read_blob()

Choose inline for small objects that are frequently read together (thumbnails, short audio clips, small documents). Choose out-of-line for large objects where you typically query metadata first and fetch raw data selectively (high-res images, video, model checkpoints).

Creating Tables with BLOB Columns

CREATE TABLE media_assets (
asset_id STRING,
file_name STRING,
mime_type STRING,
file_size BIGINT,
content BLOB
) USING hudi
TBLPROPERTIES (
primaryKey = 'asset_id',
type = 'cow'
);

The BLOB keyword in DDL automatically configures the column with the correct internal structure.

Writing Inline BLOBs

Inline BLOBs embed the raw bytes directly in the table row.

INSERT INTO media_assets VALUES (
'asset_001',
'logo.png',
'image/png',
45230,
named_struct(
'type', 'INLINE',
'data', /* binary literal or column reference */,
'reference', CAST(NULL AS STRUCT<external_path: STRING, offset: BIGINT, length: BIGINT>)
)
);

Writing Out-of-Line BLOBs

Out-of-line BLOBs store a pointer to data in external storage. The actual bytes live elsewhere (e.g., a binary container file on S3, a separate object store path, or a shared filesystem). The table only stores the reference metadata.

INSERT INTO media_assets VALUES (
'asset_002',
'video.mp4',
'video/mp4',
1073741824, -- 1 GB
named_struct(
'type', 'OUT_OF_LINE',
'data', CAST(NULL AS BINARY),
'reference', named_struct(
'external_path', 's3://my-bucket/media/container_001.bin',
'offset', 8388608, -- byte offset in the container
'length', 1073741824 -- number of bytes
)
)
);

Container File Pattern

A common pattern for out-of-line storage is to pack multiple objects into a single binary container file:

container_001.bin
├── [offset=0, len=45230] → logo.png
├── [offset=45230, len=89012] → photo.jpg
├── [offset=134242, len=1073741824] → video.mp4
└── ...

Each BLOB row stores the (external_path, offset, length) triple. This avoids creating millions of small files on object storage and enables efficient batch access.

Reading BLOBs

Querying Metadata (No Fetch)

Standard queries on a BLOB column return the descriptor — not the raw bytes:

SELECT asset_id, file_name, content.type, content.reference.external_path
FROM media_assets;
+----------+-----------+-------------+----------------------------------------+
| asset_id| file_name| type| external_path|
+----------+-----------+-------------+----------------------------------------+
| asset_001| logo.png| INLINE| null|
| asset_002| video.mp4| OUT_OF_LINE| s3://my-bucket/media/container_001.bin |
+----------+-----------+-------------+----------------------------------------+

This is fast and lightweight — no binary data is transferred.

Resolving Raw Bytes with read_blob()

Use the read_blob() SQL function to materialize the actual bytes:

-- Returns raw binary data
SELECT asset_id, read_blob(content) AS raw_bytes
FROM media_assets
WHERE asset_id = 'asset_001';

For inline BLOBs, read_blob() simply extracts the embedded bytes.

For out-of-line BLOBs, read_blob() reads from the external path at the specified offset and length, transparently fetching the data on demand.

tip

Use read_blob() selectively — filter first, then resolve. Avoid SELECT read_blob(content) FROM large_table without a WHERE clause, as this will fetch all raw data.

Use Cases

Image Datasets for Computer Vision

Store training images alongside metadata and embeddings:

CREATE TABLE training_images (
image_id STRING,
label STRING,
split STRING, -- 'train', 'val', 'test'
embedding VECTOR(1024),
raw_image BLOB
) USING hudi TBLPROPERTIES (...);

-- Get raw images for a specific label
SELECT image_id, read_blob(raw_image) AS pixels
FROM training_images
WHERE label = 'cat' AND split = 'train';

Document Store for RAG Pipelines

Store PDF documents alongside their chunk embeddings:

CREATE TABLE knowledge_base (
doc_id STRING,
chunk_id STRING,
source_url STRING,
text STRING,
embedding VECTOR(1536),
original BLOB -- original PDF bytes
) USING hudi TBLPROPERTIES (...);

-- Retrieve full document after vector search
SELECT doc_id, source_url, read_blob(original) AS pdf_bytes
FROM knowledge_base
WHERE doc_id IN (SELECT doc_id FROM top_matches);

Audio/Video Processing Pipelines

CREATE TABLE audio_clips (
clip_id STRING,
transcript STRING,
duration DOUBLE,
embedding VECTOR(512),
audio BLOB
) USING hudi TBLPROPERTIES (...);

Storage Efficiency

Out-of-line BLOBs keep the Hudi table footprint extremely small:

MetricInlineOut-of-Line
Table size vs. raw data~100%< 1%
Query metadata without fetchRequires reading data filesOnly reads pointer columns
Random access to raw dataRead full rowSeek to (offset, length)
Best for object size< 1 MB> 1 MB

Configuration Reference

PropertyDefaultDescription
hoodie.read.blob.inline.modeCONTENTControls how INLINE BLOBs are read. CONTENT materializes raw bytes in the data column. DESCRIPTOR surfaces (position, size) coordinates rewritten as OUT_OF_LINE references.
hoodie.blob.batching.max.gap.bytes4096Maximum gap (in bytes) between consecutive byte ranges before they are merged into a single read. Larger values reduce I/O calls at the cost of reading some unused bytes.
hoodie.blob.batching.lookahead.size50Number of rows to buffer for batch read detection. Larger values improve batching for sorted data but increase memory usage.
note

DESCRIPTOR mode is only supported on Lance-backed tables. CONTENT mode is always used for internal operations (compaction, merge, log replay) regardless of this setting.

Best Practices

  1. Choose the right mode — Use inline for small, frequently-accessed objects. Use out-of-line for anything over 1 MB.

  2. Filter before resolving — Always apply WHERE predicates before calling read_blob() to avoid unnecessary data transfer.

  3. Batch container files — When using out-of-line mode, pack multiple objects into container files rather than storing one file per object.

  4. Combine with VECTOR — Pair BLOB columns with VECTOR columns for powerful "search then retrieve" workflows: vector search narrows candidates, then read_blob() fetches just the winners.

  5. Use incremental queries — Process only new BLOBs by leveraging Hudi's incremental query support:

    SELECT * FROM hudi_table_changes('media_assets', 'latest_state', '20260401000000');