Version: Current

Vector Search

Hudi's VECTOR type and hudi_vector_search table-valued function (TVF) bring native similarity search to the data lakehouse. Store embeddings alongside your structured data and query them with familiar Spark SQL — no external vector database required.

VECTOR Type

The VECTOR(dim[, elementType]) type declares a column that stores fixed-dimensional embedding vectors. Dimension metadata enables the query engine to validate inputs and optimize search.

Element Types

Element Type	Description	Storage	Use Case
FLOAT (default)	32-bit float	`ArrayType(FloatType)`	Standard embeddings (OpenAI, Cohere, etc.)
DOUBLE	64-bit double	`ArrayType(DoubleType)`	High-precision scientific embeddings
INT8 / BYTE	8-bit signed integer	`ArrayType(ByteType)`	Quantized embeddings for storage efficiency

-- Default (FLOAT)
embedding VECTOR(768)

-- Explicit element types
embedding VECTOR(768, FLOAT)
embedding VECTOR(768, DOUBLE)
embedding VECTOR(256, INT8)

Declaring VECTOR Columns

Spark SQL
DataFrame API

CREATE TABLE products (
    product_id   STRING,
    name         STRING,
    description  STRING,
    embedding    VECTOR(768)
) USING hudi
TBLPROPERTIES (
    primaryKey = 'product_id',
    type = 'cow',
    hoodie.record.merger.impls = 'org.apache.hudi.DefaultSparkRecordMerger',
    hoodie.datasource.write.base.file.format = 'parquet'
);

When using SQL DDL, Hudi's parser automatically stamps the VECTOR(dim) metadata on the column.

import pyarrow as pa

schema = pa.schema([
    pa.field("product_id", pa.string()),
    pa.field("name",       pa.string()),
    pa.field("embedding",  pa.list_(pa.float32()),
             metadata={b"hudi_type": b"VECTOR(768)"}),
])

When using the DataFrame API, you must manually stamp hudi_type metadata on the column via PyArrow. This metadata is what distinguishes a VECTOR column from a regular array column.

Writing Vectors

Vectors are written as arrays of floats. Both the DataFrame API and SQL accept standard array syntax:

INSERT INTO products VALUES (
    'prod_001', 'Running Shoes', 'Lightweight trail runner',
    ARRAY(0.123, -0.456, 0.789, ...)   -- 768 floats
);

hudi_vector_search TVF

The hudi_vector_search table-valued function performs approximate nearest neighbor (ANN) search over a VECTOR column.

Syntax

SELECT *
FROM hudi_vector_search(
    table_name,       -- STRING: name of the Hudi table
    vector_column,    -- STRING: name of the VECTOR column
    query_vector,     -- ARRAY<FLOAT>: the query embedding
    top_k,            -- INT: number of nearest neighbors to return
    [distance_metric], -- STRING: 'cosine' (default), 'l2', or 'dot_product'
    [algorithm]        -- STRING: 'brute_force' (default)
)

Parameters

Parameter	Type	Default	Description
`table_name`	STRING	(required)	The Hudi table to search. Can be a registered table name or a path.
`vector_column`	STRING	(required)	The name of the VECTOR column to search against.
`query_vector`	ARRAY<FLOAT>	(required)	The query embedding. Must match the declared dimension and element type of the VECTOR column.
`top_k`	INT	(required)	The number of nearest neighbors to return. Must be a positive integer.
`distance_metric`	STRING	`'cosine'`	Distance metric: `'cosine'`, `'l2'`, or `'dot_product'`.
`algorithm`	STRING	`'brute_force'`	Search algorithm. Currently only `'brute_force'` is supported.

Return Schema

The TVF returns all columns from the source table (excluding the embedding column) plus:

Column	Type	Description
`_hudi_distance`	DOUBLE	The computed distance between the query vector and each result. Lower values indicate greater similarity.

Results are ordered by _hudi_distance ascending — closest matches first.

hudi_vector_search_batch TVF

For searching with multiple query vectors at once, use the batch variant:

Syntax

SELECT *
FROM hudi_vector_search_batch(
    corpus_table,           -- STRING: table to search
    corpus_embedding_col,   -- STRING: VECTOR column in corpus
    query_table,            -- STRING: table containing query vectors
    query_embedding_col,    -- STRING: VECTOR column in query table
    top_k,                  -- INT: neighbors per query
    [distance_metric],      -- STRING: 'cosine' (default)
    [algorithm]             -- STRING: 'brute_force' (default)
)

Return Schema (Batch)

Returns corpus columns + query columns + distance info:

Column	Type	Description
`_hudi_distance`	DOUBLE	Distance between query and corpus vector
`_hudi_query_index`	LONG	Index identifying which query vector produced this result

If corpus and query tables share column names, query columns are prefixed with _hudi_query_.

Distance Metrics

Metric	Formula	Range	Best for
cosine	1 - cos(a, b), clamped to [0, 2]	[0, 2]	Normalized embeddings (most common). Returns 1.0 for zero vectors.
l2	sqrt(sum((a[i] - b[i])^2))	[0, +inf)	Raw (unnormalized) embeddings
dot_product	-(a · b)	(-inf, +inf)	Maximum inner product search. Negated so ascending sort = most similar.

tip

For best results with cosine distance, L2-normalize your embeddings before writing them to the table. Most embedding models (OpenAI, Cohere, sentence-transformers) output normalized vectors by default. If yours does not, normalize during ingestion:

embedding = embedding / np.linalg.norm(embedding)

Examples

Find similar products:

SELECT product_id, name, _hudi_distance AS distance
FROM hudi_vector_search(
    'products', 'embedding',
    ARRAY(0.12, -0.03, 0.87, ...),  -- query embedding
    10,                               -- top 10
    'cosine'
)
ORDER BY distance;

RAG context retrieval:

-- Retrieve the 5 most relevant document chunks for an LLM prompt
SELECT chunk_id, text_content, _hudi_distance
FROM hudi_vector_search(
    'document_chunks', 'embedding',
    ARRAY(...),  -- embedding of the user's question
    5, 'cosine'
)
WHERE _hudi_distance < 0.3;  -- optional distance threshold

Cross-modal search (text-to-image):

-- Using CLIP embeddings, find images matching a text query
SELECT image_id, caption, _hudi_distance
FROM hudi_vector_search(
    'image_catalog', 'clip_embedding',
    ARRAY(...),  -- text embedding from CLIP
    20, 'cosine'
);

Best Practices

Normalize embeddings — Pre-normalize embeddings (L2 norm = 1) for cosine distance. This yields more consistent results and slightly faster search.
Right-size your dimensions — Higher dimensions capture more information but increase storage and search cost. Many use cases work well with 384–1024 dimensions.
Use incremental processing — When new data arrives, only embed and write the new records. Hudi's incremental query capabilities make this straightforward.

Constraints

VECTOR columns must be top-level fields — nesting inside STRUCT, ARRAY, or MAP is not supported.
The query vector's element type must exactly match the corpus embedding's element type (no implicit casting).
VECTOR dimension and element type cannot be changed after table creation via schema evolution.

VECTOR Type​

Element Types​

Declaring VECTOR Columns​

Writing Vectors​

hudi_vector_search TVF​

Syntax​

Parameters​

Return Schema​

hudi_vector_search_batch TVF​

Syntax​

Return Schema (Batch)​

Distance Metrics​

Examples​

Best Practices​

Constraints​

VECTOR Type

Element Types

Declaring VECTOR Columns

Writing Vectors

hudi_vector_search TVF

Syntax

Parameters

Return Schema

hudi_vector_search_batch TVF

Syntax

Return Schema (Batch)

Distance Metrics

Examples

Best Practices

Constraints