SQL DML
Spark SQL
SparkSQL provides several Data Manipulation Language (DML) actions for interacting with Hudi tables. These operations allow you to insert, update, merge and delete data from your Hudi tables. Let's explore them one by one.
Please refer to SQL DDL for creating Hudi tables using SQL.
Insert Into
You can use the INSERT INTO
statement to add data to a Hudi table using Spark SQL. Here are some examples:
INSERT INTO <table>
SELECT <columns> FROM <source>;
INSERT INTO
statement does not support evolving table schema. Please use DDL (e.g., ALTER TABLE
) or Datasource write (df.write.format("hudi")....save(basePath)
) to evolve table schema.
From 0.14.0, hoodie.sql.bulk.insert.enable
and hoodie.sql.insert.mode
are deprecated. Users are expected to use hoodie.spark.sql.insert.into.operation
instead.
To manage duplicates with INSERT INTO
, please check out insert dup policy config.
Examples:
-- Insert into a copy-on-write (COW) Hudi table
INSERT INTO hudi_cow_nonpcf_tbl SELECT 1, 'a1', 20;
-- Insert into a merge-on-read (MOR) Hudi table
INSERT INTO hudi_mor_tbl SELECT 1, 'a1', 20, 1000;
-- Insert into a COW Hudi table with static partition
INSERT INTO hudi_cow_pt_tbl PARTITION(dt = '2021-12-09', hh='11') SELECT 2, 'a2', 1000;
-- Insert into a COW Hudi table with dynamic partition
INSERT INTO hudi_cow_pt_tbl PARTITION(dt, hh) SELECT 1 AS id, 'a1' AS name, 1000 AS ts, '2021-12-09' AS dt, '10' AS hh;
Hudi offers flexibility in choosing the underlying write operation of a INSERT INTO
statement using
the hoodie.spark.sql.insert.into.operation
configuration. Possible options include "bulk_insert" (large inserts), "insert" (with small file management),
and "upsert" (with deduplication/merging). If a precombine field is not set, "insert" is chosen as the default. For a table with preCombine field set,
"upsert" is chosen as the default operation.
Insert Overwrite
The INSERT OVERWRITE
statement is used to replace existing data in a Hudi table.
INSERT OVERWRITE <table>
SELECT <columns> FROM <source>;
All existing partitions that are affected by the INSERT OVERWRITE
statement will replaced with the source data.
Here are some examples:
-- Overwrite non-partitioned table
INSERT OVERWRITE hudi_mor_tbl SELECT 99, 'a99', 20.0, 900;
INSERT OVERWRITE hudi_cow_nonpcf_tbl SELECT 99, 'a99', 20.0;
-- Overwrite partitioned table with dynamic partition
INSERT OVERWRITE TABLE hudi_cow_pt_tbl SELECT 10, 'a10', 1100, '2021-12-09', '10';
-- Overwrite partitioned table with static partition
INSERT OVERWRITE hudi_cow_pt_tbl PARTITION(dt = '2021-12-09', hh='12') SELECT 13, 'a13', 1100;
Update
You can use the UPDATE
statement to modify existing data in a Hudi table directly.
UPDATE tableIdentifier SET column = EXPRESSION(,column = EXPRESSION) [ WHERE boolExpression]
Here's an example:
-- Update data in a Hudi table
UPDATE hudi_mor_tbl SET price = price * 2, ts = 1111 WHERE id = 1;
-- Update data in a partitioned Hudi table
UPDATE hudi_cow_pt_tbl SET name = 'a1_1', ts = 1001 WHERE id = 1;
-- update using non-PK field
update hudi_cow_pt_tbl set ts = 1001 where name = 'a1';
The UPDATE
operation requires the specification of a preCombineField
.