Skip to main content
Version: 0.10.0

Data Quality

Apache Hudi has what are called Pre-Commit Validators that allow you to validate that your data meets certain data quality expectations as you are writing with DeltaStreamer or Spark Datasource writers.

To configure pre-commit validators, use this setting hoodie.precommit.validators=<comma separated list of validator class names>.

Example:

spark.write.format("hudi")
.option("hoodie.precommit.validators", "org.apache.hudi.client.validator.SqlQueryEqualityPreCommitValidator")

Today you can use any of these validators and even have the flexibility to extend your own:

SQL Query Single Result

Can be used to validate that a query on the table results in a specific value.

Multiple queries separated by ';' delimiter are supported.Expected result is included as part of query separated by '#'. Example query: query1#result1;query2#result2

Example, "expect exactly 0 null rows":

import org.apache.hudi.config.HoodiePreCommitValidatorConfig._

df.write.format("hudi").mode(Overwrite).
option(TABLE_NAME, tableName).
option("hoodie.precommit.validators", "org.apache.hudi.client.validator.SqlQuerySingleResultPreCommitValidator").
option("hoodie.precommit.validators.single.value.sql.queries", "select count(*) from <TABLE_NAME> where col=null#0").
save(basePath)

SQL Query Equality

Can be used to validate for equality of rows before and after the commit.

Example, "expect no change of null rows with this commit":

import org.apache.hudi.config.HoodiePreCommitValidatorConfig._

df.write.format("hudi").mode(Overwrite).
option(TABLE_NAME, tableName).
option("hoodie.precommit.validators", "org.apache.hudi.client.validator.SqlQueryEqualityPreCommitValidator").
option("hoodie.precommit.validators.equality.sql.queries", "select count(*) from <TABLE_NAME> where col=null").
save(basePath)

SQL Query Inequality

Can be used to validate for inequality of rows before and after the commit.

Example, "expect there must be a change of null rows with this commit":

import org.apache.hudi.config.HoodiePreCommitValidatorConfig._

df.write.format("hudi").mode(Overwrite).
option(TABLE_NAME, tableName).
option("hoodie.precommit.validators", "org.apache.hudi.client.validator.SqlQueryInequalityPreCommitValidator").
option("hoodie.precommit.validators.inequality.sql.queries", "select count(*) from <TABLE_NAME> where col=null").
save(basePath)

Extend Custom Validator

Users can also provide their own implementations by extending the abstract class SparkPreCommitValidator and overriding this method

void validateRecordsBeforeAndAfter(Dataset<Row> before, 
Dataset<Row> after,
Set<String> partitionsAffected)