All Configurations
This page covers the different ways of configuring your job to write/read Hudi tables. At a high level, you can control behaviour at few levels.
- Spark Datasource Configs: These configs control the Hudi Spark Datasource, providing ability to define keys/partitioning, pick out the write operation, specify how to merge records or choosing query type to read.
- Flink Sql Configs: These configs control the Hudi Flink SQL source/sink connectors, providing ability to define record keys, pick out the write operation, specify how to merge records, enable/disable asynchronous compaction or choosing query type to read.
- Write Client Configs: Internally, the Hudi datasource uses a RDD based HoodieWriteClient API to actually perform writes to storage. These configs provide deep control over lower level aspects like file sizing, compression, parallelism, compaction, write schema, cleaning etc. Although Hudi provides sane defaults, from time-time these configs may need to be tweaked to optimize for specific workloads.
- Metrics Configs: These set of configs are used to enable monitoring and reporting of keyHudi stats and metrics.
- Record Payload Config: This is the lowest level of customization offered by Hudi. Record payloads define how to produce new values to upsert based on incoming new record and stored old record. Hudi provides default implementations such as OverwriteWithLatestAvroPayload which simply update table with the latest/last-written record. This can be overridden to a custom class extending HoodieRecordPayload class, on both datasource and WriteClient levels.
- Kafka Connect Configs: These set of configs are used for Kafka Connect Sink Connector for writing Hudi Tables
- Amazon Web Services Configs: Please fill in the description for Config Group Name: Amazon Web Services Configs
Spark Datasource Configs
These configs control the Hudi Spark Datasource, providing ability to define keys/partitioning, pick out the write operation, specify how to merge records or choosing query type to read.
Read Options
Options useful for reading tables via read.format.option(...)
Config Class
: org.apache.hudi.DataSourceOptions.scala
hoodie.file.index.enable
Enables use of the spark file index implementation for Hudi, that speeds up listing of large tables.
Default Value: true (Optional)
Config Param: ENABLE_HOODIE_FILE_INDEX
Deprecated Version: 0.11.0
hoodie.datasource.read.paths
Comma separated list of file paths to read within a Hudi table.
Default Value: N/A (Required)
Config Param: READ_PATHS
hoodie.datasource.read.incr.filters
For use-cases like DeltaStreamer which reads from Hoodie Incremental table and applies opaque map functions, filters appearing late in the sequence of transformations cannot be automatically pushed down. This option allows setting filters directly on Hoodie Source.
Default Value: (Optional)
Config Param: PUSH_DOWN_INCR_FILTERS
hoodie.enable.data.skipping
Enables data-skipping allowing queries to leverage indexes to reduce the search space by skipping over files
Default Value: false (Optional)
Config Param: ENABLE_DATA_SKIPPING
Since Version: 0.10.0
as.of.instant
The query instant for time travel. Without specified this option, we query the latest snapshot.
Default Value: N/A (Required)
Config Param: TIME_TRAVEL_AS_OF_INSTANT
hoodie.datasource.read.schema.use.end.instanttime
Uses end instant schema when incrementally fetched data to. Default: users latest instant schema.
Default Value: false (Optional)
Config Param: INCREMENTAL_READ_SCHEMA_USE_END_INSTANTTIME
hoodie.datasource.read.incr.path.glob
For the use-cases like users only want to incremental pull from certain partitions instead of the full table. This option allows using glob pattern to directly filter on path.
Default Value: (Optional)
Config Param: INCR_PATH_GLOB
hoodie.datasource.read.end.instanttime
Instant time to limit incrementally fetched data to. New data written with an instant_time <= END_INSTANTTIME are fetched out.
Default Value: N/A (Required)
Config Param: END_INSTANTTIME
hoodie.datasource.write.precombine.field
Field used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..)
Default Value: ts (Optional)
Config Param: READ_PRE_COMBINE_FIELD
hoodie.datasource.merge.type
For Snapshot query on merge on read table, control whether we invoke the record payload implementation to merge (payload_combine) or skip merging altogetherskip_merge
Default Value: payload_combine (Optional)
Config Param: REALTIME_MERGE
hoodie.datasource.read.extract.partition.values.from.path
When set to true, values for partition columns (partition values) will be extracted from physical partition path (default Spark behavior). When set to false partition values will be read from the data file (in Hudi partition columns are persisted by default). This config is a fallback allowing to preserve existing behavior, and should not be used otherwise.
Default Value: false (Optional)
Config Param: EXTRACT_PARTITION_VALUES_FROM_PARTITION_PATH
Since Version: 0.11.0
hoodie.datasource.read.begin.instanttime
Instant time to start incrementally pulling data from. The instanttime here need not necessarily correspond to an instant on the timeline. New data written with an instant_time > BEGIN_INSTANTTIME are fetched out. For e.g: ‘20170901080000’ will get all new data written after Sep 1, 2017 08:00AM.
Default Value: N/A (Required)
Config Param: BEGIN_INSTANTTIME
hoodie.datasource.read.incr.fallback.fulltablescan.enable
When doing an incremental query whether we should fall back to full table scans if file does not exist.
Default Value: false (Optional)
Config Param: INCREMENTAL_FALLBACK_TO_FULL_TABLE_SCAN_FOR_NON_EXISTING_FILES
hoodie.datasource.query.type
Whether data needs to be read, in incremental mode (new data since an instantTime) (or) Read Optimized mode (obtain latest view, based on base files) (or) Snapshot mode (obtain latest view, by merging base and (if any) log files)
Default Value: snapshot (Optional)
Config Param: QUERY_TYPE
Write Options
You can pass down any of the WriteClient level configs directly using options()
or option(k,v)
methods.
inputDF.write()
.format("org.apache.hudi")
.options(clientOpts) // any of the Hudi client opts can be passed in as well
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key")
.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), "partition")
.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "timestamp")
.option(HoodieWriteConfig.TABLE_NAME, tableName)
.mode(SaveMode.Append)
.save(basePath);
Options useful for writing tables via write.format.option(...)
Config Class
: org.apache.hudi.DataSourceOptions.scala
hoodie.clustering.async.enabled
Enable running of clustering service, asynchronously as inserts happen on the table.
Default Value: false (Optional)
Config Param: ASYNC_CLUSTERING_ENABLE
Since Version: 0.7.0
hoodie.datasource.write.operation
Whether to do upsert, insert or bulkinsert for the write operation. Use bulkinsert to load new data into a table, and there on use upsert/insert. bulk insert uses a disk based write path to scale to load large inputs without need to cache it.
Default Value: upsert (Optional)
Config Param: OPERATION
hoodie.datasource.write.reconcile.schema
When a new batch of write has records with old schema, but latest table schema got evolved, this config will upgrade the records to leverage latest table schema(default values will be injected to missing fields). If not, the write batch would fail.
Default Value: false (Optional)
Config Param: RECONCILE_SCHEMA
hoodie.datasource.write.recordkey.field
Record key field. Value to be used as the
recordKey
component ofHoodieKey
. Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the dot notation eg:a.b.c
Default Value: uuid (Optional)
Config Param: RECORDKEY_FIELD
hoodie.datasource.hive_sync.skip_ro_suffix
Skip the _ro suffix for Read optimized table, when registering
Default Value: false (Optional)
Config Param: HIVE_SKIP_RO_SUFFIX_FOR_READ_OPTIMIZED_TABLE
hoodie.datasource.write.partitionpath.urlencode
Should we url encode the partition path value, before creating the folder structure.
Default Value: false (Optional)
Config Param: URL_ENCODE_PARTITIONING
hoodie.datasource.hive_sync.partition_extractor_class
Class which implements PartitionValueExtractor to extract the partition values, default 'SlashEncodedDayPartitionValueExtractor'.
Default Value: org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor (Optional)
Config Param: HIVE_PARTITION_EXTRACTOR_CLASS
hoodie.datasource.hive_sync.serde_properties
Serde properties to hive table.
Default Value: N/A (Required)
Config Param: HIVE_TABLE_SERDE_PROPERTIES
hoodie.datasource.hive_sync.sync_comment
Whether to sync the table column comments while syncing the table.
Default Value: false (Optional)
Config Param: HIVE_SYNC_COMMENT
hoodie.datasource.hive_sync.password
hive password to use
Default Value: hive (Optional)
Config Param: HIVE_PASS
hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled
When set to true, consistent value will be generated for a logical timestamp type column, like timestamp-millis and timestamp-micros, irrespective of whether row-writer is enabled. Disabled by default so as not to break the pipeline that deploy either fully row-writer path or non row-writer path. For example, if it is kept disabled then record key of timestamp type with value
2016-12-29 09:54:00
will be written as timestamp2016-12-29 09:54:00.0
in row-writer path, while it will be written as long value1483023240000000
in non row-writer path. If enabled, then the timestamp value will be written in both the cases.
Default Value: false (Optional)
Config Param: KEYGENERATOR_CONSISTENT_LOGICAL_TIMESTAMP_ENABLED
hoodie.datasource.hive_sync.support_timestamp
‘INT64’ with original type TIMESTAMP_MICROS is converted to hive ‘timestamp’ type. Disabled by default for backward compatibility.
Default Value: false (Optional)
Config Param: HIVE_SUPPORT_TIMESTAMP_TYPE
hoodie.datasource.hive_sync.create_managed_table
Whether to sync the table as managed table.
Default Value: false (Optional)
Config Param: HIVE_CREATE_MANAGED_TABLE
hoodie.clustering.inline
Turn on inline clustering - clustering will be run after each write operation is complete
Default Value: false (Optional)
Config Param: INLINE_CLUSTERING_ENABLE
Since Version: 0.7.0
hoodie.datasource.compaction.async.enable
Controls whether async compaction should be turned on for MOR table writing.
Default Value: true (Optional)
Config Param: ASYNC_COMPACT_ENABLE
hoodie.datasource.meta.sync.enable
Enable Syncing the Hudi Table with an external meta store or data catalog.
Default Value: false (Optional)
Config Param: META_SYNC_ENABLED
hoodie.datasource.write.streaming.ignore.failed.batch
Config to indicate whether to ignore any non exception error (e.g. writestatus error) within a streaming microbatch
Default Value: true (Optional)
Config Param: STREAMING_IGNORE_FAILED_BATCH
hoodie.datasource.write.precombine.field
Field used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..)
Default Value: ts (Optional)
Config Param: PRECOMBINE_FIELD
hoodie.datasource.hive_sync.username
hive user name to use
Default Value: hive (Optional)
Config Param: HIVE_USER
hoodie.datasource.write.partitionpath.field
Partition path field. Value to be used at the partitionPath component of HoodieKey. Actual value ontained by invoking .toString()
Default Value: N/A (Required)
Config Param: PARTITIONPATH_FIELD
hoodie.datasource.write.streaming.retry.count
Config to indicate how many times streaming job should retry for a failed micro batch.
Default Value: 3 (Optional)
Config Param: STREAMING_RETRY_CNT
hoodie.datasource.hive_sync.partition_fields
Field in the table to use for determining hive partition columns.
Default Value: (Optional)
Config Param: HIVE_PARTITION_FIELDS
hoodie.datasource.hive_sync.sync_as_datasource
Default Value: true (Optional)
Config Param: HIVE_SYNC_AS_DATA_SOURCE_TABLE
hoodie.sql.insert.mode
Insert mode when insert data to pk-table. The optional modes are: upsert, strict and non-strict.For upsert mode, insert statement do the upsert operation for the pk-table which will update the duplicate record.For strict mode, insert statement will keep the primary key uniqueness constraint which do not allow duplicate record.While for non-strict mode, hudi just do the insert operation for the pk-table.
Default Value: upsert (Optional)
Config Param: SQL_INSERT_MODE
hoodie.datasource.hive_sync.use_jdbc
Use JDBC when hive synchronization is enabled
Default Value: true (Optional)
Config Param: HIVE_USE_JDBC
Deprecated Version: 0.9.0
hoodie.meta.sync.client.tool.class
Sync tool class name used to sync to metastore. Defaults to Hive.
Default Value: org.apache.hudi.hive.HiveSyncTool (Optional)
Config Param: META_SYNC_CLIENT_TOOL_CLASS_NAME
hoodie.datasource.write.keygenerator.class
Key generator class, that implements
org.apache.hudi.keygen.KeyGenerator
Default Value: org.apache.hudi.keygen.SimpleKeyGenerator (Optional)
Config Param: KEYGENERATOR_CLASS_NAME
hoodie.datasource.write.payload.class
Payload class used. Override this, if you like to roll your own merge logic, when upserting/inserting. This will render any value set for PRECOMBINE_FIELD_OPT_VAL in-effective
Default Value: org.apache.hudi.common.model.OverwriteWithLatestAvroPayload (Optional)
Config Param: PAYLOAD_CLASS_NAME
hoodie.datasource.hive_sync.table_properties
Additional properties to store with table.
Default Value: N/A (Required)
Config Param: HIVE_TABLE_PROPERTIES
hoodie.datasource.hive_sync.jdbcurl
Hive metastore url
Default Value: jdbc:hive2://localhost:10000 (Optional)
Config Param: HIVE_URL
hoodie.datasource.hive_sync.batch_num
The number of partitions one batch when synchronous partitions to hive.
Default Value: 1000 (Optional)
Config Param: HIVE_BATCH_SYNC_PARTITION_NUM
hoodie.datasource.hive_sync.assume_date_partitioning
Assume partitioning is yyyy/mm/dd
Default Value: false (Optional)
Config Param: HIVE_ASSUME_DATE_PARTITION
hoodie.datasource.hive_sync.bucket_sync
Whether sync hive metastore bucket specification when using bucket index.The specification is 'CLUSTERED BY (trace_id) SORTED BY (trace_id ASC) INTO 65536 BUCKETS'
Default Value: false (Optional)
Config Param: HIVE_SYNC_BUCKET_SYNC
hoodie.datasource.hive_sync.auto_create_database
Auto create hive database if does not exists
Default Value: true (Optional)
Config Param: HIVE_AUTO_CREATE_DATABASE
hoodie.datasource.hive_sync.database
The name of the destination database that we should sync the hudi table to.
Default Value: default (Optional)
Config Param: HIVE_DATABASE
hoodie.datasource.write.streaming.retry.interval.ms
Config to indicate how long (by millisecond) before a retry should issued for failed microbatch
Default Value: 2000 (Optional)
Config Param: STREAMING_RETRY_INTERVAL_MS
hoodie.sql.bulk.insert.enable
When set to true, the sql insert statement will use bulk insert.
Default Value: false (Optional)
Config Param: SQL_ENABLE_BULK_INSERT
hoodie.datasource.write.commitmeta.key.prefix
Option keys beginning with this prefix, are automatically added to the commit/deltacommit metadata. This is useful to store checkpointing information, in a consistent way with the hudi timeline
Default Value: _ (Optional)
Config Param: COMMIT_METADATA_KEYPREFIX
hoodie.datasource.write.drop.partition.columns
When set to true, will not write the partition columns into hudi. By default, false.
Default Value: false (Optional)
Config Param: DROP_PARTITION_COLUMNS
hoodie.datasource.hive_sync.enable
When set to true, register/sync the table to Apache Hive metastore.
Default Value: false (Optional)
Config Param: HIVE_SYNC_ENABLED
hoodie.datasource.hive_sync.table
The name of the destination table that we should sync the hudi table to.
Default Value: unknown (Optional)
Config Param: HIVE_TABLE
hoodie.datasource.hive_sync.ignore_exceptions
Ignore exceptions when syncing with Hive.
Default Value: false (Optional)
Config Param: HIVE_IGNORE_EXCEPTIONS
hoodie.datasource.hive_sync.use_pre_apache_input_format
Flag to choose InputFormat under com.uber.hoodie package instead of org.apache.hudi package. Use this when you are in the process of migrating from com.uber.hoodie to org.apache.hudi. Stop using this after you migrated the table definition to org.apache.hudi input format
Default Value: false (Optional)
Config Param: HIVE_USE_PRE_APACHE_INPUT_FORMAT
hoodie.datasource.write.table.type
The table type for the underlying data, for this write. This can’t change between writes.
Default Value: COPY_ON_WRITE (Optional)
Config Param: TABLE_TYPE
hoodie.datasource.write.row.writer.enable
When set to true, will perform write operations directly using the spark native
Row
representation, avoiding any additional conversion costs.
Default Value: true (Optional)
Config Param: ENABLE_ROW_WRITER
hoodie.datasource.write.hive_style_partitioning
Flag to indicate whether to use Hive style partitioning. If set true, the names of partition folders follow <partition_column_name>=<partition_value> format. By default false (the names of partition folders are only partition values)
Default Value: false (Optional)
Config Param: HIVE_STYLE_PARTITIONING
hoodie.datasource.meta_sync.condition.sync
If true, only sync on conditions like schema change or partition change.
Default Value: false (Optional)
Config Param: HIVE_CONDITIONAL_SYNC
hoodie.datasource.hive_sync.mode
Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql.
Default Value: N/A (Required)
Config Param: HIVE_SYNC_MODE
hoodie.datasource.write.table.name
Table name for the datasource write. Also used to register the table into meta stores.
Default Value: N/A (Required)
Config Param: TABLE_NAME
hoodie.datasource.hive_sync.base_file_format
Base file format for the sync.
Default Value: PARQUET (Optional)
Config Param: HIVE_BASE_FILE_FORMAT
hoodie.deltastreamer.source.kafka.value.deserializer.class
This class is used by kafka client to deserialize the records
Default Value: io.confluent.kafka.serializers.KafkaAvroDeserializer (Optional)
Config Param: KAFKA_AVRO_VALUE_DESERIALIZER_CLASS
Since Version: 0.9.0
hoodie.datasource.hive_sync.metastore.uris
Hive metastore url
Default Value: thrift://localhost:9083 (Optional)
Config Param: METASTORE_URIS
hoodie.datasource.write.insert.drop.duplicates
If set to true, filters out all duplicate records from incoming dataframe, during insert operations.
Default Value: false (Optional)
Config Param: INSERT_DROP_DUPS
hoodie.datasource.write.partitions.to.delete
Comma separated list of partitions to delete
Default Value: N/A (Required)
Config Param: PARTITIONS_TO_DELETE
PreCommit Validator Configurations
The following set of configurations help validate new data before commits.
Config Class
: org.apache.hudi.config.HoodiePreCommitValidatorConfig
hoodie.precommit.validators.single.value.sql.queries
Spark SQL queries to run on table before committing new data to validate state after commit.Multiple queries separated by ';' delimiter are supported.Expected result is included as part of query separated by '#'. Example query: 'query1#result1:query2#result2'Note <TABLE_NAME> variable is expected to be present in query.
Default Value: (Optional)
Config Param: SINGLE_VALUE_SQL_QUERIES
hoodie.precommit.validators.equality.sql.queries
Spark SQL queries to run on table before committing new data to validate state before and after commit. Multiple queries separated by ';' delimiter are supported. Example: "select count(*) from <TABLE_NAME> Note <TABLE_NAME> is replaced by table state before and after commit.
Default Value: (Optional)
Config Param: EQUALITY_SQL_QUERIES
hoodie.precommit.validators
Comma separated list of class names that can be invoked to validate commit
Default Value: (Optional)
Config Param: VALIDATOR_CLASS_NAMES
hoodie.precommit.validators.inequality.sql.queries
Spark SQL queries to run on table before committing new data to validate state before and after commit.Multiple queries separated by ';' delimiter are supported.Example query: 'select count(*) from <TABLE_NAME> where col=null'Note <TABLE_NAME> variable is expected to be present in query.
Default Value: (Optional)
Config Param: INEQUALITY_SQL_QUERIES
Flink Sql Configs
These configs control the Hudi Flink SQL source/sink connectors, providing ability to define record keys, pick out the write operation, specify how to merge records, enable/disable asynchronous compaction or choosing query type to read.
Flink Options
Flink jobs using the SQL can be configured through the options in WITH clause. The actual datasource level configs are listed below.
Config Class
: org.apache.hudi.configuration.FlinkOptions
read.streaming.enabled
Whether to read as streaming source, default false
Default Value: false (Optional)
Config Param: READ_AS_STREAMING
hoodie.datasource.write.keygenerator.type
Key generator type, that implements will extract the key out of incoming record. Note This is being actively worked on. Please use
hoodie.datasource.write.keygenerator.class
instead.
Default Value: SIMPLE (Optional)
Config Param: KEYGEN_TYPE
compaction.trigger.strategy
Strategy to trigger compaction, options are 'num_commits': trigger compaction when reach N delta commits; 'time_elapsed': trigger compaction when time elapsed > N seconds since last compaction; 'num_and_time': trigger compaction when both NUM_COMMITS and TIME_ELAPSED are satisfied; 'num_or_time': trigger compaction when NUM_COMMITS or TIME_ELAPSED is satisfied. Default is 'num_commits'
Default Value: num_commits (Optional)
Config Param: COMPACTION_TRIGGER_STRATEGY
index.state.ttl
Index state ttl in days, default stores the index permanently
Default Value: 0.0 (Optional)
Config Param: INDEX_STATE_TTL
compaction.max_memory
Max memory in MB for compaction spillable map, default 100MB
Default Value: 100 (Optional)
Config Param: COMPACTION_MAX_MEMORY
hive_sync.support_timestamp
INT64 with original type TIMESTAMP_MICROS is converted to hive timestamp type. Disabled by default for backward compatibility.
Default Value: true (Optional)
Config Param: HIVE_SYNC_SUPPORT_TIMESTAMP
hive_sync.serde_properties
Serde properties to hive table, the data format is k1=v1 k2=v2
Default Value: N/A (Required)
Config Param: HIVE_SYNC_TABLE_SERDE_PROPERTIES
hive_sync.skip_ro_suffix
Skip the _ro suffix for Read optimized table when registering, default false
Default Value: false (Optional)
Config Param: HIVE_SYNC_SKIP_RO_SUFFIX
metadata.compaction.delta_commits
Max delta commits for metadata table to trigger compaction, default 10
Default Value: 10 (Optional)
Config Param: METADATA_COMPACTION_DELTA_COMMITS
hive_sync.assume_date_partitioning
Assume partitioning is yyyy/mm/dd, default false
Default Value: false (Optional)
Config Param: HIVE_SYNC_ASSUME_DATE_PARTITION
write.parquet.block.size
Parquet RowGroup size. It's recommended to make this large enough that scan costs can be amortized by packing enough column values into a single row group.
Default Value: 120 (Optional)
Config Param: WRITE_PARQUET_BLOCK_SIZE
hive_sync.table
Table name for hive sync, default 'unknown'
Default Value: unknown (Optional)
Config Param: HIVE_SYNC_TABLE
write.payload.class
Payload class used. Override this, if you like to roll your own merge logic, when upserting/inserting. This will render any value set for the option in-effective
Default Value: org.apache.hudi.common.model.OverwriteWithLatestAvroPayload (Optional)
Config Param: PAYLOAD_CLASS_NAME
compaction.tasks
Parallelism of tasks that do actual compaction, default is 4
Default Value: 4 (Optional)
Config Param: COMPACTION_TASKS
hoodie.datasource.write.hive_style_partitioning
Whether to use Hive style partitioning. If set true, the names of partition folders follow <partition_column_name>=<partition_value> format. By default false (the names of partition folders are only partition values)
Default Value: false (Optional)
Config Param: HIVE_STYLE_PARTITIONING
table.type
Type of table to write. COPY_ON_WRITE (or) MERGE_ON_READ
Default Value: COPY_ON_WRITE (Optional)
Config Param: TABLE_TYPE
hive_sync.auto_create_db
Auto create hive database if it does not exists, default true
Default Value: true (Optional)
Config Param: HIVE_SYNC_AUTO_CREATE_DB
compaction.timeout.seconds
Max timeout time in seconds for online compaction to rollback, default 20 minutes
Default Value: 1200 (Optional)
Config Param: COMPACTION_TIMEOUT_SECONDS
hive_sync.username
Username for hive sync, default 'hive'
Default Value: hive (Optional)
Config Param: HIVE_SYNC_USERNAME
write.sort.memory
Sort memory in MB, default 128MB
Default Value: 128 (Optional)
Config Param: WRITE_SORT_MEMORY
hive_sync.enable
Asynchronously sync Hive meta to HMS, default false
Default Value: false (Optional)
Config Param: HIVE_SYNC_ENABLED
changelog.enabled
Whether to keep all the intermediate changes, we try to keep all the changes of a record when enabled: 1). The sink accept the UPDATE_BEFORE message; 2). The source try to emit every changes of a record. The semantics is best effort because the compaction job would finally merge all changes of a record into one. default false to have UPSERT semantics
Default Value: false (Optional)
Config Param: CHANGELOG_ENABLED
read.streaming.check-interval
Check interval for streaming read of SECOND, default 1 minute
Default Value: 60 (Optional)
Config Param: READ_STREAMING_CHECK_INTERVAL
write.bulk_insert.shuffle_input
Whether to shuffle the inputs by specific fields for bulk insert tasks, default true
Default Value: true (Optional)
Config Param: WRITE_BULK_INSERT_SHUFFLE_INPUT
hoodie.datasource.merge.type
For Snapshot query on merge on read table. Use this key to define how the payloads are merged, in 1) skip_merge: read the base file records plus the log file records; 2) payload_combine: read the base file records first, for each record in base file, checks whether the key is in the log file records(combines the two records with same key for base and log file records), then read the left log file records
Default Value: payload_combine (Optional)
Config Param: MERGE_TYPE
write.retry.times
Flag to indicate how many times streaming job should retry for a failed checkpoint batch. By default 3
Default Value: 3 (Optional)
Config Param: RETRY_TIMES
metadata.enabled
Enable the internal metadata table which serves table metadata like level file listings, default false
Default Value: false (Optional)
Config Param: METADATA_ENABLED
read.tasks
Parallelism of tasks that do actual read, default is 4
Default Value: 4 (Optional)
Config Param: READ_TASKS
write.parquet.max.file.size
Target size for parquet files produced by Hudi write phases. For DFS, this needs to be aligned with the underlying filesystem block size for optimal performance.
Default Value: 120 (Optional)
Config Param: WRITE_PARQUET_MAX_FILE_SIZE
hoodie.bucket.index.hash.field
Index key field. Value to be used as hashing to find the bucket ID. Should be a subset of or equal to the recordKey fields. Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the dot notation eg:
a.b.c
Default Value: (Optional)
Config Param: INDEX_KEY_FIELD
hoodie.bucket.index.num.buckets
Hudi bucket number per partition. Only affected if using Hudi bucket index.
Default Value: 4 (Optional)
Config Param: BUCKET_INDEX_NUM_BUCKETS
read.end-commit
End commit instant for reading, the commit time format should be 'yyyyMMddHHmmss'
Default Value: N/A (Required)
Config Param: READ_END_COMMIT
write.log.max.size
Maximum size allowed in MB for a log file before it is rolled over to the next version, default 1GB
Default Value: 1024 (Optional)
Config Param: WRITE_LOG_MAX_SIZE
hive_sync.file_format
File format for hive sync, default 'PARQUET'
Default Value: PARQUET (Optional)
Config Param: HIVE_SYNC_FILE_FORMAT
hive_sync.mode
Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql, default 'jdbc'
Default Value: jdbc (Optional)
Config Param: HIVE_SYNC_MODE
write.retry.interval.ms
Flag to indicate how long (by millisecond) before a retry should issued for failed checkpoint batch. By default 2000 and it will be doubled by every retry
Default Value: 2000 (Optional)
Config Param: RETRY_INTERVAL_MS
write.partition.format
Partition path format, only valid when 'write.datetime.partitioning' is true, default is:
- 'yyyyMMddHH' for timestamp(3) WITHOUT TIME ZONE, LONG, FLOAT, DOUBLE, DECIMAL;
- 'yyyyMMdd' for DATE and INT.
Default Value: N/A (Required)
Config Param: PARTITION_FORMAT
hive_sync.db
Database name for hive sync, default 'default'
Default Value: default (Optional)
Config Param: HIVE_SYNC_DB
index.type
Index type of Flink write job, default is using state backed index.
Default Value: FLINK_STATE (Optional)
Config Param: INDEX_TYPE
hive_sync.password
Password for hive sync, default 'hive'
Default Value: hive (Optional)
Config Param: HIVE_SYNC_PASSWORD
hive_sync.use_jdbc
Use JDBC when hive synchronization is enabled, default true
Default Value: true (Optional)
Config Param: HIVE_SYNC_USE_JDBC
compaction.schedule.enabled
Schedule the compaction plan, enabled by default for MOR
Default Value: true (Optional)
Config Param: COMPACTION_SCHEDULE_ENABLED
hive_sync.jdbc_url
Jdbc URL for hive sync, default 'jdbc:hive2://localhost:10000'
Default Value: jdbc:hive2://localhost:10000 (Optional)
Config Param: HIVE_SYNC_JDBC_URL
hive_sync.partition_extractor_class
Tool to extract the partition value from HDFS path, default 'SlashEncodedDayPartitionValueExtractor'
Default Value: org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor (Optional)
Config Param: HIVE_SYNC_PARTITION_EXTRACTOR_CLASS_NAME
read.start-commit
Start commit instant for reading, the commit time format should be 'yyyyMMddHHmmss', by default reading from the latest instant for streaming read
Default Value: N/A (Required)
Config Param: READ_START_COMMIT
write.precombine
Flag to indicate whether to drop duplicates before insert/upsert. By default these cases will accept duplicates, to gain extra performance:
- insert operation;
- upsert for MOR table, the MOR table deduplicate on reading
Default Value: false (Optional)
Config Param: PRE_COMBINE
write.batch.size
Batch buffer size in MB to flush data into the underneath filesystem, default 256MB
Default Value: 256.0 (Optional)
Config Param: WRITE_BATCH_SIZE
archive.min_commits
Min number of commits to keep before archiving older commits into a sequential log, default 40
Default Value: 40 (Optional)
Config Param: ARCHIVE_MIN_COMMITS
hoodie.datasource.write.keygenerator.class
Key generator class, that implements will extract the key out of incoming record
Default Value: (Optional)
Config Param: KEYGEN_CLASS_NAME
index.global.enabled
Whether to update index for the old partition path if same key record with different partition path came in, default true
Default Value: true (Optional)
Config Param: INDEX_GLOBAL_ENABLED
index.partition.regex
Whether to load partitions in state if partition path matching, default
*
Default Value: .* (Optional)
Config Param: INDEX_PARTITION_REGEX
hoodie.table.name
Table name to register to Hive metastore
Default Value: N/A (Required)
Config Param: TABLE_NAME
path
Base path for the target hoodie table. The path would be created if it does not exist, otherwise a Hoodie table expects to be initialized successfully
Default Value: N/A (Required)
Config Param: PATH
index.bootstrap.enabled
Whether to bootstrap the index state from existing hoodie table, default false
Default Value: false (Optional)
Config Param: INDEX_BOOTSTRAP_ENABLED
read.streaming.skip_compaction
Whether to skip compaction instants for streaming read, there are two cases that this option can be used to avoid reading duplicates:
- you are definitely sure that the consumer reads faster than any compaction instants, usually with delta time compaction strategy that is long enough, for e.g, one week;
- changelog mode is enabled, this option is a solution to keep data integrity
Default Value: false (Optional)
Config Param: READ_STREAMING_SKIP_COMPACT
hoodie.datasource.write.partitionpath.urlencode
Whether to encode the partition path url, default false
Default Value: false (Optional)
Config Param: URL_ENCODE_PARTITIONING
compaction.async.enabled
Async Compaction, enabled by default for MOR
Default Value: true (Optional)
Config Param: COMPACTION_ASYNC_ENABLED
hive_sync.ignore_exceptions
Ignore exceptions during hive synchronization, default false
Default Value: false (Optional)
Config Param: HIVE_SYNC_IGNORE_EXCEPTIONS
hive_sync.table_properties
Additional properties to store with table, the data format is k1=v1 k2=v2
Default Value: N/A (Required)
Config Param: HIVE_SYNC_TABLE_PROPERTIES
write.ignore.failed
Flag to indicate whether to ignore any non exception error (e.g. writestatus error). within a checkpoint batch. By default true (in favor of streaming progressing over data integrity)
Default Value: true (Optional)
Config Param: IGNORE_FAILED
write.commit.ack.timeout
Timeout limit for a writer task after it finishes a checkpoint and waits for the instant commit success, only for internal use
Default Value: -1 (Optional)
Config Param: WRITE_COMMIT_ACK_TIMEOUT
write.operation
The write operation, that this write should do
Default Value: upsert (Optional)
Config Param: OPERATION
hoodie.datasource.write.partitionpath.field
Partition path field. Value to be used at the
partitionPath
component ofHoodieKey
. Actual value obtained by invoking .toString(), default ''
Default Value: (Optional)
Config Param: PARTITION_PATH_FIELD
write.bucket_assign.tasks
Parallelism of tasks that do bucket assign, default is the parallelism of the execution environment
Default Value: N/A (Required)
Config Param: BUCKET_ASSIGN_TASKS
source.avro-schema.path
Source avro schema file path, the parsed schema is used for deserialization
Default Value: N/A (Required)
Config Param: SOURCE_AVRO_SCHEMA_PATH
compaction.delta_commits
Max delta commits needed to trigger compaction, default 5 commits
Default Value: 5 (Optional)
Config Param: COMPACTION_DELTA_COMMITS
write.insert.cluster
Whether to merge small files for insert mode, if true, the write throughput will decrease because the read/write of existing small file, only valid for COW table, default false
Default Value: false (Optional)
Config Param: INSERT_CLUSTER
partition.default_name
The default partition name in case the dynamic partition column value is null/empty string
Default Value: default (Optional)
Config Param: PARTITION_DEFAULT_NAME
write.bulk_insert.sort_input
Whether to sort the inputs by specific fields for bulk insert tasks, default true
Default Value: true (Optional)
Config Param: WRITE_BULK_INSERT_SORT_INPUT
source.avro-schema
Source avro schema string, the parsed schema is used for deserialization
Default Value: N/A (Required)
Config Param: SOURCE_AVRO_SCHEMA
compaction.target_io
Target IO in MB for per compaction (both read and write), default 500 GB
Default Value: 512000 (Optional)
Config Param: COMPACTION_TARGET_IO
write.rate.limit
Write record rate limit per second to prevent traffic jitter and improve stability, default 0 (no limit)
Default Value: 0 (Optional)
Config Param: WRITE_RATE_LIMIT
write.log_block.size
Max log block size in MB for log file, default 128MB
Default Value: 128 (Optional)
Config Param: WRITE_LOG_BLOCK_SIZE
write.tasks
Parallelism of tasks that do actual write, default is 4
Default Value: 4 (Optional)
Config Param: WRITE_TASKS
clean.async.enabled
Whether to cleanup the old commits immediately on new commits, enabled by default
Default Value: true (Optional)
Config Param: CLEAN_ASYNC_ENABLED
clean.retain_commits
Number of commits to retain. So data will be retained for num_of_commits * time_between_commits (scheduled). This also directly translates into how much you can incrementally pull on this table, default 30
Default Value: 30 (Optional)
Config Param: CLEAN_RETAIN_COMMITS
read.utc-timezone
Use UTC timezone or local timezone to the conversion between epoch time and LocalDateTime. Hive 0.x/1.x/2.x use local timezone. But Hive 3.x use UTC timezone, by default true
Default Value: true (Optional)
Config Param: UTC_TIMEZONE
archive.max_commits
Max number of commits to keep before archiving older commits into a sequential log, default 50
Default Value: 50 (Optional)
Config Param: ARCHIVE_MAX_COMMITS
hoodie.datasource.query.type
Decides how data files need to be read, in 1) Snapshot mode (obtain latest view, based on row & columnar data); 2) incremental mode (new data since an instantTime); 3) Read Optimized mode (obtain latest view, based on columnar data) .Default: snapshot
Default Value: snapshot (Optional)
Config Param: QUERY_TYPE
write.precombine.field
Field used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..)
Default Value: ts (Optional)
Config Param: PRECOMBINE_FIELD
write.index_bootstrap.tasks
Parallelism of tasks that do index bootstrap, default is the parallelism of the execution environment
Default Value: N/A (Required)
Config Param: INDEX_BOOTSTRAP_TASKS
write.task.max.size
Maximum memory in MB for a write task, when the threshold hits, it flushes the max size data bucket to avoid OOM, default 1GB
Default Value: 1024.0 (Optional)
Config Param: WRITE_TASK_MAX_SIZE
hoodie.datasource.write.recordkey.field
Record key field. Value to be used as the
recordKey
component ofHoodieKey
. Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the dot notation eg:a.b.c
Default Value: uuid (Optional)
Config Param: RECORD_KEY_FIELD
write.parquet.page.size
Parquet page size. Page is the unit of read within a parquet file. Within a block, pages are compressed separately.
Default Value: 1 (Optional)
Config Param: WRITE_PARQUET_PAGE_SIZE
compaction.delta_seconds
Max delta seconds time needed to trigger compaction, default 1 hour
Default Value: 3600 (Optional)
Config Param: COMPACTION_DELTA_SECONDS
hive_sync.metastore.uris
Metastore uris for hive sync, default ''
Default Value: (Optional)
Config Param: HIVE_SYNC_METASTORE_URIS
hive_sync.partition_fields
Partition fields for hive sync, default ''
Default Value: (Optional)
Config Param: HIVE_SYNC_PARTITION_FIELDS
write.merge.max_memory
Max memory in MB for merge, default 100MB
Default Value: 100 (Optional)
Config Param: WRITE_MERGE_MAX_MEMORY
Write Client Configs
Internally, the Hudi datasource uses a RDD based HoodieWriteClient API to actually perform writes to storage. These configs provide deep control over lower level aspects like file sizing, compression, parallelism, compaction, write schema, cleaning etc. Although Hudi provides sane defaults, from time-time these configs may need to be tweaked to optimize for specific workloads.
Layout Configs
Configurations that control storage layout and data distribution, which defines how the files are organized within a table.
Config Class
: org.apache.hudi.config.HoodieLayoutConfig
hoodie.storage.layout.type
Type of storage layout. Possible options are [DEFAULT | BUCKET]
Default Value: DEFAULT (Optional)
Config Param: LAYOUT_TYPE
hoodie.storage.layout.partitioner.class
Partitioner class, it is used to distribute data in a specific way.
Default Value: N/A (Required)
Config Param: LAYOUT_PARTITIONER_CLASS_NAME
Write commit callback configs
Controls callback behavior into HTTP endpoints, to push notifications on commits on hudi tables.
Config Class
: org.apache.hudi.config.HoodieWriteCommitCallbackConfig
hoodie.write.commit.callback.on
Turn commit callback on/off. off by default.
Default Value: false (Optional)
Config Param: TURN_CALLBACK_ON
Since Version: 0.6.0
hoodie.write.commit.callback.http.url
Callback host to be sent along with callback messages
Default Value: N/A (Required)
Config Param: CALLBACK_HTTP_URL
Since Version: 0.6.0
hoodie.write.commit.callback.http.timeout.seconds
Callback timeout in seconds. 3 by default
Default Value: 3 (Optional)
Config Param: CALLBACK_HTTP_TIMEOUT_IN_SECONDS
Since Version: 0.6.0
hoodie.write.commit.callback.class
Full path of callback class and must be a subclass of HoodieWriteCommitCallback class, org.apache.hudi.callback.impl.HoodieWriteCommitHttpCallback by default
Default Value: org.apache.hudi.callback.impl.HoodieWriteCommitHttpCallback (Optional)
Config Param: CALLBACK_CLASS_NAME
Since Version: 0.6.0
hoodie.write.commit.callback.http.api.key
Http callback API key. hudi_write_commit_http_callback by default
Default Value: hudi_write_commit_http_callback (Optional)
Config Param: CALLBACK_HTTP_API_KEY_VALUE
Since Version: 0.6.0
Table Configurations
Configurations that persist across writes and read on a Hudi table like base, log file formats, table name, creation schema, table version layouts. Configurations are loaded from hoodie.properties, these properties are usually set during initializing a path as hoodie base path and rarely changes during the lifetime of the table. Writers/Queries' configurations are validated against these each time for compatibility.
Config Class
: org.apache.hudi.common.table.HoodieTableConfig
hoodie.table.precombine.field
Field used in preCombining before actual write. By default, when two records have the same key value, the largest value for the precombine field determined by Object.compareTo(..), is picked.
Default Value: N/A (Required)
Config Param: PRECOMBINE_FIELD
hoodie.archivelog.folder
path under the meta folder, to store archived timeline instants at.
Default Value: archived (Optional)
Config Param: ARCHIVELOG_FOLDER
hoodie.table.type
The table type for the underlying data, for this write. This can’t change between writes.
Default Value: COPY_ON_WRITE (Optional)
Config Param: TYPE
hoodie.table.timeline.timezone
User can set hoodie commit timeline timezone, such as utc, local and so on. local is default
Default Value: LOCAL (Optional)
Config Param: TIMELINE_TIMEZONE
hoodie.partition.metafile.use.base.format
If true, partition metafiles are saved in the same format as base-files for this dataset (e.g. Parquet / ORC). If false (default) partition metafiles are saved as properties files.
Default Value: false (Optional)
Config Param: PARTITION_METAFILE_USE_BASE_FORMAT
hoodie.table.checksum
Table checksum is used to guard against partial writes in HDFS. It is added as the last entry in hoodie.properties and then used to validate while reading table config.
Default Value: N/A (Required)
Config Param: TABLE_CHECKSUM
Since Version: 0.11.0
hoodie.table.create.schema
Schema used when creating the table, for the first time.
Default Value: N/A (Required)
Config Param: CREATE_SCHEMA
hoodie.table.recordkey.fields
Columns used to uniquely identify the table. Concatenated values of these fields are used as the record key component of HoodieKey.
Default Value: N/A (Required)
Config Param: RECORDKEY_FIELDS
hoodie.table.log.file.format
Log format used for the delta logs.
Default Value: HOODIE_LOG (Optional)
Config Param: LOG_FILE_FORMAT
hoodie.bootstrap.index.enable
Whether or not, this is a bootstrapped table, with bootstrap base data and an mapping index defined, default true.
Default Value: true (Optional)
Config Param: BOOTSTRAP_INDEX_ENABLE
hoodie.table.metadata.partitions
Comma-separated list of metadata partitions that have been completely built and in-sync with data table. These partitions are ready for use by the readers
Default Value: N/A (Required)
Config Param: TABLE_METADATA_PARTITIONS
Since Version: 0.11.0
hoodie.table.metadata.partitions.inflight
Comma-separated list of metadata partitions whose building is in progress. These partitions are not yet ready for use by the readers.
Default Value: N/A (Required)
Config Param: TABLE_METADATA_PARTITIONS_INFLIGHT
Since Version: 0.11.0
hoodie.table.partition.fields
Fields used to partition the table. Concatenated values of these fields are used as the partition path, by invoking toString()
Default Value: N/A (Required)
Config Param: PARTITION_FIELDS
hoodie.populate.meta.fields
When enabled, populates all meta fields. When disabled, no meta fields are populated and incremental queries will not be functional. This is only meant to be used for append only/immutable data for batch processing
Default Value: true (Optional)
Config Param: POPULATE_META_FIELDS
hoodie.compaction.payload.class
Payload class to use for performing compactions, i.e merge delta logs with current base file and then produce a new base file.
Default Value: org.apache.hudi.common.model.OverwriteWithLatestAvroPayload (Optional)
Config Param: PAYLOAD_CLASS_NAME
hoodie.bootstrap.index.class
Implementation to use, for mapping base files to bootstrap base file, that contain actual data.
Default Value: org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex (Optional)
Config Param: BOOTSTRAP_INDEX_CLASS_NAME
hoodie.datasource.write.partitionpath.urlencode
Should we url encode the partition path value, before creating the folder structure.
Default Value: false (Optional)
Config Param: URL_ENCODE_PARTITIONING
hoodie.datasource.write.hive_style_partitioning
Flag to indicate whether to use Hive style partitioning. If set true, the names of partition folders follow <partition_column_name>=<partition_value> format. By default false (the names of partition folders are only partition values)
Default Value: false (Optional)
Config Param: HIVE_STYLE_PARTITIONING_ENABLE
hoodie.table.keygenerator.class
Key Generator class property for the hoodie table
Default Value: N/A (Required)
Config Param: KEY_GENERATOR_CLASS_NAME
hoodie.table.version
Version of table, used for running upgrade/downgrade steps between releases with potentially breaking/backwards compatible changes.
Default Value: ZERO (Optional)
Config Param: VERSION
hoodie.table.base.file.format
Base file format to store all the base file data.
Default Value: PARQUET (Optional)
Config Param: BASE_FILE_FORMAT
hoodie.bootstrap.base.path
Base path of the dataset that needs to be bootstrapped as a Hudi table
Default Value: N/A (Required)
Config Param: BOOTSTRAP_BASE_PATH
hoodie.datasource.write.drop.partition.columns
When set to true, will not write the partition columns into hudi. By default, false.
Default Value: false (Optional)
Config Param: DROP_PARTITION_COLUMNS
hoodie.database.name
Database name that will be used for incremental query.If different databases have the same table name during incremental query, we can set it to limit the table name under a specific database
Default Value: N/A (Required)
Config Param: DATABASE_NAME
hoodie.timeline.layout.version
Version of timeline used, by the table.
Default Value: N/A (Required)
Config Param: TIMELINE_LAYOUT_VERSION
hoodie.table.name
Table name that will be used for registering with Hive. Needs to be same across runs.
Default Value: N/A (Required)
Config Param: NAME
Memory Configurations
Controls memory usage for compaction and merges, performed internally by Hudi.
Config Class
: org.apache.hudi.config.HoodieMemoryConfig
hoodie.memory.merge.fraction
This fraction is multiplied with the user memory fraction (1 - spark.memory.fraction) to get a final fraction of heap space to use during merge
Default Value: 0.6 (Optional)
Config Param: MAX_MEMORY_FRACTION_FOR_MERGE
hoodie.memory.dfs.buffer.max.size
Property to control the max memory in bytes for dfs input stream buffer size
Default Value: 16777216 (Optional)
Config Param: MAX_DFS_STREAM_BUFFER_SIZE
hoodie.memory.writestatus.failure.fraction
Property to control how what fraction of the failed record, exceptions we report back to driver. Default is 10%. If set to 100%, with lot of failures, this can cause memory pressure, cause OOMs and mask actual data errors.
Default Value: 0.1 (Optional)
Config Param: WRITESTATUS_FAILURE_FRACTION
hoodie.memory.compaction.fraction
HoodieCompactedLogScanner reads logblocks, converts records to HoodieRecords and then merges these log blocks and records. At any point, the number of entries in a log block can be less than or equal to the number of entries in the corresponding parquet file. This can lead to OOM in the Scanner. Hence, a spillable map helps alleviate the memory pressure. Use this config to set the max allowable inMemory footprint of the spillable map
Default Value: 0.6 (Optional)
Config Param: MAX_MEMORY_FRACTION_FOR_COMPACTION
hoodie.memory.merge.max.size
Maximum amount of memory used in bytes for merge operations, before spilling to local storage.
Default Value: 1073741824 (Optional)
Config Param: MAX_MEMORY_FOR_MERGE
hoodie.memory.spillable.map.path
Default file path prefix for spillable map
Default Value: /tmp/ (Optional)
Config Param: SPILLABLE_MAP_BASE_PATH
hoodie.memory.compaction.max.size
Maximum amount of memory used in bytes for compaction operations in bytes , before spilling to local storage.
Default Value: N/A (Required)
Config Param: MAX_MEMORY_FOR_COMPACTION
Storage Configs
Configurations that control aspects around writing, sizing, reading base and log files.
Config Class
: org.apache.hudi.config.HoodieStorageConfig
hoodie.logfile.data.block.max.size
LogFile Data block max size in bytes. This is the maximum size allowed for a single data block to be appended to a log file. This helps to make sure the data appended to the log file is broken up into sizable blocks to prevent from OOM errors. This size should be greater than the JVM memory.
Default Value: 268435456 (Optional)
Config Param: LOGFILE_DATA_BLOCK_MAX_SIZE
hoodie.parquet.outputtimestamptype
Sets spark.sql.parquet.outputTimestampType. Parquet timestamp type to use when Spark writes data to Parquet files.
Default Value: TIMESTAMP_MICROS (Optional)
Config Param: PARQUET_OUTPUT_TIMESTAMP_TYPE
hoodie.orc.stripe.size
Size of the memory buffer in bytes for writing
Default Value: 67108864 (Optional)
Config Param: ORC_STRIPE_SIZE
hoodie.orc.block.size
ORC block size, recommended to be aligned with the target file size.
Default Value: 125829120 (Optional)
Config Param: ORC_BLOCK_SIZE
hoodie.orc.compression.codec
Compression codec to use for ORC base files.
Default Value: ZLIB (Optional)
Config Param: ORC_COMPRESSION_CODEC_NAME
hoodie.parquet.max.file.size
Target size in bytes for parquet files produced by Hudi write phases. For DFS, this needs to be aligned with the underlying filesystem block size for optimal performance.
Default Value: 125829120 (Optional)
Config Param: PARQUET_MAX_FILE_SIZE
hoodie.hfile.max.file.size
Target file size in bytes for HFile base files.
Default Value: 125829120 (Optional)
Config Param: HFILE_MAX_FILE_SIZE
hoodie.parquet.writelegacyformat.enabled
Sets spark.sql.parquet.writeLegacyFormat. If true, data will be written in a way of Spark 1.4 and earlier. For example, decimal values will be written in Parquet's fixed-length byte array format which other systems such as Apache Hive and Apache Impala use. If false, the newer format in Parquet will be used. For example, decimals will be written in int-based format.
Default Value: false (Optional)
Config Param: PARQUET_WRITE_LEGACY_FORMAT_ENABLED
hoodie.parquet.block.size
Parquet RowGroup size in bytes. It's recommended to make this large enough that scan costs can be amortized by packing enough column values into a single row group.
Default Value: 125829120 (Optional)
Config Param: PARQUET_BLOCK_SIZE
hoodie.logfile.max.size
LogFile max size in bytes. This is the maximum size allowed for a log file before it is rolled over to the next version.
Default Value: 1073741824 (Optional)
Config Param: LOGFILE_MAX_SIZE
hoodie.parquet.dictionary.enabled
Whether to use dictionary encoding
Default Value: true (Optional)
Config Param: PARQUET_DICTIONARY_ENABLED
hoodie.hfile.block.size
Lower values increase the size in bytes of metadata tracked within HFile, but can offer potentially faster lookup times.
Default Value: 1048576 (Optional)
Config Param: HFILE_BLOCK_SIZE
hoodie.parquet.page.size
Parquet page size in bytes. Page is the unit of read within a parquet file. Within a block, pages are compressed separately.
Default Value: 1048576 (Optional)
Config Param: PARQUET_PAGE_SIZE
hoodie.hfile.compression.algorithm
Compression codec to use for hfile base files.
Default Value: GZ (Optional)
Config Param: HFILE_COMPRESSION_ALGORITHM_NAME
hoodie.orc.max.file.size
Target file size in bytes for ORC base files.
Default Value: 125829120 (Optional)
Config Param: ORC_FILE_MAX_SIZE
hoodie.logfile.data.block.format
Format of the data block within delta logs. Following formats are currently supported "avro", "hfile", "parquet"
Default Value: N/A (Required)
Config Param: LOGFILE_DATA_BLOCK_FORMAT
hoodie.logfile.to.parquet.compression.ratio
Expected additional compression as records move from log files to parquet. Used for merge_on_read table to send inserts into log files & control the size of compacted parquet file.
Default Value: 0.35 (Optional)
Config Param: LOGFILE_TO_PARQUET_COMPRESSION_RATIO_FRACTION
hoodie.parquet.compression.ratio
Expected compression of parquet data used by Hudi, when it tries to size new parquet files. Increase this value, if bulk_insert is producing smaller than expected sized files
Default Value: 0.1 (Optional)
Config Param: PARQUET_COMPRESSION_RATIO_FRACTION
hoodie.parquet.compression.codec
Compression Codec for parquet files
Default Value: gzip (Optional)
Config Param: PARQUET_COMPRESSION_CODEC_NAME
DynamoDB based Locks Configurations
Configs that control DynamoDB based locking mechanisms required for concurrency control between writers to a Hudi table. Concurrency between Hudi's own table services are auto managed internally.
Config Class
: org.apache.hudi.config.DynamoDbBasedLockConfig
hoodie.write.lock.dynamodb.billing_mode
For DynamoDB based lock provider, by default it is PAY_PER_REQUEST mode
Default Value: PAY_PER_REQUEST (Optional)
Config Param: DYNAMODB_LOCK_BILLING_MODE
Since Version: 0.10.0
hoodie.write.lock.dynamodb.table
For DynamoDB based lock provider, the name of the DynamoDB table acting as lock table
Default Value: N/A (Required)
Config Param: DYNAMODB_LOCK_TABLE_NAME
Since Version: 0.10.0
hoodie.write.lock.dynamodb.region
For DynamoDB based lock provider, the region used in endpoint for Amazon DynamoDB service. Would try to first get it from AWS_REGION environment variable. If not find, by default use us-east-1
Default Value: us-east-1 (Optional)
Config Param: DYNAMODB_LOCK_REGION
Since Version: 0.10.0
hoodie.write.lock.dynamodb.partition_key
For DynamoDB based lock provider, the partition key for the DynamoDB lock table. Each Hudi dataset should has it's unique key so concurrent writers could refer to the same partition key. By default we use the Hudi table name specified to be the partition key
Default Value: N/A (Required)
Config Param: DYNAMODB_LOCK_PARTITION_KEY
Since Version: 0.10.0
hoodie.write.lock.dynamodb.write_capacity
For DynamoDB based lock provider, write capacity units when using PROVISIONED billing mode
Default Value: 10 (Optional)
Config Param: DYNAMODB_LOCK_WRITE_CAPACITY
Since Version: 0.10.0
hoodie.write.lock.dynamodb.table_creation_timeout
For DynamoDB based lock provider, the maximum number of milliseconds to wait for creating DynamoDB table
Default Value: 600000 (Optional)
Config Param: DYNAMODB_LOCK_TABLE_CREATION_TIMEOUT
Since Version: 0.10.0
hoodie.write.lock.dynamodb.read_capacity
For DynamoDB based lock provider, read capacity units when using PROVISIONED billing mode
Default Value: 20 (Optional)
Config Param: DYNAMODB_LOCK_READ_CAPACITY
Since Version: 0.10.0
hoodie.write.lock.dynamodb.endpoint_url
For DynamoDB based lock provider, the url endpoint used for Amazon DynamoDB service. Useful for development with a local dynamodb instance.
Default Value: N/A (Required)
Config Param: DYNAMODB_ENDPOINT_URL
Since Version: 0.10.1
Metadata Configs
Configurations used by the Hudi Metadata Table. This table maintains the metadata about a given Hudi table (e.g file listings) to avoid overhead of accessing cloud storage, during queries.
Config Class
: org.apache.hudi.common.config.HoodieMetadataConfig
hoodie.metadata.index.column.stats.parallelism
Parallelism to use, when generating column stats index.
Default Value: 10 (Optional)
Config Param: COLUMN_STATS_INDEX_PARALLELISM
Since Version: 0.11.0
hoodie.metadata.compact.max.delta.commits
Controls how often the metadata table is compacted.
Default Value: 10 (Optional)
Config Param: COMPACT_NUM_DELTA_COMMITS
Since Version: 0.7.0
hoodie.assume.date.partitioning
Should HoodieWriteClient assume the data is partitioned by dates, i.e three levels from base path. This is a stop-gap to support tables created by versions < 0.3.1. Will be removed eventually
Default Value: false (Optional)
Config Param: ASSUME_DATE_PARTITIONING
Since Version: 0.3.0
hoodie.metadata.index.column.stats.enable
Enable indexing column ranges of user data files under metadata table key lookups. When enabled, metadata table will have a partition to store the column ranges and will be used for pruning files during the index lookups.
Default Value: false (Optional)
Config Param: ENABLE_METADATA_INDEX_COLUMN_STATS
Since Version: 0.11.0
hoodie.metadata.index.bloom.filter.column.list
Comma-separated list of columns for which bloom filter index will be built. If not set, only record key will be indexed.
Default Value: N/A (Required)
Config Param: BLOOM_FILTER_INDEX_FOR_COLUMNS
Since Version: 0.11.0
hoodie.metadata.metrics.enable
Enable publishing of metrics around metadata table.
Default Value: false (Optional)
Config Param: METRICS_ENABLE
Since Version: 0.7.0
hoodie.metadata.index.bloom.filter.file.group.count
Metadata bloom filter index partition file group count. This controls the size of the base and log files and read parallelism in the bloom filter index partition. The recommendation is to size the file group count such that the base files are under 1GB.
Default Value: 4 (Optional)
Config Param: METADATA_INDEX_BLOOM_FILTER_FILE_GROUP_COUNT
Since Version: 0.11.0
hoodie.metadata.cleaner.commits.retained
Number of commits to retain, without cleaning, on metadata table.
Default Value: 3 (Optional)
Config Param: CLEANER_COMMITS_RETAINED
Since Version: 0.7.0
hoodie.metadata.index.check.timeout.seconds
After the async indexer has finished indexing upto the base instant, it will ensure that all inflight writers reliably write index updates as well. If this timeout expires, then the indexer will abort itself safely.
Default Value: 900 (Optional)
Config Param: METADATA_INDEX_CHECK_TIMEOUT_SECONDS
Since Version: 0.11.0
_hoodie.metadata.ignore.spurious.deletes
There are cases when extra files are requested to be deleted from metadata table which are never added before. This config determines how to handle such spurious deletes
Default Value: true (Optional)
Config Param: IGNORE_SPURIOUS_DELETES
Since Version: 0.10.0
hoodie.file.listing.parallelism
Parallelism to use, when listing the table on lake storage.
Default Value: 200 (Optional)
Config Param: FILE_LISTING_PARALLELISM_VALUE
Since Version: 0.7.0
hoodie.metadata.populate.meta.fields
When enabled, populates all meta fields. When disabled, no meta fields are populated.
Default Value: false (Optional)
Config Param: POPULATE_META_FIELDS
Since Version: 0.10.0
hoodie.metadata.index.async
Enable asynchronous indexing of metadata table.
Default Value: false (Optional)
Config Param: ASYNC_INDEX_ENABLE
Since Version: 0.11.0
hoodie.metadata.index.column.stats.column.list
Comma-separated list of columns for which column stats index will be built. If not set, all columns will be indexed
Default Value: N/A (Required)
Config Param: COLUMN_STATS_INDEX_FOR_COLUMNS
Since Version: 0.11.0
hoodie.metadata.enable.full.scan.log.files
Enable full scanning of log files while reading log records. If disabled, Hudi does look up of only interested entries.
Default Value: true (Optional)
Config Param: ENABLE_FULL_SCAN_LOG_FILES
Since Version: 0.10.0
hoodie.metadata.index.column.stats.file.group.count
Metadata column stats partition file group count. This controls the size of the base and log files and read parallelism in the column stats index partition. The recommendation is to size the file group count such that the base files are under 1GB.
Default Value: 2 (Optional)
Config Param: METADATA_INDEX_COLUMN_STATS_FILE_GROUP_COUNT
Since Version: 0.11.0
hoodie.metadata.enable
Enable the internal metadata table which serves table metadata like level file listings
Default Value: true (Optional)
Config Param: ENABLE
Since Version: 0.7.0
hoodie.metadata.index.bloom.filter.enable
Enable indexing bloom filters of user data files under metadata table. When enabled, metadata table will have a partition to store the bloom filter index and will be used during the index lookups.
Default Value: false (Optional)
Config Param: ENABLE_METADATA_INDEX_BLOOM_FILTER
Since Version: 0.11.0
hoodie.metadata.index.bloom.filter.parallelism
Parallelism to use for generating bloom filter index in metadata table.
Default Value: 200 (Optional)
Config Param: BLOOM_FILTER_INDEX_PARALLELISM
Since Version: 0.11.0
hoodie.metadata.clean.async
Enable asynchronous cleaning for metadata table
Default Value: false (Optional)
Config Param: ASYNC_CLEAN_ENABLE
Since Version: 0.7.0
hoodie.metadata.keep.max.commits