Basic Configurations
This page covers the basic configurations you may use to write/read Hudi tables. This page only features a subset of the most frequently used configurations. For a full list of all configs, please visit the All Configurations page.
- Hudi Table Config: Basic Hudi Table configuration parameters.
- Spark Datasource Configs: These configs control the Hudi Spark Datasource, providing ability to define keys/partitioning, pick out the write operation, specify how to merge records or choosing query type to read.
- Flink Sql Configs: These configs control the Hudi Flink SQL source/sink connectors, providing ability to define record keys, pick out the write operation, specify how to merge records, enable/disable asynchronous compaction or choosing query type to read.
- Write Client Configs: Internally, the Hudi datasource uses a RDD based HoodieWriteClient API to actually perform writes to storage. These configs provide deep control over lower level aspects like file sizing, compression, parallelism, compaction, write schema, cleaning etc. Although Hudi provides sane defaults, from time-time these configs may need to be tweaked to optimize for specific workloads.
- Metastore and Catalog Sync Configs: Configurations used by the Hudi to sync metadata to external metastores and catalogs.
- Metrics Configs: These set of configs are used to enable monitoring and reporting of key Hudi stats and metrics.
- Kafka Connect Configs: These set of configs are used for Kafka Connect Sink Connector for writing Hudi Tables
- Hudi Streamer Configs: These set of configs are used for Hudi Streamer utility which provides the way to ingest from different sources such as DFS or Kafka.
In the tables below (N/A) means there is no default value set
Hudi Table Config
Basic Hudi Table configuration parameters.
Hudi Table Basic Configs
Configurations of the Hudi Table like type of ingestion, storage formats, hive table name etc. Configurations are loaded from hoodie.properties, these properties are usually set during initializing a path as hoodie base path and never changes during the lifetime of a hoodie table.
| Config Name | Default | Description |
|---|---|---|
| hoodie.bootstrap.base.path | (N/A) | Base path of the dataset that needs to be bootstrapped as a Hudi tableConfig Param: BOOTSTRAP_BASE_PATH |
| hoodie.compaction.payload.class | (N/A) | Payload class to use for performing merges, compactions, i.e merge delta logs with current base file and then produce a new base file.Config Param: PAYLOAD_CLASS_NAME |
| hoodie.database.name | (N/A) | Database name. If different databases have the same table name during incremental query, we can set it to limit the table name under a specific databaseConfig Param: DATABASE_NAME |
| hoodie.record.merge.mode | (N/A) | org.apache.hudi.common.config.RecordMergeMode: Determines the logic of merging updates COMMIT_TIME_ORDERING: Using transaction time to merge records, i.e., the record from later transaction overwrites the earlier record with the same key. EVENT_TIME_ORDERING: Using event time as the ordering to merge records, i.e., the record with the larger event time overwrites the record with the smaller event time on the same key, regardless of transaction time. The event time or preCombine field needs to be specified by the user. CUSTOM: Using custom merging logic specified by the user.Config Param: RECORD_MERGE_MODESince Version: 1.0.0 |
| hoodie.record.merge.strategy.id | (N/A) | Id of merger strategy. Hudi will pick HoodieRecordMerger implementations in hoodie.write.record.merge.custom.implementation.classes which has the same merger strategy idConfig Param: RECORD_MERGE_STRATEGY_IDSince Version: 0.13.0 |
| hoodie.table.checksum | (N/A) | Table checksum is used to guard against partial writes in HDFS. It is added as the last entry in hoodie.properties and then used to validate while reading table config.Config Param: TABLE_CHECKSUMSince Version: 0.11.0 |
| hoodie.table.create.schema | (N/A) | Schema used when creating the tableConfig Param: CREATE_SCHEMA |
| hoodie.table.index.defs.path | (N/A) | Relative path to table base path where the index definitions are storedConfig Param: RELATIVE_INDEX_DEFINITION_PATHSince Version: 1.0.0 |
| hoodie.table.keygenerator.class | (N/A) | Key Generator class property for the hoodie tableConfig Param: KEY_GENERATOR_CLASS_NAME |
| hoodie.table.keygenerator.type | (N/A) | Key Generator type to determine key generator classConfig Param: KEY_GENERATOR_TYPESince Version: 1.0.0 |
| hoodie.table.metadata.partitions | (N/A) | Comma-separated list of metadata partitions that have been completely built and in-sync with data table. These partitions are ready for use by the readersConfig Param: TABLE_METADATA_PARTITIONSSince Version: 0.11.0 |
| hoodie.table.metadata.partitions.inflight | (N/A) | Comma-separated list of metadata partitions whose building is in progress. These partitions are not yet ready for use by the readers.Config Param: TABLE_METADATA_PARTITIONS_INFLIGHTSince Version: 0.11.0 |
| hoodie.table.name | (N/A) | Table name that will be used for registering with Hive. Needs to be same across runs.Config Param: NAME |
| hoodie.table.partition.fields | (N/A) | Comma separated field names used to partition the table. These field names also include the partition type which is used by custom key generatorsConfig Param: PARTITION_FIELDS |
| hoodie.table.precombine.field | (N/A) | Field used in preCombining before actual write. By default, when two records have the same key value, the largest value for the precombine field determined by Object.compareTo(..), is picked.Config Param: PRECOMBINE_FIELD |
| hoodie.table.recordkey.fields | (N/A) | Columns used to uniquely identify the table. Concatenated values of these fields are used as the record key component of HoodieKey.Config Param: RECORDKEY_FIELDS |
| hoodie.table.secondary.indexes.metadata | (N/A) | The metadata of secondary indexesConfig Param: SECONDARY_INDEXES_METADATASince Version: 0.13.0 |
| hoodie.timeline.layout.version | (N/A) | Version of timeline used, by the table.Config Param: TIMELINE_LAYOUT_VERSION |
| hoodie.archivelog.folder | archived | path under the meta folder, to store archived timeline instants at.Config Param: ARCHIVELOG_FOLDER |
| hoodie.bootstrap.index.class | org.apache.hudi.common.bootstrap.index.hfile.HFileBootstrapIndex | Implementation to use, for mapping base files to bootstrap base file, that contain actual data.Config Param: BOOTSTRAP_INDEX_CLASS_NAME |
| hoodie.bootstrap.index.enable | true | Whether or not, this is a bootstrapped table, with bootstrap base data and an mapping index defined, default true.Config Param: BOOTSTRAP_INDEX_ENABLE |
| hoodie.bootstrap.index.type | HFILE | Bootstrap index type determines which implementation to use, for mapping base files to bootstrap base file, that contain actual data.Config Param: BOOTSTRAP_INDEX_TYPESince Version: 1.0.0 |
| hoodie.datasource.write.hive_style_partitioning | false | Flag to indicate whether to use Hive style partitioning. If set true, the names of partition folders follow <partition_column_name>=<partition_value> format. By default false (the names of partition folders are only partition values)Config Param: HIVE_STYLE_PARTITIONING_ENABLE |
| hoodie.partition.metafile.use.base.format | false | If true, partition metafiles are saved in the same format as base-files for this dataset (e.g. Parquet / ORC). If false (default) partition metafiles are saved as properties files.Config Param: PARTITION_METAFILE_USE_BASE_FORMAT |
| hoodie.populate.meta.fields | true | When enabled, populates all meta fields. When disabled, no meta fields are populated and incremental queries will not be functional. This is only meant to be used for append only/immutable data for batch processingConfig Param: POPULATE_META_FIELDS |
| hoodie.table.base.file.format | PARQUET | Base file format to store all the base file data.Config Param: BASE_FILE_FORMAT |
| hoodie.table.cdc.enabled | false | When enable, persist the change data if necessary, and can be queried as a CDC query mode.Config Param: CDC_ENABLEDSince Version: 0.13.0 |
| hoodie.table.cdc.supplemental.logging.mode | DATA_BEFORE_AFTER | org.apache.hudi.common.table.cdc.HoodieCDCSupplementalLoggingMode: Change log capture supplemental logging mode. The supplemental log is used for accelerating the generation of change log details. OP_KEY_ONLY: Only keeping record keys in the supplemental logs, so the reader needs to figure out the update before image and after image. DATA_BEFORE: Keeping the before images in the supplemental logs, so the reader needs to figure out the update after images. DATA_BEFORE_AFTER(default): Keeping the before and after images in the supplemental logs, so the reader can generate the details directly from the logs.Config Param: CDC_SUPPLEMENTAL_LOGGING_MODESince Version: 0.13.0 |
| hoodie.table.initial.version | EIGHT | Initial Version of table when the table was created. Used for upgrade/downgrade to identify what upgrade/downgrade paths happened on the table. This is only configured when the table is initially setup.Config Param: INITIAL_VERSIONSince Version: 1.0.0 |
| hoodie.table.log.file.format | HOODIE_LOG | Log format used for the delta logs.Config Param: LOG_FILE_FORMAT |
| hoodie.table.multiple.base.file.formats.enable | false | When set to true, the table can support reading and writing multiple base file formats.Config Param: MULTIPLE_BASE_FILE_FORMATS_ENABLESince Version: 1.0.0 |
| hoodie.table.timeline.timezone | LOCAL | User can set hoodie commit timeline timezone, such as utc, local and so on. local is defaultConfig Param: TIMELINE_TIMEZONE |
| hoodie.table.type | COPY_ON_WRITE | The table type for the underlying data.Config Param: TYPE |
| hoodie.table.version | EIGHT | Version of table, used for running upgrade/downgrade steps between releases with potentially breaking/backwards compatible changes.Config Param: VERSION |
| hoodie.timeline.history.path | history | path under the meta folder, to store timeline history at.Config Param: TIMELINE_HISTORY_PATH |
| hoodie.timeline.path | timeline | path under the meta folder, to store timeline instants at.Config Param: TIMELINE_PATH |
Spark Datasource Configs
These configs control the Hudi Spark Datasource, providing ability to define keys/partitioning, pick out the write operation, specify how to merge records or choosing query type to read.
Read Options
Options useful for reading tables via read.format.option(...)
| Config Name | Default | Description |
|---|---|---|
| hoodie.datasource.read.begin.instanttime | (N/A) | Required when hoodie.datasource.query.type is set to incremental. Represents the completion time to start incrementally pulling data from. The completion time here need not necessarily correspond to an instant on the timeline. New data written with completion_time >= START_COMMIT are fetched out. For e.g: ‘20170901080000’ will get all new data written on or after Sep 1, 2017 08:00AM.Config Param: START_COMMIT |
| hoodie.datasource.read.end.instanttime | (N/A) | Used when hoodie.datasource.query.type is set to incremental. Represents the completion time to limit incrementally fetched data to. When not specified latest commit completion time from timeline is assumed by default. When specified, new data written with completion_time <= END_COMMIT are fetched out. Point in time type queries make more sense with begin and end completion times specified.Config Param: END_COMMIT |
| hoodie.datasource.read.incr.table.version | (N/A) | The table version assumed for incremental readConfig Param: INCREMENTAL_READ_TABLE_VERSION |
| hoodie.datasource.read.streaming.table.version | (N/A) | The table version assumed for streaming readConfig Param: STREAMING_READ_TABLE_VERSION |
| hoodie.datasource.write.precombine.field | (N/A) | Field used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..)Config Param: READ_PRE_COMBINE_FIELD |
| hoodie.datasource.query.type | snapshot | Whether data needs to be read, in incremental mode (new data since an instantTime) (or) read_optimized mode (obtain latest view, based on base files) (or) snapshot mode (obtain latest view, by merging base and (if any) log files)Config Param: QUERY_TYPE |
Write Options
You can pass down any of the WriteClient level configs directly using options() or option(k,v) methods.
inputDF.write()
.format("org.apache.hudi")
.options(clientOpts) // any of the Hudi client opts can be passed in as well
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_row_key")
.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), "partition")
.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "timestamp")
.option(HoodieWriteConfig.TABLE_NAME, tableName)
.mode(SaveMode.Append)
.save(basePath);
Options useful for writing tables via write.format.option(...)
| Config Name | Default | Description |
|---|---|---|
| hoodie.datasource.hive_sync.mode | (N/A) | Mode to choose for Hive ops. Valid values are hms, jdbc and hiveql.Config Param: HIVE_SYNC_MODE |
| hoodie.datasource.write.partitionpath.field | (N/A) | Partition path field. Value to be used at the partitionPath component of HoodieKey. Actual value obtained by invoking .toString()Config Param: PARTITIONPATH_FIELD |
| hoodie.datasource.write.precombine.field | (N/A) | Field used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..)Config Param: PRECOMBINE_FIELD |
| hoodie.datasource.write.recordkey.field | (N/A) | Record key field. Value to be used as the recordKey component of HoodieKey. Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the dot notation eg: a.b.cConfig Param: RECORDKEY_FIELD |
| hoodie.datasource.write.secondarykey.column | (N/A) | Columns that constitute the secondary key component. Actual value will be obtained by invoking .toString() on the field value. Nested fields can be specified using the dot notation eg: a.b.cConfig Param: SECONDARYKEY_COLUMN_NAME |
| hoodie.write.record.merge.mode | (N/A) | org.apache.hudi.common.config.RecordMergeMode: Determines the logic of merging updates COMMIT_TIME_ORDERING: Using transaction time to merge records, i.e., the record from later transaction overwrites the earlier record with the same key. EVENT_TIME_ORDERING: Using event time as the ordering to merge records, i.e., the record with the larger event time overwrites the record with the smaller event time on the same key, regardless of transaction time. The event time or preCombine field needs to be specified by the user. CUSTOM: Using custom merging logic specified by the user.Config Param: RECORD_MERGE_MODESince Version: 1.0.0 |
| hoodie.clustering.async.enabled | false | Enable running of clustering service, asynchronously as inserts happen on the table.Config Param: ASYNC_CLUSTERING_ENABLESince Version: 0.7.0 |
| hoodie.clustering.inline | false | Turn on inline clustering - clustering will be run after each write operation is completeConfig Param: INLINE_CLUSTERING_ENABLESince Version: 0.7.0 |
| hoodie.datasource.hive_sync.enable | false | When set to true, register/sync the table to Apache Hive metastore.Config Param: HIVE_SYNC_ENABLED |
| hoodie.datasource.hive_sync.jdbcurl | jdbc:hive2://localhost:10000 | Hive metastore urlConfig Param: HIVE_URL |
| hoodie.datasource.hive_sync.metastore.uris | thrift://localhost:9083 | Hive metastore urlConfig Param: METASTORE_URIS |
| hoodie.datasource.meta.sync.enable | false | Enable Syncing the Hudi Table with an external meta store or data catalog.Config Param: META_SYNC_ENABLED |
| hoodie.datasource.write.hive_style_partitioning | false | Flag to indicate whether to use Hive style partitioning. If set true, the names of partition folders follow <partition_column_name>=<partition_value> format. By default false (the names of partition folders are only partition values)Config Param: HIVE_STYLE_PARTITIONING |
| hoodie.datasource.write.operation | upsert | Whether to do upsert, insert or bulk_insert for the write operation. Use bulk_insert to load new data into a table, and there on use upsert/insert. bulk insert uses a disk based write path to scale to load large inputs without need to cache it.Config Param: OPERATION |
| hoodie.datasource.write.table.type | COPY_ON_WRITE | The table type for the underlying data, for this write. This can’t change between writes.Config Param: TABLE_TYPE |