Hive Metastore
Hive Metastore is an RDBMS-backed service from Apache Hive that acts as a catalog for your data warehouse or data lake. It can store all the metadata about the tables, such as partitions, columns, column types, etc. One can sync the Hudi table metadata to the Hive metastore as well. This unlocks the capability to query Hudi tables not only through Hive but also using interactive query engines such as Presto and Trino. In this document, we will go through different ways to sync the Hudi table to Hive metastore.
Hive Sync Tool
Writing data with DataSource writer or HoodieStreamer supports syncing of the table's latest schema to Hive metastore, such that queries can pick up new columns and partitions.
In case, it's preferable to run this from commandline or in an independent jvm, Hudi provides a HiveSyncTool
, which can be invoked as below,
once you have built the hudi-hive module. Following is how we sync the above Datasource Writer written table to Hive metastore.
cd hudi-hive
./run_sync_tool.sh --jdbc-url jdbc:hive2:\/\/hiveserver:10000 --user hive --pass hive --partitioned-by partition --base-path <basePath> --database default --table <tableName>
Starting with Hudi 0.5.1 version read optimized version of merge-on-read tables are suffixed '_ro' by default. For backwards compatibility with older Hudi versions, an optional HiveSyncConfig - --skip-ro-suffix
, has been provided to turn off '_ro' suffixing if desired. Explore other hive sync options using the following command:
cd hudi-hive
./run_sync_tool.sh
[hudi-hive]$ ./run_sync_tool.sh --help
Hive Sync Configuration
Please take a look at the arguments that can be passed to run_sync_tool
in HiveSyncConfig.
Among them, following are the required arguments:
@Parameter(names = {"--database"}, description = "name of the target database in Hive", required = true);
@Parameter(names = {"--table"}, description = "name of the target table in Hive", required = true);
@Parameter(names = {"--base-path"}, description = "Basepath of hoodie table to sync", required = true);## Sync modes
Corresponding datasource options for the most commonly used hive sync configs are as follows:
In the table below (N/A) means there is no default value set.
HiveSyncConfig | DataSourceWriteOption | Default Value | Description |
---|---|---|---|
--database | hoodie.datasource.hive_sync.database | default | Name of the target database in Hive |
--table | hoodie.datasource.hive_sync.table | (N/A) | Name of the target table in Hive. Inferred from the table name in Hudi table config if not specified. |
--user | hoodie.datasource.hive_sync.username | hive | Username for hive metastore |
--pass | hoodie.datasource.hive_sync.password | hive | Password for hive metastore |
--jdbc-url | hoodie.datasource.hive_sync.jdbcurl | jdbc:hive2://localhost:10000 | Hive server url if using jdbc mode to sync |
--sync-mode | hoodie.datasource.hive_sync.mode | (N/A) | Mode to choose for Hive ops. Valid values are hms , jdbc and hiveql . More details in the following section. |
--partitioned-by | hoodie.datasource.hive_sync.partition_fields | (N/A) | Comma-separated column names in the table to use for determining hive partition. |
--partition-value-extractor | hoodie.datasource.hive_sync.partition_extractor_class | org.apache.hudi.hive.MultiPartKeysValueExtractor | Class which implements PartitionValueExtractor to extract the partition values. Inferred automatically depending on the partition fields specified. |
Sync modes
HiveSyncTool
supports three modes, namely HMS
, HIVEQL
, JDBC
, to connect to Hive metastore server.
These modes are just three different ways of executing DDL against Hive. Among these modes, JDBC or HMS is preferable over
HIVEQL, which is mostly used for running DML rather than DDL.
Note: All these modes assume that hive metastore has been configured and the corresponding properties set in hive-site.xml configuration file. Additionally, if you're using spark-shell/spark-sql to sync Hudi table to Hive then the hive-site.xml file also needs to be placed under
<SPARK_HOME>/conf
directory.
HMS
HMS mode uses the hive metastore client to sync Hudi table using thrift APIs directly.
To use this mode, pass --sync-mode=hms
to run_sync_tool
and set --use-jdbc=false
.
Additionally, if you are using remote metastore, then hive.metastore.uris
need to be set in hive-site.xml configuration file.
Otherwise, the tool assumes that metastore is running locally on port 9083 by default.