Troubleshooting
For performance related issues, please refer to the tuning guide
Writing Tables
org.apache.parquet.io.InvalidRecordException: Parquet/Avro schema mismatch: Avro field 'col1' not found
It is recommended that schema should evolve in backwards compatible way while using Hudi. Please refer here for more information on avro schema resolution - https://avro.apache.org/docs/1.8.2/spec.html. This error generally occurs when the schema has evolved in backwards incompatible way by deleting some column 'col1' and we are trying to update some record in parquet file which has ay been written with previous schema (which had 'col1'). In such cases, parquet tries to find all the present fields in the incoming record and when it finds 'col1' is not present, the mentioned exception is thrown.
The fix for this is to try and create uber schema using all the schema versions evolved so far for the concerned event and use this uber schema as the target schema. One of the good approaches can be fetching schema from hive metastore and merging it with the current schema.
Sample stacktrace where a field named "toBeDeletedStr" was omitted from new batch of updates : https://gist.github.com/nsivabalan/cafc53fc9a8681923e4e2fa4eb2133fe
java.lang.UnsupportedOperationException: org.apache.parquet.avro.AvroConverters$FieldIntegerConverter
This error will again occur due to schema evolutions in non-backwards compatible way. Basically there is some incoming update U for a record R which is already written to your Hudi dataset in the concerned parquet file. R contains field F which is having certain data type, let us say long. U has the same field F with updated data type of int type. Such incompatible data type conversions are not supported by Parquet FS.
For such errors, please try to ensure only valid data type conversions are happening in your primary data source from where you are trying to ingest.
Sample stacktrace when trying to evolve a field from Long type to Integer type with Hudi : https://gist.github.com/nsivabalan/0d81cd60a3e7a0501e6a0cb50bfaacea
SchemaCompatabilityException: Unable to validate the rewritten record
This can possibly occur if your schema has some non-nullable field whose value is not present or is null. It is recommended to evolve schema in backwards compatible ways. In essence, this means either have every newly added field as nullable or define default values for every new field. See Schema Evolution docs for more.
INT96, INT64 and timestamp compatibility
https://hudi.apache.org/docs/configurations#hoodiedatasourcehive_syncsupport_timestamp
I am seeing lot of archive files. How do I control the number of archive commit files generated?
Please note that in cloud stores that do not support log append operations, Hudi is forced to create new archive files to archive old metadata operations.
You can increase hoodie.commits.archival.batch
moving forward to increase the number of commits archived per archive file.
In addition, you can increase the difference between the 2 watermark configurations : hoodie.keep.max.commits
(default : 30)
and hoodie.keep.min.commits
(default : 20). This way, you can reduce the number of archive files created and also
at the same time increase the number of metadata archived per archive file. Note that post 0.7.0 release where we are
adding consolidated Hudi metadata (RFC-15),
the follow up work would involve re-organizing archival metadata so that we can do periodic compactions to control
file-sizing of these archive files.
How can I resolve the NoSuchMethodError from HBase when using Hudi with metadata table on HDFS?
From 0.11.0 release, we have upgraded the HBase version to 2.4.9, which is released based on Hadoop 2.x. Hudi's metadata
table uses HFile as the base file format, relying on the HBase library. When enabling metadata table in a Hudi table on
HDFS using Hadoop 3.x, NoSuchMethodError can be thrown due to compatibility issues between Hadoop 2.x and 3.x.
To address this, here's the workaround:
(1) Download HBase source code from https://github.com/apache/hbase
;
(2) Switch to the source code of 2.4.9 release with the tag rel/2.4.9
:
git checkout rel/2.4.9
(3) Package a new version of HBase 2.4.9 with Hadoop 3 version:
mvn clean install -Denforcer.skip -DskipTests -Dhadoop.profile=3.0 -Psite-install-step
(4) Package Hudi again.
How can I resolve the RuntimeException saying hbase-default.xml file seems to be for an older version of HBase
?
This usually happens when there are other HBase libs provided by the runtime environment in the classpath, such as
Cloudera CDP stack, causing the conflict. To get around the RuntimeException, you can set the
hbase.defaults.for.version.skip
to true
in the hbase-site.xml
configuration file, e.g., overwriting the config
within the Cloudera manager.
I see two different records for the same record key value, each record key with a different timestamp format. How is this possible?
This is a known issue with enabling row-writer for bulk_insert operation. When you do a bulk_insert followed by another
write operation such as upsert/insert this might be observed for timestamp fields specifically. For example, bulk_insert might produce
timestamp 2016-12-29 09:54:00.0
for record key whereas non bulk_insert write operation might produce a long value like
1483023240000000
for the record key thus creating two different records. To fix this, starting 0.10.1 a new config hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled
is introduced to bring consistency irrespective of whether row writing is enabled on not. However, for the sake of
backwards compatibility and not breaking existing pipelines, this config is set to false by default and will have to be enabled explicitly.
Querying Tables
How can I now query the Hudi dataset I just wrote?
Unless Hive sync is enabled, the dataset written by Hudi using one of the methods above can simply be queries via the Spark datasource like any other source.
val hudiSnapshotQueryDF = spark
.read()
.format("hudi")
.option("hoodie.datasource.query.type", "snapshot")
.load(basePath)
val hudiIncQueryDF = spark.read().format("hudi")
.option("hoodie.datasource.query.type", "incremental")
.option("hoodie.datasource.read.begin.instanttime", <beginInstantTime>)
.load(basePath);
if Hive Sync is enabled in the deltastreamer tool or datasource, the dataset is available in Hive as a couple of tables, that can now be read using HiveQL, Presto or SparkSQL. See here for more.
Data Quality Issues
Section below generally aids in debugging Hudi failures. Off the bat, the following metadata is added to every record to help triage issues easily using standard Hadoop SQL engines (Hive/PrestoDB/Spark)
- _hoodie_record_key - Treated as a primary key within each DFS partition, basis of all updates/inserts
- _hoodie_commit_time - Last commit that touched this record
- _hoodie_commit_seqno - This field contains a unique sequence number for each record within each transaction.
- _hoodie_file_name - Actual file name containing the record (super useful to triage duplicates)
- _hoodie_partition_path - Path from basePath that identifies the partition containing this record
Missing records
Please check if there were any write errors using the admin commands, during the window at which the record could have been written.
If you do find errors, then the record was not actually written by Hudi, but handed back to the application to decide what to do with it.
Duplicates
First of all, please confirm if you do indeed have duplicates AFTER ensuring the query is accessing the Hudi table properly .
- If confirmed, please use the metadata fields above, to identify the physical files & partition files containing the records .
- If duplicates span files across partitionpath, then this means your application is generating different partitionPaths for same recordKey, Please fix your app
- if duplicates span multiple files within the same partitionpath, please engage with mailing list. This should not happen. You can use the
records deduplicate
command to fix your data.
Ingestion
java.io.EOFException: Received -1 when reading from channel, socket has likely been closed. at kafka.utils.Utils$.read(Utils.scala:381) at kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:54)
This might happen if you are ingesting from Kafka source, your cluster is ssl enabled by default and you are using some version of Hudi older than 0.5.1. Previous versions of Hudi were using spark-streaming-kafka-0-8 library. With the release of 0.5.1 version of Hudi, spark was upgraded to 2.4.4 and spark-streaming-kafka library was upgraded to spark-streaming-kafka-0-10. SSL support was introduced from spark-streaming-kafka-0-10. Please see here for reference.
The workaround can be either use Kafka cluster which is not ssl enabled, else upgrade Hudi version to at least 0.5.1 or spark-streaming-kafka library to spark-streaming-kafka-0-10.
java.lang.IllegalArgumentException: Could not find a 'KafkaClient' entry in the JAAS configuration. System property 'java.security.auth.login.config' is not set
This might happen when you are trying to ingest from ssl enabled kafka source and your setup is not able to read jars.conf file and its properties. To fix this, you need to pass the required property as part of your spark-submit command something like
--files jaas.conf,failed_tables.json --conf 'spark.driver.extraJavaOptions=-Djava.security.auth.login.config=jaas.conf' --conf 'spark.executor.extraJavaOptions=-Djava.security.auth.login.config=jaas.conf'