Developer Setup
Pre-requisites
To contribute code, you need
- a GitHub account
- a Linux (or) macOS development environment with Java JDK 8, Apache Maven (3.x+) installed
- Docker installed for running demo, integ tests or building website
- for large contributions, a signed Individual Contributor License Agreement (ICLA) to the Apache Software Foundation (ASF).
- (Recommended) Join our dev mailing list & slack channel, listed on community page.
IntelliJ Setup
IntelliJ is the recommended IDE for developing Hudi. To contribute, you would need to do the following
-
Fork the Hudi code on Github & then clone your own fork locally. Once cloned, we recommend building as per instructions on spark quickstart or flink quickstart.
-
In IntelliJ, select
File
>New
>Project from Existing Sources...
and select thepom.xml
file under your local Hudi source folder. -
In
Project Structure
, select Java 1.8 as the Project SDK. -
Make the following configuration in
Preferences
orSettings
in newer IntelliJ so the Hudi code can compile in the IDE:-
Enable annotation processing in compiler.
-
Configure Maven NOT to delegate IDE build/run actions to Maven so you can run tests in IntelliJ directly.
-
-
If you switch maven build profile, e.g., from Spark 3.2 to Spark 3.3, you need to first build Hudi in the command line first and
Reload All Maven Projects
in IntelliJ like below, so that IntelliJ re-indexes the code. -
[Recommended] We have embraced the code style largely based on google format. Please set up your IDE with style files from <project root>/style/. These instructions have been tested on IntelliJ.
-
Open
Settings
in IntelliJ -
Install and activate CheckStyle plugin
-
In
Settings
>Tools
>Checkstyle
, use a recent version, e.g., 10.17.0 -
Click on
+
, add the style/checkstyle.xml file, and name the configuration as "Hudi Checks" -
Activate the checkstyle configuration by checking
Active
-
Open
Settings
>Editor
>Code Style
>Java
-
Select "Project" as the "Scheme". Then, go to the settings, open
Import Scheme
>CheckStyle Configuration
, selectstyle/checkstyle.xml
to load -
After loading the configuration, you should see that the
Indent
andContinuation indent
become 2 and 4, from 4 and 8, respectively -
Apply/Save the changes
-
-
[Recommended] Set up the Save Action Plugin to auto format & organize imports on save. The Maven Compilation life-cycle will fail if there are checkstyle violations.
-
[Recommended] As it is required to add Apache License header to all source files, configuring "Copyright" settings as shown below will come in handy.
Useful Maven commands for developers.
Listing out some of the maven commands that could be useful for developers.
- Compile/build entire project
mvn clean package -DskipTests
Default profile is spark2 and scala2.11
- For continuous development, you may want to build only the modules of interest. for eg, if you have been working with deltastreamer, you can build using this command instead of entire project. Majority of time goes into building all different bundles we have like flink bundle, presto bundle, trino bundle etc. But if you are developing something confined to hudi-utilties, you can achieve faster build times.
mvn package -DskipTests -pl packaging/hudi-utilities-bundle/ -am
To enable multi-threaded building, you can add -T.
mvn -T 2C package -DskipTests -pl packaging/hudi-utilities-bundle/ -am
This command will use 2 parallel threads to build.
You can also confine the build to just one module if need be.
mvn -T 2C package -DskipTests -pl hudi-spark-datasource/hudi-spark -am
Note: "-am" will build all dependent modules as well. In local laptop, entire project build can take somewhere close to 7 to 10 mins. While buildig just hudi-spark-datasource/hudi-spark with multi-threaded, could get your compilation in 1.5 to 2 mins.
If you wish to run any single test class in java.
mvn test -Punit-tests -pl hudi-spark-datasource/hudi-spark/ -am -B -DfailIfNoTests=false -Dtest=TestCleaner
If you wish to run a single test method in java.
mvn test -Punit-tests -pl hudi-spark-datasource/hudi-spark/ -am -B -DfailIfNoTests=false -Dtest=TestCleaner#testKeepLatestCommitsMOR
To filter particular scala test:
mvn -Dsuites="org.apache.spark.sql.hudi.TestSpark3DDL @Test Chinese table " -Dtest=abc -DfailIfNoTests=false test -pl packaging/hudi-spark-bundle -am
-Dtest=abc will assist in skipping all java tests. -Dsuites="org.apache.spark.sql.hudi.TestSpark3DDL @Test Chinese table " filters for a single scala test.
- Run an Integration Test
mvn -T 2C -Pintegration-tests -DfailIfNoTests=false -Dit.test=ITTestHoodieSanity#testRunHoodieJavaAppOnMultiPartitionKeysMORTable verify
verify
phase runs the integration test and cleans up the docker cluster after execution. To retain the docker cluster use
integration-test
phase instead.
Note: If you encounter unknown shorthand flag: 'H' in -H
, this error occurs when local environment has docker-compose version >= 2.0.
The latest docker-compose is accessible using docker-compose
whereas v1 version is accessible using docker-compose-v1
locally.
You can use alt def
command to define different docker-compose versions. Refer https://github.com/dotboris/alt.
Use alt use
to use v1 version of docker-compose while running integration test locally.
Code & Project Structure
docker
: Docker containers used by demo and integration tests. Brings up a mini data ecosystem locallyhudi-cli
: CLI to inspect, manage and administer datasetshudi-client
: Spark client library to take a bunch of inserts + updates and apply them to a Hoodie tablehudi-common
: Common classes used across moduleshudi-hadoop-mr
: InputFormat implementations for ReadOptimized, Incremental, Realtime viewshudi-hive
: Manage hive tables off Hudi datasets and houses the HiveSyncToolhudi-integ-test
: Longer running integration test processeshudi-spark
: Spark datasource for writing and reading Hudi datasets. Streaming sink.hudi-utilities
: Houses tools like DeltaStreamer, SnapshotCopierpackaging
: Poms for building out bundles for easier drop in to Spark, Hive, Presto, Utilitiesstyle
: Code formatting, checkstyle files
Code WalkThrough
This Quick Video will give a code walkthrough to start with watch.
Running unit tests and local debugger via Intellij IDE
When submitting a PR please make sure to NOT commit the changes mentioned in these steps, instead once testing is done make sure to revert the changes and then submit a pr.
- Build the project with the intended profiles via the
mvn
cli, for example for spark 3.2 usemvn clean package -Dspark3.2 -Dscala-2.12 -DskipTests
. - Install the "Maven Helper" plugin from the Intellij IDE.
- Make sure IDEA uses Maven to build/run tests:
- You need to select the intended Maven profiles (using Maven tool pane in IDEA): select profiles you are targeting for example
spark2.4
andscala-2.11
orspark3.2
,scala-2.12
etc. - Add
.mvn/maven.config
file at the root of the repo w/ the the profiles you selected in the pane:-Dspark3.2
-Dscala-2.12
- Add
.mvn/
to the.gitignore
file located in the root of the project.
- You need to select the intended Maven profiles (using Maven tool pane in IDEA): select profiles you are targeting for example
- Make sure you change (temporarily) the
scala.binary.version
in the rootpom.xml
to the intended scala profile version. For example if running with spark3scala.binary.version
should be2.12
- Finally right click on the unit test's method signature you are trying to run, there should be an option with a mvn symbol that allows you to
run <test-name>
, as well as an option todebug <test-name>
.- For debugging make sure to first set breakpoints in the src code see (https://www.jetbrains.com/help/idea/debugging-code.html)
Docker Setup
We encourage you to test your code on docker cluster please follow this for docker setup.
Remote Debugging
if your code fails on docker cluster you can remotely debug your code please follow the below steps.
Step 1 :- Run your Delta Streamer Job with --conf as defined this will ensure to wait till you attach your intellij with Remote Debugging on port 4044
spark-submit \
--conf spark.driver.extraJavaOptions="-Dconfig.resource=myapp.conf -agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=4044" \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer $HUDI_UTILITIES_BUNDLE \
--table-type COPY_ON_WRITE \
--source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
--source-ordering-field ts \
--base-file-format parquet \
--target-base-path /user/hive/warehouse/stock_ticks_cow \
--target-table stock_ticks_cow --props /var/demo/config/kafka-source.properties \
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
Step 2 :- Attaching Intellij (tested on Intellij Version > 2019. this steps may change acc. to intellij version)
- Come to Intellij --> Edit Configurations -> Remote -> Add Remote - > Put Below Configs -> Apply & Save -> Put Debug Point -> Start.
- Name : Hudi Remote
- Port : 4044
- Command Line Args for Remote JVM : -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=4044
- Use Module ClassPath : select hudi
Website
Apache Hudi site is hosted on a special asf-site
branch. Please follow the README
file under docs
on that branch for
instructions on making changes to the website.