In this page, we explain how to get your Hudi spark job to store into AWS S3.
There are two configurations required for Hudi-S3 compatibility:
- Adding AWS Credentials for Hudi
- Adding required Jars to classpath
Simplest way to use Hudi with S3, is to configure your
SparkContext with S3 credentials. Hudi will automatically pick this up and talk to S3.
Alternatively, add the required configs in your core-site.xml from where Hudi can fetch them. Replace the
fs.defaultFS with your S3 bucket name and Hudi should be able to read/write from the bucket.
Utilities such as hudi-cli or deltastreamer tool, can pick up s3 creds via environmental variable prefixed with
HOODIE_ENV_. For e.g below is a bash snippet to setup
such variables and then have cli be able to work on datasets stored in s3
AWS hadoop libraries to add to our classpath
AWS glue data libraries are needed if AWS glue data is used
AWS S3 Versioned Bucket
With versioned buckets any object deleted creates a Delete Marker, as Hudi cleans up files using Cleaner utility the number of Delete Markers increases over time. It is important to configure the Lifecycle Rule correctly to clean up these delete markers as the List operation can choke if the number of delete markers reaches 1000. We recommend cleaning up Delete Markers after 1 day in Lifecycle Rule.