Stream Unzipped Files to S3 with Java: s3-stream-unzip

I’ve been spending a lot of time building a data pipeline with AWS S3 lately and encountered the surprisingly non-trivial challenge of unzipping files in an S3 bucket.
A few minutes with Google and StackOverflow made it clear that many others have faced the same problem.

I’ll explain some options for handling unzipping as well as the final solution that inspired me to create nejckorasa/s3-stream-unzip.

to sum up:

  • There is no support for unzipping files in S3 in-line,
  • There is no unzip built-in API available in the AWS SDK.

So you need to download the files from S3 to unzip, unzip and upload the decompressed files back.

This solution is easy to implement with the use of the Java AWS SDK, and is probably good enough if you’re dealing with small files – if the files are small enough you can just hold the decompressed files in memory and return them You can upload.

Alternatively, in case of memory constraints, the files can be persisted to disk storage. Great, it works.

Problems arise with large files. For example, AWS Lambda has a memory and disk space limit of 1024MB. A dedicated EC2 instance will solve the disk space issue, but requires more maintenance. I would also argue that storing 500MB+ files on disk is not the most optimal way.
This will of course depend on the run frequency of that operation as well as how many files need to be unzipped – that’s fine as a one time but probably not if it needs to be run daily. In any case, we can really do better.

streaming solutions

A better approach would be to stream the file from S3, download it in chunks, unzip and upload back to S3 using Multipart Upload. This way you avoid the need for disk storage altogether and you can reduce the memory footprint by tuning the download and upload chunk sizes.

There are 2 parts to this solution that need to be integrated:

1) download and uznip

Streaming S3 objects natively supported by the AWS SDK, is a getObjectContent() Method that returns an input stream containing the contents of an S3 object.

Java provides ZipInputStream as an input stream filter for reading files in the Zip file format. It reads ZIP contents entry-by-entry and thus allows custom handling for each entry.

Streaming and feeding object content to and from S3 ZipInputStream Will give us decomposed pieces of object content that we can buffer in memory.

2) Upload the unzipped chunks to S3

Uploading files to S3 is a common task and the SDK supports several options to choose from, including multipart upload.

What is Multipart Upload?

Multipart upload allows you to upload an object as a set of parts.
Each part is a contiguous part of the object’s data. You can upload these object parts freely and in any order.
If the transmission of a part fails, you can retransmit that part without affecting the other parts.

After all parts of your object are uploaded, Amazon S3 assembles these parts and creates the object.

In general, when your object size reaches 100MB, you should consider using multipart upload instead of uploading the object in a single operation.

nejcorsa/s3-stream-unzip

All that remains to be done now is to integrate stream downloads, unzip and multipart uploads.
I have tried my best and made nejckorasa/s3-stream-unzip.

Java library to manage unzipping large files and data in AWS S3 without knowing the size in advance and without keeping it all in memory or writing to disk.

Unzipping is achieved without knowing the size in advance and keeping it all in memory or writing to disk. This makes it suitable for large data files – it has been used to unzip files of size 100GB+.

It supports various unzip strategies, including the option to split zipped files (suitable for large files, such as CSV files). It is lightweight and only requires an AmazonS3 client to run.

It has a simple API:

// initialize AmazonS3 client
AmazonS3 s3CLient = AmazonS3ClientBuilder.standard()
        // customize the client
        .build()

// create UnzipStrategy
var strategy = new NoSplitUnzipStrategy();
var strategy = new SplitTextUnzipStrategy()
        .withHeader(true)
        .withFileBytesLimit(100 * MB);

// or create UnzipStrategy with additional config
var config = new S3MultipartUpload.Config()
        .withThreadCount(5)
        .withQueueSize(5)
        .withAwaitTerminationTimeSeconds(2)
        .withCannedAcl(CannedAccessControlList.BucketOwnerFullControl)
        .withUploadPartBytesLimit(20 * MB)
        .withCustomizeInitiateUploadRequest(request -> {
            // customize request
            return request;
        });

var strategy = new NoSplitUnzipStrategy(config);

// create S3UnzipManager
var um = new S3UnzipManager(s3Client, strategy);
var um = new S3UnzipManager(s3Client, strategy.withContentTypes(List.of("application/zip"));

// unzip options
um.unzipObjects("bucket-name", "input-path", "output-path");
um.unzipObjectsKeyMatching("bucket-name", "input-path", "output-path", ".*\\.zip");
um.unzipObjectsKeyContaining("bucket-name", "input-path", "output-path", "-part-of-object-");
um.unzipObject(s3Object, "output-path");
enter fullscreen mode

exit fullscreen mode

The library is available on Maven Central and Github.

You can view the original blog post here: https://nejckorasa.github.io/posts/s3-unzip/

Leave a Comment