We have a client that is doing interesting data science that depends on processing very large files (100GB) that are also transferred between parties.

This workflow was actually the inspiration for our S3S2 Open Source project which aims to make it easy to share files securely.

As we got deeper into the problem, we realized that we hadn’t fully examined how resource intensive it would be to handle such large files. This post shares some of the initial results.

Getting Started with Large Files

First we created a very large file by doing this:

mkfile -n 100g ./100g_test_file.txt

This is of course, naive, because we won’t see realistic compression. The file we just created is all 0’s. However, it is and looks like a 100GB file so let’s start here.

Next, we’re curious how hard it is to just compress the file.

om:s3s2 mk$ time gzip 100g_test_file.txt

real    9m9.417s
user    7m53.403s
sys    0m58.835s

OK, so we’re talking ~10 minutes to zip a 100GB file on my local drive with an SSD. Could be faster in the cloud but for our use case we have folks that want the data locally and some who want to use it in the cloud, so it’s at least a reasonable first test.

So… what do you think, do you think the encryption is going to take a lot longer than that?

GPG RSA Encryption

Let’s see! Here we’ll encrypt the same data with RSA.

om:s3s2 mk$ time gpg -e -u "Matt Konda" -r "Matt Konda" 100g_test_file.txt

real    9m25.457s
user    8m12.256s
sys    1m3.425s

Turns out its pretty much the same. ~10 minutes. This is RSA with a 4096 bit key.

Cool. OK, well that’s good. I wouldn’t want the encryption to be much slower than that or this whole approach is going to be unusable.

Let’s check out the decryption.

om:s3s2 mk$ time gpg -d 100g_test_file.txt.gpg

You need a passphrase to unlock the secret key for
user: "Matt Konda (s3s2-test-key) <mkonda@jemurai.com>"
4096-bit RSA key, ID A497AEC4, created 2019-04-11 (main key ID 5664D905)

gpg: encrypted with 4096-bit RSA key, ID A497AEC4, created 2019-04-11
      "Matt Konda (s3s2-test-key) <mkonda@jemurai.com>"

real    206m31.857s
user    49m31.655s
sys    43m35.630s

Oh no! 200+ MINUTES!? As my kids would say: WHAT ON PLANET EARTH!? That is definitely unusable.

OK. Step back. Let’s think here. With encryption, people usually use symmetric algorithms like AES after exchanging the key with asymmetric ones like RSA… because it is faster. Cool. Let’s see what that looks like then.

Symmetric Encryption

Let’s try encrypting that 100GB file with AES-256, which is presumably very fast because processors have been optimized to do it efficiently.

om:s3s2 mk$ time gpg --cipher-algo AES256 -c 100g_test_file.txt

real    14m36.886s
user    12m35.381s
sys    1m14.609s

14+ minutes is longer than zip or RSA encrypt, but not terrible. What does it look like going the other way? My guess would have been about the same as on the way in. That just shows how much I know…

om:s3s2 mk$ time gpg --output 100g_test_file.aesdecrypted.txt --decrypt 100g_test_file.txt.gpg
gpg: AES256 encrypted data
gpg: encrypted with 1 passphrase

real    49m34.661s
user    44m6.992s
sys    3m32.160s

50 minutes! That’s more than 3x the time to encrypt and too long to be practically useful.

Compression

Per a colleague’s suggestion, we thought maybe the compression was slowing us down a lot so we tested with no compression.

om:s3s2 mk$ time gpg --cipher-algo AES256 --compress-algo none -c 100g_test_file.txt

real    23m28.978s
user    18m29.866s
sys    3m22.856s
om:s3s2 mk$ time gpg -d 100g_test_file.nocompress.txt.gpg
gpg: AES256 encrypted data
gpg: encrypted with 1 passphrase

real    205m4.581s
user    65m6.715s
sys    47m45.610s

Everything got slower. So that was a wrong turn.

It is possible that using a library that would better capture processor support, a la AES-NI would significantly improve the result. We’re not sure how portable that will be so we’re setting that aside for now.

Other Approaches

We tried 7z with AES in stream… but that was also just slower.

om:s3s2 mk$ time 7za a -p 100g_test_file.txt.7z 100g_test_file.txt
...
Files read from disk: 1
Archive size: 15768178 bytes (16 MiB)
Everything is Ok

real    43m48.769s
user    102m6.180s
sys     4m51.758s

Decompressing and decrypting:

om:s3s2 mk$ time 7z x 100g_test_file.txt.7z -p
...
Size:       107374182400
Compressed: 15768178

real  7m29.579s
user  5m17.954s
sys   1m11.559s

Other Archiving: Zstd

Kudos to @runako who pointed me to the zstd library.

om:s3s2 mk$ time zstd -o 100g_test_file.txt.zstd 100g_test_file.txt
100g_test_file.txt   :  0.01%   (107374182400 => 9817620 bytes, 100g_test_file.txt.zstd)

real  1m22.134s
user  0m55.896s
sys   1m5.896s

Woah - that’s fast! I guess sometimes the new tricks are worth learning. If we can apply really good compression this fast, it will also make the encryption step much faster.

Conclusion

I don’t know as much about encryption or compression as I thought. 😀

Dealing with very large files is still a legitimate and relevant challenge given day to day tasks of data scientists working on huge data sets. How we approach the problem will have real consequences for how effectively we can handle data.

Having a solution that pulls some of the effective practices together was a bigger challenge than we thought it might be, but it’s bearing fruit with our S3S2 Project. By using Zstd with gpg and S3, we’ve settled on what is, at least so far, the fastest way to safely work on and share lots of very large files.

We’d love to hear your input.

References

Matt Konda

Matt is a software engineer. He's our CEO and former Chair & OWASP Board Member.

Want to stay up to date with the lastest from Jemurai?

Sign up for our monthly newsletter!