We have a client that is doing interesting data science that depends on processing very large files (100GB) that are also transferred between parties.
This workflow was actually the inspiration for our S3S2 Open Source project which aims to make it easy to share files securely.
As we got deeper into the problem, we realized that we hadn’t fully examined how resource intensive it would be to handle such large files. This post shares some of the initial results.
Getting Started with Large Files
First we created a very large file by doing this:
mkfile -n 100g ./100g_test_file.txt
This is of course, naive, because we won’t see realistic compression. The file we just created is all 0’s. However, it is and looks like a 100GB file so let’s start here.
Next, we’re curious how hard it is to just compress the file.
om:s3s2 mk$ time gzip 100g_test_file.txt real 9m9.417s user 7m53.403s sys 0m58.835s
OK, so we’re talking ~10 minutes to zip a 100GB file on my local drive with an SSD. Could be faster in the cloud but for our use case we have folks that want the data locally and some who want to use it in the cloud, so it’s at least a reasonable first test.
So… what do you think, do you think the encryption is going to take a lot longer than that?
GPG RSA Encryption
Let’s see! Here we’ll encrypt the same data with RSA.
om:s3s2 mk$ time gpg -e -u "Matt Konda" -r "Matt Konda" 100g_test_file.txt real 9m25.457s user 8m12.256s sys 1m3.425s
Turns out its pretty much the same. ~10 minutes. This is RSA with a 4096 bit key.
Cool. OK, well that’s good. I wouldn’t want the encryption to be much slower than that or this whole approach is going to be unusable.
Let’s check out the decryption.
om:s3s2 mk$ time gpg -d 100g_test_file.txt.gpg You need a passphrase to unlock the secret key for user: "Matt Konda (s3s2-test-key) <email@example.com>" 4096-bit RSA key, ID A497AEC4, created 2019-04-11 (main key ID 5664D905) gpg: encrypted with 4096-bit RSA key, ID A497AEC4, created 2019-04-11 "Matt Konda (s3s2-test-key) <firstname.lastname@example.org>" real 206m31.857s user 49m31.655s sys 43m35.630s
Oh no! 200+ MINUTES!? As my kids would say: WHAT ON PLANET EARTH!? That is definitely unusable.
OK. Step back. Let’s think here. With encryption, people usually use symmetric algorithms like AES after exchanging the key with asymmetric ones like RSA… because it is faster. Cool. Let’s see what that looks like then.
Let’s try encrypting that 100GB file with AES-256, which is presumably very fast because processors have been optimized to do it efficiently.
om:s3s2 mk$ time gpg --cipher-algo AES256 -c 100g_test_file.txt real 14m36.886s user 12m35.381s sys 1m14.609s
14+ minutes is longer than zip or RSA encrypt, but not terrible. What does it look like going the other way? My guess would have been about the same as on the way in. That just shows how much I know…
om:s3s2 mk$ time gpg --output 100g_test_file.aesdecrypted.txt --decrypt 100g_test_file.txt.gpg gpg: AES256 encrypted data gpg: encrypted with 1 passphrase real 49m34.661s user 44m6.992s sys 3m32.160s
50 minutes! That’s more than 3x the time to encrypt and too long to be practically useful.
Per a colleague’s suggestion, we thought maybe the compression was slowing us down a lot so we tested with no compression.
om:s3s2 mk$ time gpg --cipher-algo AES256 --compress-algo none -c 100g_test_file.txt real 23m28.978s user 18m29.866s sys 3m22.856s
om:s3s2 mk$ time gpg -d 100g_test_file.nocompress.txt.gpg gpg: AES256 encrypted data gpg: encrypted with 1 passphrase real 205m4.581s user 65m6.715s sys 47m45.610s
Everything got slower. So that was a wrong turn.
It is possible that using a library that would better capture processor support, a la AES-NI would significantly improve the result. We’re not sure how portable that will be so we’re setting that aside for now.
We tried 7z with AES in stream… but that was also just slower.
om:s3s2 mk$ time 7za a -p 100g_test_file.txt.7z 100g_test_file.txt ... Files read from disk: 1 Archive size: 15768178 bytes (16 MiB) Everything is Ok real 43m48.769s user 102m6.180s sys 4m51.758s
Decompressing and decrypting:
om:s3s2 mk$ time 7z x 100g_test_file.txt.7z -p ... Size: 107374182400 Compressed: 15768178 real 7m29.579s user 5m17.954s sys 1m11.559s
Other Archiving: Zstd
om:s3s2 mk$ time zstd -o 100g_test_file.txt.zstd 100g_test_file.txt 100g_test_file.txt : 0.01% (107374182400 => 9817620 bytes, 100g_test_file.txt.zstd) real 1m22.134s user 0m55.896s sys 1m5.896s
Woah - that’s fast! I guess sometimes the new tricks are worth learning. If we can apply really good compression this fast, it will also make the encryption step much faster.
I don’t know as much about encryption or compression as I thought. 😀
Dealing with very large files is still a legitimate and relevant challenge given day to day tasks of data scientists working on huge data sets. How we approach the problem will have real consequences for how effectively we can handle data.
Having a solution that pulls some of the effective practices together was a bigger challenge than we thought it might be, but it’s bearing fruit with our S3S2 Project. By using Zstd with gpg and S3, we’ve settled on what is, at least so far, the fastest way to safely work on and share lots of very large files.
We’d love to hear your input.