We have a client that is doing interesting data science that depends on processing very large files (100GB) that are also transferred between parties.
This workflow was actually the inspiration for our S3S2 Open Source project which aims to make it easy to share files securely.
As we got deeper into the problem, we realized that we hadn’t fully examined how resource intensive it would be to handle such large files. This post shares some of the initial results.
First we created a very large file by doing this:
mkfile -n 100g ./100g_test_file.txt
This is of course, naive, because we won’t see realistic compression. The file we just created is all 0’s. However, it is and looks like a 100GB file so let’s start here.
Next, we’re curious how hard it is to just compress the file.
om:s3s2 mk$ time gzip 100g_test_file.txt
real 9m9.417s
user 7m53.403s
sys 0m58.835s
OK, so we’re talking ~10 minutes to zip a 100GB file on my local drive with an SSD. Could be faster in the cloud but for our use case we have folks that want the data locally and some who want to use it in the cloud, so it’s at least a reasonable first test.
So… what do you think, do you think the encryption is going to take a lot longer than that?
Let’s see! Here we’ll encrypt the same data with RSA.
om:s3s2 mk$ time gzip 100g_test_file.txt
real 9m9.417s
user 7m53.403s
sys 0m58.835s
Turns out its pretty much the same. ~10 minutes. This is RSA with a 4096 bit key.
Cool. OK, well that’s good. I wouldn’t want the encryption to be much slower than that or this whole approach is going to be unusable.
Let’s check out the decryption.
om:s3s2 mk$ time gpg -d 100g_test_file.txt.gpg
You need a passphrase to unlock the secret key for
user: "Matt Konda (s3s2-test-key) <mkonda@jemurai.com>"
4096-bit RSA key, ID A497AEC4, created 2019-04-11 (main key ID 5664D905)
gpg: encrypted with 4096-bit RSA key, ID A497AEC4, created 2019-04-11
"Matt Konda (s3s2-test-key) <mkonda@jemurai.com>"
real 206m31.857s
user 49m31.655s
sys 43m35.630s
Oh no! 200+ MINUTES!? As my kids would say: WHAT ON PLANET EARTH!? That is definitely unusable.
OK. Step back. Let’s think here. With encryption, people usually use symmetric algorithms like AES after exchanging the key with asymmetric ones like RSA… because it is faster. Cool. Let’s see what that looks like then.
Let’s try encrypting that 100GB file with AES-256, which is presumably very fast because processors have been optimized to do it efficiently.
om:s3s2 mk$ time gpg --cipher-algo AES256 -c 100g_test_file.txt
real 14m36.886s
user 12m35.381s
sys 1m14.609s
14+ minutes is longer than zip or RSA encrypt, but not terrible. What does it look like going the other way? My guess would have been about the same as on the way in. That just shows how much I know…
om:s3s2 mk$ time gpg --output 100g_test_file.aesdecrypted.txt --decrypt 100g_test_file.txt.gpg
gpg: AES256 encrypted data
gpg: encrypted with 1 passphrase
real 49m34.661s
user 44m6.992s
sys 3m32.160s
50 minutes! That’s more than 3x the time to encrypt and too long to be practically useful.
Per a colleague’s suggestion, we thought maybe the compression was slowing us down a lot so we tested with no compression.
om:s3s2 mk$ time gpg --cipher-algo AES256 --compress-algo none -c 100g_test_file.txt
real 23m28.978s
user 18m29.866s
sys 3m22.856s
om:s3s2 mk$ time gpg -d 100g_test_file.nocompress.txt.gpg
gpg: AES256 encrypted data
gpg: encrypted with 1 passphrase
real 205m4.581s
user 65m6.715s
sys 47m45.610s
Everything got slower. So that was a wrong turn.
It is possible that using a library that would better capture processor support, a la AES-NI would significantly improve the result. We’re not sure how portable that will be so we’re setting that aside for now.
We tried 7z with AES in stream… but that was also just slower.
om:s3s2 mk$ time 7za a -p 100g_test_file.txt.7z 100g_test_file.txt
...
Files read from disk: 1
Archive size: 15768178 bytes (16 MiB)
Everything is Ok
real 43m48.769s
user 102m6.180s
sys 4m51.758s
Decompressing and decrypting:
om:s3s2 mk$ time 7z x 100g_test_file.txt.7z -p
...
Size: 107374182400
Compressed: 15768178
real 7m29.579s
user 5m17.954s
sys 1m11.559s
Kudos to @runako who pointed me to the zstd library.
om:s3s2 mk$ time zstd -o 100g_test_file.txt.zstd 100g_test_file.txt
100g_test_file.txt : 0.01% (107374182400 => 9817620 bytes, 100g_test_file.txt.zstd)
real 1m22.134s
user 0m55.896s
sys 1m5.896s
Woah - that’s fast! I guess sometimes the new tricks are worth learning. If we can apply really good compression this fast, it will also make the encryption step much faster.
I don’t know as much about encryption or compression as I thought. 😀
Dealing with very large files is still a legitimate and relevant challenge given day to day tasks of data scientists working on huge data sets. How we approach the problem will have real consequences for how effectively we can handle data.
Having a solution that pulls some of the effective practices together was a bigger challenge than we thought it might be, but it’s bearing fruit with our S3S2 Project. By using Zstd with gpg and S3, we’ve settled on what is, at least so far, the fastest way to safely work on and share lots of very large files.
We’d love to hear your input.