Story time! Around twenty years ago, I was working as a reverse-engineer and cryptanalyst for AccessData while getting my Physics degree at BYU. It was the late 90s and early 2000s. The US government had been gradually easing off export restrictions on software containing cryptography, but most software still contained pretty worthless password protection. We’d buy desktop office software, I’d reverse-engineer it to figure out what algorithm it was using for access control, and then break the crypto. It was a never-ending stream of interesting-but-not-impossible math puzzles. I wrote about forty password crackers during my time there. We would sell them to home users, system administrators, and local and federal law enforcement agencies. I got to go down to the Federal Law Enforcement Training Center in Glynco a few times to teach the Secret Service, FBI, and ATF about cryptography and the use of our products.
Two projects really stand out in my memory. The first was Microsoft Word 97. Before Word 97, the files were encrypted by xoring the bytes with a repeating 16-byte string derived from the password. The most common bytes in a Word file were usually either 0x00, 0xFF, or 0x20 (space), so we’d just look at the most common character in each column and try the 3 16 variations on those. Recovering the key was usually instantaneous, but to help people feel like they’d gotten their money’s worth, we’d put on a little animated show like a Hollywood hacking scene with lots of random characters that gradually revealed the right password.
Microsoft 97 changed that. It might have been possible to find out the encryption format through MSDN, but we were small and couldn’t afford a subscription. It also wasn’t clear that we’d be allowed to write the cracker if we did get the info from them. To figure out how it worked, I used SoftICE to stop the software immediately after typing in the password, then walked slowly up the stack until I could find the algorithm. This was before IDA Pro, so I printed out a few dozen pages of assembly code, taped it to the wall, and drew all over it with a red crayon. I was very pleased when I finally worked it all out. At the time, Microsoft was only allowed to export 40 bit cryptography, so they did as much as they were legally permitted: they’d repeatedly MD5 hash the password with “salt” (randomly chosen bytes stored in the file) to get the 40 bit key, then add salt to that and repeatedly hash it again. It took roughly half a second to test a password on the computers of the time. We had to resort to a dictionary attack, because breaking it outright was pretty much impossible. We did eventually write a cracker for larger companies and agencies that brute-forced the 40-bit keyspace using those fancy MMX instructions on the Pentium. I knew of one place that ran the software for nine months before finally getting in.
The other really fun one one was zip archives. The developer of PKZIP, Phil Katz, made the decision—unusual at the time—to document his file format and include it with his software, making it a favorite of developers. Roger Schlafly designed the stream cipher used for encrypting the archives. The zip standard quickly became by far the most popular compression format on Windows, and many other formats like Java’s .jar format and OpenOffice’s document formats are really zip files with a particular directory structure inside. InfoZIP is an open-source version of the software; its implementation was used as the basis for nearly all other branded zip software, like WinZip. At the time I was trying to crack it, WinZip had 95% of the market.
Eli Biham and Paul Kocher had published a known-plaintext attack on the cipher, but the known plaintext was compressed plaintext. To get the Huffman codes at the start of a deflated file, you basically need the whole file. The attack was practically useless to law enforcement agencies.
The zip cipher has 96 bits of internal state split into three 32-bit chunks called key0, key1, and key2. key0 and key2 are the internal state of two copies of CRC32, a linear feedback shift register. Basically, to update the state with a new byte of data, you shift all the bytes left by one byte (dropping the high byte) then xor in a constant from a lookup table indexed by the byte of data. key1 is the internal state of a truncated linear congruential generator (TLCG). To update its internal state, you add the data byte, multiply by a constant I’ll call c, and add 1. The cipher works like this: you feed in a data byte to the first CRC32, then take the low byte of that and feed it into the TLCG, then take the high byte of that and feed it into the second CRC32, then take the state and (roughly) square it, then output the second byte of the result as the stream byte. To initialize the 96-bit state, you start from a known state and encrypt the password, then encrypt ten bytes of salt.
PKZIP got its salt bytes by allocating memory without initializing it, so it would contain bits of stuff from other programs you’d been running or images or documents, whatever. That worked fine on Windows, but on many Unix systems, memory gets initialized automatically when it’s allocated. InfoZIP chose the salt bytes by using C’s rand function. It would initialize the state of the random number generator by xoring the timestamp with the process ID, then generate ten bytes per file. Had they left it that way, knowing the timestamp and process ID would be enough information to recover the header bytes, which in turn would let you mount Biham and Kocher’s attack. It seems the InfoZIP authors knew about that attack, because they went one step further and encrypted the headers once using the password. That way an attacker had only doubly-encrypted plaintext, and their attack wouldn’t work.
I noticed that because the password was the same in both passes, the first stream byte was the same each pass. Just like toggling a light switch twice leaves it where it started, xoring a byte twice with the same stream byte leaves it unaffected. This let me mount a very powerful ciphertext only attack: given five encrypted files in an archive, I could derive the internal state of C’s rand function without having to look at the timestamp or knowing the process ID. That let me generate the original, unencrypted headers. Because only a few bits of each part of the cipher affected the next part, I could also guess only a few bits of the state and check whether decrypting the next bytes twice gave me the answer I was expecting. I had to guess fewer and fewer bits of key material as I went on. Each extra file also let me exclude more potential key material; at the time, it took a few hours on a single desktop machine. I published a paper on it and got to present it in Japan at Fast Software Encryption 2001.
I left AccessData in 2002. I worked for a neural networking startup for a year, spent three years on a Master’s degree in Computer Science at the University of Auckland with Cris Calude, started a PhD with the mathematical physicist John Baez at UC Riverside, worked on Google’s applied security team for six years and finished the PhD, and after a few more years ended up as CTO of a startup.
About six months ago, I received, out of the blue, a message on LinkedIn from a Russian guy. He had read that paper I’d written 19 years ago and wanted to know if the attack could work on a file with only two files. A quick analysis said not without an enormous amount of processing power and a lot of money. Because I only had two files to work with, a lot more false positives would advance at each stage. There would end up being 2 73 possible keys to test, nearly 10 sextillion. I estimated it would take a large GPU farm a year to break, with a cost on the order of $100K. He astounded me by saying he could spend that much to recover the key.
Back in January of 2016, he had bought around $10K or $15K of Bitcoin and put the keys in an encrypted zip file. Now they were worth upwards of $300K and he couldn’t remember the password. Luckily, he still had the original laptop and knew exactly when the encryption took place. Because InfoZip seeds its entropy using the timestamp, that promised to reduce the work enormously—”only” 10 quintillion—and made it quite feasible, a matter of a couple of months on a medium GPU farm. We made a contract and I got to work.
I spent a while reconstructing the attack from the overview in the paper. To my chagrin, there were certain tricky details I had skimmed over in my paper, but I worked them out again. Then I discovered I’d made a terrible mistake while estimating the work!
In my original attack, I would guess the high bytes of key1·c, key1·c², key1·c³, and key1·c⁴. By the time I guessed that fourth byte, I knew the complete state of the rest of the cipher; I just had to convert those four bytes into the original key1 and I was done. I would run through the 32-bit state space for key1 and check each one to see if it gave the proper high bytes. Here, though, I had trillions of keys to check; if I had to do 2 32 tests on each one, it would take a few hundred thousand years.
I vaguely remembered that there had been some work done on cryptanalyzing TLCGs using lattice reduction, so I dug out the original paper. It was perfect! I just needed to define a lattice with basis vectors given by 2 32 and the powers of the constant from the TLCG, then do lattice reduction. Given the reduced basis, I could recover the original state from the high bytes just by doing a matrix multiplication.
At least, that was the idea. I would need five bytes to constrain it to a single outcome, and at that point in the attack, I only had four. The process described in the paper rarely gave the right answer. However, I knew the answer had to be close to the right one, so I ran through all the possible key1 values and checked the difference from the answer it gave me and the true one. The difference was always one of 36 vectors! With this optimization, I could run through just 36 possibilities instead of four billion. We were still in business.
The next thing we ran into was the problem of transferring data between the machines with the GPUs on them. My business partner, Nash Foster, was working on the GPU implementation, advising me on how fast different operations are and writing a lot of the support structure for the application into which I put my crypto cracking code. How would we get these petabytes of candidate keys to the GPUs that would test them? They’d be almost idle, twiddling their thumbs waiting for stuff to work on. It occurred to me that each stage of my attack was guessing a lot of bits and then keeping only one candidate out of roughly 65K. If I had some way of using that information to derive bits rather than just guess and check, I’d save a lot of work, and more importantly, a lot of network traffic. The problem with that idea was that the math was too complicated; it involved mixing the math of finite fields with the math of integer rings, and those don’t play nicely together.
I thought about other cryptanalytic attacks I knew, and one that seemed promising was a meet-in-the-middle attack. A meet-in-the-middle attack applies to a block cipher when it uses one part of the key material to do the first half of the encryption and the other part of the key material to do the second half. That sort-of applied to the zip cipher, but the key material far outweighed the number of bits in the middle. Then it occurred to me that I could use the linearity of CRC32 to my advantage: by xoring together two outputs of the last CRC32, the result would be independent of key2! Rather than compute the intermediate state of the cipher and store it in a table, I’d calculate the xor of two intermediate states and store that instead.
I wrote up the differential meet-in-the-middle attack and ran it on my laptop. A stage that had taken hours on my laptop before now finished in mere seconds. The next stage, which I expected to take a week on the GPU farm finished in a few hours on a powerful CPU. I couldn’t quite make it work fast enough on the third stage to be useful in speeding up the overall attack, but it completely obviated the need to move data around: we simply computed the candidates for each GPU on the computer with the cards in it. Nash wrote the GPU code, which ran screamingly fast.
The attack ran for ten days and failed.
I was heartbroken. Had we missed one of the candidate keys? We went back and looked at the maximum process ID on his laptop and discovered it was a few bits larger than we had expected, and therefore there were a couple of other possible initial seeds for rand. I also went back and double-checked all our tests. Was there something we’d missed? Did the CPU-based version behave any differently from the GPU version? Finally I discovered that the GPU version failed to find the correct key when it was the second in a list of candidates, but it worked when it was the first one. Digging into the code, we found that we had swapped the block index with the thread index in computing the offset into the data structure.
We fixed the bug, re-ran the code, and found the correct key within a day. Our client was very happy and gave us a large bonus for finding the key so quickly and saving him so much money over our initial estimate.
I’m currently looking for work in a staff or senior staff engineering or data scientist role. If you’ve got interesting technical analysis or optimization problems, please reach out to me and let’s talk.