In early May 2020, I was contacted by two Tor developers to help them with their denial-of-service (DoS) mitigation proposal. Onion services have been plagued by DoS for many years and there are no easy solutions due to the anonymous nature of the Tor network. The vulnerability stems from the fact that the service has to perform a series of expensive operations for each connecting client, so an attacker can simply flood the service with connection requests, which are comparatively cheap to make.
One of the strategies that can be used against such attacks are client puzzles, which require the client to prove that they performed certain amount of computational work before the request is accepted. This is not a new idea – using a client puzzle to combat email spam was first proposed in 1997 [ 1 ] and there have been proposals to add a client puzzle to the TLS protocol that we all use every day to browse the web [ 2 ].
The goal of the current Tor proposal [ 3 ] is to integrate some kind of client puzzle as part of the protocol used to connect to onion services. The client would download the hidden service descriptor, find that the service is protected by a client puzzle, calculate the solution and send it as part of the connection request. The service would give the request a priority value based on the "difficulty" of the puzzle solution.
The main issue is to find a suitable puzzle. There are three main requirements that the algorithm must meet:
The first requirement is due to the limited space in the message that the client sends to the hidden service to negotiate a connection. This doesn't pose a big problem in practice.
The second requirement is needed to prevent the client puzzle from becoming a DoS vector on its own. The service must be able to quickly verify if a proof provided by a client is valid or not, otherwise the attacker could just flood the service with invalid proofs.
The third requirement is the most important one. Users of the Tor browser are equipped with an ordinary general-purpose CPU and an attacker must not be able to use similarly priced hardware that offers a much higher performance for solving the puzzle – otherwise it would be easy for an attacker to simulate many legitimate clients connecting to the service.
It should be noted that the puzzle does not need to be resistant to speed-up by specialized fixed-function hardware (ASIC). This is called "ASIC resistance" and there are two main reasons why it's not strictly required in this case:
Therefore the client puzzle algorithm must be CPU-friendly and minimize the advantage of programmable hardware such as GPUs and FPGAs.
In 2019, we developed a new CPU-friendly proof of work algorithm called RandomX to be used by Monero [ 7 ]. It has been audited and field tested, which makes it a good candidate to be considered for a Tor client puzzle.
However, during the development of RandomX, we focused mainly on the "CPU-friendliness" of the algorithm to minimize the advantage of specialized hardware. Other aspects such as verification performance and memory requirements were secondary and only had upper limits. The result is that RandomX has quite high solution verification demands – over 2 GiB of memory and around 2 milliseconds to verify a hash. Alternatively, it is possible to use only 256 MiB of memory, but then verification takes around 15 ms. This is way too slow to be used as a client puzzle.
So I began slimming down RandomX to make it less demanding for verification. I halved the memory requirements and reduced the number of rounds from 8 to 2, which is the minimum that can be considered safe from optimizations. The result is RandomX-Tor [ 8 ], which takes around 0.5 ms to verify with slightly over 1 GiB of memory. GPUs perform very poorly when running RandomX-Tor – similar to full RandomX.
However, I see two main problems with RandomX-Tor:
So I decided to continue my quest and look for a better puzzle.
I briefly tested Argon2 [ 9 ] and Yespower [ 10 ], which are password hashing algorithms claiming to be "GPU-resistant". However, both have slower verification performance than RandomX-Tor and both run faster on GPUs, so they are not suitable. See Appendix for details.
Then I researched Equihash, which is an asymmetric client puzzle offering very fast solution verification [ 11 ]. The main problem with Equihash is that GPUs perform up to 100x faster than CPUs, so it is unusable as a puzzle for CPU-based clients.
I was ready to give up when I remembered SuperscalarHash, which is a part of RandomX that's used only to initialize the DAG [ 12 ]. It could be considered a lightweight version of RandomX with only integer operations and no memory accesses. I began refactoring the code with the goal of creating a fast GPU-resistant hash function. The main idea is that each instance of the hash function would use a completely different randomly generated sequence of instructions. I managed to optimized the SuperscalarHash generator and reduce the time to generate a random program from 250 μs to 50 μs, which could mean up to 20000 verifications per second – ten times faster than RandomX-Tor.
I named the new algorithm "HashX" and in the course of the next week, I made several other changes compared to the old SuperscalarHash:
I also made a few tweaks to the hash finalizer to pass SMHasher [ 13 ]. The result is a CPU-friendly algorithm suitable to be used as a partial hash inversion client puzzle.
I still felt a bit uneasy about HashX using no memory. This means that logic-only FPGAs could be a viable option to run HashX. Additionally, introducing the right amount of memory-hardness could affect GPUs more than CPUs. I remembered Equihash, which is a memory-hard algorithm, but still allows for fast solution verification. I could simply replace the Blake2b hash in the default implementation with HashX. Each solution attempt would then use a different randomly generated HashX instance.
Equihash has two main parameters that determine the difficulty of the generalized birthday problem:
n is the hash size in bits and
2k is the number of hashes that need to XOR to zero. The memory hardness is proportional to
2n/(k+1) . To get fast verification and a small proof size,
k should be as low as possible. I chose
k=3 , which means the goal would be to find preimages to 8 hashes that XOR to zero.
I first tried with
n=96 , which would require about 2 GB of memory for an efficient solver. However, the problem is that each HashX instance would be used for
225 hashes, which could make it feasible for GPUs to compile an optimized kernel for each new solution attempt. Additionally, GPUs are much better at the sorting phase of Equihash than CPUs due to their higher memory bandwidth.
After some more benchmarking, I estimated that
216 hashes per solution attempt would be the sweet spot for CPUs because one solution attempt would take roughly 6-8 ms, the HashX compilation overhead would still be under 1% and the expected memory usage would be under 2 MiB. This can fit entirely into the CPU cache, which is the only case when CPUs can compete with GPUs in memory bandwidth. The optimal Equihash parameters are then
k=3 . I call this algorithm "Equi-X" (portmanteau of Equi hash and Hash X ). The exact memory requirements for Equi-X are 1.81 MiB with negligible solution discarding. It could be reduced to a minimum of 1 MiB with perfect bit packing and around 25% of discarded solutions, although this is viable only for custom hardware.
However, when testing Equi-X, I discovered a fatal flaw: the generalized birthday problem, as used by Equihash, requires a collision resistant hash function. While HashX is preimage-resistant, it is not collision resistant. Some small fraction of HashX instances produce a significant number of hash collisions. With a 2 k -XOR, finding 2 k-1 pairs of colliding hashes automatically produces a valid solution because equal values XOR to zero. But it gets worse. If the number of colliding hashes is higher than 2 k , the number of valid solutions grows exponentially. For example, with
k=3 and 100 colliding hashes, it is possible to produce more than 180 billion valid solutions! Finding such a multicollision takes only a few minutes for HashX.
Some modifications of HashX reduced the number of collisions, but I could not completely eliminate them, so I needed another solution. I began researching the generalized birthday problem and I found this paragraph in the seminal paper by David Wagner [ 14 ]:
The advantage of 2 k -SUM compared to 2 k -XOR is that identical hashes no longer produce valid solutions (except for
0x800000000000000 , but these are very rare). There are two additional advantages:
With this modification, an Equi-X solution for nonce
X is a set of eight 16-bit indices
i0, ..., i7 such that:
HX(i0) + HX(i1) + HX(i2) + HX(i3) + HX(i4) + HX(i5) + HX(i6) + HX(i7) = 0 (mod 260)
HX is a HashX function generated for nonce
Example of a solution:
H(0x6c31) = 0xcfa5375c0a7f5d7 \ (+) = 0xa73d9777f110000 \ H(0x8803) = 0xd798601be690a29 / | (+) = 0xefdaadb00000000 \ H(0x80c2) = 0xcabd8974bbee8d5 \ | | (+) = 0x489d16380ef0000 / | H(0xa1db) = 0x7ddf8cc3530172b / | (+) = 0 H(0x6592) = 0x348a96fd685dcba \ | (+) = 0x357120e8ffb8000 \ | H(0x76b7) = 0x00e689eb975a346 / | | (+) = 0x102552500000000 / H(0x74a6) = 0xacccc4ad2d06bcd \ | (+) = 0xdab431670048000 / H(0xe259) = 0x2de76cb9d341433 /
The following table compares Equi-X with the most common variants of Equihash:
|Algorithm||n||k||memory||solution size||verification 1||CPU perf. 2||GPU perf. 3|
|Equi-X||60||3||1.8 MiB||16 bytes||~50 μs||2000 Sol/s||?|
|Zcash||200||9||144 MiB||1344 bytes||>150 μs||30 Sol/s||~400 Sol/s 4|
|BTG||144||5||2.5 GiB||100 bytes||~10 μs||1 Sol/s||~45 Sol/s 5|
|Algorithm||GPU speed 6||Solver memory||Verifier memory||Verif./sec. 11||Solution size 12|
|Equi-X 1||<50% 7||1.81 MiB||< 20 KiB||19600||32 bytes|
|HashX 2||<50% 7||< 20 KiB||< 20 KiB||19700||16 bytes|
|RandomX-Tor 3||10% 8||1041 MiB||1041 MiB||2200||16 bytes|
|Argon2 4||~300% 9||2.00 MiB||2.00 MiB||1020||16 bytes|
|Yespower 5||~200% 10||2.00 MiB||2.00 MiB||780||16 bytes|