The Plonk-DIZK GPU Acceleration prize category required contestants to effectively balance computational efficiency, memory management, and parallelism while preserving the privacy and security aspects of zero-knowledge proofs. Overcoming this challenge required innovative approaches, a deep understanding of cryptographic primitives, and the ability to push the boundaries of GPU computation and distributed systems. The winning team was composed of faculty members and PhD students at the Hong Kong Polytechnic University (PolyU) and the University of Hong Kong (HKU) including Xingye Lu (PolyU), Chengru Zhang (HKU), Mengling Liu (PolyU), Man Ho Au (PolyU).
The winning team reduced proof generation time by 40% by moving the most time-consuming operations in the proof generation process, namely, the multi-scalar multiplication (MSM) and fast Fourier transform (FFT) to GPUs. They also developed a new dispatcher to distribute the prover’s computation across a cluster of computers and were able to generate a proof for a much larger circuit. This new implementation can generate a proof in less than 1 hour for a circuit of size 2^28, which is the largest circuit with reported successful plonk proof generation.
The main obstacle for real-world applications of the Plonk proof system lies in the heavy computational burden and memory usage of proof generation. Plonk-DIZK combination is important because it offers an efficient and scalable solution for generating ZKPs in a distributed computing environment. There are three key components:
This prize category aimed at accelerating the Plonk proof system for gigantic circuits through parallelization. In order to accelerate a Plonk prover, the team made the following two contributions.
First, the winning team moved the most time-consuming operations in the proof generation process, multi-scalar multiplication (MSM) and fast Fourier transform (FFT) to GPUs. This reduced proof generation time by 40%.
Secondly, the team developed a new dispatcher to efficiently distribute the prover's computation across a cluster of computers. For example, to generate a proof for a circuit with 2^28 gates, the prover would typically need at least 1TB of memory. However, using their new approach, the same process can be accomplished with one computer having 120GB of memory and five additional computers, each with 66GB of memory.
For a circuit of size 228 , a single wire polynomial/selector polynomial takes 8 GB of memory and its FFT evaluations in the quotient domain take up to 64 GB. It is nearly impossible for a single server to compute complicated polynomials operations monolithically in its memory. Distributing the workload among multiple servers is non-trivial. The transfer time of sending a total of 18 different polynomials (5 wires, 13 selectors) in Plonk alone requires 1 hour with 1 GB(yte)/s bandwidth using the pre-existing distribution approach. Given the bandwidth limitation in some real-world environments, the new approach sends (parts of) the polynomials only when it is necessary.
Parallelizing FFT is challenging, particularly when the polynomial can easily exceed the memory of a single GPU in a large circuit. The team identified that the Cooley-Tukey algorithm was more suitable for parallelizing FFT in this setting.
Zero-knowledge proofs have a wide range of applications. For each application, the circuit size can vary dramatically. You can think of the circuit size as roughly corresponding to the complexity of the program.
ZCash-style payments, for example, require relatively small circuits (ranging in size from 2^15 - 2^17 gates). But training on ML model is significantly larger and scales up with depending on how much training data is used.
This is why the DIZK approach is ultimately important. By parallizing the proof generation process, it can enable zero-knowledge applications that would otherwise be impossible to prove on a single machine.
Thanks to all who participated in this category. Your contributions pushed Plonk-DIZK GPU Acceleration far into the future.