Skip to content

Finding the barrier toward better parallelism #11

@weikengchen

Description

@weikengchen

More updates will be added to this note.

I measured the cost for Groth16 to prove a constraint system with two million constraints, with a different number of cores, using cargo bench in this repo with appropriate command line parameters.

Below I focus on the main cost in my constraint system, which turns out to be in the witness map and computation of C.

Going from 1 core to 4 cores, the improvement is significant. But later when more cores are added, the time for the witness map seems to not change a lot.

Since the witness map involves a lot of FFT, it may suggest that the current implementation of FFT has a barrier toward many-many-core parallelism.

Such a barrier, maybe avoidable, maybe unavoidable. I will take a look at the detailed breakdown of the cost of the witness map.

===== Groth16 =====

1 core
R1CS to QAP witness map ... 32.102s
Compute C ................. 56.453s
Total ..................... 93.780s

4 cores
R1CS to QAP witness map ... 12.476s
Compute C ................. 16.54s
Total ..................... 34.203s

20 cores
R1CS to QAP witness map ... 8.856s
Compute C ................. 8.631s
Total ..................... 23.99s

40 cores
R1CS to QAP witness map ... 8.300s
Compute C ................. 4.414s
Total ..................... 18.319s

48 cores
R1CS to QAP witness map ... 8.166s
Compute C ................. 4.682s
Total ..................... 18.230s

Metadata

Metadata

Assignees

No one assigned

    Labels

    D-mediumDifficulty: mediumT-designType: discuss API design and/or researchT-performanceType: performance improvements

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions