Hashing fired detectors with boost::dynamic_bitset #57

draganaurosgrbic · 2025-07-25T09:12:49Z

Hashing Syndrome Patterns with `boost::dynamic_bitset`

In this PR, I address a key performance bottleneck: the hashing of fired detector patterns (syndrome patterns). I introduce the use of boost::dynamic_bitset from the Boost library, a data structure that combines the memory-saving bit-packing feature of std::vector<bool> with highly optimized vectorized bit-wise operations. Crucially, boost::dynamic_bitset also provides highly optimized, built-in functions for efficiently hashing sequences of boolean elements.

Initial Optimization: `std::vector<bool>` to `std::vector<char>`

The initial Tesseract implementation, as documented in #25, utilized std::vector<bool> to store patterns of fired detectors and predicates that block specific errors from being added to the current error hypothesis. While std::vector<bool> optimizes memory usage by packing elements into individual bits, accessing and modifying its elements is highly inefficient due to its reliance on proxy objects that perform costly bit-wise operations (shifting, masking). Given Tesseract's frequent access and modification of these elements, this caused significant performance overheads.

In #25, I transitioned from std::vector<bool> to std::vector<char>. This change made boolean elements addressable bytes, enabling efficient and direct byte-level access. Although this increased memory footprint (as each boolean was stored as a full byte), it delivered substantial performance gains by eliminating std::vector<bool>'s proxy objects and their associated overheads for element access and modification. Speedups achieved with this initial optimization were significant:

For Color Codes, speedups reached 17.2%-32.3%
For Bivariate-Bicycle Codes, speedups reached 13.0%-22.3%
For Surface Codes, speedups reached 33.4%-42.5%
For Transversal CNOT Protocols, speedups reached 12.2%-32.4%

These significant performance gains highlight the importance of choosing appropriate data structures for boolean sequences, especially in performance-sensitive applications like Tesseract. The remarkable 42.5% speedup achieved in Surface Codes with this initial switch underscores the substantial overhead caused by unsuitable data structures. The performance gain from removing std::vector<bool>'s proxy objects and their inefficient operations far outweighed any overhead from increased memory consumption.

Current Bottleneck: `std::vector<char>` and Hashing

Following the optimizations in #25, Tesseract continued to use std::vector<char> for storing and managing patterns of fired detectors and predicates that block errors. Subsequently, PR #34 replaced and merged vectors of blocked errors into the DetectorCostTuple structure, which efficiently stores error_blocked and detectors_count as uint32_t fields (reasons explained in #34). These changes left vectors of fired detectors as the sole remaining std::vector<char> data structure in this context.

After implementing and evaluating optimizations in #25, #27, and #34, profiling Tesseract to analyze remaining bottlenecks revealed that, aside from the get_detcost function, a notable bottleneck emerged: VectorCharHash (originally VectorBoolHash). This function is responsible for hashing patterns of fired detectors to prevent re-exploring previously visited syndrome states. The implementation of VectorCharHash involved iterating through each element, byte by byte, and accumulating the hash. Even though this function saw significant speedups with the initial switch from std::vector<bool> to std::vector<char>, hashing patterns of fired detectors still consumed considerable time. Post-optimization profiling (after #25, #27, and #34) revealed that this hashing function consumed approximately 25% of decoding time in Surface Codes, 30% in Transversal CNOT Protocols, 10% in Color Codes, and 2% in Bivariate-Bicycle Codes (get_detcost remained the primary bottleneck for Bivariate-Bicycle Codes). Therefore, I decided to explore opportunities to further optimize this function and enhance the decoding speed.

Solution: Introducing `boost::dynamic_bitset`

This PR addresses the performance bottleneck of hashing fired detector patterns and mitigates the increased memory footprint from the initial switch to std::vector<char> by introducing the boost::dynamic_bitset data structure. The C++ standard library's std::bitset offers an ideal conceptual solution: memory-efficient bit-packed storage (like std::vector<bool>) combined with highly efficient and vectorized bit-wise operations. This data structure achieves efficient access and modification by employing highly optimized bit-wise operations, thereby reducing performance overhead stemming from proxy objects in std::vector<bool>. However, std::bitset requires a static size (determined at compile-time), rendering it unsuitable for Tesseract's dynamically sized syndrome patterns.

The Boost library's boost::dynamic_bitset provides the perfect solution by offering dynamic-sized bit arrays whose dimensions can be determined at runtime. This data structure brilliantly combines the memory efficiency of std::vector<bool> (by packing elements into individual bits) with the performance benefits of vectorized bit-wise operations. This is achieved by internally storing bits within contiguous memory blocks and executing vectorized bit-wise operations across all elements from the same block, thus avoiding the overheads of std::vector<bool>'s proxy objects and costly bit-wise operations. Furthermore, boost::dynamic_bitset offers highly optimized, built-in hashing functions, replacing our custom, less efficient byte-by-byte hashing and resulting in a cleaner, faster implementation.

Performance Evaluation: Individual Impact of Optimization

I performed two types of experiments to evaluate the achieved performance gains. First, I conducted extensive benchmarks across various code families and configurations to evaluate the individual performance gains achieved by this specific optimization. Speedups achieved include:

For Surface Codes: 8.0%-24.7%
For Transversal CNOT Protocols: 12.1%-26.8%
For Color Codes: 3.6%-7.0%
For Bivariate-Bicycle Codes: 0.5%-4.8%

These results highlight the highest impact in Surface Codes and Transversal CNOT Protocols, which aligns with the initial profiling data that showcased these code families were spending more time in the original VectorCharHash function.

Speedups in Surface Codes

Speedups in Transversal CNOT Protocols

Speedups in Color Codes

Speedups in Bivariate-Bicycle Codes

Performance Evaluation: Cumulative Speedup

Following the evaluation of individual performance gains, I analyzed the cumulative effect of the optimizations implemented across PRs #25, #27, and #34. The cumulative speedups achieved are:

For Color Codes: 40.7%-54.8%
For Bivariate-Bicycle Codes: 41.5%-80.3%
For Surface Codes: 50.0%-62.4%
For Transversal CNOT Protocols: 57.8%-63.6%

These results demonstrate that my optimizations achieved over 2x speedup in Color Codes, over 2.5x speedup in Surface Codes and Transversal CNOT Protocols, and over 5x speedup in Bivariate-Bicycle Codes.

Speedups in Color Codes

Speedups in Bivariate-Bicycle Codes

Speedups in Surface Codes

Speedups in Transversal CNOT Protocols

Conclusion

These results demonstrate that the boost::dynamic_bitset optimization significantly impacts code families where the original hashing function (VectorCharHash) was a primary bottleneck (Surface Codes and Transversal CNOT Protocols). The substantial speedups achieved in these code families validate that boost::dynamic_bitset provides demonstrably more efficient hashing and bit-wise operations. For code families where hashing was less of a bottleneck (Color Codes and Bivariate-Bicycle Codes), the speedups were modest, reinforcing that std::vector<char> can remain highly efficient even with increased memory usage when bit packing is not the primary performance concern. Crucially, this optimization delivers comparable or superior performance to std::vector<char> while simultaneously reducing memory footprint, providing additional speedups where hashing performance is critical.

Key Contributions

Identified the hashing of syndrome patterns as the primary remaining bottleneck in Surface Codes and Transversal CNOT Protocols, post prior optimizations (Replace std::vector<bool> with std::vector<char> for faster computations #25, Removing unnecessary std::vector copy operations #27, Accelerating 'get_detcost' function #34).
Adopted boost::dynamic_bitset as a superior data structure, combining std::vector<bool>'s memory efficiency with high-performance vectorized bit-wise operations and efficient built-in hashing
Replaced std::vector<char> with boost::dynamic_bitset for storing syndrome patterns.
Performed extensive benchmarking to evaluate both the individual impact of this optimization and its cumulative effect with prior PRs.
Achieved significant individual speedups (e.g., 8.0%-24.7% in Surface Codes, 12.1%-26.8% in Transversal CNOT Protocols) and substantial cumulative speedups (over 2x in Color Codes, over 2.5x in Surface Codes and Transversal CNOT Protocols, and over 5x in Bivariate-Bicycle Codes).

for better data locality Signed-off-by: Dragana Grbic <[email protected]>

Signed-off-by: Dragana Grbic <[email protected]>

…eract-decoder into optimization-cpu

Signed-off-by: Dragana Grbic <[email protected]>

…eract-decoder into optimization-cpu

LalehB

LGTM!

### Description This Pull Request introduces a substantial performance optimization to _Tesseract_'s initialization phase. While previous efforts primarily focused on enhancing the critical decoding speed, this work addresses an identified bottleneck in the one-time setup/initialization process. I've targeted a highly inefficient code segment and achieved remarkable speedups. --- ### Background Before _Tesseract_ can decode simulations/shots of quantum circuits, it must first read and parse the quantum circuit model. This process involves populating and constructing internal data structures essential for decoding. For a given quantum circuit, _Tesseract_ performs this initialization once, then utilizes the constructed data structures and parsed model to decode multiple shots/simulations. As such, the initialization phase hasn't been a primary focus for optimization, as it's a one-time operation and generally not a major time sink compared to the iterative decoding process. However, after achieving significant performance gains in the decoding phase, I identified an opportunity to further improve overall efficiency by optimizing a particularly inefficient loop within initialization. --- ### Problem: Inefficient `eneighbors` Calculation The primary bottleneck I identified within the initialization phase was the loop responsible for calculating `eneighbors` (error neighbors). This data structure determines, for each error, which detectors are affected by its neighboring errors. The original implementation, shown below, exhibited severe performance issues: ```cpp std::vector<std::unordered_set<size_t>> edets_sets(edets.size()); for (size_t ei = 0; ei < edets.size(); ++ei) { edets_sets[ei] = std::unordered_set<size_t>(edets[ei].begin(), edets[ei].end()); } for (size_t ei = 0; ei < num_errors; ++ei) { std::set<int> neighbor_set; for (int d : edets[ei]) { for (int oei : d2e[d]) { for (int od : edets_sets[oei]) { if (!edets_sets[ei].contains(od)) { neighbor_set.insert(od); } } } } eneighbors[ei] = std::vector<int>(neighbor_set.begin(), neighbor_set.end()); } ``` This implementation suffered from: 1. **High Computational Complexity:** The four nested loops resulted in a complexity proportional to `num_errors` \* `detectors_per_error` \* `errors_per_detector` \* `detectors_per_neighbor_error`. 2. `std::set` and `std::unordered_set` Overheads: Frequent insert operations on `std::set` (logarithmic time complexity) and contains operations on `std::unordered_set` (average constant time complexity) introduced significant memory management overhead and could be substantial when iterated large number of times. --- ### Solution: Leveraging `boost::dynamic_bitset` for Efficient Set Operations Drawing from the successful application of `boost::dynamic_bitset` in optimizing syndrome pattern hashing (as implemented in #57), I replaced `std::set` and `std::unordered_set` in this critical initialization loop with `boost::dynamic_bitset`. This significantly accelerated the `eneighbors` calculation. As detailed in #57, `boost::dynamic_bitset` offers memory efficiency similar to `std::vector<bool>` but provides highly optimized bit-wise operations for manipulating elements. This is achieved by packing individual bits/elements into contiguous memory blocks and enabling a single bit-wise operation to be executed across multiple elements from the same block simultaneously, leveraging CPU vectorization. The optimized loop is shown below: ```cpp std::vector<boost::dynamic_bitset<>> edets_bitsets(num_errors, boost::dynamic_bitset<>(num_detectors)); for (size_t ei = 0; ei < num_errors; ++ei) { for (int d : edets[ei]) { edets_bitsets[ei][d] = 1; } } for (size_t ei = 0; ei < num_errors; ++ei) { boost::dynamic_bitset<> neighbor_set(num_detectors, false); for (int d : edets[ei]) { for (int oei : d2e[d]) { // Unify detectors from neighboring errors neighbor_set |= edets_bitsets[oei]; } } // Remove detectors from error's own set neighbor_set &= ~edets_bitsets[ei]; for (size_t d = neighbor_set.find_first(); d != boost::dynamic_bitset<>::npos; d = neighbor_set.find_next(d)) { eneighbors[ei].push_back(d); } } ``` This optimization significantly improves performance by: 1. **Reduced Nested Loops:** The code now contains three nested loops (instead of the original four), substantially decreasing the total number of iterations. 2. **Vectorized Bit-wise Operations:** `boost::dynamic_bitset` stores bits in contiguous memory blocks and enables the execution of highly optimized, hardware-accelerated bit-wise operations. These operations can work on entire blocks in a single CPU cycle, effectively performing vectorized set unions and differences. This dramatically reduces the overhead of element-wise checks and insertions found in the original implementation. 3. **Memory Efficiency:** `boost::dynamic_bitset` retains the memory-saving bit-packing feature similar to `std::vector<bool>` while eliminating the performance overhead stemming from `std::vector<bool>`'s proxy objects and inefficient bit-level manipulations that operate on individual elements separately. --- ### Impact and Performance Benchmarks The speedups I achieved in the initialization function are remarkable across various code families and configurations. #### Before Optimization (Initial Times) - **Color Codes:** 1.5 - 9 seconds - **Bivariate-Bicycle Codes:** 6 - 17 seconds - **Surface Codes:** 0.1 - 0.8 seconds - **Transversal CNOT Protocols:** 0.2 - 7 seconds #### After Optimization (Speedups) - **Color Codes:** 95.89x to 128.18x (less than a second) - **Bivariate-Bicycle Codes:** 106.43x to 132.88x (less than a second) - **Surface Codes:** 26.41x to 36.07x (less than a second) - **Transversal CNOT Protocols:** 17.51x to 43.07x (less than a second) As shown by the initial times, the initialization previously did not exceed 17 seconds for the benchmarks I performed, with Bivariate-Bicycle codes having the highest overhead. However, since this operation is performed once per quantum circuit (and Tesseract then uses the initialized knowledge to decode multiple simulations/shots, where performance is critical), even these initial times were acceptable. Nevertheless, initialization times after this optimization fell below a second for all tested code families and configurations. For Color Codes, initialization fell below 0.09 seconds, for Bivariate-Bicycle Codes below 0.15 seconds, For Surface Codes below 0.03 seconds and for Transversal CNOT Protocols below 0.4 seconds. This dramatic reduction explains the exceptionally high speedup factors; **the initialization phase is now extremely fast.** Below are plots that show the performance gains I achieved across different code families and configurations. <img width="1790" height="989" alt="color1" src="https://github.com/user-attachments/assets/7f550ae7-b2a0-464c-80c5-c90094e45ec9" /> <img width="1790" height="989" alt="color2" src="https://github.com/user-attachments/assets/87eb9b9c-2515-4027-9d96-0f9a3814a448" /> <img width="1790" height="989" alt="color3" src="https://github.com/user-attachments/assets/8b0a1b24-9d6a-485a-a079-5e272c7b611f" /> <img width="1790" height="989" alt="bicycle1" src="https://github.com/user-attachments/assets/420a9794-bf10-42bf-8ed0-580470aa70ac" /> <img width="1790" height="989" alt="bicycle2" src="https://github.com/user-attachments/assets/6396faac-f3e3-4ce6-88ae-5e4c2af8200f" /> <img width="1790" height="989" alt="bicycle3" src="https://github.com/user-attachments/assets/ed6f9565-2fb2-413d-b9f0-06cdd3624d2e" /> <img width="1790" height="989" alt="surface1" src="https://github.com/user-attachments/assets/1029b27e-8b33-4051-9dd3-30f882d60dc2" /> <img width="1790" height="989" alt="trans1" src="https://github.com/user-attachments/assets/76c6b166-64a7-4af9-a4c8-b7b5bc73cfa1" /> <img width="1790" height="989" alt="trans2" src="https://github.com/user-attachments/assets/27da2186-f651-44db-8ab1-00c1f260d5b8" /> <img width="1790" height="989" alt="trans3" src="https://github.com/user-attachments/assets/9bc57295-509a-4dad-a3b1-42457b563285" /> --- ### Conclusion This optimization to the initialization function demonstrates the substantial performance gains achievable by refactoring inefficient loops and leveraging advanced data structures like `boost::dynamic_bitset`. It showcases how its highly optimized bit-wise operations (enabling vectorized execution across multiple elements at once) can be used to implement highly efficient set operations (union, difference). The resulting remarkable speedups further enhance the overall efficiency of the Tesseract decoder. --- ### Key Contributions * **Identified and Investigated:** Pinpointed a critical inefficiency/loop within the initialization function that was consuming significant time. * **Leveraged Advanced Data Structures:** Applied knowledge and success from optimization in #57 to replace code logic that frequently manipulates set elements. This involved leveraging `boost::dynamic_bitset` for highly optimized, vectorized bit-wise operations to perform set operations. * **Refactored Critical Loop:** Replaced the inefficient `eneighbors` calculation loop, which previously used `std::unordered_set` and `std::set`, with an improved version utilizing `boost::dynamic_bitset` and vectorized bit-wise operations for set unions and differences. * **Achieved Remarkable Speedups:** Delivered exceptional speedups across various code families and configurations, reaching up to 132.88x faster initialization in a Bivariate-Bicycle code benchmark, making Tesseract even more robust and scalable. --------- Signed-off-by: Dragana Grbic <[email protected]> Co-authored-by: noajshu <[email protected]> Co-authored-by: LaLeh <[email protected]>

draganaurosgrbic and others added 20 commits June 14, 2025 14:52

Packing blocked errors and detection counts into a single array/struct

0a24685

for better data locality Signed-off-by: Dragana Grbic <[email protected]>

Remove/refactor redundant code

779e7ac

Signed-off-by: Dragana Grbic <[email protected]>

Minor changes

eff2a15

Signed-off-by: Dragana Grbic <[email protected]>

Merge remote-tracking branch 'origin/main' into optimization-cpu

4003520

Control the number of nodes inserted into priority queue

144affe

Signed-off-by: Dragana Grbic <[email protected]>

fix detector beam node skipping implementation

02c1066

beam search pruning based on min cost per detector count

6f4b5dd

Merge branch 'optimization-cpu' of https://github.com/quantumlib/tess…

e0d6f2f

…eract-decoder into optimization-cpu

Merge remote-tracking branch 'origin/main' into optimization-cpu

b16bc51

Format src/tesseract.cc file

cfd157a

Signed-off-by: Dragana Grbic <[email protected]>

Merge branch 'main' into optimization-cpu

97b78eb

Remove unnecessary code

59b73f1

Signed-off-by: Dragana Grbic <[email protected]>

Merge remote-tracking branch 'origin/main' into optimization-cpu

bb2ac2b

Merge branch 'optimization-cpu' of https://github.com/quantumlib/tess…

d6a5659

…eract-decoder into optimization-cpu

Scripts for executing and plotting benchmarks

29aa98a

Remove scripts for benchmarking and plotting

d97840d

Merge branch 'main' into optimization-cpu

b1e9fef

Merge branch 'main' into optimization-cpu

1cb96cb

remove d2e_detcost

053d286

Use boost::dynamic_bitset for hashing fired detectors

7147df5

draganaurosgrbic requested review from LalehB and noajshu July 25, 2025 09:12

draganaurosgrbic added 8 commits July 25, 2025 02:14

Merge branch 'main' into optimization-cpu

55bc51c

add boost dependency

1446245

fix boost dependency

21dc212

minor change

e513348

minor change

1ddef02

minor change

fa10be2

consistent usage of variable types inside for loops

3d72df4

Merge branch 'main' into optimization-cpu

20a29bb

draganaurosgrbic added 5 commits July 30, 2025 03:08

replace size_t with int in appropriate loops

f2830a9

Merge branch 'optimization-cpu' of https://github.com/quantumlib/tess…

5b9cc83

…eract-decoder into optimization-cpu

rename Boost headers directory

37b6391

minnor change

813cc42

update boost.BUILD

49919a3

LalehB approved these changes Jul 30, 2025

View reviewed changes

draganaurosgrbic merged commit 8699f2d into main Jul 30, 2025
4 checks passed

draganaurosgrbic deleted the optimization-cpu branch July 30, 2025 18:16

draganaurosgrbic removed the request for review from noajshu July 30, 2025 18:18

NoureldinYosri mentioned this pull request Jul 31, 2025

BUILD on old ubuntu to support colab #63

Merged

draganaurosgrbic mentioned this pull request Aug 5, 2025

Accelerating Decoder's Initialization #66

Merged

NoureldinYosri mentioned this pull request Aug 7, 2025

Add gaurds to redirect C++'s stdout/stderr streams to python's stdout/stderr streams #68

Merged

NoureldinYosri mentioned this pull request Aug 9, 2025

Create a visualization class and wire it through the code #83

Merged

NoureldinYosri mentioned this pull request Sep 5, 2025

Build for py13 #122

Merged

NoureldinYosri mentioned this pull request Oct 20, 2025

drop macos-13 from workflows #146

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Hashing fired detectors with boost::dynamic_bitset #57

Hashing fired detectors with boost::dynamic_bitset #57

Uh oh!

draganaurosgrbic commented Jul 25, 2025 •

edited

Loading

Uh oh!

LalehB left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Hashing fired detectors with boost::dynamic_bitset #57

Hashing fired detectors with boost::dynamic_bitset #57

Uh oh!

Conversation

draganaurosgrbic commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Hashing Syndrome Patterns with boost::dynamic_bitset

Initial Optimization: std::vector<bool> to std::vector<char>

Current Bottleneck: std::vector<char> and Hashing

Solution: Introducing boost::dynamic_bitset

Performance Evaluation: Individual Impact of Optimization

Speedups in Surface Codes

Speedups in Transversal CNOT Protocols

Speedups in Color Codes

Speedups in Bivariate-Bicycle Codes

Performance Evaluation: Cumulative Speedup

Speedups in Color Codes

Speedups in Bivariate-Bicycle Codes

Speedups in Surface Codes

Speedups in Transversal CNOT Protocols

Conclusion

Key Contributions

Uh oh!

LalehB left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

draganaurosgrbic commented Jul 25, 2025 •

edited

Loading

Hashing Syndrome Patterns with `boost::dynamic_bitset`

Initial Optimization: `std::vector<bool>` to `std::vector<char>`

Current Bottleneck: `std::vector<char>` and Hashing

Solution: Introducing `boost::dynamic_bitset`