-
Notifications
You must be signed in to change notification settings - Fork 18
Hashing fired detectors with boost::dynamic_bitset #57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
for better data locality Signed-off-by: Dragana Grbic <[email protected]>
Signed-off-by: Dragana Grbic <[email protected]>
Signed-off-by: Dragana Grbic <[email protected]>
Signed-off-by: Dragana Grbic <[email protected]>
…eract-decoder into optimization-cpu
Signed-off-by: Dragana Grbic <[email protected]>
Signed-off-by: Dragana Grbic <[email protected]>
…eract-decoder into optimization-cpu
LalehB
approved these changes
Jul 30, 2025
Collaborator
LalehB
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
draganaurosgrbic
added a commit
that referenced
this pull request
Aug 7, 2025
### Description
This Pull Request introduces a substantial performance optimization to
_Tesseract_'s initialization phase. While previous efforts primarily
focused on enhancing the critical decoding speed, this work addresses an
identified bottleneck in the one-time setup/initialization process. I've
targeted a highly inefficient code segment and achieved remarkable
speedups.
---
### Background
Before _Tesseract_ can decode simulations/shots of quantum circuits, it
must first read and parse the quantum circuit model. This process
involves populating and constructing internal data structures essential
for decoding. For a given quantum circuit, _Tesseract_ performs this
initialization once, then utilizes the constructed data structures and
parsed model to decode multiple shots/simulations. As such, the
initialization phase hasn't been a primary focus for optimization, as
it's a one-time operation and generally not a major time sink compared
to the iterative decoding process. However, after achieving significant
performance gains in the decoding phase, I identified an opportunity to
further improve overall efficiency by optimizing a particularly
inefficient loop within initialization.
---
### Problem: Inefficient `eneighbors` Calculation
The primary bottleneck I identified within the initialization phase was
the loop responsible for calculating `eneighbors` (error neighbors).
This data structure determines, for each error, which detectors are
affected by its neighboring errors. The original implementation, shown
below, exhibited severe performance issues:
```cpp
std::vector<std::unordered_set<size_t>> edets_sets(edets.size());
for (size_t ei = 0; ei < edets.size(); ++ei) {
edets_sets[ei] = std::unordered_set<size_t>(edets[ei].begin(), edets[ei].end());
}
for (size_t ei = 0; ei < num_errors; ++ei) {
std::set<int> neighbor_set;
for (int d : edets[ei]) {
for (int oei : d2e[d]) {
for (int od : edets_sets[oei]) {
if (!edets_sets[ei].contains(od)) {
neighbor_set.insert(od);
}
}
}
}
eneighbors[ei] = std::vector<int>(neighbor_set.begin(), neighbor_set.end());
}
```
This implementation suffered from:
1. **High Computational Complexity:** The four nested loops resulted in
a complexity proportional to `num_errors` \* `detectors_per_error` \*
`errors_per_detector` \* `detectors_per_neighbor_error`.
2. `std::set` and `std::unordered_set` Overheads: Frequent insert
operations on `std::set` (logarithmic time complexity) and contains
operations on `std::unordered_set` (average constant time complexity)
introduced significant memory management overhead and could be
substantial when iterated large number of times.
---
### Solution: Leveraging `boost::dynamic_bitset` for Efficient Set
Operations
Drawing from the successful application of `boost::dynamic_bitset` in
optimizing syndrome pattern hashing (as implemented in #57), I replaced
`std::set` and `std::unordered_set` in this critical initialization loop
with `boost::dynamic_bitset`. This significantly accelerated the
`eneighbors` calculation. As detailed in #57, `boost::dynamic_bitset`
offers memory efficiency similar to `std::vector<bool>` but provides
highly optimized bit-wise operations for manipulating elements. This is
achieved by packing individual bits/elements into contiguous memory
blocks and enabling a single bit-wise operation to be executed across
multiple elements from the same block simultaneously, leveraging CPU
vectorization. The optimized loop is shown below:
```cpp
std::vector<boost::dynamic_bitset<>> edets_bitsets(num_errors,
boost::dynamic_bitset<>(num_detectors));
for (size_t ei = 0; ei < num_errors; ++ei) {
for (int d : edets[ei]) {
edets_bitsets[ei][d] = 1;
}
}
for (size_t ei = 0; ei < num_errors; ++ei) {
boost::dynamic_bitset<> neighbor_set(num_detectors, false);
for (int d : edets[ei]) {
for (int oei : d2e[d]) {
// Unify detectors from neighboring errors
neighbor_set |= edets_bitsets[oei];
}
}
// Remove detectors from error's own set
neighbor_set &= ~edets_bitsets[ei];
for (size_t d = neighbor_set.find_first(); d != boost::dynamic_bitset<>::npos;
d = neighbor_set.find_next(d)) {
eneighbors[ei].push_back(d);
}
}
```
This optimization significantly improves performance by:
1. **Reduced Nested Loops:** The code now contains three nested loops
(instead of the original four), substantially decreasing the total
number of iterations.
2. **Vectorized Bit-wise Operations:** `boost::dynamic_bitset` stores
bits in contiguous memory blocks and enables the execution of highly
optimized, hardware-accelerated bit-wise operations. These operations
can work on entire blocks in a single CPU cycle, effectively performing
vectorized set unions and differences. This dramatically reduces the
overhead of element-wise checks and insertions found in the original
implementation.
3. **Memory Efficiency:** `boost::dynamic_bitset` retains the
memory-saving bit-packing feature similar to `std::vector<bool>` while
eliminating the performance overhead stemming from `std::vector<bool>`'s
proxy objects and inefficient bit-level manipulations that operate on
individual elements separately.
---
### Impact and Performance Benchmarks
The speedups I achieved in the initialization function are remarkable
across various code families and configurations.
#### Before Optimization (Initial Times)
- **Color Codes:** 1.5 - 9 seconds
- **Bivariate-Bicycle Codes:** 6 - 17 seconds
- **Surface Codes:** 0.1 - 0.8 seconds
- **Transversal CNOT Protocols:** 0.2 - 7 seconds
#### After Optimization (Speedups)
- **Color Codes:** 95.89x to 128.18x (less than a second)
- **Bivariate-Bicycle Codes:** 106.43x to 132.88x (less than a second)
- **Surface Codes:** 26.41x to 36.07x (less than a second)
- **Transversal CNOT Protocols:** 17.51x to 43.07x (less than a second)
As shown by the initial times, the initialization previously did not
exceed 17 seconds for the benchmarks I performed, with Bivariate-Bicycle
codes having the highest overhead. However, since this operation is
performed once per quantum circuit (and Tesseract then uses the
initialized knowledge to decode multiple simulations/shots, where
performance is critical), even these initial times were acceptable.
Nevertheless, initialization times after this optimization fell below a
second for all tested code families and configurations. For Color Codes,
initialization fell below 0.09 seconds, for Bivariate-Bicycle Codes
below 0.15 seconds, For Surface Codes below 0.03 seconds and for
Transversal CNOT Protocols below 0.4 seconds. This dramatic reduction
explains the exceptionally high speedup factors; **the initialization
phase is now extremely fast.**
Below are plots that show the performance gains I achieved across
different code families and configurations.
<img width="1790" height="989" alt="color1"
src="https://github.com/user-attachments/assets/7f550ae7-b2a0-464c-80c5-c90094e45ec9"
/>
<img width="1790" height="989" alt="color2"
src="https://github.com/user-attachments/assets/87eb9b9c-2515-4027-9d96-0f9a3814a448"
/>
<img width="1790" height="989" alt="color3"
src="https://github.com/user-attachments/assets/8b0a1b24-9d6a-485a-a079-5e272c7b611f"
/>
<img width="1790" height="989" alt="bicycle1"
src="https://github.com/user-attachments/assets/420a9794-bf10-42bf-8ed0-580470aa70ac"
/>
<img width="1790" height="989" alt="bicycle2"
src="https://github.com/user-attachments/assets/6396faac-f3e3-4ce6-88ae-5e4c2af8200f"
/>
<img width="1790" height="989" alt="bicycle3"
src="https://github.com/user-attachments/assets/ed6f9565-2fb2-413d-b9f0-06cdd3624d2e"
/>
<img width="1790" height="989" alt="surface1"
src="https://github.com/user-attachments/assets/1029b27e-8b33-4051-9dd3-30f882d60dc2"
/>
<img width="1790" height="989" alt="trans1"
src="https://github.com/user-attachments/assets/76c6b166-64a7-4af9-a4c8-b7b5bc73cfa1"
/>
<img width="1790" height="989" alt="trans2"
src="https://github.com/user-attachments/assets/27da2186-f651-44db-8ab1-00c1f260d5b8"
/>
<img width="1790" height="989" alt="trans3"
src="https://github.com/user-attachments/assets/9bc57295-509a-4dad-a3b1-42457b563285"
/>
---
### Conclusion
This optimization to the initialization function demonstrates the
substantial performance gains achievable by refactoring inefficient
loops and leveraging advanced data structures like
`boost::dynamic_bitset`. It showcases how its highly optimized bit-wise
operations (enabling vectorized execution across multiple elements at
once) can be used to implement highly efficient set operations (union,
difference). The resulting remarkable speedups further enhance the
overall efficiency of the Tesseract decoder.
---
### Key Contributions
* **Identified and Investigated:** Pinpointed a critical
inefficiency/loop within the initialization function that was consuming
significant time.
* **Leveraged Advanced Data Structures:** Applied knowledge and success
from optimization in #57 to replace code logic that frequently
manipulates set elements. This involved leveraging
`boost::dynamic_bitset` for highly optimized, vectorized bit-wise
operations to perform set operations.
* **Refactored Critical Loop:** Replaced the inefficient `eneighbors`
calculation loop, which previously used `std::unordered_set` and
`std::set`, with an improved version utilizing `boost::dynamic_bitset`
and vectorized bit-wise operations for set unions and differences.
* **Achieved Remarkable Speedups:** Delivered exceptional speedups
across various code families and configurations, reaching up to 132.88x
faster initialization in a Bivariate-Bicycle code benchmark, making
Tesseract even more robust and scalable.
---------
Signed-off-by: Dragana Grbic <[email protected]>
Co-authored-by: noajshu <[email protected]>
Co-authored-by: LaLeh <[email protected]>
Merged
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hashing Syndrome Patterns with
boost::dynamic_bitsetIn this PR, I address a key performance bottleneck: the hashing of fired detector patterns (syndrome patterns). I introduce the use of
boost::dynamic_bitsetfrom the Boost library, a data structure that combines the memory-saving bit-packing feature ofstd::vector<bool>with highly optimized vectorized bit-wise operations. Crucially,boost::dynamic_bitsetalso provides highly optimized, built-in functions for efficiently hashing sequences of boolean elements.Initial Optimization:
std::vector<bool>tostd::vector<char>The initial Tesseract implementation, as documented in #25, utilized
std::vector<bool>to store patterns of fired detectors and predicates that block specific errors from being added to the current error hypothesis. Whilestd::vector<bool>optimizes memory usage by packing elements into individual bits, accessing and modifying its elements is highly inefficient due to its reliance on proxy objects that perform costly bit-wise operations (shifting, masking). Given Tesseract's frequent access and modification of these elements, this caused significant performance overheads.In #25, I transitioned from
std::vector<bool>tostd::vector<char>. This change made boolean elements addressable bytes, enabling efficient and direct byte-level access. Although this increased memory footprint (as each boolean was stored as a full byte), it delivered substantial performance gains by eliminatingstd::vector<bool>'s proxy objects and their associated overheads for element access and modification. Speedups achieved with this initial optimization were significant:These significant performance gains highlight the importance of choosing appropriate data structures for boolean sequences, especially in performance-sensitive applications like Tesseract. The remarkable 42.5% speedup achieved in Surface Codes with this initial switch underscores the substantial overhead caused by unsuitable data structures. The performance gain from removing
std::vector<bool>'s proxy objects and their inefficient operations far outweighed any overhead from increased memory consumption.Current Bottleneck:
std::vector<char>and HashingFollowing the optimizations in #25, Tesseract continued to use
std::vector<char>for storing and managing patterns of fired detectors and predicates that block errors. Subsequently, PR #34 replaced and merged vectors of blocked errors into theDetectorCostTuplestructure, which efficiently storeserror_blockedanddetectors_countasuint32_tfields (reasons explained in #34). These changes left vectors of fired detectors as the sole remainingstd::vector<char>data structure in this context.After implementing and evaluating optimizations in #25, #27, and #34, profiling Tesseract to analyze remaining bottlenecks revealed that, aside from the
get_detcostfunction, a notable bottleneck emerged:VectorCharHash(originallyVectorBoolHash). This function is responsible for hashing patterns of fired detectors to prevent re-exploring previously visited syndrome states. The implementation ofVectorCharHashinvolved iterating through each element, byte by byte, and accumulating the hash. Even though this function saw significant speedups with the initial switch fromstd::vector<bool>tostd::vector<char>, hashing patterns of fired detectors still consumed considerable time. Post-optimization profiling (after #25, #27, and #34) revealed that this hashing function consumed approximately 25% of decoding time in Surface Codes, 30% in Transversal CNOT Protocols, 10% in Color Codes, and 2% in Bivariate-Bicycle Codes (get_detcostremained the primary bottleneck for Bivariate-Bicycle Codes). Therefore, I decided to explore opportunities to further optimize this function and enhance the decoding speed.Solution: Introducing
boost::dynamic_bitsetThis PR addresses the performance bottleneck of hashing fired detector patterns and mitigates the increased memory footprint from the initial switch to
std::vector<char>by introducing theboost::dynamic_bitsetdata structure. The C++ standard library'sstd::bitsetoffers an ideal conceptual solution: memory-efficient bit-packed storage (likestd::vector<bool>) combined with highly efficient and vectorized bit-wise operations. This data structure achieves efficient access and modification by employing highly optimized bit-wise operations, thereby reducing performance overhead stemming from proxy objects instd::vector<bool>. However,std::bitsetrequires a static size (determined at compile-time), rendering it unsuitable for Tesseract's dynamically sized syndrome patterns.The Boost library's
boost::dynamic_bitsetprovides the perfect solution by offering dynamic-sized bit arrays whose dimensions can be determined at runtime. This data structure brilliantly combines the memory efficiency ofstd::vector<bool>(by packing elements into individual bits) with the performance benefits of vectorized bit-wise operations. This is achieved by internally storing bits within contiguous memory blocks and executing vectorized bit-wise operations across all elements from the same block, thus avoiding the overheads ofstd::vector<bool>'s proxy objects and costly bit-wise operations. Furthermore,boost::dynamic_bitsetoffers highly optimized, built-in hashing functions, replacing our custom, less efficient byte-by-byte hashing and resulting in a cleaner, faster implementation.Performance Evaluation: Individual Impact of Optimization
I performed two types of experiments to evaluate the achieved performance gains. First, I conducted extensive benchmarks across various code families and configurations to evaluate the individual performance gains achieved by this specific optimization. Speedups achieved include:
These results highlight the highest impact in Surface Codes and Transversal CNOT Protocols, which aligns with the initial profiling data that showcased these code families were spending more time in the original
VectorCharHashfunction.Speedups in Surface Codes
Speedups in Transversal CNOT Protocols
Speedups in Color Codes
Speedups in Bivariate-Bicycle Codes
Performance Evaluation: Cumulative Speedup
Following the evaluation of individual performance gains, I analyzed the cumulative effect of the optimizations implemented across PRs #25, #27, and #34. The cumulative speedups achieved are:
These results demonstrate that my optimizations achieved over 2x speedup in Color Codes, over 2.5x speedup in Surface Codes and Transversal CNOT Protocols, and over 5x speedup in Bivariate-Bicycle Codes.
Speedups in Color Codes
Speedups in Bivariate-Bicycle Codes
Speedups in Surface Codes
Speedups in Transversal CNOT Protocols
Conclusion
These results demonstrate that the
boost::dynamic_bitsetoptimization significantly impacts code families where the original hashing function (VectorCharHash) was a primary bottleneck (Surface Codes and Transversal CNOT Protocols). The substantial speedups achieved in these code families validate thatboost::dynamic_bitsetprovides demonstrably more efficient hashing and bit-wise operations. For code families where hashing was less of a bottleneck (Color Codes and Bivariate-Bicycle Codes), the speedups were modest, reinforcing thatstd::vector<char>can remain highly efficient even with increased memory usage when bit packing is not the primary performance concern. Crucially, this optimization delivers comparable or superior performance tostd::vector<char>while simultaneously reducing memory footprint, providing additional speedups where hashing performance is critical.Key Contributions
boost::dynamic_bitsetas a superior data structure, combiningstd::vector<bool>'s memory efficiency with high-performance vectorized bit-wise operations and efficient built-in hashingstd::vector<char>withboost::dynamic_bitsetfor storing syndrome patterns.