-
Notifications
You must be signed in to change notification settings - Fork 18
Fixing the performance issue/bug in copying large vectors when --at-most-two-errors-per-detector flag enabled #45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
for better data locality Signed-off-by: Dragana Grbic <[email protected]>
Signed-off-by: Dragana Grbic <[email protected]>
Signed-off-by: Dragana Grbic <[email protected]>
Signed-off-by: Dragana Grbic <[email protected]>
…eract-decoder into optimization-cpu
Signed-off-by: Dragana Grbic <[email protected]>
LalehB
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Dragana!
Could you include some benchmarking results for both performance and accuracy? It would be good to quantify the impact of this change.
Signed-off-by: Dragana Grbic <[email protected]>
…eract-decoder into optimization-cpu
|
@LalehB Please see the updated description of the PR. It should answer all of your questions. |
Thank you, @draganaurosgrbic, for addressing the comments. I noticed that you're reporting the number of low-confidence events to quantify accuracy — could you also include the error count for completeness? |
|
This is a pretty incredible speedup when the flag is enabled! Although the flag does effect accuracy, I wonder if we couldn't use a similar technique but possibly allow more than 2. Like I also wonder just how much accuracy is lost for e.g. the large color codes. |
+1 could you share the error rates as well please? |
|
@LalehB Please check the PR again, I added the plots that compare the speed and accuracy with and without using the |
|
@LalehB Please see the updated description of the PR, it explains the change/improvement in the accuracy of the decoder. There was a bug before this PR that was occurring only when using this flag. This PR also fixes that bug. |
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
|
Thank you @draganaurosgrbic for the update to this PR, I am very confused by this data chart that you included: |
|
@LalehB Yes, when I performed these benchmarks, I was getting lower performance when using this flag, but somewhat better accuracy (check the graphs for the accuracy/low confidence count). |
|
@LalehB Please also note the two very important contributions of this PR: completely fixes the bug of copying large vectors and fixes the bug that we didn't know existed before (inconsistent update on |
@draganaurosgrbic I thought the accuracy improvement was because of the bug fix of "(inconsistent update on next_detector_cost_tuples and next_next_detector_cost_tuples)", however would you please explain to me how the flag |
|
@LalehB Please check the PR description again, there are two accuracy changes/improvements I analyzed. The first one is before and after I fixed the performance issue/bug (inconsistent update on |
|
@LalehB The first analysis/comparison of accuracy is the comparison before and after I fixed the bug. I realized there was a bug because when I completely removed the redundant |
|
@draganaurosgrbic I think it would be good to touch base on some of project goals here:
|
|
@LalehB Thank you for your feedback. Maybe my PR description is not clear enough. Please read this:
No, this PR has two logical parts:
|
@LalehB Please check the PR description again, these graphs show the accuracy improvement coming from the performance fix (and also bug fix, implicitly). They are already included in the PR description.
|
@LalehB Please check the PR description again, these graphs show the accuracy improvement coming from the heuristic itself. They are already included in the PR description.
|
|
@LalehB Hope this makes sense now. All of these graphs are already included in the PR description. I thought my logical flow of all of these things are clear from the description. I changed the workflow of the description, it should be more clear now. |
|
@LalehB @noajshu The PR description is updated, including more comprehensive data for analyzing the impact of the flag on the performance and accuracy, summary list that includes major contributions, as well as the results for the benchmark we looked at during our last meeting. I have also updated the graphs to look more clear. If you run into any issues, please let me know. |
LalehB
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Co-authored-by: LaLeh <[email protected]>
Co-authored-by: LaLeh <[email protected]>
### Hashing Syndrome Patterns with `boost::dynamic_bitset` In this PR, I address a key performance bottleneck: the hashing of fired detector patterns (syndrome patterns). I introduce the use of `boost::dynamic_bitset` from the Boost library, a data structure that combines the memory-saving bit-packing feature of `std::vector<bool>` with highly optimized bit-wise operations and built-in hashing, enabling fast access and modification operations like in `std::vector<char>`. Crucially, `boost::dynamic_bitset` also provides highly optimized, built-in functions for efficiently hashing sequences of boolean elements. --- ### Initial Optimization: `std::vector<bool>` to `std::vector<char>` The initial _Tesseract_ implementation, as documented in #25, utilized `std::vector<bool>` to store patterns of fired detectors and predicates that block specific errors from being added to the current error hypothesis. While `std::vector<bool>` optimizes memory usage by packing elements into individual bits, accessing and modifying its elements is highly inefficient due to its reliance on proxy objects that perform costly bit-wise operations (shifting, masking). Given _Tesseract_'s frequent access and modification of these elements, this caused significant performance overheads. In #25, I transitioned from `std::vector<bool>` to `std::vector<char>`. This change made boolean elements addressable bytes, enabling efficient and direct byte-level access. Although this increased memory footprint (as each boolean was stored as a full byte), it delivered substantial performance gains by eliminating `std::vector<bool>`'s proxy objects and their associated overheads for element access and modification. Speedups achieved with this initial optimization were significant: * For Color Codes, speedups reached 17.2%-32.3% * For Bivariate-Bicycle Codes, speedups reached 13.0%-22.3% * For Surface Codes, speedups reached 33.4%-42.5% * For Transversal CNOT Protocols, speedups reached 12.2%-32.4% These significant performance gains highlight the importance of choosing appropriate data structures for boolean sequences, especially in performance-sensitive applications like _Tesseract_. The remarkable 42.5% speedup achieved in Surface Codes with this initial switch underscores the substantial overhead caused by unsuitable data structures. The performance gain from removing `std::vector<bool>`'s proxy objects and their inefficient operations far outweighed any overhead from increased memory consumption. --- ### Current Bottleneck: `std::vector<char>` and Hashing Following the optimizations in #25, _Tesseract_ continued to use `std::vector<char>` for storing and managing patterns of fired detectors and predicates that block errors. Subsequently, PR #34 replaced and merged vectors of blocked errors into the `DetectorCostTuple` structure, which efficiently stores `error_blocked` and `detectors_count` as `uint32_t` fields (reasons explained in #34). These changes left vectors of fired detectors as the sole remaining `std::vector<char>` data structure in this context. After implementing and evaluating optimizations in #25, #27, #34, and #45, profiling _Tesseract_ to analyze remaining bottlenecks revealed that, aside from the `get_detcost` function, a notable bottleneck emerged: `VectorCharHash` (originally `VectorBoolHash`). This function is responsible for hashing patterns of fired detectors to prevent re-exploring previously visited syndrome states. The implementation of `VectorCharHash` involved iterating through each element, byte by byte, and accumulating the hash. Even though this function saw significant speedups with the initial switch from `std::vector<bool>` to `std::vector<char>`, hashing patterns of fired detectors still consumed considerable time. Post-optimization profiling (after #25, #27, #34, and #45) revealed that this hashing function consumed approximately 25% of decoding time in Surface Codes, 30% in Transversal CNOT Protocols, 10% in Color Codes, and 2% in Bivariate-Bicycle Codes (`get_detcost` remained the primary bottleneck for Bivariate-Bicycle Codes). Therefore, I decided to explore opportunities to further optimize this function and enhance the decoding speed. --- ### Solution: Introducing `boost::dynamic_bitset` This PR addresses the performance bottleneck of hashing fired detector patterns and mitigates the increased memory footprint from the initial switch to `std::vector<char>` by introducing the `boost::dynamic_bitset` data structure. The C++ standard library's `std::bitset` offers an ideal conceptual solution: memory-efficient bit-packed storage (like `std::vector<bool>`) combined with highly efficient access and modification operations (like `std::vector<char>`). This data structure achieves efficient access and modification by employing highly optimized bit-wise operations, thereby reducing performance overhead stemming from proxy objects in `std::vector<bool>`. However, `std::bitset` requires a static size (determined at compile-time), rendering it unsuitable for _Tesseract_'s dynamically sized syndrome patterns. The Boost library's `boost::dynamic_bitset` provides the perfect solution by offering dynamic-sized bit arrays whose dimensions can be determined at runtime. This data structure brilliantly combines the memory efficiency of `std::vector<bool>` (by packing elements into individual bits) with the performance benefits of direct element access and modification, similar to `std::vector<char>`. This is achieved by internally storing bits within a contiguous array of fundamental integer types (e.g., `unsigned long` or `uint64_t`) and accessing/modifying elements using highly optimized bit-wise operations, thus avoiding the overheads of `std::vector<bool>`'s proxy objects and costly bit-wise operations. Furthermore, `boost::dynamic_bitset` offers highly optimized, built-in hashing functions, replacing our custom, less efficient byte-by-byte hashing and resulting in a cleaner, faster implementation. --- ### Performance Evaluation: Individual Impact of Optimization I performed two types of experiments to evaluate the achieved performance gains. First, I conducted extensive benchmarks across various code families and configurations to evaluate the individual performance gains achieved by this specific optimization. Speedups achieved include: * For Surface Codes: 8.0%-24.7% * For Transversal CNOT Protocols: 12.1%-26.8% * For Color Codes: 3.6%-7.0% * For Bivariate-Bicycle Codes: 0.5%-4.8% These results highlight the highest impact in Surface Codes and Transversal CNOT Protocols, which aligns with the initial profiling data that showcased these code families were spending more time in the original `VectorCharHash` function. --- #### Speedups in Surface Codes <img width="1990" height="989" alt="img1" src="https://github.com/user-attachments/assets/04044da5-a980-4282-a6fe-4debfa815f41" /> --- #### Speedups in Transversal CNOT Protocols <img width="1990" height="989" alt="img2" src="https://github.com/user-attachments/assets/f79e4d7d-5cfc-4077-be1a-13ef92a2d65a" /> <img width="1990" height="989" alt="img3" src="https://github.com/user-attachments/assets/35a9b672-07d3-45ea-9334-23dd85760925" /> --- #### Speedups in Color Codes <img width="1990" height="989" alt="img4" src="https://github.com/user-attachments/assets/2b52c4fd-5137-47f0-9bae-7c667c740ff0" /> <img width="1990" height="989" alt="img5" src="https://github.com/user-attachments/assets/e7883dec-5a88-4b2b-914b-3d12a1843d6f" /> --- #### Speedups in Bivariate-Bicycle Codes <img width="1990" height="989" alt="img6" src="https://github.com/user-attachments/assets/bd530a3b-da17-4ac1-bf68-702aaafe6047" /> <img width="1990" height="989" alt="img7" src="https://github.com/user-attachments/assets/2d2f2576-0b16-4f0a-b8a2-221723250945" /> --- ### Performance Evaluation: Cumulative Speedup Following the evaluation of individual performance gains, I analyzed the cumulative effect of the optimizations implemented across PRs #25, #27, #34, and #45. The cumulative speedups achieved are: * For Color Codes: 40.7%-54.8% * For Bivariate-Bicycle Codes: 41.5%-80.3% * For Surface Codes: 50.0%-62.4% * For Transversal CNOT Protocols: 57.8%-63.6% These results demonstrate that my optimizations achieved over 2x speedup in Color Codes, over 2.5x speedup in Surface Codes and Transversal CNOT Protocols, and over 5x speedup in Bivariate-Bicycle Codes. --- #### Speedups in Color Codes <img width="1990" height="989" alt="img1" src="https://github.com/user-attachments/assets/cd81dc98-8599-4740-b00c-4ff396488f69" /> <img width="1990" height="989" alt="img2" src="https://github.com/user-attachments/assets/c337ddcf-44f0-4641-91df-2a6d3c586680" /> --- #### Speedups in Bivariate-Bicycle Codes <img width="1990" height="989" alt="img3" src="https://github.com/user-attachments/assets/a57cf9e2-4c2c-44e8-8a6e-1860b1544cbd" /> <img width="1990" height="989" alt="img4" src="https://github.com/user-attachments/assets/fde60159-fd7f-4893-b30d-34da844ac452" /> --- #### Speedups in Surface Codes <img width="1990" height="989" alt="img5" src="https://github.com/user-attachments/assets/57234d33-201b-41a9-b867-15e9ff87e666" /> --- #### Speedups in Transversal CNOT Protocols <img width="1990" height="989" alt="img6" src="https://github.com/user-attachments/assets/5780843d-2055-4870-9454-50184a268ad1" /> --- ### Conclusion These results demonstrate that the `boost::dynamic_bitset` optimization significantly impacts code families where the original hashing function (`VectorCharHash`) was a primary bottleneck (Surface Codes and Transversal CNOT Protocols). The substantial speedups achieved in these code families validate that `boost::dynamic_bitset` provides demonstrably more efficient hashing and bit-wise operations. For code families where hashing was less of a bottleneck (Color Codes and Bivariate-Bicycle Codes), the speedups were modest, reinforcing that `std::vector<char>` can remain highly efficient even with increased memory usage when bit packing is not the primary performance concern. Crucially, this optimization delivers comparable or superior performance to `std::vector<char>` while simultaneously reducing memory footprint, providing additional speedups where hashing performance is critical. --- ### Key Contributions * Identified the hashing of syndrome patterns as the primary remaining bottleneck in Surface Codes and Transversal CNOT Protocols, post prior optimizations (#25, #27, #34, #45). * Adopted `boost::dynamic_bitset` as a superior data structure, combining `std::vector<bool>`'s memory efficiency with high-performance bit-wise operations and built-in hashing, enabling fast access and modification operations like in `std::vector<char>` * Replaced `std::vector<char>` with `boost::dynamic_bitset` for storing syndrome patterns. * Performed extensive benchmarking to evaluate both the individual impact of this optimization and its cumulative effect with prior PRs. * Achieved significant individual speedups (e.g., 8.0%-24.7% in Surface Codes, 12.1%-26.8% in Transversal CNOT Protocols) and substantial cumulative speedups (over 2x in Color Codes, over 2.5x in Surface Codes and Transversal CNOT Protocols, and over 5x in Bivariate-Bicycle Codes). PR #47 contains the scripts I used for benchmarking and plotting the results. --------- Signed-off-by: Dragana Grbic <[email protected]> Co-authored-by: noajshu <[email protected]> Co-authored-by: LaLeh <[email protected]>





Fixing a Performance Bottleneck and a Related Bug When Using the
--at-most-two-errors-per-detectorFlagThis PR addresses a critical performance issue and a related bug that existed when the
--at-most-two-errors-per-detectorflag was enabled. The core of the problem was the highly inefficient copying of largestd::vectordata structures, which was necessary for state management (mentioned in #27) but became prohibitively expensive due to previous optimizations. This PR completely removes these costly copy operations and, in doing so, implicitly fixes a bug that was degrading decoder accuracy.The Performance Issue (and also a Bug)
This PR resolves a performance degradation that was exclusively present when using the
--at-most-two-errors-per-detectorflag. While the code path for these redundant copy operations existed from the beginning, its performance impact was initially masked by other bottlenecks. After significant speedups were achieved in other parts of the code through optimizations in PRs #25, #27, and #34, this flag-specific degradation became a major and prominent bottleneck. The cost of the copy operations escalated as the size of the data structures increased with these PRs. The current work focuses on resolving this last remaining degradation, ensuring that even when this flag is enabled, we achieve a high level of speedup consistent with other scenarios.The Technical Solution
To solve this, I replaced the expensive copy-and-revert strategy with a more intelligent state management method. Instead of making a copy, I now store a special value (
2) in theerror_blockedfield of theDetectorCostTuplestruct. This special value indicates that an error was blocked specifically due to the--at-most-two-errors-per-detectorflag and should be unblocked (or "reverted") in the next search state. This is made possible because theerror_blockedfield is auint32_t, allowing it to hold values beyond a simpletrue/false(1or0).This new approach completely removes the need for the
next_next_detector_cost_tuplesvector. Instead, all changes are now made on a single vector (next_detector_cost_tuples), which is much more efficient.The Bug and its Fix
The old code contained a bug that was degrading the decoder's accuracy. The following loop:
had an inconsistency when the
--at-most-two-errors-per-detectorflag was enabled. The number of fired detectors was being updated onnext_detector_cost_tuples, but the blocking of errors was being performed onnext_next_detector_cost_tuples. When theget_detcostfunction was called, it would usenext_next_detector_cost_tuples, which did not have the correct fired detector counts. This inconsistency explains why some benchmarks produced low confidence results and errors (e.g., 3 errors in a Surface Code benchmark).The new code, which replaces the above loop, is as follows:
By completely removing the
next_next_detector_cost_tuplesvector and performing all modifications on a single array, this PR implicitly fixes the bug and ensures consistency between the fired detector counts and the blocked error flags.Benchmarking the Performance and Bug Fix
The results below clearly show a huge amount of time was lost on these expensive copy operations when using the
--at-most-two-errors-per-detectorflag. This was especially apparent in the Surface Code benchmark (r=11, p=0.002, 500 shots). This performance degradation escalated as the size of the data structures increased with previous optimizations.The bug fix also improved accuracy. For the Surface Code benchmark (r=11, p=0.002, 500 shots) that previously had 3 errors, the new code produced 0 errors.
Analyzing the Impact of the Flag Itself
With the performance and correctness issues fixed, I performed additional experiments to analyze the intrinsic impact of the
--at-most-two-errors-per-detectorflag. The goal was to understand its effect on performance and accuracy now that the underlying implementation is optimized.The results from these benchmarks show a consistent and strange behavior: the flag provides somewhat better accuracy but lower performance. This is the opposite of a typical heuristic, which is meant to trade some accuracy for a performance gain.
To collect more comprehensive data, I performed additional experiments on various groups of code families. These experiments confirmed the initial findings: the flag provides better accuracy at the cost of performance, with a decoding slowdown ranging from 0.2% to 69%.
Conclusion: This Heuristic is not an Optimization
I performed a final benchmark on a specific case that previously showed a performance benefit when using the flag. The command used for this benchmark was:
bazel build src:all && time ./bazel-bin/src/tesseract --pqlimit 200000 --beam 5 --num-det-orders 20 --sample-num-shots 20 --det-order-seed 13267562 --circuit testdata/colorcodes/r=9,d=9,p=0.002,noise=si1000,c=superdense_color_code_X,q=121,gates=cz.stim --sample-seed 717347 --threads 1 --print-statsBefore the optimizations in PR #34, the execution time without the flag was 75.91 seconds, while using the flag was 74.23 seconds. However, after applying the
get_detcostoptimizations from PR #34, the execution time without the flag was 69.01 seconds, while with the flag it was 72.98 seconds.This demonstrates that the speedup from the optimizations in PR #34 (a gain of ~6.9 seconds) is far more significant than the speedup this heuristic flag initially provided (a gain of ~1.68 seconds). The flag's intended purpose of improving performance by pruning the search space is now outweighed by the efficiency gains in the core decoding function. The optimization I implemented in PR #34 achieved a better speedup than the one this heuristic flag might initially have achieved, making the flag-specific degradation an issue that needed to be addressed. It is clear that my optimization work now provides a high speedup in all scenarios.
My conclusion is that the current version of the Tesseract algorithm is faster without using this flag. The next logical step may be to remove the flag entirely, but I am leaving that decision to the original implementers of the flag.
Key Contributions
--at-most-two-errors-per-detectorflag. This logic performed redundant copy operations on large vectors instead of a smarter method of reverting changes. This fixes the last remaining performance degradation that was specific to using the heuristic flag.--at-most-two-errors-per-detectorflag was enabled, which was causing accuracy issues and low confidence results.2value in theDetectorCostTuplestructure, made possible by previous data representation changes.