Skip to content

Conversation

@draganaurosgrbic
Copy link
Contributor

@draganaurosgrbic draganaurosgrbic commented Jul 25, 2025

Hashing Syndrome Patterns with boost::dynamic_bitset

In this PR, I address a key performance bottleneck: the hashing of fired detector patterns (syndrome patterns). I introduce the use of boost::dynamic_bitset from the Boost library, a data structure that combines the memory-saving bit-packing feature of std::vector<bool> with highly optimized vectorized bit-wise operations. Crucially, boost::dynamic_bitset also provides highly optimized, built-in functions for efficiently hashing sequences of boolean elements.


Initial Optimization: std::vector<bool> to std::vector<char>

The initial Tesseract implementation, as documented in #25, utilized std::vector<bool> to store patterns of fired detectors and predicates that block specific errors from being added to the current error hypothesis. While std::vector<bool> optimizes memory usage by packing elements into individual bits, accessing and modifying its elements is highly inefficient due to its reliance on proxy objects that perform costly bit-wise operations (shifting, masking). Given Tesseract's frequent access and modification of these elements, this caused significant performance overheads.

In #25, I transitioned from std::vector<bool> to std::vector<char>. This change made boolean elements addressable bytes, enabling efficient and direct byte-level access. Although this increased memory footprint (as each boolean was stored as a full byte), it delivered substantial performance gains by eliminating std::vector<bool>'s proxy objects and their associated overheads for element access and modification. Speedups achieved with this initial optimization were significant:

  • For Color Codes, speedups reached 17.2%-32.3%
  • For Bivariate-Bicycle Codes, speedups reached 13.0%-22.3%
  • For Surface Codes, speedups reached 33.4%-42.5%
  • For Transversal CNOT Protocols, speedups reached 12.2%-32.4%

These significant performance gains highlight the importance of choosing appropriate data structures for boolean sequences, especially in performance-sensitive applications like Tesseract. The remarkable 42.5% speedup achieved in Surface Codes with this initial switch underscores the substantial overhead caused by unsuitable data structures. The performance gain from removing std::vector<bool>'s proxy objects and their inefficient operations far outweighed any overhead from increased memory consumption.


Current Bottleneck: std::vector<char> and Hashing

Following the optimizations in #25, Tesseract continued to use std::vector<char> for storing and managing patterns of fired detectors and predicates that block errors. Subsequently, PR #34 replaced and merged vectors of blocked errors into the DetectorCostTuple structure, which efficiently stores error_blocked and detectors_count as uint32_t fields (reasons explained in #34). These changes left vectors of fired detectors as the sole remaining std::vector<char> data structure in this context.

After implementing and evaluating optimizations in #25, #27, and #34, profiling Tesseract to analyze remaining bottlenecks revealed that, aside from the get_detcost function, a notable bottleneck emerged: VectorCharHash (originally VectorBoolHash). This function is responsible for hashing patterns of fired detectors to prevent re-exploring previously visited syndrome states. The implementation of VectorCharHash involved iterating through each element, byte by byte, and accumulating the hash. Even though this function saw significant speedups with the initial switch from std::vector<bool> to std::vector<char>, hashing patterns of fired detectors still consumed considerable time. Post-optimization profiling (after #25, #27, and #34) revealed that this hashing function consumed approximately 25% of decoding time in Surface Codes, 30% in Transversal CNOT Protocols, 10% in Color Codes, and 2% in Bivariate-Bicycle Codes (get_detcost remained the primary bottleneck for Bivariate-Bicycle Codes). Therefore, I decided to explore opportunities to further optimize this function and enhance the decoding speed.


Solution: Introducing boost::dynamic_bitset

This PR addresses the performance bottleneck of hashing fired detector patterns and mitigates the increased memory footprint from the initial switch to std::vector<char> by introducing the boost::dynamic_bitset data structure. The C++ standard library's std::bitset offers an ideal conceptual solution: memory-efficient bit-packed storage (like std::vector<bool>) combined with highly efficient and vectorized bit-wise operations. This data structure achieves efficient access and modification by employing highly optimized bit-wise operations, thereby reducing performance overhead stemming from proxy objects in std::vector<bool>. However, std::bitset requires a static size (determined at compile-time), rendering it unsuitable for Tesseract's dynamically sized syndrome patterns.

The Boost library's boost::dynamic_bitset provides the perfect solution by offering dynamic-sized bit arrays whose dimensions can be determined at runtime. This data structure brilliantly combines the memory efficiency of std::vector<bool> (by packing elements into individual bits) with the performance benefits of vectorized bit-wise operations. This is achieved by internally storing bits within contiguous memory blocks and executing vectorized bit-wise operations across all elements from the same block, thus avoiding the overheads of std::vector<bool>'s proxy objects and costly bit-wise operations. Furthermore, boost::dynamic_bitset offers highly optimized, built-in hashing functions, replacing our custom, less efficient byte-by-byte hashing and resulting in a cleaner, faster implementation.


Performance Evaluation: Individual Impact of Optimization

I performed two types of experiments to evaluate the achieved performance gains. First, I conducted extensive benchmarks across various code families and configurations to evaluate the individual performance gains achieved by this specific optimization. Speedups achieved include:

  • For Surface Codes: 8.0%-24.7%
  • For Transversal CNOT Protocols: 12.1%-26.8%
  • For Color Codes: 3.6%-7.0%
  • For Bivariate-Bicycle Codes: 0.5%-4.8%

These results highlight the highest impact in Surface Codes and Transversal CNOT Protocols, which aligns with the initial profiling data that showcased these code families were spending more time in the original VectorCharHash function.


Speedups in Surface Codes

img1

Speedups in Transversal CNOT Protocols

img2 img3

Speedups in Color Codes

img4 img5

Speedups in Bivariate-Bicycle Codes

img6 img7

Performance Evaluation: Cumulative Speedup

Following the evaluation of individual performance gains, I analyzed the cumulative effect of the optimizations implemented across PRs #25, #27, and #34. The cumulative speedups achieved are:

  • For Color Codes: 40.7%-54.8%
  • For Bivariate-Bicycle Codes: 41.5%-80.3%
  • For Surface Codes: 50.0%-62.4%
  • For Transversal CNOT Protocols: 57.8%-63.6%

These results demonstrate that my optimizations achieved over 2x speedup in Color Codes, over 2.5x speedup in Surface Codes and Transversal CNOT Protocols, and over 5x speedup in Bivariate-Bicycle Codes.


Speedups in Color Codes

img1 img2

Speedups in Bivariate-Bicycle Codes

img3 img4

Speedups in Surface Codes

img5

Speedups in Transversal CNOT Protocols

img6

Conclusion

These results demonstrate that the boost::dynamic_bitset optimization significantly impacts code families where the original hashing function (VectorCharHash) was a primary bottleneck (Surface Codes and Transversal CNOT Protocols). The substantial speedups achieved in these code families validate that boost::dynamic_bitset provides demonstrably more efficient hashing and bit-wise operations. For code families where hashing was less of a bottleneck (Color Codes and Bivariate-Bicycle Codes), the speedups were modest, reinforcing that std::vector<char> can remain highly efficient even with increased memory usage when bit packing is not the primary performance concern. Crucially, this optimization delivers comparable or superior performance to std::vector<char> while simultaneously reducing memory footprint, providing additional speedups where hashing performance is critical.


Key Contributions

  • Identified the hashing of syndrome patterns as the primary remaining bottleneck in Surface Codes and Transversal CNOT Protocols, post prior optimizations (Replace std::vector<bool> with std::vector<char> for faster computations #25, Removing unnecessary std::vector copy operations #27, Accelerating 'get_detcost' function #34).
  • Adopted boost::dynamic_bitset as a superior data structure, combining std::vector<bool>'s memory efficiency with high-performance vectorized bit-wise operations and efficient built-in hashing
  • Replaced std::vector<char> with boost::dynamic_bitset for storing syndrome patterns.
  • Performed extensive benchmarking to evaluate both the individual impact of this optimization and its cumulative effect with prior PRs.
  • Achieved significant individual speedups (e.g., 8.0%-24.7% in Surface Codes, 12.1%-26.8% in Transversal CNOT Protocols) and substantial cumulative speedups (over 2x in Color Codes, over 2.5x in Surface Codes and Transversal CNOT Protocols, and over 5x in Bivariate-Bicycle Codes).

draganaurosgrbic and others added 20 commits June 14, 2025 14:52
Signed-off-by: Dragana Grbic <[email protected]>
Signed-off-by: Dragana Grbic <[email protected]>
Signed-off-by: Dragana Grbic <[email protected]>
Signed-off-by: Dragana Grbic <[email protected]>
Copy link
Collaborator

@LalehB LalehB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@draganaurosgrbic draganaurosgrbic merged commit 8699f2d into main Jul 30, 2025
4 checks passed
@draganaurosgrbic draganaurosgrbic deleted the optimization-cpu branch July 30, 2025 18:16
@draganaurosgrbic draganaurosgrbic removed the request for review from noajshu July 30, 2025 18:18
draganaurosgrbic added a commit that referenced this pull request Aug 7, 2025
### Description
This Pull Request introduces a substantial performance optimization to
_Tesseract_'s initialization phase. While previous efforts primarily
focused on enhancing the critical decoding speed, this work addresses an
identified bottleneck in the one-time setup/initialization process. I've
targeted a highly inefficient code segment and achieved remarkable
speedups.

---

### Background
Before _Tesseract_ can decode simulations/shots of quantum circuits, it
must first read and parse the quantum circuit model. This process
involves populating and constructing internal data structures essential
for decoding. For a given quantum circuit, _Tesseract_ performs this
initialization once, then utilizes the constructed data structures and
parsed model to decode multiple shots/simulations. As such, the
initialization phase hasn't been a primary focus for optimization, as
it's a one-time operation and generally not a major time sink compared
to the iterative decoding process. However, after achieving significant
performance gains in the decoding phase, I identified an opportunity to
further improve overall efficiency by optimizing a particularly
inefficient loop within initialization.

---

### Problem: Inefficient `eneighbors` Calculation
The primary bottleneck I identified within the initialization phase was
the loop responsible for calculating `eneighbors` (error neighbors).
This data structure determines, for each error, which detectors are
affected by its neighboring errors. The original implementation, shown
below, exhibited severe performance issues:
```cpp
  std::vector<std::unordered_set<size_t>> edets_sets(edets.size());
  for (size_t ei = 0; ei < edets.size(); ++ei) {
    edets_sets[ei] = std::unordered_set<size_t>(edets[ei].begin(), edets[ei].end());
  }
  for (size_t ei = 0; ei < num_errors; ++ei) {
    std::set<int> neighbor_set;
    for (int d : edets[ei]) {
      for (int oei : d2e[d]) {
        for (int od : edets_sets[oei]) {
          if (!edets_sets[ei].contains(od)) {
            neighbor_set.insert(od);
          }
        }
      }
    }
    eneighbors[ei] = std::vector<int>(neighbor_set.begin(), neighbor_set.end());
  }
```
This implementation suffered from:
1. **High Computational Complexity:** The four nested loops resulted in
a complexity proportional to `num_errors` \* `detectors_per_error` \*
`errors_per_detector` \* `detectors_per_neighbor_error`.
2. `std::set` and `std::unordered_set` Overheads: Frequent insert
operations on `std::set` (logarithmic time complexity) and contains
operations on `std::unordered_set` (average constant time complexity)
introduced significant memory management overhead and could be
substantial when iterated large number of times.

---

### Solution: Leveraging `boost::dynamic_bitset` for Efficient Set
Operations
Drawing from the successful application of `boost::dynamic_bitset` in
optimizing syndrome pattern hashing (as implemented in #57), I replaced
`std::set` and `std::unordered_set` in this critical initialization loop
with `boost::dynamic_bitset`. This significantly accelerated the
`eneighbors` calculation. As detailed in #57, `boost::dynamic_bitset`
offers memory efficiency similar to `std::vector<bool>` but provides
highly optimized bit-wise operations for manipulating elements. This is
achieved by packing individual bits/elements into contiguous memory
blocks and enabling a single bit-wise operation to be executed across
multiple elements from the same block simultaneously, leveraging CPU
vectorization. The optimized loop is shown below:
```cpp
  std::vector<boost::dynamic_bitset<>> edets_bitsets(num_errors,
                                                     boost::dynamic_bitset<>(num_detectors));
  for (size_t ei = 0; ei < num_errors; ++ei) {
    for (int d : edets[ei]) {
      edets_bitsets[ei][d] = 1;
    }
  }

  for (size_t ei = 0; ei < num_errors; ++ei) {
    boost::dynamic_bitset<> neighbor_set(num_detectors, false);
    for (int d : edets[ei]) {
      for (int oei : d2e[d]) {
        // Unify detectors from neighboring errors
        neighbor_set |= edets_bitsets[oei];
      }
    }
    // Remove detectors from error's own set
    neighbor_set &= ~edets_bitsets[ei];

    for (size_t d = neighbor_set.find_first(); d != boost::dynamic_bitset<>::npos;
         d = neighbor_set.find_next(d)) {
      eneighbors[ei].push_back(d);
    }
  }
```
This optimization significantly improves performance by:
1. **Reduced Nested Loops:** The code now contains three nested loops
(instead of the original four), substantially decreasing the total
number of iterations.
2. **Vectorized Bit-wise Operations:** `boost::dynamic_bitset` stores
bits in contiguous memory blocks and enables the execution of highly
optimized, hardware-accelerated bit-wise operations. These operations
can work on entire blocks in a single CPU cycle, effectively performing
vectorized set unions and differences. This dramatically reduces the
overhead of element-wise checks and insertions found in the original
implementation.
3. **Memory Efficiency:** `boost::dynamic_bitset` retains the
memory-saving bit-packing feature similar to `std::vector<bool>` while
eliminating the performance overhead stemming from `std::vector<bool>`'s
proxy objects and inefficient bit-level manipulations that operate on
individual elements separately.

---

### Impact and Performance Benchmarks
The speedups I achieved in the initialization function are remarkable
across various code families and configurations.

#### Before Optimization (Initial Times)
- **Color Codes:** 1.5 - 9 seconds
- **Bivariate-Bicycle Codes:** 6 - 17 seconds
- **Surface Codes:** 0.1 - 0.8 seconds
- **Transversal CNOT Protocols:** 0.2 - 7 seconds

#### After Optimization (Speedups)
- **Color Codes:** 95.89x to 128.18x (less than a second)
- **Bivariate-Bicycle Codes:** 106.43x to 132.88x (less than a second)
- **Surface Codes:** 26.41x to 36.07x (less than a second)
- **Transversal CNOT Protocols:** 17.51x to 43.07x (less than a second)

As shown by the initial times, the initialization previously did not
exceed 17 seconds for the benchmarks I performed, with Bivariate-Bicycle
codes having the highest overhead. However, since this operation is
performed once per quantum circuit (and Tesseract then uses the
initialized knowledge to decode multiple simulations/shots, where
performance is critical), even these initial times were acceptable.

Nevertheless, initialization times after this optimization fell below a
second for all tested code families and configurations. For Color Codes,
initialization fell below 0.09 seconds, for Bivariate-Bicycle Codes
below 0.15 seconds, For Surface Codes below 0.03 seconds and for
Transversal CNOT Protocols below 0.4 seconds. This dramatic reduction
explains the exceptionally high speedup factors; **the initialization
phase is now extremely fast.**

Below are plots that show the performance gains I achieved across
different code families and configurations.

<img width="1790" height="989" alt="color1"
src="https://github.com/user-attachments/assets/7f550ae7-b2a0-464c-80c5-c90094e45ec9"
/>

<img width="1790" height="989" alt="color2"
src="https://github.com/user-attachments/assets/87eb9b9c-2515-4027-9d96-0f9a3814a448"
/>

<img width="1790" height="989" alt="color3"
src="https://github.com/user-attachments/assets/8b0a1b24-9d6a-485a-a079-5e272c7b611f"
/>

<img width="1790" height="989" alt="bicycle1"
src="https://github.com/user-attachments/assets/420a9794-bf10-42bf-8ed0-580470aa70ac"
/>

<img width="1790" height="989" alt="bicycle2"
src="https://github.com/user-attachments/assets/6396faac-f3e3-4ce6-88ae-5e4c2af8200f"
/>

<img width="1790" height="989" alt="bicycle3"
src="https://github.com/user-attachments/assets/ed6f9565-2fb2-413d-b9f0-06cdd3624d2e"
/>

<img width="1790" height="989" alt="surface1"
src="https://github.com/user-attachments/assets/1029b27e-8b33-4051-9dd3-30f882d60dc2"
/>

<img width="1790" height="989" alt="trans1"
src="https://github.com/user-attachments/assets/76c6b166-64a7-4af9-a4c8-b7b5bc73cfa1"
/>

<img width="1790" height="989" alt="trans2"
src="https://github.com/user-attachments/assets/27da2186-f651-44db-8ab1-00c1f260d5b8"
/>

<img width="1790" height="989" alt="trans3"
src="https://github.com/user-attachments/assets/9bc57295-509a-4dad-a3b1-42457b563285"
/>

---

### Conclusion
This optimization to the initialization function demonstrates the
substantial performance gains achievable by refactoring inefficient
loops and leveraging advanced data structures like
`boost::dynamic_bitset`. It showcases how its highly optimized bit-wise
operations (enabling vectorized execution across multiple elements at
once) can be used to implement highly efficient set operations (union,
difference). The resulting remarkable speedups further enhance the
overall efficiency of the Tesseract decoder.

---

### Key Contributions
* **Identified and Investigated:** Pinpointed a critical
inefficiency/loop within the initialization function that was consuming
significant time.
* **Leveraged Advanced Data Structures:** Applied knowledge and success
from optimization in #57 to replace code logic that frequently
manipulates set elements. This involved leveraging
`boost::dynamic_bitset` for highly optimized, vectorized bit-wise
operations to perform set operations.
* **Refactored Critical Loop:** Replaced the inefficient `eneighbors`
calculation loop, which previously used `std::unordered_set` and
`std::set`, with an improved version utilizing `boost::dynamic_bitset`
and vectorized bit-wise operations for set unions and differences.
* **Achieved Remarkable Speedups:** Delivered exceptional speedups
across various code families and configurations, reaching up to 132.88x
faster initialization in a Bivariate-Bicycle code benchmark, making
Tesseract even more robust and scalable.

---------

Signed-off-by: Dragana Grbic <[email protected]>
Co-authored-by: noajshu <[email protected]>
Co-authored-by: LaLeh <[email protected]>
@NoureldinYosri NoureldinYosri mentioned this pull request Sep 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants