Skip to content

Conversation

@weslleyspereira
Copy link

Adapt tests from #3 for the new interface of xGGQRCS.

This bug was introduced in commit 446ac9a2.
Make the test slightly more robust by using non-diagonal test matrices.
This bug was discovered by the random test from commit dcd04cd.
Minimal triggering example with m=1, n=2, p=1 with non-random entries:
  A = [1, 1],
  B = [1, 0].
Fix a problem where double precision code was built when single
precision was requested.
Fix harmless out-of-bounds accesses to avoid spurious address sanitizer
(ASAN) failures.
Accidentally, the code check for infinity when finite numbers were
expected.
christoph-conrads and others added 29 commits April 27, 2021 21:38
Fix the U2 column permutation when the pre-processing of the matrix B is
enabled and rank(B) is larger than p - rank(B).
The pre-processing may make the computation of some angles by the CS
decomposition superfluous. These missing angles should properly computed
for any combination of pre-processing flags starting with this commit.
The tests are still not passing because the pre-processing of B is still
broken when B has full rank because the bottom right element of B
containing the elementary reflector is overwritten by the scalar
belonging to the elementary reflector
* add hints for pre-processing of input matrices A, B
* add parameter for matrix X and its leading dimension LDX
  (this simplifies workspace management a bit as well)
* reduce workspace size by evaluating matrix ranks
* simplify workspace management
* reserve workspace for scalar factors of QR factorization of A, B
  instead of re-using ALPHA, BETA
* avoid unnecessary xGEMM (instead of xTRMM ) call when assembling X
* insert debugging code overwriting arrays with NaNs
Whenever the norm of B was zero, the calculated norm of G would be NaN.
The new checks for NaN caught this problem immediately.
Also swap the pre-processing hints when swapping the order of the input
matrices.
Fix a stupid bug caused by replacing the identifier `L` with `RANK`
causing a multiplication with a matrix from the *R*ight instead of the
*L*eft side. Yesterday several more instances of this bug were fixed;
the fixes were squashed into the commit changing the SGGQRCS API.

This commit fixes the pre-processing of the input matrix A.
The matrix pre-processing passes all C++ tests in the branch
christoph-conrads/qr+cs-gsvd now.
When pre-processing of B is enabled, fix the column order of U2
immediately after multiplying U2 with the orthogonal factor of the QR
decomposition of B instead of performing other, unrelated computations
in between.
Streamline, comment, and speed up the code adjusting the singular values
and the rows of the matrix X for the matrix scaling.

* Exploit pre-processing information to avoid calls to expensive
  trigonometric functions when results are known to be zero or one.
* Avoid having to move angles computed by the CS decomposition
* add extensive commments
* allow more matrices with more rows than columns for testing
  pre-processing
* halve maximum xGGQRCS matrix size
SORMQR requires a workspace of size 4000+ for 1x1 input (bug?). Luckily,
the call can be omitted.
With the new pre-processing enabled, some matrices have a larger
backward error than before. Note that the constant factor in the
tolerance expression was doubled but the matrix dimension factor
reduced.
Show the output every 60 seconds in the infinite xGGQRCS test unless an
iteration takes more than 60 seconds.
Both test problems were found by the random GSVD test.
The algorithm is a simple blocked matrix transposition. In comparison to
a simple column-wise read, this approach reduces level 1 cache misses in
Cachegrind from 17% to 12%; the last-level cache performance is near 6%
for both solutions.

Cachegrind cache configuration
* level 1: 2048 bytes, 8-way associative, 64 bytes cache line size
* last level: 8192 bytes, 16-way associative, 64 bytes cache line size

Cachegrind command line:
```
valgrind --tool=cachegrind --D1=2048,8,64 --LL=8192,16,64 ./a.out
```
Previously, xGGQRCS would rely exclusively on QR factorizations for
pre-processing the input matrices. These operations were numerically
stable but if G = [A; B] had a large number of columns, then xGGQRCS
would be significantly slowed down so an LQ decomposition of G was
calculated but this caused backward errors 45 times larger than the
tolerance for matrix pairs as small as 3x20. These observations match
the numerical linear algebra theory saying that the LQ decomposition has
a bounded norm-wise round-off error in every column but there are no
such bounds for individual columns, cf. Higham: "Accuracy and Stability
of Numerical Algorithms" (2nd edition), 2002, Section 19.4 "Pivoting and
Row-Wise Stability".

There are several possible solutions:
* An LQ factorization with row pivoting. There is no such algorithm in
  LAPACK and porting the existing QR factorization with column pivoting
  in xGEQP3 will give horrible performance because xGEQP3 relies on
  matrix columns and elementary reflectors being laid out sequentially
  in memory.
* An in-place matrix transposition followed by a call to xGEQP3 followed
  by another in-place matrix transposition. In-place matrix
  transposition is not implemented in LAPACK; implementing this
  functionality is no trivial and requires a good understanding of
  modern computer architecture and number theory.
* A matrix transposition (not in place) followed by a call to xGEQP3
  followed by another matrix transposition. This is possible with the
  extra memory available when X must be stored. This is the solution
  implemented in this commit.
The C++ code was used in the performance measurements for SGETRP.
Compute the LQ decomposition with row-pivoting immediately.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants