Skip to content

Conversation

@ekg
Copy link
Collaborator

@ekg ekg commented Dec 3, 2025

Summary

This PR integrates three tools for pangenome graph construction:

  • sweepga: Fast all-vs-all alignment using FastGA with k-mer frequency filtering
  • seqwish: Graph induction via transitive closure computation (produces proper variation graphs where shared sequences are collapsed)
  • gfasort: Graph sorting using the Ygs pipeline (path-guided SGD + grooming + topological sort)

Features

  1. New graph command for standalone graph building from FASTA files

    • Supports multiple input files and file lists
    • Configurable k-mer frequency multiplier (-f, default: genomes × 10)
    • Configurable minimum alignment length (-l, default: 100bp)
  2. Query output supports gfa format (seqwish-based variation graphs)

    • gfa or gfa-seqwish: Proper variation graph with collapsed shared sequences (default)
    • gfa-poa: POA-based partial order alignment graph
  3. Automatic graph sorting using gfasort's Ygs pipeline:

    • Y = Path-guided SGD (positions nodes to minimize path distance differences)
    • g = Grooming (ensures consistent node orientations along paths)
    • s = Topological sort (linearizes graph respecting edge directions)

Dependencies

  • sweepga @ 63ed05d52c4ac282b94f4e6c67510d3c0d4f73d2
  • seqwish @ a3369b07370cd7406e00c64224152eef11bcc665 (rust-2 branch)
  • gfasort @ b55608abeec6121671b674aae20bcc2f940ef213

Test plan

  • cargo test passes
  • cargo clippy passes (no new warnings in changed files)
  • Manual testing of impg graph --fasta-files command
  • Manual testing of impg query -o gfa with seqwish output

@ekg ekg closed this Dec 11, 2025
@ekg ekg reopened this Dec 11, 2025
ekg added 11 commits December 11, 2025 11:59
Add support for building variation graphs from sequences using sweepga
for alignment and seqwish for graph induction.

Features:
- New `graph` command for standalone graph building from FASTA files
- Query output supports `gfa` (seqwish-based, default) and `gfa-poa` modes
- Seqwish produces proper variation graphs with collapsed shared sequences
- Configurable k-mer frequency multiplier (-f, default: genomes * 10)

Dependencies:
- sweepga @ 63ed05d52c4ac282b94f4e6c67510d3c0d4f73d2
- seqwish @ a3369b07370cd7406e00c64224152eef11bcc665 (rust-2 branch)
Sort GFA output by default using gfasort's Ygs pipeline, which performs:
- Path-guided SGD (positions nodes to minimize path distance differences)
- Grooming (ensures consistent node orientations along paths)
- Topological sort (linearizes graph respecting edge directions)

This is equivalent to running `gfasort -p Ygs` on the command line.

Changes:
- Add gfasort dependency @ b55608abeec6121671b674aae20bcc2f940ef213
- Add sort_gfa() helper function to src/graph.rs
- Sort output in generate_gfa_seqwish_from_intervals() (query interface)
- Sort output in build_graph() (graph command)

The sorting produces well-ordered graphs suitable for visualization and
downstream analysis.
Uses sweepga @ 7a7a8d2 which defaults to /dev/shm for FastGA temp files
when available, improving performance on systems with tmpfs.
Implements `impg align` command to generate alignment pairs from input sequences
with various sparsification strategies to reduce the O(n²) alignment complexity:

- none: All pairwise alignments
- random:X: Random sampling at fraction X
- giant:X: Erdős-Rényi random sampling with giant component guarantee (probability X)
- tree:k_near:k_far:rand: Tree-based sampling combining k-nearest neighbors,
  k-farthest (stranger-joining), and random edges using MinHash/Mash distances

The command outputs a job list for cluster execution or can run alignments
directly using sweepga/FastGA. Sparsification strategies based on allwave
implementation for efficient pangenome construction.
- Add 1:1 alignment filtering with scaffold-based chaining (default)
- Count genomes (SAMPLE#HAPLOTYPE prefixes) instead of sequences for k-mer frequency
- Add CLI options: --no-filter, --num-mappings, --scaffold-jump/mass/filter, --overlap, --min-identity
- Update README with filtering documentation and tutorial improvements
- Pin seqwish to b94ba0f (cleanup function + /dev/shm temp dir preference)
The fix in seqwish 0448c4d corrects a path length mismatch bug in
GFA emission that was causing lace validation failures. The Rust
implementation now matches the C++ behavior by iterating through
each base position and adding nodes when crossing node boundaries.
When TMPDIR is not set, use /dev/shm as the default temp directory for
seqwish and other temp file operations. This provides significant I/O
performance improvements by using RAM-based storage instead of disk.
The previous /dev/shm default caused temp files to accumulate in RAM
and could fill up shared memory on systems with limited /dev/shm space.
Now defaults to system temp (/tmp) which is safer. Users can still get
the /dev/shm performance benefit by explicitly passing --temp-dir /dev/shm.
Updates sweepga from 7a7a8d2 to 67ae1b0, which includes:
- FastGA temp directory parameter for HPC compatibility
- Batch-bytes feature for disk-controlled alignment
- ZSTD compression options for k-mer indices
- Various bug fixes

The temp_dir parameter now flows through to FastGA alignment.
Documents the effect of different partition window sizes on graph structure
using real benchmarks from the 7-strain yeast pangenome:
- 10kb: 1,537 partitions, 164k nodes, 222k edges
- 50kb: 368 partitions, 207k nodes, 280k edges
- 100kb: 227 partitions, 221k nodes, 300k edges

Includes recommendations for choosing partition sizes and a script for
experimenting with different window values.
…n from tutorial

Seqwish commit 81ceb66 adds path step length validation that catches
path construction bugs at graph build time. With this fix, lace validation
passes without needing --skip-validation.
@ekg ekg force-pushed the sweepqwish branch 2 times, most recently from 1caef6c to feedbcf Compare December 11, 2025 22:06
Tests full pipeline: index -> partition -> graph -> lace
Uses 7-strain yeast chrV data (S288C, DBVPG6765, UWOPS034614, Y12, YPS128, SK1, DBVPG6044)
- ~1.5MB compressed test data
- 994 alignments
- Validates: 30k nodes, 7 paths produced correctly
- Update sweepga from 67ae1b0 to 608547a (fixes scaffold filtering bug)
- Change scaffold_filter default from inf:inf to 1:1 (now that it works)
- Use zcat for decompression in tests (handles both gzip and BGZIP)

The scaffold filter bug caused all-vs-all alignments to lose most data
when using 1:1 mode because it grouped by chromosome rather than genome
pair. The fix in sweepga 608547a correctly groups by genome pair first,
then filters by chromosome pair within each genome pair.
ekg added 5 commits December 12, 2025 11:04
Add CLI options for finer control over alignment filtering:
- --scaffold-dist (-D): Maximum scaffold deviation distance
- --min-mapping-length (-b): Minimum mapping length for filtering

Both default to 0 (no filtering).
The help text claimed these options accepted k/m/g suffixes but no
parser existed. Now "10k", "5M", "1g" work as expected for scaffold_jump,
scaffold_mass, scaffold_dist, and min_mapping_length options.
When a PAF file is provided with -a/--paf-file, the graph command
skips the FastGA alignment step and uses the provided alignments
directly. This allows using external aligners or pre-computed
alignments for graph construction.
Expose the same filtering and scaffolding options available in impg graph
to impg align, including: num_mappings, scaffold_jump, scaffold_mass,
scaffold_filter, overlap, min_identity, scaffold_dist, min_mapping_length,
and no_filter.

Note: These options are parsed and stored in AlignConfig for future use
when direct PAF/1aln output is implemented (currently align only generates
job lists for external aligners).
@ekg ekg merged commit e3e00b9 into main Jan 6, 2026
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants