Integrate sweepga, seqwish, and gfasort for variation graph construction and sorting #127

ekg · 2025-12-03T00:29:02Z

Summary

This PR integrates three tools for pangenome graph construction:

sweepga: Fast all-vs-all alignment using FastGA with k-mer frequency filtering
seqwish: Graph induction via transitive closure computation (produces proper variation graphs where shared sequences are collapsed)
gfasort: Graph sorting using the Ygs pipeline (path-guided SGD + grooming + topological sort)

Features

New graph command for standalone graph building from FASTA files
- Supports multiple input files and file lists
- Configurable k-mer frequency multiplier (-f, default: genomes × 10)
- Configurable minimum alignment length (-l, default: 100bp)
Query output supports gfa format (seqwish-based variation graphs)
- gfa or gfa-seqwish: Proper variation graph with collapsed shared sequences (default)
- gfa-poa: POA-based partial order alignment graph
Automatic graph sorting using gfasort's Ygs pipeline:
- Y = Path-guided SGD (positions nodes to minimize path distance differences)
- g = Grooming (ensures consistent node orientations along paths)
- s = Topological sort (linearizes graph respecting edge directions)

Dependencies

sweepga @ 63ed05d52c4ac282b94f4e6c67510d3c0d4f73d2
seqwish @ a3369b07370cd7406e00c64224152eef11bcc665 (rust-2 branch)
gfasort @ b55608abeec6121671b674aae20bcc2f940ef213

Test plan

cargo test passes
cargo clippy passes (no new warnings in changed files)
Manual testing of impg graph --fasta-files command
Manual testing of impg query -o gfa with seqwish output

Add support for building variation graphs from sequences using sweepga for alignment and seqwish for graph induction. Features: - New `graph` command for standalone graph building from FASTA files - Query output supports `gfa` (seqwish-based, default) and `gfa-poa` modes - Seqwish produces proper variation graphs with collapsed shared sequences - Configurable k-mer frequency multiplier (-f, default: genomes * 10) Dependencies: - sweepga @ 63ed05d52c4ac282b94f4e6c67510d3c0d4f73d2 - seqwish @ a3369b07370cd7406e00c64224152eef11bcc665 (rust-2 branch)

Sort GFA output by default using gfasort's Ygs pipeline, which performs: - Path-guided SGD (positions nodes to minimize path distance differences) - Grooming (ensures consistent node orientations along paths) - Topological sort (linearizes graph respecting edge directions) This is equivalent to running `gfasort -p Ygs` on the command line. Changes: - Add gfasort dependency @ b55608abeec6121671b674aae20bcc2f940ef213 - Add sort_gfa() helper function to src/graph.rs - Sort output in generate_gfa_seqwish_from_intervals() (query interface) - Sort output in build_graph() (graph command) The sorting produces well-ordered graphs suitable for visualization and downstream analysis.

Uses sweepga @ 7a7a8d2 which defaults to /dev/shm for FastGA temp files when available, improving performance on systems with tmpfs.

Implements `impg align` command to generate alignment pairs from input sequences with various sparsification strategies to reduce the O(n²) alignment complexity: - none: All pairwise alignments - random:X: Random sampling at fraction X - giant:X: Erdős-Rényi random sampling with giant component guarantee (probability X) - tree:k_near:k_far:rand: Tree-based sampling combining k-nearest neighbors, k-farthest (stranger-joining), and random edges using MinHash/Mash distances The command outputs a job list for cluster execution or can run alignments directly using sweepga/FastGA. Sparsification strategies based on allwave implementation for efficient pangenome construction.

- Add 1:1 alignment filtering with scaffold-based chaining (default) - Count genomes (SAMPLE#HAPLOTYPE prefixes) instead of sequences for k-mer frequency - Add CLI options: --no-filter, --num-mappings, --scaffold-jump/mass/filter, --overlap, --min-identity - Update README with filtering documentation and tutorial improvements - Pin seqwish to b94ba0f (cleanup function + /dev/shm temp dir preference)

The fix in seqwish 0448c4d corrects a path length mismatch bug in GFA emission that was causing lace validation failures. The Rust implementation now matches the C++ behavior by iterating through each base position and adding nodes when crossing node boundaries.

When TMPDIR is not set, use /dev/shm as the default temp directory for seqwish and other temp file operations. This provides significant I/O performance improvements by using RAM-based storage instead of disk.

The previous /dev/shm default caused temp files to accumulate in RAM and could fill up shared memory on systems with limited /dev/shm space. Now defaults to system temp (/tmp) which is safer. Users can still get the /dev/shm performance benefit by explicitly passing --temp-dir /dev/shm.

Updates sweepga from 7a7a8d2 to 67ae1b0, which includes: - FastGA temp directory parameter for HPC compatibility - Batch-bytes feature for disk-controlled alignment - ZSTD compression options for k-mer indices - Various bug fixes The temp_dir parameter now flows through to FastGA alignment.

Documents the effect of different partition window sizes on graph structure using real benchmarks from the 7-strain yeast pangenome: - 10kb: 1,537 partitions, 164k nodes, 222k edges - 50kb: 368 partitions, 207k nodes, 280k edges - 100kb: 227 partitions, 221k nodes, 300k edges Includes recommendations for choosing partition sizes and a script for experimenting with different window values.

…n from tutorial Seqwish commit 81ceb66 adds path step length validation that catches path construction bugs at graph build time. With this fix, lace validation passes without needing --skip-validation.

Tests full pipeline: index -> partition -> graph -> lace Uses 7-strain yeast chrV data (S288C, DBVPG6765, UWOPS034614, Y12, YPS128, SK1, DBVPG6044) - ~1.5MB compressed test data - 994 alignments - Validates: 30k nodes, 7 paths produced correctly

- Update sweepga from 67ae1b0 to 608547a (fixes scaffold filtering bug) - Change scaffold_filter default from inf:inf to 1:1 (now that it works) - Use zcat for decompression in tests (handles both gzip and BGZIP) The scaffold filter bug caused all-vs-all alignments to lose most data when using 1:1 mode because it grouped by chromosome rather than genome pair. The fix in sweepga 608547a correctly groups by genome pair first, then filters by chromosome pair within each genome pair.

Add CLI options for finer control over alignment filtering: - --scaffold-dist (-D): Maximum scaffold deviation distance - --min-mapping-length (-b): Minimum mapping length for filtering Both default to 0 (no filtering).

The help text claimed these options accepted k/m/g suffixes but no parser existed. Now "10k", "5M", "1g" work as expected for scaffold_jump, scaffold_mass, scaffold_dist, and min_mapping_length options.

When a PAF file is provided with -a/--paf-file, the graph command skips the FastGA alignment step and uses the provided alignments directly. This allows using external aligners or pre-computed alignments for graph construction.

Expose the same filtering and scaffolding options available in impg graph to impg align, including: num_mappings, scaffold_jump, scaffold_mass, scaffold_filter, overlap, min_identity, scaffold_dist, min_mapping_length, and no_filter. Note: These options are parsed and stored in AlignConfig for future use when direct PAF/1aln output is implemented (currently align only generates job lists for external aligners).

ekg closed this Dec 11, 2025

ekg reopened this Dec 11, 2025

ekg added 11 commits December 11, 2025 11:59

Update sweepga to use /dev/shm as default temp directory

b3ad615

Uses sweepga @ 7a7a8d2 which defaults to /dev/shm for FastGA temp files when available, improving performance on systems with tmpfs.

Default temp directory to /dev/shm for better I/O performance

662dc86

When TMPDIR is not set, use /dev/shm as the default temp directory for seqwish and other temp file operations. This provides significant I/O performance improvements by using RAM-based storage instead of disk.

Update seqwish to include path validation and remove --skip-validatio…

1e533d1

…n from tutorial Seqwish commit 81ceb66 adds path step length validation that catches path construction bugs at graph build time. With this fix, lace validation passes without needing --skip-validation.

ekg force-pushed the sweepqwish branch 2 times, most recently from 1caef6c to feedbcf Compare December 11, 2025 22:06

ekg force-pushed the sweepqwish branch from feedbcf to 46a0249 Compare December 11, 2025 22:29

ekg force-pushed the sweepqwish branch from 6ca9b46 to 1d78379 Compare December 12, 2025 05:05

ekg added 5 commits December 12, 2025 11:04

Expose additional sweepga filtering parameters for graph command

8858605

Add CLI options for finer control over alignment filtering: - --scaffold-dist (-D): Maximum scaffold deviation distance - --min-mapping-length (-b): Minimum mapping length for filtering Both default to 0 (no filtering).

Add k/m/g suffix parsing for size CLI options

31801ff

The help text claimed these options accepted k/m/g suffixes but no parser existed. Now "10k", "5M", "1g" work as expected for scaffold_jump, scaffold_mass, scaffold_dist, and min_mapping_length options.

Add --paf-file option to graph command to skip alignment

2fe02b7

When a PAF file is provided with -a/--paf-file, the graph command skips the FastGA alignment step and uses the provided alignments directly. This allows using external aligners or pre-computed alignments for graph construction.

Update gfasort to e23f45e

ce4fe36

ekg merged commit e3e00b9 into main Jan 6, 2026
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Integrate sweepga, seqwish, and gfasort for variation graph construction and sorting #127

Integrate sweepga, seqwish, and gfasort for variation graph construction and sorting #127

Uh oh!

ekg commented Dec 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Integrate sweepga, seqwish, and gfasort for variation graph construction and sorting #127

Integrate sweepga, seqwish, and gfasort for variation graph construction and sorting #127

Uh oh!

Conversation

ekg commented Dec 3, 2025

Summary

Features

Dependencies

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants