-
Notifications
You must be signed in to change notification settings - Fork 10
Integrate sweepga, seqwish, and gfasort for variation graph construction and sorting #127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add support for building variation graphs from sequences using sweepga for alignment and seqwish for graph induction. Features: - New `graph` command for standalone graph building from FASTA files - Query output supports `gfa` (seqwish-based, default) and `gfa-poa` modes - Seqwish produces proper variation graphs with collapsed shared sequences - Configurable k-mer frequency multiplier (-f, default: genomes * 10) Dependencies: - sweepga @ 63ed05d52c4ac282b94f4e6c67510d3c0d4f73d2 - seqwish @ a3369b07370cd7406e00c64224152eef11bcc665 (rust-2 branch)
Sort GFA output by default using gfasort's Ygs pipeline, which performs: - Path-guided SGD (positions nodes to minimize path distance differences) - Grooming (ensures consistent node orientations along paths) - Topological sort (linearizes graph respecting edge directions) This is equivalent to running `gfasort -p Ygs` on the command line. Changes: - Add gfasort dependency @ b55608abeec6121671b674aae20bcc2f940ef213 - Add sort_gfa() helper function to src/graph.rs - Sort output in generate_gfa_seqwish_from_intervals() (query interface) - Sort output in build_graph() (graph command) The sorting produces well-ordered graphs suitable for visualization and downstream analysis.
Uses sweepga @ 7a7a8d2 which defaults to /dev/shm for FastGA temp files when available, improving performance on systems with tmpfs.
Implements `impg align` command to generate alignment pairs from input sequences with various sparsification strategies to reduce the O(n²) alignment complexity: - none: All pairwise alignments - random:X: Random sampling at fraction X - giant:X: Erdős-Rényi random sampling with giant component guarantee (probability X) - tree:k_near:k_far:rand: Tree-based sampling combining k-nearest neighbors, k-farthest (stranger-joining), and random edges using MinHash/Mash distances The command outputs a job list for cluster execution or can run alignments directly using sweepga/FastGA. Sparsification strategies based on allwave implementation for efficient pangenome construction.
- Add 1:1 alignment filtering with scaffold-based chaining (default) - Count genomes (SAMPLE#HAPLOTYPE prefixes) instead of sequences for k-mer frequency - Add CLI options: --no-filter, --num-mappings, --scaffold-jump/mass/filter, --overlap, --min-identity - Update README with filtering documentation and tutorial improvements - Pin seqwish to b94ba0f (cleanup function + /dev/shm temp dir preference)
The fix in seqwish 0448c4d corrects a path length mismatch bug in GFA emission that was causing lace validation failures. The Rust implementation now matches the C++ behavior by iterating through each base position and adding nodes when crossing node boundaries.
When TMPDIR is not set, use /dev/shm as the default temp directory for seqwish and other temp file operations. This provides significant I/O performance improvements by using RAM-based storage instead of disk.
The previous /dev/shm default caused temp files to accumulate in RAM and could fill up shared memory on systems with limited /dev/shm space. Now defaults to system temp (/tmp) which is safer. Users can still get the /dev/shm performance benefit by explicitly passing --temp-dir /dev/shm.
Updates sweepga from 7a7a8d2 to 67ae1b0, which includes: - FastGA temp directory parameter for HPC compatibility - Batch-bytes feature for disk-controlled alignment - ZSTD compression options for k-mer indices - Various bug fixes The temp_dir parameter now flows through to FastGA alignment.
Documents the effect of different partition window sizes on graph structure using real benchmarks from the 7-strain yeast pangenome: - 10kb: 1,537 partitions, 164k nodes, 222k edges - 50kb: 368 partitions, 207k nodes, 280k edges - 100kb: 227 partitions, 221k nodes, 300k edges Includes recommendations for choosing partition sizes and a script for experimenting with different window values.
…n from tutorial Seqwish commit 81ceb66 adds path step length validation that catches path construction bugs at graph build time. With this fix, lace validation passes without needing --skip-validation.
1caef6c to
feedbcf
Compare
Tests full pipeline: index -> partition -> graph -> lace Uses 7-strain yeast chrV data (S288C, DBVPG6765, UWOPS034614, Y12, YPS128, SK1, DBVPG6044) - ~1.5MB compressed test data - 994 alignments - Validates: 30k nodes, 7 paths produced correctly
- Update sweepga from 67ae1b0 to 608547a (fixes scaffold filtering bug) - Change scaffold_filter default from inf:inf to 1:1 (now that it works) - Use zcat for decompression in tests (handles both gzip and BGZIP) The scaffold filter bug caused all-vs-all alignments to lose most data when using 1:1 mode because it grouped by chromosome rather than genome pair. The fix in sweepga 608547a correctly groups by genome pair first, then filters by chromosome pair within each genome pair.
Add CLI options for finer control over alignment filtering: - --scaffold-dist (-D): Maximum scaffold deviation distance - --min-mapping-length (-b): Minimum mapping length for filtering Both default to 0 (no filtering).
The help text claimed these options accepted k/m/g suffixes but no parser existed. Now "10k", "5M", "1g" work as expected for scaffold_jump, scaffold_mass, scaffold_dist, and min_mapping_length options.
When a PAF file is provided with -a/--paf-file, the graph command skips the FastGA alignment step and uses the provided alignments directly. This allows using external aligners or pre-computed alignments for graph construction.
Expose the same filtering and scaffolding options available in impg graph to impg align, including: num_mappings, scaffold_jump, scaffold_mass, scaffold_filter, overlap, min_identity, scaffold_dist, min_mapping_length, and no_filter. Note: These options are parsed and stored in AlignConfig for future use when direct PAF/1aln output is implemented (currently align only generates job lists for external aligners).
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR integrates three tools for pangenome graph construction:
Features
New
graphcommand for standalone graph building from FASTA files-f, default: genomes × 10)-l, default: 100bp)Query output supports
gfaformat (seqwish-based variation graphs)gfaorgfa-seqwish: Proper variation graph with collapsed shared sequences (default)gfa-poa: POA-based partial order alignment graphAutomatic graph sorting using gfasort's Ygs pipeline:
Dependencies
sweepga@ 63ed05d52c4ac282b94f4e6c67510d3c0d4f73d2seqwish@ a3369b07370cd7406e00c64224152eef11bcc665 (rust-2 branch)gfasort@ b55608abeec6121671b674aae20bcc2f940ef213Test plan
cargo testpassescargo clippypasses (no new warnings in changed files)impg graph --fasta-filescommandimpg query -o gfawith seqwish output