-
Notifications
You must be signed in to change notification settings - Fork 212
Simulating reads with vg sim
This page gives a practical example of using vg sim. Note that this process is automated in toil-vg and we especially recommend using toil-vg if working with large graphs.
Most of the time we might want to simulate reads from one of the sample that was used to create the graph. To do this we need to index each haplotype using a GBWT. This index will be used to identify the paths in the graph corresponding to the haplotype or sample we want to simulate the reads from.
This tutorial uses small/x.fa, small/x.vcf.gz, and small/x.fa_1.fastq in the test directory of vg.
# Construct variation graph from FASTA backbone with added variants
vg construct -r small/x.fa -v small/x.vcf.gz -a -f -m 32 > x.vg
# Create GBWT index with haplotype paths based on the VCF
vg gbwt -o x.gbwt -v small/x.vcf.gz -x x.vg
# Combine GBWT index and VG into GBZ format
vg gbwt -g x.gbz -x x.vg x.gbwtWe can simulate reads from a specific sample with -m/--sample-name:
vg sim -x x.vg # simulate from variation graph x.vg \
-g x.gbwt # haplotype paths stored in GBWT index x.gbwt \
-m 1 # simulate reads from sample 1 \
-n 1000 # simulate 1000 reads \
-l 150 # simulate reads of length 150 \
-a # output in GAM format \
> sim_unpaired.gam
# You can also provide a GBZ, which contains a GBWT
vg sim -x x.gbz -m 1 -n 1000 -l 150 -a > sim_unpaired.gamBy default vg sim outputs reads in a text format. To output in other formats, use -q (FASTQ format), -a (GAM format), or -J (JSON version of GAM format).
We can specify the fragment size and its standard deviation using the -p and -v arguments. For example:
vg sim -x x.vg -g x.gbwt -m 1 -n 1000 -l 150 -p 500 -v 50 -a > sim_paired.gamA FASTQ file can be used to model the error profile using the -F argument. For example:
vg sim -x x.vg -g x.gbwt -m 1 -n 1000 -p 500 -v 50 -F small/x.fa_1.fastq -a > sim_errors.gamNote that in that case the read length argument (-l) is not necessary.