Skip to content

Simulating reads with vg sim

Faith Okamoto edited this page Dec 31, 2025 · 7 revisions

This page gives a practical example of using vg sim. Note that this process is automated in toil-vg and we especially recommend using toil-vg if working with large graphs.

Most of the time we might want to simulate reads from one of the sample that was used to create the graph. To do this we need to index each haplotype using a GBWT. This index will be used to identify the paths in the graph corresponding to the haplotype or sample we want to simulate the reads from.

Constructing a graph and indexing haplotypes

This tutorial uses small/x.fa, small/x.vcf.gz, and small/x.fa_1.fastq in the test directory of vg.

# Construct variation graph from FASTA backbone with added variants
vg construct -r small/x.fa -v small/x.vcf.gz -a -f -m 32 > x.vg
# Create GBWT index with haplotype paths based on the VCF
vg gbwt -o x.gbwt -v small/x.vcf.gz -x x.vg
# Combine GBWT index and VG into GBZ format
vg gbwt -g x.gbz -x x.vg x.gbwt

Simulating reads from a specific sample

We can simulate reads from a specific sample with -m/--sample-name:

vg sim -x x.vg # simulate from variation graph x.vg \
       -g x.gbwt # haplotype paths stored in GBWT index x.gbwt \
       -m 1 # simulate reads from sample 1 \
       -n 1000 # simulate 1000 reads \
       -l 150 # simulate reads of length 150 \
       -a # output in GAM format \
       > sim_unpaired.gam
# You can also provide a GBZ, which contains a GBWT
vg sim -x x.gbz -m 1 -n 1000 -l 150 -a > sim_unpaired.gam

By default vg sim outputs reads in a text format. To output in other formats, use -q (FASTQ format), -a (GAM format), or -J (JSON version of GAM format).

Paired-end reads

We can specify the fragment size and its standard deviation using the -p and -v arguments. For example:

vg sim -x x.vg -g x.gbwt -m 1 -n 1000 -l 150 -p 500 -v 50 -a > sim_paired.gam

Providing a real FASTQ file for error modeling

A FASTQ file can be used to model the error profile using the -F argument. For example:

vg sim -x x.vg -g x.gbwt -m 1 -n 1000 -p 500 -v 50 -F small/x.fa_1.fastq -a > sim_errors.gam

Note that in that case the read length argument (-l) is not necessary.

Clone this wiki locally