-
-
Notifications
You must be signed in to change notification settings - Fork 75
Read Structures
NEW: Validate your read structures using this online tool.
In fgbio, fqtk, sgdemux, and also in Picard, a Read Structure refers to a String that describes how the bases in a sequencing run should be allocated into logical reads. It serves a similar purpose to the --use-bases-mask in Illumina's bcltofastq software, but provides some additional capabilities.
A Read Structure is a sequence of <number><operator> pairs or segments where, optionally, the last segment in the string is allowed to use + instead of a number for its length. The + means translates to whatever bases are left after the other segments are processed and can be thought of as meaning [0..infinity].
Read Structures are most commonly used in tools that convert from sequencer output formats (e.g. fastq files, BCLs) to downstream formats like SAM/BAM/CRAM, and in tools that process SAM/BAM/CRAM to extract non-template bases from the reads. Examples include:
-
DemuxFastqsin fgbio to demultiplex a set of multi-sample fastq files and optionally extract UMIs -
FastqToBamin fgbio to convert from fastq to BAM while preserving sample barcode, cell barcode, and UMI information -
ExtractUmisFromBamin fgbio which re-writes a BAM file with UMI sequences extracted from the reads and placed into tags -
IlluminaBasecallsToSamandIlluminaBasecallsToFastqin Picard both of which process BCLs and related files in an Illumina run folder and create BAMs or FASTQs respectively
Four kinds of operator are supported:
-
Tor Template: the bases in the segment are reads of template (e.g. genomic dna, rna, etc.) -
Bor Sample Barcode: the bases in the segment are an index sequence used to identify the sample being sequenced -
Mor Molecular Barcode: the bases in the segment are an index sequence used to identify the unique source molecule being sequence (i.e. a UMI) -
Cor Cell Barcode: the bases in the segment are a cell bar index sequence used to identify the cell being sequenced -
Sor Skip: the bases in the segment should be skipped or ignored, for example if they are monotemplate sequence generated by the library preparation
- Any number of segments >= 1 is valid
- The length of each segment must be a positive integer >= 1 (or
+) - Only the last segment in a read structure may use
+for it's length - Adjacent segments may use the same operator. E.g. if two sample indices are ligated onto a molecule separately such that they are adjacent, a structure of
6B6B+Tis perfectly acceptable.
The following handful of examples attempt to describe the recommended way to describe a sequencing run in two different ways. Firstly as a single Read Structure for the entire run as you might use with IlluminaBasecallsToSam, and secondly as a set of Read Structures that would map one-to-one with the physical reads after fastq-conversion and optionally adapter trimming (which will create variable length reads):
- A simple 2x150bp paired end run with no sample or molecular indices:
150T150T- [
+T,+T]
- A 2x75bp paired end run with an 8bp I1 index read:
75T8B75T- [
+T,8B,+T]
- A 2x150bp paired end run with an 8bp I1 index read and an inline 6bp UMI in read 1:
6M142T8B150T- [
6M+T,8B,+T]
- A 2x150bp duplex sequencing run with dual sample-barcoding (I1 and I2) and both a 10bp UMI and 5bp monotemplate at the start of both R1 and R2:
10M5S135T8B8B10M5S135T- [
10M5S+T,8B,8B,10M5S+T]
- A 2x150bp single-cell sequencing run with two cell-specific barcodes separated by a skipped linker and a UMI:
5C30S5C3S8M99T8B150T- [
5C30S5C3S8M+T,8B,+T]
The formal grammar for Read Structures supported by fgbio is as follows:
<read-structure> ::= <fixed-structure> <variable-segment>
<fixed-structure> ::= "" | <fixed-length> <operator> <fixed-structure>
<variable-segment> ::= "" | <variable-length> <operator>
<segment> ::= <any-length><operator>
<operator> ::= "T" | "B" | "M" | "C" | "S"
<fixed-length> ::= <non-zero-digit>{<digit>}
<variable-length> ::= "+"
<any-length> ::= <fixed-length> / <variable-length>
<non-zero-digit> ::= "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
<digit> ::= "0" | <non-zero-digit>