FASTQ Input Help¶
Currently we only support FASTQ input reads for the recalibrate command. We also support only one FASTQ file (and its corrected variant) at a time. Because of this, there are some onerous requirements for reads names and length.
We plan to support input reads from any number of BAM or FASTQ files before release. We also plan to remove the requirement of externally correcting the reads before input.
FASTQ Input Requirements¶
| Feature | Requirement |
|---|---|
--infer-rg |
Add RG: to ID and append to name delimited with _ |
| Recalibrate | Corrected read name begins with uncorrected read name |
| Paired reads | First _ delimited field of name ends with /2 for
second in pair reads |
| Benchmark | First _ delimited field of name (without /1 or
/2 suffix) must match a read name in the BAM |
Only one FASTQ file of reads is currently supported. You must supply the
file and a corrected version of the file on the command line. For example,
if your reads are in reads.fq and a corrected version is in reads.cor.fq,
you would use the recalibrate command like:
kbbq recalibrate -f reads.fq reads.cor.fq
To ensure the reads in the uncorrected file and corrected file are properly
aligned, recalibrate will check that the corrected version of each read
has a name that begins with the name of the uncorrected read. This requirement
is satisfied when lighter is used as the error corrector, but should be
satisfied by other error correctors as well.
Inferring Read Groups¶
Currently the only other command line option to recalibrate that applies
to FASTQ files is --infer-rg, which will attempt to parse the read name for
the read group the read belongs to. If you don’t want to mess with renaming your
reads or already have them in sets of 1 read group per file, you can safely use
recalibrate on each file separately; the default behavior without the
--infer-rg flag is to treat all reads in the file as belonging to the same
read group.
To properly infer the read group, the read must have the read group appended
with a _ character to the end of the read name. The read group should also
include a RG: prefix, and may include other information (such as a type
indicator) delimited by a : before the read group ID.
Thus a read name like @HJCMTCCXX160113:5:1101:7760:55965/1 in read group
HJCMT.5 is made suitable for parsing by adding _RG:HJCMT.5 to the end,
forming the full name @HJCMTCCXX160113:5:1101:7760:55965/1_RG:HJCMT.5. It
is also equally valid to include additional information before the actual ID.
For example, it’s OK to add the Z type indicator from samtools indicating
the RG is a string type, and in that case the full read name would be:
@HJCMTCCXX160113:5:1101:7760:55965/1_RG:Z:HJCMT.5
First or Second in Pair¶
The program will interpret any read where the last 2 characters of the first field
of a _ delimited name are /2 as a second-in-pair read. Any other name and the
read will be interpreted as first in pair. If your reads are paired and match this naming
scheme or if your reads are unpaired, there is nothing you must do to mark your reads.
All the reads below will be interpreted as 2nd in pair:
@HJCMTCCXX160113:5:1101:7760:55965/2_RG:Z:HJCMT.5@HJCMTCCXX160113:5:1101:7760:55965/2@HJCMTCCXX160113:5:1101:7760:55965/2_foo
All the reads below will be interpreted as not 2nd in pair:
@HJCMTCCXX160113:5:1101:7760:55965/1_RG:Z:HJCMT.5@HJCMTCCXX160113:5:1101:7760:55965/1@HJCMTCCXX160113:5:1101:7760:55965@HJCMTCCXX160113:5:1101:7760:55965/3@HJCMTCCXX160113:5:1101:7760:55965_2@HJCMTCCXX160113:5:1101:7760:55965_foo_/2
If your reads are paired, but the second-in-pair reads are not properly marked, recalibration may be less effective, though I haven’t seen data to indicate that.
Benchmark Read Matching¶
For benchmark to properly match the read in a FASTQ with the read
in the BAM file, the first _ delimited field must match the read name,
minus any /1 or /2 parts of the read name. For example, to match
the 2nd in pair read HK2WYCCXX160124:1:1219:24545:4315, the FASTQ represented
by FASTQ lines,
@HJCMTCCXX160113:5:1101:7760:55965/2_RG:Z:HJCMT.5@HJCMTCCXX160113:5:1101:7760:55965/2@HJCMTCCXX160113:5:1101:7760:55965/2_foo
will all work. However, to match a READ2-flagged read, benchmark must be able to determine that the read is 2nd in pair. Read the discussion above for how the program determines this. Reads flagged as READ1 or without either READ1 or READ2 flags set in the bam can both be matched with FASTQ reads that aren’t interpreted as 2nd in pair.
Converting from BAM to FASTQ¶
Recalibrating reads from a BAM file is not yet supported.
Use the samtools fastq command to convert your reads
to fastq format. This should go something like:
samtools sort -n -O bam input.bam > namesorted.bam
samtools fastq -t -N -F 3844 -O -0 /dev/null -s /dev/null -1 reads.1.fq -2 reads.2.fq namesorted.bam
This will sort your input by read name (required for samtools fastq),
-t will append RG tags to the output read name (currently required for recalibration),
-N add /1 or /2 to the output read name (also currentlyrequired for recalibration),
-F will exclude unmapped, not primary alignment, QC failed, optical duplicate, and supplementary alignment reads,
and -O will use OQ tags to obtain the read quality if available.
Reads without a READ1 or READ2 flag and singletons will not be output, while READ1 reads will be output to reads.1.fq
and READ2 reads will be output to reads.2.fq.
Reads must then be interleaved for input. This can be done with a tool like seqtk:
seqtk mergepe reads.1.fq reads.2.fq > reads.merged.fq
If you used the -t option to add RG tags, you’ll want to remove the spaces inserted by samtools
as many error correctors won’t support them. Currently kbbq enforces a _ character delimiter,
but this requirement will be eased in future releases.
The tr command can efficiently replace the spaces with _ like this:
cat reads.merged.fq | tr ' ' _ > reads.fixed.fq