Example Recalibration Workflow

This tutorial will guide you through a full recalibration workflow with kbbq using an example data set. The data set comes from [Li_2018]. It is a variant-calling benchmarking data set of sequenced human hydatidiform moles.

[Li_2018](1, 2) Li H, Bloom JM, Farjoun Y, Fleharty M, Gauthier L, Neale B, MacArthur D (2018) A synthetic-diploid benchmark for accurate variant-calling evaluation. Nat Methods, 15:595-597. [PMID:30013044]

Software Requirements

This tutorial will require slightly more software than kbbq alone. These are somewhat common tools that appear in many bioinformatic workflows and will be used to download and manipulate the data. We’ll need:

  1. samtools
  2. bcftools
  3. seqtk
  4. lighter

Samtools and bcftools follow the download, ./configure, make, make install paradigm. seqtk does not require ./configure, and lighter will require manual installation to a directory on your PATH.

Note

These installation instructions will install the programs to the ~/.local directory, which is where pip installs programs with the --user flag. If you want to install somewhere else, modify the appropriate --prefix, BINDIR or install argument.

To download and install samtools:

wget https://github.com/samtools/samtools/releases/download/1.9/samtools-1.9.tar.bz2
bunzip samtools-1.9.tar.bz2
./configure --prefix=~/.local
make
make install

To install bcftools:

wget https://github.com/samtools/bcftools/releases/download/1.9/bcftools-1.9.tar.bz2
bunzip bcftools-1.9.tar.bz2
./configure --prefix=~/.local
make
make install

To download and install seqtk:

git clone https://github.com/lh3/seqtk.git
cd seqtk
make
make install BINDIR=~/.local/bin

To download and install lighter:

git clone https://github.com/mourisl/Lighter.git
cd Lighter
make
install lighter ~/.local/bin #install is like cp

Finally, you’ll need to have kbbq installed. The fast way to do this is:

pip install --user git+https://github.com/adamjorr/kbbq.git

Follow the instructions in Installation for more information.

Note

These instructions install to the .local/bin directory in your home directory. This follows the convention pip install --user uses, and is probably already on your PATH. If it isn’t, edit ~/.profile and add the following line:

export PATH=${PATH}:~/.local/bin

Then run:

. ~/.profile

to reload the variable.

Downloading the Data

Start by making a new directory we can use to store the files we’ll create in this tutorial:

mkdir kbbq_tutorial
cd kbbq_tutorial

As an example dataset, we’ll use the [Li_2018] CHM1/13 data set. This is a variant benchmarking dataset that includes:

  1. bam alignment
  2. BED file designating confident regions
  3. VCF file containing variable sites

Note

The only data required to recalibrate are the reads (in this case, from the BAM file.) We use the BED to subset the BAM to a reasonable size. If you aren’t going to benchmark, download and subset your bam and skip directly to Correcting Reads

We’ll start by downloading the CHM1/13 evaluation kit and subsetting it to a more reasonable size for a tutorial. The following lines download the tar archive, extract the BED file, uncompress it, and take the first 25 lines:

wget https://github.com/lh3/CHM-eval/releases/download/v0.5/CHM-evalkit-20180222.tar
tar -xnf CHM-evalkit-20180222.tar --to-stdout CHM-eval.kit/full.37m.bed.gz | \
zcat | \
head -n 25 > confident.bed

This will extract the first 10 confident regions to a file called confident.bed. We will limit our analyses to these 10 regions.

The evaluation kit also includes a VCF of variable sites. We will use bcftools view to subset the whole file to just the confident regions we picked earlier:

tar -xnf CHM-evalkit-20180222.tar --to-stdout CHM-eval.kit/full.37m.vcf.gz | \
bcftools view -T confident.bed -Oz -o confident.vcf.gz

Note

A reference is only required for the benchmark command. If you’re not interested in benchmarking, feel free to skip the reference download.

We also need the reference for the benchmark command. We only need the chromosomes specified in our region file. To avoid downloading the whole file, we can download the reference index first and then just specify the chromosomes that are present in the regions file. To get the chromosomes in the regions:

cut -f1 confident.bed | sort | uniq > chromosomes.txt

We can cat the contents of that file to xargs to have xargs put each chromosome on the end of the command line. So to download and subset the reference:

wget http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta.fai
cat chromosomes.txt | \
xargs samtools faidx http://www.broadinstitute.org/ftp/pub/seq/references/Homo_sapiens_assembly19.fasta > ref.fa

Now we need to download the BAM file. In my experience, downloading from the internet is much faster if we specify the regions on the command line instead of using the -L flag, so we’ll use awk to convert the regions in the BED to the proper format for samtools and xargs to add the regions to the end of the command line:

cat confident.bed | \
awk 'BEGIN{OFS=""}{print $1,":",($2+1),"-",$3}' > confident.regions

Since this is a huge amount of data, it’s better to download the index first and use samtools to just grab the reads that intersect the confident regions we want. Even though we do this, the download will still take a minute or so:

wget ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR134/ERR1341796/CHM1_CHM13_2.bam.bai
cat confident.regions | \
xargs samtools view -h -b -M -q 1 -F 3844 -o confident.bam ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR134/ERR1341796/CHM1_CHM13_2.bam

Once this download finishes you should have 4 important files:

  1. confident.bed
  2. confident.vcf.gz
  3. confident.bam
  4. ref.fa

Note

You only need confident.bam if you plan to skip the benchmarking example.

Recalibrating a BAM file

Warning

Recalibrating a BAM file is not yet supported. We hope to support this feature soon in an upcoming release. Once this feature is enabled, instructions to do so will be here.

Correcting Reads

The current implementation detects read errors by comparing the original read with a corrected one output from an error correction method. In this tutorial we use lighter [Song_2014], but any sufficiently-accurate method will work.

[Song_2014]Song, L., Florea, L. and Langmead, B., Lighter: Fast and Memory-efficient Sequencing Error Correction without Counting. Genome Biol. 2014 Nov 15;15(11):509.

Note

We intend to implement a similar algorithm to make it easier to recalibrate data without having to correct and do other processing steps. Please look forward to it.

First, we need to extract the reads in the BAM. For more information about this, check out Converting from BAM to FASTQ. To do this, sort the reads by name and use the samtools fastq command. We will discard non-paired and singleton reads by setting the output files for those reads to /dev/null:

samtools sort -@ 4 -n -o confident.nsorted.bam -O bam confident.bam
samtools fastq -t -N -F 3844 -O -0 /dev/null -s /dev/null -1 reads.1.fq -2 reads.2.fq confident.nsorted.bam

Here, the -n flag specifies that we should sort the reads by name. The -t flag to fastq tells samtools to add the RG, BC and QT flags to the read name. This will allow kbbq to place the reads in the proper read group. The -N flag adds /1 or /2 to the read name to specify pair information, and the -O flag says to use the original quality scores assigned to the read rather than any updated quality scores.

We will then merge the fastq files to interleaved format. We will also convert the whitespace to ‘_’ characters using the tr command:

seqtk mergepe reads.1.fq reads.2.fq | tr ' ' _ > reads.fq

Now we correct the reads with lighter. lighter requires an output directory, so we’ll need to create that before running the command:

mkdir lighter

Additionally, we need to calculate the best parameters to use for the correction. Lighter requires 3 parameters: a k-mer size, a genome size, and a sampling rate. I tend to use k=32 because that seems to work well. If you don’t know your approximate genome size, take a guess. To calculate the sampling rate, the lighter authors recommend using 7 / avg. read depth. You can estimate the depth by multiplying the number of reads by the read length and dividing by the genome size.

You can also use samtools depth and a short awk script to calculate these parameters:

samtools depth -b confident.bed -m0 confident.bam | \
awk '{x+=$3}END{print "bases:", NR, "\ndepth:", x/NR, "\nalpha:", 7/(x/NR)}'

The -m0 option set here ensures depth doesn’t limit the maximum depth of any site. The awk script adds up the 3rd column (specifying the number of bases) then prints the relevant information after all records are processed, dividing by the number of records (= the number of sites) to calculate the depth. This should output something like:

bases: 73465
depth: 49.1876
alpha: 0.142312

Now we can use lighter:

lighter -r reads.fq -k 32 70000 .14 -od lighter

The output file we’re interested in is lighter/reads.cor.fq, which we’ll copy to our tutorial directory while removing the space characters and replacing them with underscores as before:

cat lighter/reads.cor.fq | tr ' ' _ > reads.cor.fq

Calibrating FASTQ Reads

The reference-free and alignment-free recalibration workflow works in three phases:

  1. Find errors in reads
  2. Build a model
  3. Correct reads with the model

This is similar to GATK’s method, but errors are detected by comparing corrected and uncorrected reads rather than by comparing the aligned read to the reference.

Now that all the data has been properly prepared, to recalibrate:

kbbq recalibrate --infer-rg -f reads.fq reads.cor.fq > reads.recalibrated.fq

The --infer-rg flag will ensure each read is assigned to its proper read group. For more information on how it does this, check out Inferring Read Groups.

Benchmarking Calibration

In this case, the dataset has been validated and can serve as a truth set. In a research environment, especially with nonmodel organisms, you won’t likely have a truth set you can use to check your calibration. benchmark requires a set of variable sites in a VCF file and a set of confident regions in a BED file. It assumes any site in the confident regions that are not variable but don’t match the reference are errors.

To check the calibration, we’ll use the benchmark and plot commands. Benchmark requires that the names in the FASTQ files match up with the reads in the BAM file that comprises the truth set. Read Benchmark Read Matching for more information. Benchmark can also benchmark reads in a BAM file if a FASTQ file isn’t provided. To show both usages and to make a more interesting plot, we’ll benchmark the original qualities, the updated qualities in the bam, and the qualities produced by KBBQ. We use the -l option to specify the label that will go in the benchmark file and on the plot.

First, to benchmark the BAM reads:

kbbq benchmark -l BAM -b confident.bam -r ref.fa -v confident.vcf.gz > bam.benchmark.txt

To benchmark the original qualities from the BAM:

kbbq benchmark -l Original --use-oq -b confident.bam -r ref.fa -v confident.vcf.gz > oq.benchmark.txt

And to benchmark the qualities KBBQ assigned from the new fastq files:

kbbq benchmark -l KBBQ -b confident.bam -r ref.fa -v confident.vcf.gz -f reads.recalibrated.fq > kbbq.benchmark.txt

Plotting the Benchmark

plot is designed to make it as simple as possible to plot the output of benchmark. Since we have multiple benchmark files, we can plot them individually by calling plot multiple times or concatenate them into one file to plot all 3 lines on one plot. To do this:

cat bam.benchmark.txt oq.benchmark.txt kbbq.benchmark.txt | \
kbbq plot -o plot.pdf

The output name is passed directly to matplotlib.pyplot.savefig(), so the output type is determined by the name.

The default type of plot made is calibration, however, plot can also plot sample sizes, which may be informative if you want to know how many bases of each quality score are in your dataset. To plot these, use the -t option to change the plot type to sample-size:

cat bam.benchmark.txt oq.benchmark.txt kbbq.benchmark.txt | \
kbbq plot -t sample-size -o sample-size.pdf