Module Reference

kbbq

A package for reference-free base quality score recalibration.

kbbq.recaltable

Provides utilities for interfacing with GATK Recalibration tables.

class kbbq.recaltable.GATKReport(tables, version='1.1')[source]

Bases: object

A base class representing a GATK Report file that contains many tables.

Attributes

  • tables - a list of GATKTable objects
  • version - the GATKReport format version. The default is
    '1.1'.

Methods

__init__(tables, version='1.1')[source]

Initialize a GATKReport.

This module was written by reverse engineering a version 1.1 report. If you have a table with a newer version, please file a bug report, especially if this module fails to parse it.

Parameters:
  • tables (list(GATKTable)) – a list of tables.
  • version (str) – The report version as a string.
Returns:

a representation of a report

Return type:

GATKReport

__str__()[source]

A string containing the report.

Each table is separated by two newlines.

classmethod fromfile(filename)[source]

Initialize the GATKReport from a file.

Parameters:filename (str) – The file to be read.
Returns:a representation of a report
Return type:GATKReport
get_headerstring()[source]

Get the header line.

The header line identifies the file as a report, shows the version, and shows the number of tables that should be present (such as #:GATKReport.v1.1:5). This method returns a string that does not include a trailing newline.

Returns:the header line without a newline.
Return type:str
tables = None

A list of GATKTable objects.

version = None

The report version identifier (such as '1.1')

write(filename)[source]

Write the report to filename.

Parameters:filename (str) – The file name to write.
class kbbq.recaltable.GATKTable(title, description, data)[source]

Bases: object

A class representing a GATK report table.

The intended method of interacting with the table is using the data attribute, which is a standard Pandas dataframe.

Attributes

  • title - The title of the table
  • description - A description of the table. May be ''.
  • data - A Pandas dataframe containing the table data. Accessing this
    attribute is the primary way to interact with the data.
  • typemap - A dict mapping types to a character to be used
    for constructing the format string.
  • precisionmap - A dict mapping column headers to precision
    to be used for constructing the format string.

Methods

__init__(title, description, data)[source]

Initialize the table.

Parameters:
  • title (str) – the table title
  • description (str) – a short description of the table. May be ''
  • data (pandas.DataFrame) – the data contained by the table
__str__()[source]

Return str(self).

data = None

A Pandas dataframe containing the table data. Accessing this attribute is the primary way to interact with the data.

description = None

A short description of the table.

classmethod fromstring(tablestring)[source]

Initialize the table from a string

The first line of the table contains the a string describing the table format and is like #:GATKTable:ncol:nrow:%f:%f:%f:;. It is : delimited and ; terminated. The last format string will be empty. The second line of the table specifies the name of the table. It is formatted like #:GATKTable:title:description. description can be an empty string. It is : delimited. The third line contains the header. The header and row data are delimited by two or more spaces. The number of spaces are determined by padding each value in the row such that the the character width of the column is constant and can represent all the data in the column. Strings (including column headings) are left justified and numeric types are right justified.

get_colfmts()[source]

Get the formats for each column in the dataframe.

Uses typemap and precisionmap to create the strings. This will return something like ['%s', '%s', '%.2f', '%.4f']

Returns:format strings for each column
Return type:list(str)
get_datastring()[source]

Get the data string. Includes the header line and the data in the table.

Returns:the data table as a formatted string
Return type:str
get_fmtstring()[source]

Get the format string by inspecting the dataframe.

Uses the dtypes, number of rows, and number of columns in the data frame along with the strings returned by get_colfmts() to create the format string

which is formatted like #:GATKTable:ncol:nrow:%f:%f:%f:;.

Returns:The format string
Return type:str
get_ncols()[source]

Get the number of columns in the dataframe.

Returns:The number of columns in the dataframe.
Return type:int
get_nrows()[source]

Get the number of rows in the dataframe.

Returns:The number of rows in the dataframe.
Return type:int
get_titlestring()[source]

Get the titlestring.

Uses the title and description to create the title line. It is formatted like #:GATKTable:title:subtitle.

Returns:the title string
Return type:str
get_unindexed()[source]

Return a copy of the dataframe without the indexes included.

If the table has an unnamed index, such as the default index, return a copy of the data frame. Otherwise, reset the index.

static parse_fmtstring(header, fmtstring)[source]

Turn a fmtstring into a dictionary of {col_title : type}

Parameters:
  • header (list(str)) – The column titles as a list
  • fmtstring (str) – The format string to parse
Returns:

A dictionary mapping {col_title : type}

Return type:

dict(str, type)

precisionmap = None

A dictionary of {colname : str} to determine the precision string to prepend to each formatstring. This only affects get_fmtstring().

title = None

The title of the table.

typemap = None

A dictionary of {type : str} to determine the string for each type.

Note that in the current implementation this only affects get_fmtstring().

write(filehandle)[source]

Write the table to a filehandle.

Adds a newline to the end that isn’t present in __str__()

Parameters:filehandle (File Object) – The file object to write the table to.
Returns:Whatever the file object’s write method returns. Usually the number of characters written.
Return type:Usually int
class kbbq.recaltable.RecalibrationReport(tables, version='1.1')[source]

Bases: kbbq.recaltable.GATKReport

A class representing a GATK Recalibration Report used for Base Quality Score Recalibration.

It extends the GATKReport, so the primary method of interacting with the table is the tables attribute. The primary difference is the __init__() function will set indices and data types as they should be for a recalibration report.

  • Table tables[0] contains the arguments used to call the program
    that generated the table.
  • Table tables[1] contains the quanization map
  • Table tables[2] contains read group error information.
  • Table tables[3] contains error information subset by read group,
    then reported quality score.
  • Table tables[4] contains error information subset by read group,
    then reported quality score, then covariate name and value. Both Context and Cycle are valid for the covariate name.

See the table below for a schema.

Table Number of Columns Variables
0 2 Argument, Value
1 3 QualityScore, Count, QuantizedScore
2 6 ReadGroup, EventType, EmpiricalQuality, EstimatedQReported, Observations, Errors
3 6 ReadGroup, QualityScore, EventType, EmpiricalQuality, Observations, Errors
4 8 ReadGroup, QualityScore, CovariateValue, CovariateName, EventType, EmpiricalQuality, Observations, Errors

For current GATK parameters, EventType will be M for mismatch though I for insertion and D for deletion are possible. Valid values for CovariateName are Context and Cycle.

__init__(tables, version='1.1')[source]

Initialize the recalibration report from a list of GATKTable s.

This initializes the tables with some data wrangling to set indices and data types properly. If you don’t want this, use a GATKReport instead.

__str__()[source]

GATKReports list CovariateValue before CovariateName, which doesn’t make much sense for organizing the index. So we switch the columns before printing and switch them back after we get the string.

kbbq.compare_reads

A mish-mash of functions for recalibration. Should probably be renamed in the near future.

class kbbq.compare_reads.Dinucleotide[source]

Bases: object

A class to cache dinucleotides and maintain a consistent dinuc -> int map throughout the module.

complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}

Dictionary for complementing nucleotides

dinuc_to_int = {'AA': 0, 'AC': 3, 'AG': 2, 'AT': 1, 'CA': 12, 'CC': 15, 'CG': 14, 'CT': 13, 'GA': 8, 'GC': 11, 'GG': 10, 'GT': 9, 'TA': 4, 'TC': 7, 'TG': 6, 'TT': 5}

Dictionary mapping dinuc -> int

dinucs = ['AA', 'AT', 'AG', 'AC', 'TA', 'TT', 'TG', 'TC', 'GA', 'GT', 'GG', 'GC', 'CA', 'CT', 'CG', 'CC']

List of valid dinucleotides

nucleotides = ['A', 'T', 'G', 'C']

List of valid nucleotides

classmethod veccomplement(*args, **kwargs)[source]
classmethod vecget(*args, **kwargs)[source]
vectorized_complement = <numpy.vectorize object>
vectorized_get = <numpy.vectorize object>
class kbbq.compare_reads.RescaledNormal[source]

Bases: object

A class to cache the rescaled normal prior used in the bayesian recalibration model.

Attributes

Methods

  • prior() - get the prior probability for a given quality score difference

Most of these attributes are nonsense; the “proper” way to interact with the class is via the prior_dist array. If that makes you uncomfortable, use the provided accessor function prior().

Under no circumstances should you attempt to replace any of these attributes, it will most likely not have the desired effect. The caching mechanism here only works because the class attributes are immediately instatiated when the class is created, so by the time you replace them it won’t matter. Reimplement it yourself if you want a different prior.

i = 93
maxscore = 93

The maximum quality score supported by this class.

oldset = {'divide': 'warn', 'invalid': 'warn', 'over': 'warn', 'under': 'ignore'}
possible_diffs = array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93])
classmethod prior(difference)[source]

Return the prior probability for a given difference in quality score.

Parameters:difference (int) – The difference in quality score
Returns:the prior probability
Return type:np.longdouble
prior_dist = array([-1.05360516e-01, -2.10536052e+00, -8.10536052e+00, -1.81053605e+01, -3.21053605e+01, -5.01053605e+01, -7.21053605e+01, -9.81053605e+01, -1.28105361e+02, -1.62105361e+02, -2.00105361e+02, -2.42105361e+02, -2.88105361e+02, -3.38105361e+02, -3.92105361e+02, -4.50105361e+02, -5.12105361e+02, -5.78105361e+02, -6.48105361e+02, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf], dtype=float128)
kbbq.compare_reads.bamread_get_oq(read)[source]
kbbq.compare_reads.fastq_cycle_covariates(read, secondinpair=False)[source]
kbbq.compare_reads.fastq_dinuc_covariates(read, minscore=6)[source]
kbbq.compare_reads.fastq_infer_rg(read)[source]

Infer the read group from appended read information, such as produced by the samtools fastq command.

Requires the read to be formatted such the rg tag is added to the end of the read name delimited by a _. Returns the read group tag.

kbbq.compare_reads.fastq_infer_secondinpair(read)[source]
kbbq.compare_reads.find_read_errors(read, ref, variable)[source]

Use the CIGAR to find errors in the read or sites to skip.

Softclipped bases will be added to the skip array. Returns a tuple with the error array and the skip array. We don’t consider indel errors.

kbbq.compare_reads.gatk_delta_q(prior_q, numerrs, numtotal, maxscore=42)[source]

Calculate the shift in quality scores from the prior given data.

This is achieved by finding the difference between the maximum a posteriori and the prior point estimate of Q.

kbbq.compare_reads.generic_cycle_covariate(sequencelen, secondinpair=False)[source]
kbbq.compare_reads.generic_dinuc_covariate(sequences, quals, minscore=6)[source]
kbbq.compare_reads.get_rg_to_pu(bamfileobj)[source]
kbbq.compare_reads.get_var_sites(vcf)[source]

Get positions covered by any record in a VCF file.

Pass the file name. Each value in the returned dict will be a list of 0-based positions.

kbbq.compare_reads.load_positions(posfile)[source]

Get positions that are covered by a non-gzipped BED file.

Pass the file name; it will be opened with python.open(). This is slightly different than benchmark.get_bed_dict() because it doesn’t require a refdict and only provides a list of 0-based positions, not a boolean array of all positions in the reference. It also doesn’t support gzipped files.

kbbq.compare_reads.nparray_to_qualitystring(a, offset=33)[source]
kbbq.compare_reads.p_to_q(p, maxscore=42)[source]
kbbq.compare_reads.print_info(*args, **kwargs)[source]

Print input to stderr prepended by a timestamp.

kbbq.compare_reads.q_to_p(q)[source]
kbbq.compare_reads.recalibrate_fastq(read, meanq, globaldeltaq, qscoredeltaq, positiondeltaq, dinucdeltaq, rg, dinuc_to_int, secondinpair=False, minscore=6, maxscore=42)[source]
kbbq.compare_reads.regression_recalibrate(lr, q)[source]
kbbq.compare_reads.train_regression(rawquals, corrected, tol=1e-08)[source]
kbbq.compare_reads.tstamp()[source]

Return the current time up to the second in ISO format as a string.

Returns:current time with brackets
Return type:str

kbbq.plot

Utilities for plotting calibration and other relevant info.

kbbq.plot.plot_benchmark(fhin, outfile, plottype='calibration')[source]

Plot predicted vs. true probability or size of each quality bin.

This plots the predicted quality score (x) against the true quality score (y) for each quality bin. This is done by calculating the mean probability of error for each bin and converting that probability back to a quality score.

With plottype = ‘sample-size’, instead plots the size of each predicted quality score bin.

Parameters:
  • fhin (python.IOBase) – file containing benchmark data
  • outfile (str) – name of file to save plot to

kbbq.benchmark

Utilities for benchmarking calibration methods.

kbbq.benchmark.benchmark(bamfile, fafile, vcffile, fastqfile=None, label=None, use_oq=False, bedfile=None, namedelimiter='_')[source]

Perform the benchmark and print the results to stdout.

If a fastqfile is not given, the benchmark will be taken from the reads in the BAM file. If a fastqfile is given, it should have the same read names as the reads in the BAM file. The names will then be used to look up where errors occur in the reads.

kbbq.benchmark.benchmark_files(bamfh, ref, fullskips, use_oq=False, fastqfh=None, namedelimiter='_')[source]

Benchmark the given files and return the actual and predicted Q.

If fastqfh is None (the default), reads from the BAM will be directly benchmarked. Otherwise, names from the fastq files will be used to look up

kbbq.benchmark.calculate_q(numerrs, numtotal)[source]

Calculate Actual Q and Predicted Q given an array containing counts of errors and an array containing total counts for each quality score.

The index of the returned array represents the predicted quality score, the value represents the actual quality score or the number of bases.

kbbq.benchmark.get_bam_readname(read)[source]

Given a bam read, get a canonicalized name. The name is the query name + a suffix based on whether it is first or second in pair.

kbbq.benchmark.get_bed_dict(refdict, bedfh)[source]
kbbq.benchmark.get_error_dict(bamfile, refdict, fullskips)[source]

Finds errors given an authoritative BAM, ref, and sites to skip.

The returned dict has readname keys and tuple(np.array, np.array) of bools as values.

kbbq.benchmark.get_fastq_readname(read)[source]

Given a fastq read, get a canonicalized name. This is based on whether the read is first in pair.

kbbq.benchmark.get_full_skips(refdict, var_sites, bedfh=None)[source]
kbbq.benchmark.get_ref_dict(reffilename)[source]
kbbq.benchmark.get_var_sites(vcf)[source]
kbbq.benchmark.open_bedfile(bedfile)[source]

Attempt to open the given bedfile and return a filehandle.

If bedfile is None, None will be returned.

This is a context manager; the file will be closed once it goes out of scope.

kbbq.benchmark.print_benchmark(actual_q, label, nbases)[source]

Print benchmark data to a tab separated table.

Label should be a string and is used for all values in the table. No headers are printed. This means the benchmark command can be called many times and each table can be concatenated together.

kbbq.read

Class and functions for read data.

This module includes the ReadData class and a few utility functions for working with BAM reads of type pysam.AlignedSegment.

Classes

Functions

class kbbq.read.ReadData(seq, qual, skips, name, rg, second, errors)[source]

Bases: object

A class that represents some minimal information from a sequencing read.

Since this won’t hold as much data as a BAM read, if you want to recalibrate a read you should turn it into a ReadData object, then update the original read with the new information; otherwise you’ll lose annotations. This class also manages read group information. You should instantiate the object from one of the class methods instead of directly instantiating it if possible. You should never assign directly to any of the class attributes; treat them as read-only. Instance variables should be fine to manipulate.

Class Attributes

  • rg_to_pu - Dict of read group id’s -> platform unit
  • rg_to_int - Dict of read group id’s -> int
  • numrgs - int number of read groups

Instance Attributes

  • seq - Numpy array of characters representing the sequence
  • qual - Numpy array of int representing the quality score
  • skips - Numpy array of bools representing sites to skip
  • name - string representing the name of the read
  • rg - int representing the read group the read belongs to
  • second - bool representing whether the read is 2nd in pair
  • errors - Numpy array of bools representing whether the base is an error

Class Methods

Instance Methods

Note that from https://docs.python.org/3/library/stdtypes.html#dict , iteration order is guaranteed to be in insertion order. Thus we are OK saving the rg as an int on the fly so long as we don’t remove any read groups from the rg_to_pu dictionary. So don’t do that! In fact, we should probably remove __del__() and pop() from the dicts…

__getitem__(key)[source]

Return a read subset by key.

__init__(seq, qual, skips, name, rg, second, errors)[source]

Initialize self. See help(type(self)) for accurate signature.

__len__()[source]

Return sequence length.

Returns:sequence length
Return type:int
__setitem__(key, value)[source]

Set key given another read (or an object with the same attributes). Note only the arrays will be modified; name, rg, and second should all be set directly. The key should probably be a slice that is the size of len(value).

__str__()[source]

A string representation of the read.

Returns:string representation of read
Return type:str
canonical_name()[source]

The name with an added suffix based on whether the read is firstinpair or not.

If the read has its second flag set to True, /2 is added to the end. Otherwise, /1 is added.

Returns:read name with a suffix
Return type:str
errors = None

numpy.ndarray of bools representing whether the base is an error

classmethod from_bamread(bamread, use_oq=False)[source]

ReadData factory that instantiates a ReadData object from a pysam AlignedSegment.

Skips and errors are initialized to be empty bool arrays. If you need to use these attributes, make sure you set them the way you need to for your use case.

If use_oq is True, use the OQ tag for base quality. The default is to use regular quality scores.

If the read group hasn’t been loaded in the dictionary, it will be registered with a PU equal to the value of rg. To load the dictionary, call load_rgs_from_bamfile() first. If there is no RG tag for the read it will be given a generic RG of None and lumped in with all other reads that have no RG tag.

This will reverse-complement the sequence and reverse the qualities if the read aligns on the reverse strand.

Parameters:
Returns:

a read object

Return type:

ReadData

classmethod from_fastq(fastqread, rg=None, second=None, namedelimiter='_')[source]

ReadData factory that instantiates a ReadData object from a pysam FastqProxy.

Skips and errors are initialized to be empty bool arrays. If you need to use them, make sure you set them the way you need to for your use case. The read name will be set to the first field of the delimited read name, minus any trailing /1 or /2 if they exist.

If rg is None (the default), we will attempt to infer the read group id from the read name. To infer rg, the read group must be in its own field with the field beginning with RG:, such as RG:example, with fields delimited by namedelimiter. When multiple fields are found that begin with RG:, the last is chosen to be the true ID. If inference fails, the read group will remain None.

If the read group (either inferred or provided explicitly) hasn’t been loaded in the dictionary, it will be registered with a PU equal to the value of rg.

If second is None (the default) we will attempt to infer if the read is second in pair based on the name. To infer second, the first field of the read name must end with /2. If the last 2 characters of the first field are not /2, second will be inferred to be false.

Parameters:
  • fastqread (pysam.FastqProxy) – The fastq read to get data from
  • rg (str) – the read group the read belongs to
  • second (bool) – whether the read is 2nd in pair
  • namedelimiter (str) – the delimiter for parsing the read name
Returns:

a read object

Return type:

ReadData

get_cycle_array()[source]

Return an array of cycle values.

This is range(len(seq)) if second is false, otherwise it is -1..-len(seq) inclusive. This way we can hold cycle values as an array of size 2 * seqlen and store negative values at the end of the array.

Returns:cycle values
Return type:numpy.ndarray (int)
get_cycle_errors()[source]

Return an array of cycle values that were errors and of all valid cycle values.

The errors are always a subset of the valid values.

Returns:erroneous and valid cycle values
Return type:tuple(numpy.ndarray (int) , numpy.ndarray (int))
get_dinuc_errors(minscore=6)[source]

Return an array of dinucleotide values that were errors and of all valid dinucleotide values.

The errors are always a subset of the valid values.

Returns:erroneous and valid dinucleotide values
Return type:tuple(numpy.ndarray (int) , numpy.ndarray (int))
get_dinucleotide_array(minscore=6)[source]

Return an array of dinucleotide values.

The character to int map is stored in kbbq.compare_reads.Dinucleotide.

Returns:dinucleotide values
Return type:numpy.ndarray (int)
get_pu()[source]

Return the PU from the rg_to_pu dict.

Returns:The Read Group’s PU tag
Return type:str
get_q_errors()[source]

Return an array of q values that were errors and of all valid q values.

The errors are always a subset of the valid values.

Returns:erroneous and valid q values
Return type:tuple(numpy.ndarray (int) , numpy.ndarray (int))
get_rg_errors()[source]

Return an array of rg values that were errors and all valid rg values.

The errors are always a subset of the valid sites.

For example, a read with 4 non skipped sites and 1 error will return ([rg], [rg, rg, rg, rg])

Returns:errors and valid rg values
Return type:tuple(numpy.ndarray (int) , numpy.ndarray (int))
get_rg_int()[source]

Return the RG as an int suitable for indexing rather than as the actual string RG ID.

Returns:read group index
Return type:int
classmethod load_rgs_from_bamfile(bamfileobj)[source]

Load read group IDs and PUs from the header of the pysam bamfileobj.

TODO: figure out what to do with this when there are no read groups in the header.

Recalibration is done on a PU basis, so when building a model using reads in a BAM, the PU for each read is obtained by looking up the PU associated with its read group. The dictionary that controls these lookups is a class variable. To store the RG data in an int form, an rg-to-int dict is also loaded. This should be equal to dict(zip(rg_to_pu, range(len(rg_to_pu))))

Parameters:bamfileobj (pysam.AlignmentFile) – The opened bam file
name = None

string representing the name of the read

not_skipped_errors()[source]

Return a logical and of not skips and errors

Returns:array of valid errors
Return type:numpy.ndarray (bool)
numrgs = 0

Number of readgroups encountered so far.

qual = None

numpy.ndarray of int representing the quality score

rg = None

int representing the read group the read belongs to

rg_to_int = {}

Dict mapping RG ids to integer indices.

rg_to_pu = {}

Dict mapping RG ids to the PU tag for the read group.

second = None

bool representing whether the read is 2nd in pair

seq = None

numpy.ndarray of characters representing the sequence

skips = None

numpy.ndarray of bools representing sites to skip

str_qual(offset=33)[source]

Get the quality of this read as a list of characters.

offset (default 33) will be added to the ASCII value of each character.

Parameters:offset (int) – Offset to add
Returns:read quality
Return type:list(chr)
kbbq.read.bamread_get_oq(read, offset=33)[source]

Get the OQ of the given bamread as an array of int.

offset (default 33) will be subtracted from the ASCII value of each character.

Parameters:
Returns:

quality scores

Return type:

numpy.ndarray of int

kbbq.read.bamread_get_quals(read, use_oq=False)[source]

Return the qualities of a pysam bam read as an array.

If use_oq = True, use the OQ tag for base quality. The default is to use regular quality scores.

Parameters:
Returns:

quality scores

Return type:

numpy.ndarray of int

kbbq.covariate

A data class and helper functions for read covariates.

This module includes the Covariate class and subclasses, as well as the pad_axis() helper function for appending to the end of an arbitrary axis of an numpy.ndarray.

Classes
Functions
  • pad_axis() - helper function to pad an np.ndarray with zeros.
class kbbq.covariate.Covariate(shape=0)[source]

Bases: object

A base class for covariates.

Holds 2 np.ndarray, errors for number of errors and total for the total number of observations of this covariate. The indices of each array are they keys; ie. the number of times covariate 2 was observed can be found with self.total[2].

Warning

Don’t assign directly to errors or total. shape assumes both arrays have the same shape and only returns the shape of total. If the arrays’ shapes become different somehow, __getitem__() and __setitem__() may start throwing errors unexpectedly. If you want to use two arrays that aren’t guaranteed to have the same size, use two arrays. This isn’t the right class for that use case.

__getitem__() will pass the key to both arrays and return a tuple of (errors, total). As a pneumonic, you can remember this as a fraction from left to right, as total should always be larger than errors.

__setitem__() will also pass the key to both arrays, and will expect value to be a tuple of values, setting the first value to errors and the second to total.

Attributes

Methods

__getitem__(key)[source]

Pass key to each np.ndarray and return the result as a tuple.

The result will be in order of (errors, total). As a pneumonic, you can remember this as a fraction from left to right, as total should always be larger than errors.

Parameters:key (slice) – indices to get
Returns:number of errors and number of total observations
Return type:tuple(int, int) or tuple( numpy.ndarray , numpy.ndarray )
__init__(shape=0)[source]

Initialize the arrays with the given shapes.

Parameters:tuple (shape) – desired shape of error and total arrays
Returns:empty object
Return type:Covariate
__setitem__(key, value)[source]

Pass key to each np.ndarray and assign each value.

Value must be a tuple of values to assign. The first value is assigned to errors and the second to total.

Parameters:
__str__()[source]

Return str(self).

errors = None

np.ndarray (int) holding number of errors observed for each key.

increment(idx, value=(1, 1))[source]

Increment the given indices by value.

idx should be a tuple of values interpretable by np.add().

The first indices given will increment the error array. The second indices given will increment the total array.

Note that this is implemented with np.add.at(). For example, if idx[0] is [2,2], errors [2] will be incremented twice. This is equivalent to

errors[2] += 1
errors[2] += 1

If value[0] is 2, errors will be instead incremented by 4 and is equivalent to

errors[2] += 2
errors[2] += 2
Parameters:
  • idx (tuple( index , index )) – indices to increment array at
  • value (typle( int, int )) – value to add for each index given.
pad_axis(axis, n=1)[source]

Expand axis dimension by n.

Parameters:
  • axis (int) – axis to pad
  • n (int) – size of pad to apply
pad_axis_to_fit(axis, idx)[source]

Expand the axis until idx is a valid idx in that dimension.

Does nothing if idx is already valid. idx should be a single scalar value, not a slice.

Parameters:
  • axis (int) – axis to pad
  • idx (int) – index to ensure is valid
shape

Return the shape of the underlying arrays.

Returns:array shape
Return type:tuple(int)
total = None

np.ndarray (int) holding number of observations for each key.

class kbbq.covariate.CovariateData[source]

Bases: object

A class for holding all the covariate data needed to recalibrate reads.

Attributes

Methods

__init__()[source]

Initialize covariate arrays.

__str__()[source]

Return str(self).

consume_read(read)[source]

Add covariates from the read to the proper data arrays.

Parameters:read (kbbq.read.ReadData) – the read to take data from
cyclecov = None

Cycle covariates

dinuccov = None

Dinucleotide covariates

get_num_cycles()[source]

Return number of cycles in the cycle covariate

This is the half the length of the cycle vector; there are twice as many cycles since positive and negative cycles are treated differently.

Returns:half the length of cycle vector
Return type:int
get_num_dinucs()[source]

Return number of dinucs in the dinuc covariate.

Returns:length of the dinuc vector
Return type:int
get_num_qs()[source]

Return number of qs in the q covariate

Returns:length of q vector
Return type:int
get_num_rgs()[source]

Return number of rgs in the rg covariate

Returns:length of rg vector
Return type:int
qcov = None

Q covariates

class kbbq.covariate.CycleCovariate[source]

Bases: kbbq.covariate.Covariate

A 3d covariate array in the read group, Q, and cycle dimensions.

pad_axis() is overriden here since the way cycles are stored means appending to the end of the array changes the location of the data.

Methods:

__init__()[source]

Initialize empty arrays.

num_cycles()[source]

Return number of cycles in the cycle covariate

This is the half the length of the cycle vector; there are twice as many cycles since positive and negative cycles are treated differently.

Returns:half the length of cycle vector
Return type:int
pad_axis(axis, n=1)[source]

Expand the axis dimension by n.

This is overriden for the CycleCovariate since we use one array to handle both positive and negative cycle values. Appending 0’s to the end would ruin the array since negative indices are significant. Thus, we have to wrangle the existing data a bit.

Parameters:
  • axis (int) – axis to pad
  • n (int) – size of pad to apply
class kbbq.covariate.DinucCovariate[source]

Bases: kbbq.covariate.Covariate

A 3d covariate array in the read group, Q, and dinucleotide dimensions.

__init__()[source]

Initialize empty arrays.

num_dinucs()[source]

Return length of dinucleotide array.

Returns:number of dinucleotides
Return type:int
class kbbq.covariate.QCovariate[source]

Bases: kbbq.covariate.Covariate

A 2d covariate array in the read group and Q dimensions.

This class holds a RGCovariate to efficiently increment both objects while iterating through reads.

Attributes

Methods

__init__()[source]

Initialize an empty RGCovariate and initialize the empty q arrays.

__str__()[source]

Return str(self).

consume_read(read)[source]

Increment the rg and q covariates given a kbbq.read.ReadData

Parameters:read (kbbq.read.ReadData) – read to take covariate values from
Returns:erroneous and valid read group values and erroneous and valid q values
Return type:(( numpy.ndarray , numpy.ndarray ),( numpy.ndarray , numpy.ndarray ))
num_qs()[source]

Return size needed to hold largest q observed so far.

Returns:largest q + 1
Type:int
rgcov = None

RG covariates

class kbbq.covariate.RGCovariate[source]

Bases: kbbq.covariate.Covariate

A 1d covariate array in the read group dimension.

Methods

__init__()[source]

Initialize a 1D read group array.

Returns:empty object
Return type:Covariate
consume_read(read)[source]

Increment the rg covariates given a kbbq.read.ReadData

Parameters:read (kbbq.read.ReadData) – read to take covariate values from
Returns:erroneous and valid read group values
Return type:tuple( numpy.ndarray , numpy.ndarray )
num_rgs()[source]

Return size needed to hold largest rg observed so far.

Returns:largest rg + 1
Type:int
kbbq.covariate.pad_axis(array, axis, n)[source]

Pad the axis of np.ndarray with n zeros.

Parameters:
  • axis (int) – axis to pad
  • n (int) – size of pad to apply
Returns:

padded array

Return type:

numpy.npdarray

kbbq.gatk

This submodule contains code to emulate GATK’s BQSR and ApplyBQSR functions.

kbbq.gatk.bqsr

Utilities for emulating GATK’s BQSR tool.

BQSR model construction hard clips soft clips and trims adaptors. ApplyBQSR does not. So we need different functions for each.

kbbq.gatk.bqsr.bam_to_bqsr_covariates(bamfileobj, fastafilename, var_pos, minscore=6, maxscore=42)[source]

Given a BAM file object, FASTA reference file name and var_pos dict, get the standard covariate arrays.

kbbq.gatk.bqsr.bam_to_report(bamfileobj, fastafilename, var_pos)[source]
kbbq.gatk.bqsr.bamread_adaptor_boundary(read)[source]
kbbq.gatk.bqsr.bamread_bqsr_cycle(read)[source]
kbbq.gatk.bqsr.bamread_bqsr_dinuc(read, use_oq=True, minscore=6)[source]
kbbq.gatk.bqsr.quantize(q_errs, q_total, maxscore=93)[source]

This function doesn’t match the GATK version, but it’s not used so it’s not a priority.

kbbq.gatk.bqsr.trim_bamread(read)[source]
kbbq.gatk.bqsr.vectors_to_report(meanq, global_errs, global_total, q_errs, q_total, pos_errs, pos_total, dinuc_errs, dinuc_total, rg_order, maxscore=42)[source]

Turn the set of recalibration vectors into a kbbq.recaltable.RecalibrationReport object.

For the recalibration vectors, each dimension corresponds to a covariate. The first index is always the read group, and the second (if it exists) represents the raw quality score, the final index is either the cycle or dinucleotide covariate.

Parameters:
  • meanq (numpy.ndarray [:]) – Mean q for each read group
  • global_errs (numpy.ndarray [:]) – Number of errors for each read group
  • global_total (numpy.ndarray [:]) – Number of observations for each read group
  • q_errs (numpy.ndarray [:,:]) – Number of errors for each read group and q score subset.
  • q_total (numpy.ndarray [:,:]) – Number of observations for each read group and q score subset.
  • pos_errs (numpy.ndarray [:,:,:]) – Number of errors for each read group, q, and cycle subset.
  • pos_total (numpy.ndarray [:,:,:]) – Number of observations for each read group, q, and cycle subset.
  • dinuc_errs (numpy.ndarray [:,:,:]) – Number of errors for each read group, q, and dinucleotide subset.
  • dinuc_total (numpy.ndarray [:,:,:]) – Number of observations for each read group, q, and dinucleotide subset.
  • rg_order (list(str)) – The order of read groups
  • maxscore (int) – The maximum possible quality score
Returns:

the recalibration table

Return type:

kbbq.recaltable.RecalibrationReport

kbbq.gatk.applybqsr

Utilities for emulating GATK’s ApplyBQSR tool

BQSR model construction hard clips soft clips and trims adaptors. ApplyBQSR does not. So we need different functions for each.

class kbbq.gatk.applybqsr.ModelDQs(mean, rg, q, cycle, dinuc)

Bases: tuple

A class to hold the model delta Q’s and enforce a consistent ordering.

cycle

Alias for field number 3 Shift from the mean for each quality score. 3-dimensional numpy.ndarray with shape (number of read groups, number of quality scores, number of cycles)

dinuc

Alias for field number 4 Shift from the mean for each quality score. 3-dimensional numpy.ndarray with shape (number of read groups, number of quality scores, number of cycles)

mean

Alias for field number 0 The mean Q for each read group. 1-dimensional numpy.ndarray with length equal to the number of read groups.

q

Alias for field number 2 Shift from the mean for each quality score. 2-dimensional numpy.ndarray with shape (number of read groups, number of quality scores)

rg

Alias for field number 1 Shift from the mean for each read group. 1-dimensional numpy.ndarray with length equal to the number of read groups.

kbbq.gatk.applybqsr.bamread_cycle_covariates(read)[source]

Get the cycle covariate from a bam read.

Parameters:read (pysam.AlignedSegment) – read to get covariate values from
Returns:cycle covariate
Return type:numpy.ndarray of int
kbbq.gatk.applybqsr.bamread_dinuc_covariates(read, use_oq=True, minscore=6)[source]

Get the dinuc covariates from a bam read.

These range from 0 to 15 for each possible dinucleotide, or -1 for an invalid value. The first base is always invalid (since there is no base before it) and any base overlapping an N will be invalid. Any base with a quality score below minscore is also invalid.

Parameters:read (pysam.AlignedSegment) – read to get covariate values from
Returns:dinuc covariate
Return type:numpy.ndarray of int
kbbq.gatk.applybqsr.get_delta_qs(meanq, rg_errs, rg_total, q_errs, q_total, pos_errs, pos_total, dinuc_errs, dinuc_total)[source]

Find the covariate delta Q’s as arrays given the data arrays.

The input vector shapes are: [rg], [rg, q], [rg, q, covariate]. Returns a tuple of delta Q arrays. The indices specify the covariate values, the values specify how much the Q score should change based on its covariate values.

The preferred method of getting the covariate delta Q’s is the get_modeldqs_from_covariates() function.

Parameters:all (numpy.ndarray of int) – data array
Returns:Delta Q arrays (rgdeltaq, qscoredeltaq, positiondeltaq, dinucdeltaq)
Return type:tuple of numpy.ndarray of int
kbbq.gatk.applybqsr.get_modeldqs_from_covariates(covariates)[source]

Find the covariate delta Q’s given covariate data.

Parameters:covariates (CovariateData) – covariate data
Returns:delta Q’s
Return type:ModelDQs
kbbq.gatk.applybqsr.get_modelvecs_from_covariates(covariates)[source]

Find the model vectors given covariate data.

The model vectors are: meanq, global_errs, global_total, q_errs, q_total, pos_errs, pos_total, dinuc_errs, dinuc_total

Parameters:covariates (CovariateData) – covariate data
Returns:model vectors
Return type:tuple(numpy.ndarray)
kbbq.gatk.applybqsr.recalibrate_bamread(read, meanq, globaldeltaq, qscoredeltaq, positiondeltaq, dinucdeltaq, rg_to_int, use_oq=True, minscore=6)[source]

Recalibrate a bamread and return the new quality scores.

Note that this function doesn’t alter the given read.

Parameters:
  • read (pysam.AlignedSegment) – read to recalibrate
  • others (numpy.ndarray of int) – delta q values where the indices represent the rg, q, and dinuc or cycle covariates.
  • rg_to_int (dict(str, int)) – dictionary to convert rg IDs to ints
Returns:

recalibrated quality scores

Return type:

numpy.ndarray of int

kbbq.gatk.applybqsr.table_to_vectors(table, rg_order, maxscore=42)[source]

Get a tuple of recalibration data vectors from a RecalibrationReport.

The recal table uses the PU of the read group as the read group entry in the table. See vectors_to_report() for more info

The rg_order must be given to define the output order of read groups, since there is no natural ordering for them. If the PUs listed in the table are ‘apple’ and ‘orange’, and rg_order is [‘apple’, ‘orange’], data from the ‘apple’ read group will be in index 0 and data from the ‘orange’ read group will be in index 1.

Parameters:
  • table (RecalibrationReport) – Report to get data from
  • rg_order (list(str)) – List of PU’s controlling the order of read groups.
  • maxscore (int) – largest quality score possible
Returns:

data vectors (meanq, global_errs, global_total, q_errs, q_total, pos_errs, pos_total, dinuc_errs, dinuc_total)

Return type:

tuple of numpy.ndarray of int

kbbq.bloom

Methods to create and manipulate khmer bloom filters.

The base khmer bloom filter object is a Nodegraph. Nodegraph.get(kmer) returns 0 or 1 depending on whether the kmer is present Nodegraph.count(kmer) sets the value to 1 (present) Nodegraph.add(khmer) sets the value to 1 (present) and returns true if it’s new. Nodegraph.consume(str) does add on each kmer in the string and returns the number of kmers added Nodegraph.get_kmer_counts(str) does get on each kmer in the string and returns a list of counts Nodegraph.save(filename) saves to filename Nodegraph.load(filename) loads from filename Nodegraph.hash(kmer) returns the hash of the k-mer Nodegraph.get_kmer_hashes(kmer) hashes each k khmer.calc_expected_collisions(nodegraph, force = False, max_false_pos = .15) will return false positive rate and issue a warning if it’s too high. when force is true, don’t exit if false positive is higher than max. #may also want to try a Nodetable at some point. It should have the same fns.

ksize = 32 maxmem = khmer.khmer_args.memory_setting(‘8G’) fake_args = namedtuple(max_memory_usage = maxmem, n_tables = 4) tablesize = calculate_graphsize(fake_args, ‘nodegraph’, multiplier = 1.0) nodegraph = khmer.Nodegraph(ksize, tablesize, fake_args.n_tables)

kbbq.bloom.add_trusted_kmers(read, graph)[source]

Add trusted kmers to graph.

kbbq.bloom.calculate_thresholds(p_added, ksize)[source]

Calculate the thresholds. If the number of overlapping kmers is less than this number, the base is inferred to be erroneous.

kbbq.bloom.correction_len(seq, graph, right=True)[source]

Get the number of corrections given an ndarray of sequence.

This value is an array of values between between 1 and ksize, or len(seq) if no correction can be made. The length of the array represents the number of results if there is a tie.

The last element returned says whether there were multiple corrections that led to a trusted kmer.

This function is in desperate need of a refactor.

kbbq.bloom.count_read(read, graph, sampling_rate)[source]

Put ReadData read in the countgraph. Each k-mer will be added with p = sampling_rate

kbbq.bloom.create_empty_nodegraph(ksize=32, max_mem='8G')[source]

Create an empty nodegraph. From this point, fill it from a file or start filling it with count.

kbbq.bloom.find_longest_trusted_block(trusted_kmers)[source]

Given a boolean array describing whether each kmer is trusted, return a pair of indices to the start and end of the block.

The indices will be standard for python ranges; inclusive on the left side and exclusive on the right side.

kbbq.bloom.fix_one(seq, graph)[source]

If we don’t start with an anchor we just fix one base that produces the largest number of trusted kmers.

kbbq.bloom.fix_overcorrection(read, ksize, minqual=6, window=20, threshold=4, adjust=False)[source]

The threshold is adjusted when there are not multiple options for any correction. It is only adjusted for positions [window:-window]

kbbq.bloom.infer_errors(overlapping, possible, thresholds)[source]

Perform a binomial test to infer which bases are erroneous.

Given a vector of the number of overlapping kmers in the hash and the number of possible kmers in the hash for each position, infer whether each base is erroneous.

This is done with essentially a binomial test. We evaluate the probability the sample came from a binomial distribution with p = p_added. We do a right-tailed test with the null hypothesis that all kmers overlapping the site are erroneous. We assume the multiplicity of a weak kmer is less than ~2, so this is reflected in p_added. We only use the right tail because a lower p parameter supports the null, while a high value supports the alternative.

kbbq.bloom.infer_errors_from_trusted_kmers(read, graph)[source]

Return an array of errors and a bool describing whether multiple corrections were made.

kbbq.bloom.infer_read_errors(read, graph, thresholds)[source]

Return an array of errors given a graph and the thresholds.

kbbq.bloom.kmers_in_graph(read, graph)[source]

Query the graph for each kmer in read and return a numpy.ndarray of bools.

The returned array has length len(read) - ksize + 1

kbbq.bloom.overlapping_kmers_in_graph(read, graph)[source]

Get the number of kmers overlapping each read position that are in the graph.

The returned array has length len(read).

kbbq.bloom.overlapping_kmers_possible(read, ksize)[source]

Get the possible number of kmers overlapping each read position.

For example, position 1 will always have only 1 overlapping kmer, position 2 will have 2, etc.

This is cached.

kbbq.bloom.p_kmer_added(sampling_rate, graph)[source]

The probability a kmer was added to the graph, including the false positive rate.

P(A) = 1 - ( 1 - sampling_rate ) ^ n, where n is the assumed highest multiplicity of a weak kmer. When sampling_rate >= .1, n = 2. Otherwise, n = .2 / sampling_rate. This function returns P*(A) = P(A) + B - B*P(A), where B is the false positive rate.

The point is to perform a binomial test, with p = p_kmer_added, k = n_kmers_in_graph, N = n_kmers_possible. The null hypothesis is that the kmer is not an error, and this is p under that assumption. Since the kmer is not an error, it was observed 2 or greater times. The alternative hypothesis is that the kmer is an error, so the true probability the kmer was added to the graph was less than the number returned by this function.

I think the better model would be a negative binomial. In this case, we have no idea how many times an overlapping kmer appeared in our dataset. We do know that there are r kmers that failed to be added to the hash and that each time an overlapping kmer appeared, there was a sampling_rate probability that it was added to the hash. Then the number of overlapping kmers in the dataset can be predicted with a negative binomial distribution with r = (# in hash) and p = 1 - sampling_rate. If this distribution is X, the read coverage at that site is X / (# k-mers that could overlap the site); in most cases, X / k.

Thus with this technique, we can calculate the distribution of site-by-site coverage in the dataset. In marginalizing, we sum the probability vector of each X then divide by the number of sites. Once we’re done, we must also divide by X again, since each base with coverage x (element of X) was counted x times.

In our sequencing model, the number of sequenced reads at any position is Poisson. Because errors are correlated, the number of sequenced erroneous reads is an overdispersed Poisson. This can be accurately modeled with a different negative binomial. Thus from our inference of coverage we should be able to fit a mixture of negative binomials that represents the coverage expected in the dataset, and this binomial will tell us the probability that a read is an error given its coverage. Note that we use the negative binomial for two distinct purposes: one to estimate the coverage given a site at a read, and one to model the total coverage of the dataset.

Notably, this approach suggests an optimal choice for the sampling rate since the distribution becomes less predictive for some coverages at extreme points. A strategy for picking the theoretically optimal sampling rate can likely be found.

kbbq.bloom.rolling_window(a, window)[source]

Use stride tricks to reshape an array into an array of sliding windows.

Different values in the array will point to the same place in memory, so copy the array before altering it if that behavior is undesired.

From https://rigtorp.se/2011/01/01/rolling-statistics-numpy.html

kbbq.recalibrate

Utilities for recalibrating reads.

kbbq.recalibrate.fill_read_errors(read, trustgraph)[source]

Fill the errors attribute of the given ReadData object.

This will modify the input read.

Parameters:
  • read (ReadData) – read to modify
  • trustgraphkhmer.Nodegraph
kbbq.recalibrate.find_corrected_sites(uncorr_read, corr_read)[source]

Given a read and a corrected read, return an array of corrected sites.

Parameters:
Returns:

Array of changed sites

Return type:

numpy.ndarray of bool

kbbq.recalibrate.find_covariates(read_sources)[source]

Consume reads from the given list of iterables and load them into a kbbq.covariate.CovariateData object.

kbbq.recalibrate.find_trusted_kmers(readsource, graph, ksize, memory, thresholds)[source]

Load the trusted k-mer hash.

Parameters:
  • readsource (iterable) – source of read, original pairs (as generated by generate_reads_from_files())
  • graph (khmer.Nodegraph) – The subsampled k-mer graph
  • ksize (int) – k-mer size to use for hash
  • memory (str) – memory to use for hash
  • thresholds (numpy.ndarray of int) – error threshold array
Returns:

trusted k-mer hash

Return type:

khmer.Nodegraph

kbbq.recalibrate.generate_reads_from_files(files, bams, infer_rg=False, use_oq=False)[source]

Return a list of ReadData generators from opening each file.

This function acts as a context manager. The files will automatically be closed once they go out of context.

See contextlib for more information.

Returns:list of generators of (read, original) pairs
Return type:list(generator(ReadData, pysam.AlignedSegment or pysam.FastqProxy))
kbbq.recalibrate.get_dqs_from_corrected(readsource, correctedsource)[source]

Get kbbq.gatk.applybqsr.ModelDQs from corrected files.

Returns:

model DQs

Return type:

kbbq.gatk.applybqsr.ModelDQs

Parameters:
kbbq.recalibrate.get_dqs_with_hashes(readsource, trustgraph)[source]

Get DQs from a readsource and trustgraph.

Parameters:
  • readsource (iterable) – source of read, original pairs (as generated by generate_reads_from_files())
  • trustgraph (khmer.Nodegraph) – trusted k-mer hash
Returns:

model DQs

Return type:

kbbq.gatk.applybqsr.ModelDQs

kbbq.recalibrate.load_headers_from_bams(inputs)[source]

Load RG information from the headers of each input bam. Return which inputs are bams.

kbbq.recalibrate.load_subsampled_hash(readsource, ksize, memory, alpha)[source]

Load the initial subsampled hash.

Parameters:
  • readsource (iterable) – source of read, original pairs (as generated by generate_reads_from_files())
  • ksize (int) – k-mer size to use for hash
  • memory (str) – memory to use for hash
  • alpha (double) – sampling rate
Returns:

subsampled k-mer hash and thresholds

Return type:

( khmer.Nodegraph , numpy.ndarray (int) )

kbbq.recalibrate.open_outputs(files, output, bams)[source]

Initialize output files ensuring BAM headers are handled.

This function acts as a context manager. The outputs will automatically be closed once they go out of context.

See contextlib for more information.

kbbq.recalibrate.opens_as_bam(path)[source]

Attempt to open the path as a BAM file. Return True if no error is raised.

kbbq.recalibrate.recalibrate(files, output, corrected, infer_rg=False, use_oq=False, set_oq=False, ksize=32, memory='3G', alpha=0.1, gatkreport=None)[source]
kbbq.recalibrate.recalibrate_and_write(read, original, dqs, output, set_oq)[source]

Recalibrate a read and write it to output given the model parameters.

Parameters:
kbbq.recalibrate.recalibrate_fastq(fastq, dqs, out, infer_rg=False)[source]

Recalibrate reads in a FASTQ file given a fastq file name and CovariateData dqs. Out should be a file-like object.

kbbq.recalibrate.recalibrate_read(read, dqs, minscore=6)[source]

Return new qualities given ReadData and DQ arrays.

kbbq.recalibrate.validate_files(files)[source]

Ensures input files are regular files.

kbbq.recalibrate.yield_reads(iterable, *args, **kwargs)[source]

Return a generator of (ReadData, original_read) pairs.