Module Reference

kbbq

A package for reference-free base quality score recalibration.

kbbq.recaltable

Provides utilities for interfacing with GATK Recalibration tables.

class kbbq.recaltable.GATKReport(tables, version='1.1')[source]

Bases: object

A base class representing a GATK Report file that contains many tables.

Attributes

  • tables - a list of GATKTable objects
  • version - the GATKReport format version. The default is
    '1.1'.

Methods

__init__(tables, version='1.1')[source]

Initialize a GATKReport.

This module was written by reverse engineering a version 1.1 report. If you have a table with a newer version, please file a bug report, especially if this module fails to parse it.

Parameters:
  • tables (list(GATKTable)) – a list of tables.
  • version (str) – The report version as a string.
Returns:

a representation of a report

Return type:

GATKReport

__str__()[source]

A string containing the report.

Each table is separated by two newlines.

classmethod fromfile(filename)[source]

Initialize the GATKReport from a file.

Parameters:filename (str) – The file to be read.
Returns:a representation of a report
Return type:GATKReport
get_headerstring()[source]

Get the header line.

The header line identifies the file as a report, shows the version, and shows the number of tables that should be present (such as #:GATKReport.v1.1:5). This method returns a string that does not include a trailing newline.

Returns:the header line without a newline.
Return type:str
tables = None

A list of GATKTable objects.

version = None

The report version identifier (such as '1.1')

write(filename)[source]

Write the report to filename.

Parameters:filename (str) – The file name to write.
class kbbq.recaltable.GATKTable(title, description, data)[source]

Bases: object

A class representing a GATK report table.

The intended method of interacting with the table is using the data attribute, which is a standard Pandas dataframe.

Attributes

  • title - The title of the table
  • description - A description of the table. May be ''.
  • data - A Pandas dataframe containing the table data. Accessing this
    attribute is the primary way to interact with the data.
  • typemap - A dict mapping types to a character to be used
    for constructing the format string.
  • precisionmap - A dict mapping column headers to precision
    to be used for constructing the format string.

Methods

__init__(title, description, data)[source]

Initialize the table.

Parameters:
  • title (str) – the table title
  • description (str) – a short description of the table. May be ''
  • data (pandas.DataFrame) – the data contained by the table
__str__()[source]

Return str(self).

data = None

A Pandas dataframe containing the table data. Accessing this attribute is the primary way to interact with the data.

description = None

A short description of the table.

classmethod fromstring(tablestring)[source]

Initialize the table from a string

The first line of the table contains the a string describing the table format and is like #:GATKTable:ncol:nrow:%f:%f:%f:;. It is : delimited and ; terminated. The last format string will be empty. The second line of the table specifies the name of the table. It is formatted like #:GATKTable:title:description. description can be an empty string. It is : delimited. The third line contains the header. The header and row data are delimited by two or more spaces. The number of spaces are determined by padding each value in the row such that the the character width of the column is constant and can represent all the data in the column. Strings (including column headings) are left justified and numeric types are right justified.

get_colfmts()[source]

Get the formats for each column in the dataframe.

Uses typemap and precisionmap to create the strings. This will return something like ['%s', '%s', '%.2f', '%.4f']

Returns:format strings for each column
Return type:list(str)
get_datastring()[source]

Get the data string. Includes the header line and the data in the table.

Returns:the data table as a formatted string
Return type:str
get_fmtstring()[source]

Get the format string by inspecting the dataframe.

Uses the dtypes, number of rows, and number of columns in the data frame along with the strings returned by get_colfmts() to create the format string

which is formatted like #:GATKTable:ncol:nrow:%f:%f:%f:;.

Returns:The format string
Return type:str
get_ncols()[source]

Get the number of columns in the dataframe.

Returns:The number of columns in the dataframe.
Return type:int
get_nrows()[source]

Get the number of rows in the dataframe.

Returns:The number of rows in the dataframe.
Return type:int
get_titlestring()[source]

Get the titlestring.

Uses the title and description to create the title line. It is formatted like #:GATKTable:title:subtitle.

Returns:the title string
Return type:str
get_unindexed()[source]

Return a copy of the dataframe without the indexes included.

If the table has an unnamed index, such as the default index, return a copy of the data frame. Otherwise, reset the index.

static parse_fmtstring(header, fmtstring)[source]

Turn a fmtstring into a dictionary of {col_title : type}

Parameters:
  • header (list(str)) – The column titles as a list
  • fmtstring (str) – The format string to parse
Returns:

A dictionary mapping {col_title : type}

Return type:

dict(str, type)

precisionmap = None

A dictionary of {colname : str} to determine the precision string to prepend to each formatstring. This only affects get_fmtstring().

title = None

The title of the table.

typemap = None

A dictionary of {type : str} to determine the string for each type.

Note that in the current implementation this only affects get_fmtstring().

write(filehandle)[source]

Write the table to a filehandle.

Adds a newline to the end that isn’t present in __str__()

Parameters:filehandle (File Object) – The file object to write the table to.
Returns:Whatever the file object’s write method returns. Usually the number of characters written.
Return type:Usually int
class kbbq.recaltable.RecalibrationReport(tables, version='1.1')[source]

Bases: kbbq.recaltable.GATKReport

A class representing a GATK Recalibration Report used for Base Quality Score Recalibration.

It extends the GATKReport, so the primary method of interacting with the table is the tables attribute. The primary difference is the __init__() function will set indices and data types as they should be for a recalibration report.

  • Table tables[0] contains the arguments used to call the program
    that generated the table.
  • Table tables[1] contains the quanization map
  • Table tables[2] contains read group error information.
  • Table tables[3] contains error information subset by read group,
    then reported quality score.
  • Table tables[4] contains error information subset by read group,
    then reported quality score, then covariate name and value. Both Context and Cycle are valid for the covariate name.

See the table below for a schema.

Table Number of Columns Variables
0 2 Argument, Value
1 3 QualityScore, Count, QuantizedScore
2 6 ReadGroup, EventType, EmpiricalQuality, EstimatedQReported, Observations, Errors
3 6 ReadGroup, QualityScore, EventType, EmpiricalQuality, Observations, Errors
4 8 ReadGroup, QualityScore, CovariateValue, CovariateName, EventType, EmpiricalQuality, Observations, Errors

For current GATK parameters, EventType will be M for mismatch though I for insertion and D for deletion are possible. Valid values for CovariateName are Context and Cycle.

__init__(tables, version='1.1')[source]

Initialize the recalibration report from a list of GATKTable s.

This initializes the tables with some data wrangling to set indices and data types properly. If you don’t want this, use a GATKReport instead.

__str__()[source]

GATKReports list CovariateValue before CovariateName, which doesn’t make much sense for organizing the index. So we switch the columns before printing and switch them back after we get the string.

kbbq.compare_reads

A mish-mash of functions for recalibration. Should probably be renamed in the near future.

class kbbq.compare_reads.Dinucleotide[source]

Bases: object

A class to cache dinucleotides and maintain a consistent dinuc -> int map throughout the module.

complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}

Dictionary for complementing nucleotides

dinuc_to_int = {'AA': 0, 'AC': 3, 'AG': 2, 'AT': 1, 'CA': 12, 'CC': 15, 'CG': 14, 'CT': 13, 'GA': 8, 'GC': 11, 'GG': 10, 'GT': 9, 'TA': 4, 'TC': 7, 'TG': 6, 'TT': 5}

Dictionary mapping dinuc -> int

dinucs = ['AA', 'AT', 'AG', 'AC', 'TA', 'TT', 'TG', 'TC', 'GA', 'GT', 'GG', 'GC', 'CA', 'CT', 'CG', 'CC']

List of valid dinucleotides

nucleotides = ['A', 'T', 'G', 'C']

List of valid nucleotides

classmethod veccomplement(*args, **kwargs)[source]
classmethod vecget(*args, **kwargs)[source]
vectorized_complement = <numpy.vectorize object>
vectorized_get = <numpy.vectorize object>
class kbbq.compare_reads.RescaledNormal[source]

Bases: object

A class to cache the rescaled normal prior used in the bayesian recalibration model.

Attributes

Methods

  • prior() - get the prior probability for a given quality score difference

Most of these attributes are nonsense; the “proper” way to interact with the class is via the prior_dist array. If that makes you uncomfortable, use the provided accessor function prior().

Under no circumstances should you attempt to replace any of these attributes, it will most likely not have the desired effect. The caching mechanism here only works because the class attributes are immediately instatiated when the class is created, so by the time you replace them it won’t matter. Reimplement it yourself if you want a different prior.

i = 42
maxscore = 42

The maximum quality score supported by this class.

oldset = {'divide': 'warn', 'invalid': 'warn', 'over': 'warn', 'under': 'ignore'}
possible_diffs = array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42])
classmethod prior(difference)[source]

Return the prior probability for a given difference in quality score.

Parameters:difference (int) – The difference in quality score
Returns:the prior probability
Return type:np.longdouble
prior_dist = array([-1.05360516e-01, -2.10536052e+00, -8.10536052e+00, -1.81053605e+01, -3.21053605e+01, -5.01053605e+01, -7.21053605e+01, -9.81053605e+01, -1.28105361e+02, -1.62105361e+02, -2.00105361e+02, -2.42105361e+02, -2.88105361e+02, -3.38105361e+02, -3.92105361e+02, -4.50105361e+02, -5.12105361e+02, -5.78105361e+02, -6.48105361e+02, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf], dtype=float128)
kbbq.compare_reads.bamread_get_oq(read)[source]
kbbq.compare_reads.fastq_cycle_covariates(read, secondinpair=False)[source]
kbbq.compare_reads.fastq_dinuc_covariates(read, minscore=6)[source]
kbbq.compare_reads.fastq_infer_rg(read)[source]

Infer the read group from appended read information, such as produced by the samtools fastq command.

Requires the read to be formatted such the rg tag is added to the end of the read name delimited by a _. Returns the read group tag.

kbbq.compare_reads.fastq_infer_secondinpair(read)[source]
kbbq.compare_reads.find_read_errors(read, ref, variable)[source]

Use the CIGAR to find errors in the read or sites to skip.

Softclipped bases will be added to the skip array. Returns a tuple with the error array and the skip array. We don’t consider indel errors.

kbbq.compare_reads.gatk_delta_q(prior_q, numerrs, numtotal, maxscore=42)[source]

Calculate the shift in quality scores from the prior given data.

This is achieved by finding the difference between the maximum a posteriori and the prior point estimate of Q.

kbbq.compare_reads.generic_cycle_covariate(sequencelen, secondinpair=False)[source]
kbbq.compare_reads.generic_dinuc_covariate(sequences, quals, minscore=6)[source]
kbbq.compare_reads.get_rg_to_pu(bamfileobj)[source]
kbbq.compare_reads.get_var_sites(vcf)[source]

Get positions covered by any record in a VCF file.

Pass the file name. Each value in the returned dict will be a list of 0-based positions.

kbbq.compare_reads.load_positions(posfile)[source]

Get positions that are covered by a non-zipped BED file.

Pass the file name; it will be opened with python.open(). This is slightly different than benchmark.get_bed_dict() because it doesn’t require a refdict and only provides a list of 0-based positions, not a boolean array of all positions in the reference.

kbbq.compare_reads.p_to_q(p, maxscore=42)[source]
kbbq.compare_reads.q_to_p(q)[source]
kbbq.compare_reads.recalibrate_fastq(read, meanq, globaldeltaq, qscoredeltaq, positiondeltaq, dinucdeltaq, rg, dinuc_to_int, secondinpair=False, minscore=6, maxscore=42)[source]
kbbq.compare_reads.regression_recalibrate(lr, q)[source]
kbbq.compare_reads.train_regression(rawquals, corrected, tol=1e-08)[source]
kbbq.compare_reads.tstamp()[source]

Return the current time up to the second in ISO format as a string.

Returns:current time with brackets
Return type:str

kbbq.plot

Utilities for plotting calibration and other relevant info.

kbbq.plot.plot_benchmark(fhin, outfile, plottype='calibration')[source]

Plot predicted vs. true probability or size of each quality bin.

This plots the predicted quality score (x) against the true quality score (y) for each quality bin. This is done by calculating the mean probability of error for each bin and converting that probability back to a quality score.

With plottype = ‘sample-size’, instead plots the size of each predicted quality score bin.

Parameters:
  • fhin (python.IOBase) – file containing benchmark data
  • outfile (str) – name of file to save plot to

kbbq.benchmark

Utilities for benchmarking calibration methods.

kbbq.benchmark.benchmark(bamfile, fafile, vcffile, fastqfile=None, label=None, use_oq=False, bedfh=None)[source]

Perform the benchmark and print the results to stdout.

If a fastqfile is not given, the benchmark will be taken from the reads in the BAM file. If a fastqfile is given, it should have the same read names as the reads in the BAM file. The names will then be used to look up where errors occur in the reads.

kbbq.benchmark.benchmark_bam(bamfile, ref, var_sites, use_oq=False, bedfh=None)[source]
kbbq.benchmark.benchmark_fastq(fqfile, bamfile, ref, var_sites, bedfh=None)[source]
kbbq.benchmark.calculate_q(errors, quals)[source]

Calculate Actual Q and Predicted Q given a flat array of errors and flat array of quality scores.

The index of the returned array represents the predicted quality score, the value represents the actual quality score or the number of bases.

kbbq.benchmark.get_bam_readname(read)[source]

Given a bam read, get a canonicalized name. The name is the query name + a suffix based on whether it is first or second in pair.

kbbq.benchmark.get_bamread_quals(read, use_oq=False)[source]

Return the qualities of the read as an np.array.

Use the OQ tag for read quals if use_oq = True.

kbbq.benchmark.get_bed_dict(refdict, bedfh)[source]
kbbq.benchmark.get_error_dict(bamfile, refdict, fullskips)[source]

Finds errors given an authoritative BAM, ref, and sites to skip.

The returned dict has readname keys and tuple(np.array, np.array) of bools as values.

kbbq.benchmark.get_fastq_readname(read)[source]

Given a fastq read, get a canonicalized name. This is based on whether the read is first in pair.

kbbq.benchmark.get_full_skips(refdict, var_sites, bedfh=None)[source]
kbbq.benchmark.get_ref_dict(reffilename)[source]
kbbq.benchmark.get_var_sites(vcf)[source]
kbbq.benchmark.print_benchmark(actual_q, label, nbases)[source]

Print benchmark data to a tab separated table.

Label should be a string and is used for all values in the table. No headers are printed. This means the benchmark command can be called many times and each table can be concatenated together.

kbbq.read

Class and functions for read data.

This module includes the ReadData class and a few utility functions for working with BAM reads of type pysam.AlignedSegment.

Classes

Functions

class kbbq.read.ReadData(seq, qual, skips, name, rg, second, errors)[source]

Bases: object

A class that represents some minimal information from a sequencing read.

Since this won’t hold as much data as a BAM read, if you want to recalibrate a read you should turn it into a ReadData object, then update the original read with the new information; otherwise you’ll lose annotations. This class also manages read group information. You should instantiate the object from one of the class methods instead of directly instantiating it if possible. You should never assign directly to any of the class attributes; treat them as read-only. Instance variables should be fine to manipulate.

Class Attributes

  • rg_to_pu - Dict of read group id’s -> platform unit
  • rg_to_int - Dict of read group id’s -> int
  • numrgs - int number of read groups

Instance Attributes

  • seq - Numpy array of characters representing the sequence
  • qual - Numpy array of int representing the quality score
  • skips - Numpy array of bools representing sites to skip
  • name - string representing the name of the read
  • rg - int representing the read group the read belongs to
  • second - bool representing whether the read is 2nd in pair
  • errors - Numpy array of bools representing whether the base is an error

Class Methods

Instance Methods

Note that from https://docs.python.org/3/library/stdtypes.html#dict , iteration order is guaranteed to be in insertion order. Thus we are OK saving the rg as an int on the fly so long as we don’t remove any read groups from the rg_to_pu dictionary. So don’t do that! In fact, we should probably remove __del__() and pop() from the dicts…

__init__(seq, qual, skips, name, rg, second, errors)[source]

Initialize self. See help(type(self)) for accurate signature.

__len__()[source]

Return sequence length.

Returns:sequence length
Return type:int
canonical_name()[source]

The name with an added suffix based on whether the read is firstinpair or not.

If the read has its second flag set to True, /2 is added to the end. Otherwise, /1 is added.

Returns:read name with a suffix
Return type:str
errors = None

numpy.ndarray of bools representing whether the base is an error

classmethod from_bamread(bamread, use_oq=False)[source]

ReadData factory that instantiates a ReadData object from a pysam AlignedSegment.

Skips and errors are initialized to be empty bool arrays. If you need to use these attributes, make sure you set them the way you need to for your use case.

If use_oq is True, use the OQ tag for base quality. The default is to use regular quality scores.

If the read group hasn’t been loaded in the dictionary, it will be registered with a PU equal to the value of rg. To load the dictionary, call load_rgs_from_bamfile() first. If there is no RG tag for the read it will be given a generic RG of None and lumped in with all other reads that have no RG tag.

This will reverse-complement the sequence and reverse the qualities if the read aligns on the reverse strand.

Parameters:
Returns:

a read object

Return type:

ReadData

classmethod from_fastq(fastqread, rg=None, second=None, namedelimiter='_')[source]

ReadData factory that instantiates a ReadData object from a pysam FastqProxy.

Skips and errors are initialized to be empty bool arrays. If you need to use them, make sure you set them the way you need to for your use case. The read name will be set to the first field of the delimited read name, minus any trailing /1 or /2 if they exist.

If rg is None (the default), we will attempt to infer the read group id from the read name. To infer rg, the read group must be in its own field with the field beginning with RG:, such as RG:example, with fields delimited by namedelimiter. When multiple fields are found that begin with RG:, the last is chosen to be the true ID. If inference fails, the read group will remain None.

If the read group (either inferred or provided explicitly) hasn’t been loaded in the dictionary, it will be registered with a PU equal to the value of rg.

If second is None (the default) we will attempt to infer if the read is second in pair based on the name. To infer second, the first field of the read name must end with /2. If the last 2 characters of the first field are not /2, second will be inferred to be false.

Parameters:
  • fastqread (pysam.FastqProxy) – The fastq read to get data from
  • rg (str) – the read group the read belongs to
  • second (bool) – whether the read is 2nd in pair
  • namedelimiter (str) – the delimiter for parsing the read name
Returns:

a read object

Return type:

ReadData

get_cycle_array()[source]

Return an array of cycle values.

This is range(len(seq)) if second is false, otherwise it is -1..-len(seq) inclusive. This way we can hold cycle values as an array of size 2 * seqlen and store negative values at the end of the array.

Returns:cycle values
Return type:numpy.ndarray (int)
get_cycle_errors()[source]

Return an array of cycle values that were errors and of all valid cycle values.

The errors are always a subset of the valid values.

Returns:erroneous and valid cycle values
Return type:tuple(numpy.ndarray (int) , numpy.ndarray (int))
get_dinuc_errors(minscore=6)[source]

Return an array of dinucleotide values that were errors and of all valid dinucleotide values.

The errors are always a subset of the valid values.

Returns:erroneous and valid dinucleotide values
Return type:tuple(numpy.ndarray (int) , numpy.ndarray (int))
get_dinucleotide_array(minscore=6)[source]

Return an array of dinucleotide values.

The character to int map is stored in kbbq.compare_reads.Dinucleotide.

Returns:dinucleotide values
Return type:numpy.ndarray (int)
get_pu()[source]

Return the PU from the rg_to_pu dict.

Returns:The Read Group’s PU tag
Return type:str
get_q_errors()[source]

Return an array of q values that were errors and of all valid q values.

The errors are always a subset of the valid values.

Returns:erroneous and valid q values
Return type:tuple(numpy.ndarray (int) , numpy.ndarray (int))
get_rg_errors()[source]

Return an array of rg values that were errors and all valid rg values.

The errors are always a subset of the valid sites.

For example, a read with 4 non skipped sites and 1 error will return ([rg], [rg, rg, rg, rg])

Returns:errors and valid rg values
Return type:tuple(numpy.ndarray (int) , numpy.ndarray (int))
get_rg_int()[source]

Return the RG as an int suitable for indexing rather than as the actual string RG ID.

Returns:read group index
Return type:int
classmethod load_rgs_from_bamfile(bamfileobj)[source]

Load read group IDs and PUs from the header of the pysam bamfileobj.

Recalibration is done on a PU basis, so when building a model using reads in a BAM, the PU for each read is obtained by looking up the PU associated with its read group. The dictionary that controls these lookups is a class variable. To store the RG data in an int form, an rg-to-int dict is also loaded. This should be equal to dict(zip(rg_to_pu, range(len(rg_to_pu))))

Parameters:bamfileobj (pysam.AlignmentFile) – The opened bam file
name = None

string representing the name of the read

not_skipped_errors()[source]

Return a logical and of not skips and errors

Returns:array of valid errors
Return type:numpy.ndarray (bool)
numrgs = 0

Number of readgroups encountered so far.

qual = None

numpy.ndarray of int representing the quality score

rg = None

int representing the read group the read belongs to

rg_to_int = {}

Dict mapping RG ids to integer indices.

rg_to_pu = {}

Dict mapping RG ids to the PU tag for the read group.

second = None

bool representing whether the read is 2nd in pair

seq = None

numpy.ndarray of characters representing the sequence

skips = None

numpy.ndarray of bools representing sites to skip

str_qual(offset=33)[source]

Get the quality of this read as a list of characters.

offset (default 33) will be added to the ASCII value of each character.

Parameters:offset (int) – Offset to add
Returns:read quality
Return type:list(chr)
kbbq.read.bamread_get_oq(read, offset=33)[source]

Get the OQ of the given bamread as an array of int.

offset (default 33) will be subtracted from the ASCII value of each character.

Parameters:
Returns:

quality scores

Return type:

numpy.ndarray of int

kbbq.read.bamread_get_quals(read, use_oq=False)[source]

Return the qualities of a pysam bam read as an array.

If use_oq = True, use the OQ tag for base quality. The default is to use regular quality scores.

Parameters:
Returns:

quality scores

Return type:

numpy.ndarray of int

kbbq.covariate

A data class and helper functions for read covariates.

This module includes the Covariate class and subclasses, as well as the pad_axis() helper function for appending to the end of an arbitrary axis of an numpy.ndarray.

Classes
Functions
  • pad_axis() - helper function to pad an np.ndarray with zeros.
class kbbq.covariate.Covariate(shape=0)[source]

Bases: object

A base class for covariates.

Holds 2 np.ndarray, errors for number of errors and total for the total number of observations of this covariate. The indices of each array are they keys; ie. the number of times covariate 2 was observed can be found with self.total[2].

Warning

Don’t assign directly to errors or total. shape() assumes both arrays have the same shape and only returns the shape of total. If the arrays’ shapes become different somehow, __getitem__() and __setitem__() may start throwing errors unexpectedly. If you want to use two arrays that aren’t guaranteed to have the same size, use two arrays. This isn’t the right class for that use case.

__getitem__() will pass the key to both arrays and return a tuple of (errors, total). As a pneumonic, you can remember this as a fraction from left to right, as total should always be larger than errors.

__setitem__() will also pass the key to both arrays, and will expect value to be a tuple of values, setting the first value to errors and the second to total.

Attributes

Methods

  • __init__() - Initialize arrays with desired shape.
  • pad_axis() - Pad specified axis of both arrays
  • pad_axis_to_fit() - Pad specified axis so the given index is valid
  • increment() - Increment the given indices of each array.
  • shape() - return the shape of the arrays.
  • __getitem__() - get number of errors and total observations at the given index
  • __setitem__() - set errors and observations at the given index
__getitem__(key)[source]

Pass key to each np.ndarray and return the result as a tuple.

The result will be in order of (errors, total). As a pneumonic, you can remember this as a fraction from left to right, as total should always be larger than errors.

Parameters:key (slice) – indices to get
Returns:number of errors and number of total observations
Return type:tuple(int, int) or tuple( numpy.ndarray , numpy.ndarray )
__init__(shape=0)[source]

Initialize the arrays with the given shapes.

Parameters:tuple (shape) – desired shape of error and total arrays
Returns:empty object
Return type:Covariate
__setitem__(key, value)[source]

Pass key to each np.ndarray and assign each value.

Value must be a tuple of values to assign. The first value is assigned to errors and the second to total.

Parameters:
errors = None

np.ndarray (int) holding number of errors observed for each key.

increment(idx, value=(1, 1))[source]

Increment the given indices by value.

idx should be a tuple of values interpretable by np.add().

The first indices given will increment the error array. The second indices given will increment the total array.

Note that this is implemented with np.add.at(). For example, if idx[0] is [2,2], errors [2] will be incremented twice. This is equivalent to

errors[2] += 1
errors[2] += 1

If value[0] is 2, errors will be instead incremented by 4 and is equivalent to

errors[2] += 2
errors[2] += 2
Parameters:
  • idx (tuple( index , index )) – indices to increment array at
  • value (typle( int, int )) – value to add for each index given.
pad_axis(axis, n=1)[source]

Expand axis dimension by n.

Parameters:
  • axis (int) – axis to pad
  • n (int) – size of pad to apply
pad_axis_to_fit(axis, idx)[source]

Expand the axis until idx is a valid idx in that dimension.

Does nothing if idx is already valid. idx should be a single scalar value, not a slice.

Parameters:
  • axis (int) – axis to pad
  • idx (int) – index to ensure is valid
shape()[source]

Return the shape of the underlying arrays.

Returns:array shape
Return type:tuple(int)
total = None

np.ndarray (int) holding number of observations for each key.

class kbbq.covariate.CovariateData[source]

Bases: object

A class for holding all the covariate data needed to recalibrate reads.

Attributes

  • rgcov - RGCovariate
  • qcov - QCovariate
  • poscov - PosCovariate
  • dinuccov - DinucCovariate

Methods

__init__()[source]

Initialize covariate arrays.

consume_read(read)[source]

Add covariates from the read to the proper data arrays.

Parameters:read (kbbq.read.ReadData) – the read to take data from
cyclecov = None

Cycle covariates

dinuccov = None

Dinucleotide covariates

get_num_cycles()[source]

Return number of cycles in the cycle covariate

This is the half the length of the cycle vector; there are twice as many cycles since positive and negative cycles are treated differently.

Returns:half the length of cycle vector
Return type:int
get_num_dinucs()[source]

Return number of dinucs in the dinuc covariate.

Returns:length of the dinuc vector
Return type:int
get_num_qs()[source]

Return number of qs in the q covariate

Returns:length of q vector
Return type:int
get_num_rgs()[source]

Return number of rgs in the rg covariate

Returns:length of rg vector
Return type:int
qcov = None

Q covariates

class kbbq.covariate.CycleCovariate[source]

Bases: kbbq.covariate.Covariate

A 3d covariate array in the read group, Q, and cycle dimensions.

pad_axis() is overriden here since the way cycles are stored means appending to the end of the array changes the location of the data.

Methods:

__init__()[source]

Initialize empty arrays.

num_cycles()[source]

Return number of cycles in the cycle covariate

This is the half the length of the cycle vector; there are twice as many cycles since positive and negative cycles are treated differently.

Returns:half the length of cycle vector
Return type:int
pad_axis(axis, n=1)[source]

Expand the axis dimension by n.

This is overriden for the CycleCovariate since we use one array to handle both positive and negative cycle values. Appending 0’s to the end would ruin the array since negative indices are significant. Thus, we have to wrangle the existing data a bit.

Parameters:
  • axis (int) – axis to pad
  • n (int) – size of pad to apply
class kbbq.covariate.DinucCovariate[source]

Bases: kbbq.covariate.Covariate

A 3d covariate array in the read group, Q, and dinucleotide dimensions.

__init__()[source]

Initialize empty arrays.

num_dinucs()[source]

Return length of dinucleotide array.

Returns:number of dinucleotides
Return type:int
class kbbq.covariate.QCovariate[source]

Bases: kbbq.covariate.Covariate

A 2d covariate array in the read group and Q dimensions.

This class holds a RGCovariate to efficiently increment both objects while iterating through reads.

Attributes

Methods

__init__()[source]

Initialize an empty RGCovariate and initialize the empty q arrays.

consume_read(read)[source]

Increment the rg and q covariates given a kbbq.read.ReadData

Parameters:read (kbbq.read.ReadData) – read to take covariate values from
Returns:erroneous and valid read group values and erroneous and valid q values
Return type:(( numpy.ndarray , numpy.ndarray ),( numpy.ndarray , numpy.ndarray ))
num_qs()[source]

Return size needed to hold largest q observed so far.

Returns:largest q + 1
Type:int
rgcov = None

RG covariates

class kbbq.covariate.RGCovariate[source]

Bases: kbbq.covariate.Covariate

A 1d covariate array in the read group dimension.

Methods

__init__()[source]

Initialize a 1D read group array.

Returns:empty object
Return type:Covariate
consume_read(read)[source]

Increment the rg covariates given a kbbq.read.ReadData

Parameters:read (kbbq.read.ReadData) – read to take covariate values from
Returns:erroneous and valid read group values
Return type:tuple( numpy.ndarray , numpy.ndarray )
num_rgs()[source]

Return size needed to hold largest rg observed so far.

Returns:largest rg + 1
Type:int
kbbq.covariate.pad_axis(array, axis, n)[source]

Pad the axis of np.ndarray with n zeros.

Parameters:
  • axis (int) – axis to pad
  • n (int) – size of pad to apply
Returns:

padded array

Return type:

numpy.npdarray