Module Reference¶

kbbq
kbbq.recaltable
kbbq.compare_reads
kbbq.plot
kbbq.benchmark
kbbq.read
kbbq.covariate

kbbq ¶

A package for reference-free base quality score recalibration.

kbbq.recaltable ¶

Provides utilities for interfacing with GATK Recalibration tables.

class kbbq.recaltable.GATKReport(tables, version='1.1')[source]¶

Bases: object

A base class representing a GATK Report file that contains many tables.

Attributes

tables - a list of GATKTable objects

version - the GATKReport format version. The default is

'1.1'.

Methods

write() - write the report to a file

fromfile() - read the report from a file

get_headerstring() - get the header line.

Does not include a newline.

__init__(tables, version='1.1')[source]¶

Initialize a GATKReport.

This module was written by reverse engineering a version 1.1 report. If you have a table with a newer version, please file a bug report, especially if this module fails to parse it.

Parameters:	tables (list(GATKTable)) – a list of tables. version (str) – The report version as a string.
Returns:	a representation of a report
Return type:	`GATKReport`

__str__()[source]¶

A string containing the report.

Each table is separated by two newlines.

classmethod fromfile(filename)[source]¶

Initialize the GATKReport from a file.

Parameters:	filename (str) – The file to be read.
Returns:	a representation of a report
Return type:	`GATKReport`

get_headerstring()[source]¶

Get the header line.

The header line identifies the file as a report, shows the version, and shows the number of tables that should be present (such as #:GATKReport.v1.1:5). This method returns a string that does not include a trailing newline.

Returns:	the header line without a newline.
Return type:	str

tables = None¶: A list of GATKTable objects.

version = None¶: The report version identifier (such as '1.1')

write(filename)[source]¶

Write the report to filename.

Parameters:	filename (str) – The file name to write.

class kbbq.recaltable.GATKTable(title, description, data)[source]¶

Bases: object

A class representing a GATK report table.

The intended method of interacting with the table is using the data attribute, which is a standard Pandas dataframe.

Attributes

title - The title of the table

description - A description of the table. May be ''.

data - A Pandas dataframe containing the table data. Accessing this

attribute is the primary way to interact with the data.

typemap - A dict mapping types to a character to be used

for constructing the format string.

precisionmap - A dict mapping column headers to precision

to be used for constructing the format string.

Methods

fromstring() - Initialize a GATKTable from a string.

parse_fmtstring() - Parse a fmtstring into a dictionary of types

get_fmtstring() - Get a formatstring by inspecting the dataframe.

get_unindexed() - Get a copy of the dataframe with indices reset

get_colfmts() - Get a list of format strings for the data in

the dataframe

get_titlestring() - Get the title string

get_datastring() - Format the dataframe into a string

get_nrows() - Get the number of rows in the dataframe.

get_ncols() - Get the number of columns in the dataframe.

write() - Write the table to a filehandle.

__init__(title, description, data)[source]¶

Initialize the table.

Parameters:	title (str) – the table title description (str) – a short description of the table. May be `''` data (pandas.DataFrame) – the data contained by the table

__str__()[source]¶: Return str(self).

data = None¶: A Pandas dataframe containing the table data. Accessing this attribute is the primary way to interact with the data.

description = None¶: A short description of the table.

classmethod fromstring(tablestring)[source]¶

Initialize the table from a string

The first line of the table contains the a string describing the table format and is like #:GATKTable:ncol:nrow:%f:%f:%f:;. It is : delimited and ; terminated. The last format string will be empty. The second line of the table specifies the name of the table. It is formatted like #:GATKTable:title:description. description can be an empty string. It is : delimited. The third line contains the header. The header and row data are delimited by two or more spaces. The number of spaces are determined by padding each value in the row such that the the character width of the column is constant and can represent all the data in the column. Strings (including column headings) are left justified and numeric types are right justified.

get_colfmts()[source]¶

Get the formats for each column in the dataframe.

Uses typemap and precisionmap to create the strings. This will return something like ['%s', '%s', '%.2f', '%.4f']

Returns:	format strings for each column
Return type:	list(str)

get_datastring()[source]¶

Get the data string. Includes the header line and the data in the table.

Returns:	the data table as a formatted string
Return type:	str

get_fmtstring()[source]¶

Get the format string by inspecting the dataframe.

Uses the dtypes, number of rows, and number of columns in the data frame along with the strings returned by get_colfmts() to create the format string

which is formatted like #:GATKTable:ncol:nrow:%f:%f:%f:;.

Returns:	The format string
Return type:	str

get_ncols()[source]¶

Get the number of columns in the dataframe.

Returns:	The number of columns in the dataframe.
Return type:	int

get_nrows()[source]¶

Get the number of rows in the dataframe.

Returns:	The number of rows in the dataframe.
Return type:	int

get_titlestring()[source]¶

Get the titlestring.

Uses the title and description to create the title line. It is formatted like #:GATKTable:title:subtitle.

Returns:	the title string
Return type:	str

get_unindexed()[source]¶

Return a copy of the dataframe without the indexes included.

If the table has an unnamed index, such as the default index, return a copy of the data frame. Otherwise, reset the index.

static parse_fmtstring(header, fmtstring)[source]¶

Turn a fmtstring into a dictionary of {col_title : type}

Parameters:	header (list(str)) – The column titles as a list fmtstring (str) – The format string to parse
Returns:	A dictionary mapping {col_title : type}
Return type:	dict(`str`, `type`)

precisionmap = None¶: A dictionary of {colname : str} to determine the precision string to prepend to each formatstring. This only affects get_fmtstring().

title = None¶: The title of the table.

typemap = None¶

A dictionary of {type : str} to determine the string for each type.

Note that in the current implementation this only affects get_fmtstring().

write(filehandle)[source]¶

Write the table to a filehandle.

Adds a newline to the end that isn’t present in __str__()

Parameters:	filehandle (File Object) – The file object to write the table to.
Returns:	Whatever the file object’s write method returns. Usually the number of characters written.
Return type:	Usually int

class kbbq.recaltable.RecalibrationReport(tables, version='1.1')[source]¶

Bases: kbbq.recaltable.GATKReport

A class representing a GATK Recalibration Report used for Base Quality Score Recalibration.

It extends the GATKReport, so the primary method of interacting with the table is the tables attribute. The primary difference is the __init__() function will set indices and data types as they should be for a recalibration report.

Table tables[0] contains the arguments used to call the program

that generated the table.
Table tables[1] contains the quanization map
Table tables[2] contains read group error information.
Table tables[3] contains error information subset by read group,

then reported quality score.
Table tables[4] contains error information subset by read group,

then reported quality score, then covariate name and value. Both Context and Cycle are valid for the covariate name.

See the table below for a schema.

Table	Number of Columns	Variables
0	2	Argument, Value
1	3	QualityScore, Count, QuantizedScore
2	6	ReadGroup, EventType, EmpiricalQuality, EstimatedQReported, Observations, Errors
3	6	ReadGroup, QualityScore, EventType, EmpiricalQuality, Observations, Errors
4	8	ReadGroup, QualityScore, CovariateValue, CovariateName, EventType, EmpiricalQuality, Observations, Errors

For current GATK parameters, EventType will be M for mismatch though I for insertion and D for deletion are possible. Valid values for CovariateName are Context and Cycle.

__init__(tables, version='1.1')[source]¶

Initialize the recalibration report from a list of GATKTable s.

This initializes the tables with some data wrangling to set indices and data types properly. If you don’t want this, use a GATKReport instead.

__str__()[source]¶: GATKReports list CovariateValue before CovariateName, which doesn’t make much sense for organizing the index. So we switch the columns before printing and switch them back after we get the string.

kbbq.compare_reads ¶

A mish-mash of functions for recalibration. Should probably be renamed in the near future.

class kbbq.compare_reads.Dinucleotide[source]¶

Bases: object

A class to cache dinucleotides and maintain a consistent dinuc -> int map throughout the module.

complement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}¶: Dictionary for complementing nucleotides

dinuc_to_int = {'AA': 0, 'AC': 3, 'AG': 2, 'AT': 1, 'CA': 12, 'CC': 15, 'CG': 14, 'CT': 13, 'GA': 8, 'GC': 11, 'GG': 10, 'GT': 9, 'TA': 4, 'TC': 7, 'TG': 6, 'TT': 5}¶: Dictionary mapping dinuc -> int

dinucs = ['AA', 'AT', 'AG', 'AC', 'TA', 'TT', 'TG', 'TC', 'GA', 'GT', 'GG', 'GC', 'CA', 'CT', 'CG', 'CC']¶: List of valid dinucleotides

nucleotides = ['A', 'T', 'G', 'C']¶: List of valid nucleotides

classmethod veccomplement(*args, **kwargs)[source]¶

classmethod vecget(*args, **kwargs)[source]¶

vectorized_complement = <numpy.vectorize object>¶

vectorized_get = <numpy.vectorize object>¶

class kbbq.compare_reads.RescaledNormal[source]¶

Bases: object

A class to cache the rescaled normal prior used in the bayesian recalibration model.

Attributes

maxscore - max score supported

prior_dist - numpy array of prior probability

Methods

prior() - get the prior probability for a given quality score difference

Most of these attributes are nonsense; the “proper” way to interact with the class is via the prior_dist array. If that makes you uncomfortable, use the provided accessor function prior().

Under no circumstances should you attempt to replace any of these attributes, it will most likely not have the desired effect. The caching mechanism here only works because the class attributes are immediately instatiated when the class is created, so by the time you replace them it won’t matter. Reimplement it yourself if you want a different prior.

i = 42¶

maxscore = 42¶: The maximum quality score supported by this class.

oldset = {'divide': 'warn', 'invalid': 'warn', 'over': 'warn', 'under': 'ignore'}¶

possible_diffs = array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42])¶

classmethod prior(difference)[source]¶

Return the prior probability for a given difference in quality score.

Parameters:	difference (int) – The difference in quality score
Returns:	the prior probability
Return type:	np.longdouble

prior_dist = array([-1.05360516e-01, -2.10536052e+00, -8.10536052e+00, -1.81053605e+01, -3.21053605e+01, -5.01053605e+01, -7.21053605e+01, -9.81053605e+01, -1.28105361e+02, -1.62105361e+02, -2.00105361e+02, -2.42105361e+02, -2.88105361e+02, -3.38105361e+02, -3.92105361e+02, -4.50105361e+02, -5.12105361e+02, -5.78105361e+02, -6.48105361e+02, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf], dtype=float128)¶

kbbq.compare_reads.bamread_get_oq(read)[source]¶

kbbq.compare_reads.fastq_cycle_covariates(read, secondinpair=False)[source]¶

kbbq.compare_reads.fastq_dinuc_covariates(read, minscore=6)[source]¶

kbbq.compare_reads.fastq_infer_rg(read)[source]¶

Infer the read group from appended read information, such as produced by the samtools fastq command.

Requires the read to be formatted such the rg tag is added to the end of the read name delimited by a _. Returns the read group tag.

kbbq.compare_reads.fastq_infer_secondinpair(read)[source]¶

kbbq.compare_reads.find_read_errors(read, ref, variable)[source]¶

Use the CIGAR to find errors in the read or sites to skip.

Softclipped bases will be added to the skip array. Returns a tuple with the error array and the skip array. We don’t consider indel errors.

kbbq.compare_reads.gatk_delta_q(prior_q, numerrs, numtotal, maxscore=42)[source]¶

Calculate the shift in quality scores from the prior given data.

This is achieved by finding the difference between the maximum a posteriori and the prior point estimate of Q.

kbbq.compare_reads.generic_cycle_covariate(sequencelen, secondinpair=False)[source]¶

kbbq.compare_reads.generic_dinuc_covariate(sequences, quals, minscore=6)[source]¶

kbbq.compare_reads.get_rg_to_pu(bamfileobj)[source]¶

kbbq.compare_reads.get_var_sites(vcf)[source]¶

Get positions covered by any record in a VCF file.

Pass the file name. Each value in the returned dict will be a list of 0-based positions.

kbbq.compare_reads.load_positions(posfile)[source]¶

Get positions that are covered by a non-zipped BED file.

Pass the file name; it will be opened with python.open(). This is slightly different than benchmark.get_bed_dict() because it doesn’t require a refdict and only provides a list of 0-based positions, not a boolean array of all positions in the reference.

kbbq.compare_reads.p_to_q(p, maxscore=42)[source]¶

kbbq.compare_reads.q_to_p(q)[source]¶

kbbq.compare_reads.recalibrate_fastq(read, meanq, globaldeltaq, qscoredeltaq, positiondeltaq, dinucdeltaq, rg, dinuc_to_int, secondinpair=False, minscore=6, maxscore=42)[source]¶

kbbq.compare_reads.regression_recalibrate(lr, q)[source]¶

kbbq.compare_reads.train_regression(rawquals, corrected, tol=1e-08)[source]¶

kbbq.compare_reads.tstamp()[source]¶

Return the current time up to the second in ISO format as a string.

Returns:	current time with brackets
Return type:	str

kbbq.plot ¶

Utilities for plotting calibration and other relevant info.

kbbq.plot.plot_benchmark(fhin, outfile, plottype='calibration')[source]¶

Plot predicted vs. true probability or size of each quality bin.

This plots the predicted quality score (x) against the true quality score (y) for each quality bin. This is done by calculating the mean probability of error for each bin and converting that probability back to a quality score.

With plottype = ‘sample-size’, instead plots the size of each predicted quality score bin.

Parameters:	fhin (python.IOBase) – file containing benchmark data outfile (str) – name of file to save plot to

kbbq.benchmark ¶

Utilities for benchmarking calibration methods.

kbbq.benchmark.benchmark(bamfile, fafile, vcffile, fastqfile=None, label=None, use_oq=False, bedfh=None)[source]¶

Perform the benchmark and print the results to stdout.

If a fastqfile is not given, the benchmark will be taken from the reads in the BAM file. If a fastqfile is given, it should have the same read names as the reads in the BAM file. The names will then be used to look up where errors occur in the reads.

kbbq.benchmark.benchmark_bam(bamfile, ref, var_sites, use_oq=False, bedfh=None)[source]¶

kbbq.benchmark.benchmark_fastq(fqfile, bamfile, ref, var_sites, bedfh=None)[source]¶

kbbq.benchmark.calculate_q(errors, quals)[source]¶

Calculate Actual Q and Predicted Q given a flat array of errors and flat array of quality scores.

The index of the returned array represents the predicted quality score, the value represents the actual quality score or the number of bases.

kbbq.benchmark.get_bam_readname(read)[source]¶: Given a bam read, get a canonicalized name. The name is the query name + a suffix based on whether it is first or second in pair.

kbbq.benchmark.get_bamread_quals(read, use_oq=False)[source]¶

Return the qualities of the read as an np.array.

Use the OQ tag for read quals if use_oq = True.

kbbq.benchmark.get_bed_dict(refdict, bedfh)[source]¶

kbbq.benchmark.get_error_dict(bamfile, refdict, fullskips)[source]¶

Finds errors given an authoritative BAM, ref, and sites to skip.

The returned dict has readname keys and tuple(np.array, np.array) of bools as values.

kbbq.benchmark.get_fastq_readname(read)[source]¶: Given a fastq read, get a canonicalized name. This is based on whether the read is first in pair.

kbbq.benchmark.get_full_skips(refdict, var_sites, bedfh=None)[source]¶

kbbq.benchmark.get_ref_dict(reffilename)[source]¶

kbbq.benchmark.get_var_sites(vcf)[source]¶

kbbq.benchmark.print_benchmark(actual_q, label, nbases)[source]¶

Print benchmark data to a tab separated table.

Label should be a string and is used for all values in the table. No headers are printed. This means the benchmark command can be called many times and each table can be concatenated together.

kbbq.read ¶

Class and functions for read data.

This module includes the ReadData class and a few utility functions for working with BAM reads of type pysam.AlignedSegment.

Classes

ReadData

Functions

bamread_get_oq()

bamread_get_quals()

class kbbq.read.ReadData(seq, qual, skips, name, rg, second, errors)[source]¶

Bases: object

A class that represents some minimal information from a sequencing read.

Since this won’t hold as much data as a BAM read, if you want to recalibrate a read you should turn it into a ReadData object, then update the original read with the new information; otherwise you’ll lose annotations. This class also manages read group information. You should instantiate the object from one of the class methods instead of directly instantiating it if possible. You should never assign directly to any of the class attributes; treat them as read-only. Instance variables should be fine to manipulate.

Class Attributes

rg_to_pu - Dict of read group id’s -> platform unit

rg_to_int - Dict of read group id’s -> int

numrgs - int number of read groups

Instance Attributes

seq - Numpy array of characters representing the sequence

qual - Numpy array of int representing the quality score

skips - Numpy array of bools representing sites to skip

name - string representing the name of the read

rg - int representing the read group the read belongs to

second - bool representing whether the read is 2nd in pair

errors - Numpy array of bools representing whether the base is an error

Class Methods

from_bamread() - instantiate a ReadData object from a BAM read

from_fastq() - instantiate a ReadData object from a fastq read

load_rgs_from_bamfile() - load read groups into the class from a bam file object

Instance Methods

str_qual() - Get the quality score as a list of chars

canonical_name() - Return the name with a ‘/1’ or ‘/2’ suffix

get_rg_int() - Get the read group index using the rg_to_int dictionary

get_pu() - Get the PU from the read group and rg_to_pu dictionary.

not_skipped_errors() - Return a logical and of ~:attr:skips and errors

Note that from https://docs.python.org/3/library/stdtypes.html#dict , iteration order is guaranteed to be in insertion order. Thus we are OK saving the rg as an int on the fly so long as we don’t remove any read groups from the rg_to_pu dictionary. So don’t do that! In fact, we should probably remove __del__() and pop() from the dicts…

__init__(seq, qual, skips, name, rg, second, errors)[source]¶: Initialize self. See help(type(self)) for accurate signature.

__len__()[source]¶

Return sequence length.

Returns:	sequence length
Return type:	int

canonical_name()[source]¶

The name with an added suffix based on whether the read is firstinpair or not.

If the read has its second flag set to True, /2 is added to the end. Otherwise, /1 is added.

Returns:	read name with a suffix
Return type:	str

errors = None¶: numpy.ndarray of bools representing whether the base is an error

classmethod from_bamread(bamread, use_oq=False)[source]¶

ReadData factory that instantiates a ReadData object from a pysam AlignedSegment.

Skips and errors are initialized to be empty bool arrays. If you need to use these attributes, make sure you set them the way you need to for your use case.

If use_oq is True, use the OQ tag for base quality. The default is to use regular quality scores.

If the read group hasn’t been loaded in the dictionary, it will be registered with a PU equal to the value of rg. To load the dictionary, call load_rgs_from_bamfile() first. If there is no RG tag for the read it will be given a generic RG of None and lumped in with all other reads that have no RG tag.

This will reverse-complement the sequence and reverse the qualities if the read aligns on the reverse strand.

Parameters:	bamread (`pysam.AlignedSegment`) – read to get data from use_oq (bool) – use the OQ tag for quality scores
Returns:	a read object
Return type:	`ReadData`

classmethod from_fastq(fastqread, rg=None, second=None, namedelimiter='_')[source]¶

ReadData factory that instantiates a ReadData object from a pysam FastqProxy.

Skips and errors are initialized to be empty bool arrays. If you need to use them, make sure you set them the way you need to for your use case. The read name will be set to the first field of the delimited read name, minus any trailing /1 or /2 if they exist.

If rg is None (the default), we will attempt to infer the read group id from the read name. To infer rg, the read group must be in its own field with the field beginning with RG:, such as RG:example, with fields delimited by namedelimiter. When multiple fields are found that begin with RG:, the last is chosen to be the true ID. If inference fails, the read group will remain None.

If the read group (either inferred or provided explicitly) hasn’t been loaded in the dictionary, it will be registered with a PU equal to the value of rg.

If second is None (the default) we will attempt to infer if the read is second in pair based on the name. To infer second, the first field of the read name must end with /2. If the last 2 characters of the first field are not /2, second will be inferred to be false.

Parameters:	fastqread (`pysam.FastqProxy`) – The fastq read to get data from rg (str) – the read group the read belongs to second (bool) – whether the read is 2nd in pair namedelimiter (str) – the delimiter for parsing the read name
Returns:	a read object
Return type:	`ReadData`

get_cycle_array()[source]¶

Return an array of cycle values.

This is range(len(seq)) if second is false, otherwise it is -1..-len(seq) inclusive. This way we can hold cycle values as an array of size 2 * seqlen and store negative values at the end of the array.

Returns:	cycle values
Return type:	`numpy.ndarray` (int)

get_cycle_errors()[source]¶

Return an array of cycle values that were errors and of all valid cycle values.

The errors are always a subset of the valid values.

Returns:	erroneous and valid cycle values
Return type:	tuple(`numpy.ndarray` (int) , `numpy.ndarray` (int))

get_dinuc_errors(minscore=6)[source]¶

Return an array of dinucleotide values that were errors and of all valid dinucleotide values.

The errors are always a subset of the valid values.

Returns:	erroneous and valid dinucleotide values
Return type:	tuple(`numpy.ndarray` (int) , `numpy.ndarray` (int))

get_dinucleotide_array(minscore=6)[source]¶

Return an array of dinucleotide values.

The character to int map is stored in kbbq.compare_reads.Dinucleotide.

Returns:	dinucleotide values
Return type:	`numpy.ndarray` (int)

get_pu()[source]¶

Return the PU from the rg_to_pu dict.

Returns:	The Read Group’s PU tag
Return type:	str

get_q_errors()[source]¶

Return an array of q values that were errors and of all valid q values.

The errors are always a subset of the valid values.

Returns:	erroneous and valid q values
Return type:	tuple(`numpy.ndarray` (int) , `numpy.ndarray` (int))

get_rg_errors()[source]¶

Return an array of rg values that were errors and all valid rg values.

The errors are always a subset of the valid sites.

For example, a read with 4 non skipped sites and 1 error will return ([rg], [rg, rg, rg, rg])

Returns:	errors and valid rg values
Return type:	tuple(`numpy.ndarray` (int) , `numpy.ndarray` (int))

get_rg_int()[source]¶

Return the RG as an int suitable for indexing rather than as the actual string RG ID.

Returns:	read group index
Return type:	int

classmethod load_rgs_from_bamfile(bamfileobj)[source]¶

Load read group IDs and PUs from the header of the pysam bamfileobj.

Recalibration is done on a PU basis, so when building a model using reads in a BAM, the PU for each read is obtained by looking up the PU associated with its read group. The dictionary that controls these lookups is a class variable. To store the RG data in an int form, an rg-to-int dict is also loaded. This should be equal to dict(zip(rg_to_pu, range(len(rg_to_pu))))

Parameters:	bamfileobj (`pysam.AlignmentFile`) – The opened bam file

name = None¶: string representing the name of the read

not_skipped_errors()[source]¶

Return a logical and of not skips and errors

Returns:	array of valid errors
Return type:	`numpy.ndarray` (bool)

numrgs = 0¶: Number of readgroups encountered so far.

qual = None¶: numpy.ndarray of int representing the quality score

rg = None¶: int representing the read group the read belongs to

rg_to_int = {}¶: Dict mapping RG ids to integer indices.

rg_to_pu = {}¶: Dict mapping RG ids to the PU tag for the read group.

second = None¶: bool representing whether the read is 2nd in pair

seq = None¶: numpy.ndarray of characters representing the sequence

skips = None¶: numpy.ndarray of bools representing sites to skip

str_qual(offset=33)[source]¶

Get the quality of this read as a list of characters.

offset (default 33) will be added to the ASCII value of each character.

Parameters:	offset (int) – Offset to add
Returns:	read quality
Return type:	list(chr)

kbbq.read.bamread_get_oq(read, offset=33)[source]¶

Get the OQ of the given bamread as an array of int.

offset (default 33) will be subtracted from the ASCII value of each character.

Parameters:	read (`pysam.AlignedSegment`) – Read to get quals from offset (int) – Offset to subtract
Returns:	quality scores
Return type:	`numpy.ndarray` of int

kbbq.read.bamread_get_quals(read, use_oq=False)[source]¶

Return the qualities of a pysam bam read as an array.

If use_oq = True, use the OQ tag for base quality. The default is to use regular quality scores.

Parameters:	read (`pysam.AlignedSegment`) – Read to get quals from use_oq (bool) – Use OQ tag for quality scores
Returns:	quality scores
Return type:	`numpy.ndarray` of int

kbbq.covariate ¶

A data class and helper functions for read covariates.

This module includes the Covariate class and subclasses, as well as the pad_axis() helper function for appending to the end of an arbitrary axis of an numpy.ndarray.

Classes

Covariate - a covariate base class
RGCovariate
QCovariate
CycleCovariate
DinucCovariate
CovariateData - a container for the covariates

Functions

pad_axis() - helper function to pad an np.ndarray with zeros.

class kbbq.covariate.Covariate(shape=0)[source]¶

Bases: object

A base class for covariates.

Holds 2 np.ndarray, errors for number of errors and total for the total number of observations of this covariate. The indices of each array are they keys; ie. the number of times covariate 2 was observed can be found with self.total[2].

Warning

Don’t assign directly to errors or total. shape() assumes both arrays have the same shape and only returns the shape of total. If the arrays’ shapes become different somehow, __getitem__() and __setitem__() may start throwing errors unexpectedly. If you want to use two arrays that aren’t guaranteed to have the same size, use two arrays. This isn’t the right class for that use case.

__getitem__() will pass the key to both arrays and return a tuple of (errors, total). As a pneumonic, you can remember this as a fraction from left to right, as total should always be larger than errors.

__setitem__() will also pass the key to both arrays, and will expect value to be a tuple of values, setting the first value to errors and the second to total.

Attributes

errors - numpy.ndarray of the number of errors observed

total - numpy.ndarray of the total number of observations

Methods

__init__() - Initialize arrays with desired shape.

pad_axis() - Pad specified axis of both arrays

pad_axis_to_fit() - Pad specified axis so the given index is valid

increment() - Increment the given indices of each array.

shape() - return the shape of the arrays.

__getitem__() - get number of errors and total observations at the given index

__setitem__() - set errors and observations at the given index

__getitem__(key)[source]¶

Pass key to each np.ndarray and return the result as a tuple.

The result will be in order of (errors, total). As a pneumonic, you can remember this as a fraction from left to right, as total should always be larger than errors.

Parameters:	key (slice) – indices to get
Returns:	number of errors and number of total observations
Return type:	tuple(int, int) or tuple( `numpy.ndarray` , `numpy.ndarray` )

__init__(shape=0)[source]¶

Initialize the arrays with the given shapes.

Parameters:	tuple (shape) – desired shape of error and total arrays
Returns:	empty object
Return type:	`Covariate`

__setitem__(key, value)[source]¶

Pass key to each np.ndarray and assign each value.

Value must be a tuple of values to assign. The first value is assigned to errors and the second to total.

Parameters:	key (slice) – indices to set value (tuple(int,int)) – values to set

errors = None¶: np.ndarray (int) holding number of errors observed for each key.

increment(idx, value=(1, 1))[source]¶

Increment the given indices by value.

idx should be a tuple of values interpretable by np.add().

The first indices given will increment the error array. The second indices given will increment the total array.

Note that this is implemented with np.add.at(). For example, if idx[0] is [2,2], errors [2] will be incremented twice. This is equivalent to

errors[2] += 1
errors[2] += 1

If value[0] is 2, errors will be instead incremented by 4 and is equivalent to

errors[2] += 2
errors[2] += 2

Parameters:	idx (tuple( index , index )) – indices to increment array at value (typle( int, int )) – value to add for each index given.

pad_axis(axis, n=1)[source]¶

Expand axis dimension by n.

Parameters:	axis (int) – axis to pad n (int) – size of pad to apply

pad_axis_to_fit(axis, idx)[source]¶

Expand the axis until idx is a valid idx in that dimension.

Does nothing if idx is already valid. idx should be a single scalar value, not a slice.

Parameters:	axis (int) – axis to pad idx (int) – index to ensure is valid

shape()[source]¶

Return the shape of the underlying arrays.

Returns:	array shape
Return type:	tuple(int)

total = None¶: np.ndarray (int) holding number of observations for each key.

class kbbq.covariate.CovariateData[source]¶

Bases: object

A class for holding all the covariate data needed to recalibrate reads.

Attributes

rgcov - RGCovariate

qcov - QCovariate

poscov - PosCovariate

dinuccov - DinucCovariate

Methods

consume_read() - Add covariates from the read to the proper data arrays

get_num_rgs() - Return number of rgs in the rg covariate

get_num_qs() - Return number of qs in the q covariate

get_num_cycles() - Return number of cycles in the cycle covariate

get_num_dinucs() - Return number of dinucs in the dinuc covariate

__init__()[source]¶: Initialize covariate arrays.

consume_read(read)[source]¶

Add covariates from the read to the proper data arrays.

Parameters:	read (`kbbq.read.ReadData`) – the read to take data from

cyclecov = None¶: Cycle covariates

dinuccov = None¶: Dinucleotide covariates

get_num_cycles()[source]¶

Return number of cycles in the cycle covariate

This is the half the length of the cycle vector; there are twice as many cycles since positive and negative cycles are treated differently.

Returns:	half the length of cycle vector
Return type:	int

get_num_dinucs()[source]¶

Return number of dinucs in the dinuc covariate.

Returns:	length of the dinuc vector
Return type:	int

get_num_qs()[source]¶

Return number of qs in the q covariate

Returns:	length of q vector
Return type:	int

get_num_rgs()[source]¶

Return number of rgs in the rg covariate

Returns:	length of rg vector
Return type:	int

qcov = None¶: Q covariates

class kbbq.covariate.CycleCovariate[source]¶

Bases: kbbq.covariate.Covariate

A 3d covariate array in the read group, Q, and cycle dimensions.

pad_axis() is overriden here since the way cycles are stored means appending to the end of the array changes the location of the data.

Methods:

__init__() - Initialize empty arrays

pad_axis() - Pad given axis.

__init__()[source]¶: Initialize empty arrays.

num_cycles()[source]¶

Return number of cycles in the cycle covariate

This is the half the length of the cycle vector; there are twice as many cycles since positive and negative cycles are treated differently.

Returns:	half the length of cycle vector
Return type:	int

pad_axis(axis, n=1)[source]¶

Expand the axis dimension by n.

This is overriden for the CycleCovariate since we use one array to handle both positive and negative cycle values. Appending 0’s to the end would ruin the array since negative indices are significant. Thus, we have to wrangle the existing data a bit.

Parameters:	axis (int) – axis to pad n (int) – size of pad to apply

class kbbq.covariate.DinucCovariate[source]¶

Bases: kbbq.covariate.Covariate

A 3d covariate array in the read group, Q, and dinucleotide dimensions.

__init__()[source]¶: Initialize empty arrays.

num_dinucs()[source]¶

Return length of dinucleotide array.

Returns:	number of dinucleotides
Return type:	int

class kbbq.covariate.QCovariate[source]¶

Bases: kbbq.covariate.Covariate

A 2d covariate array in the read group and Q dimensions.

This class holds a RGCovariate to efficiently increment both objects while iterating through reads.

Attributes

rgcov - RG covariates

Methods

__init__() - Initialize arrays

consume_read() - increment the rg and q covariates given a kbbq.read.ReadData

num_qs() - return number of qs observed so far

__init__()[source]¶: Initialize an empty RGCovariate and initialize the empty q arrays.

consume_read(read)[source]¶

Increment the rg and q covariates given a kbbq.read.ReadData

Parameters:	read (`kbbq.read.ReadData`) – read to take covariate values from
Returns:	erroneous and valid read group values and erroneous and valid q values
Return type:	(( `numpy.ndarray` , `numpy.ndarray` ),( `numpy.ndarray` , `numpy.ndarray` ))

num_qs()[source]¶

Return size needed to hold largest q observed so far.

Returns:	largest q + 1
Type:	int

rgcov = None¶: RG covariates

class kbbq.covariate.RGCovariate[source]¶

Bases: kbbq.covariate.Covariate

A 1d covariate array in the read group dimension.

Methods

__init__() - Initialize arrays

consume_read() - increment the covariates given a kbbq.read.ReadData

num_rgs() - return number of rgs observed so far

__init__()[source]¶

Initialize a 1D read group array.

Returns:	empty object
Return type:	`Covariate`

consume_read(read)[source]¶

Increment the rg covariates given a kbbq.read.ReadData

Parameters:	read (`kbbq.read.ReadData`) – read to take covariate values from
Returns:	erroneous and valid read group values
Return type:	tuple( `numpy.ndarray` , `numpy.ndarray` )

num_rgs()[source]¶

Return size needed to hold largest rg observed so far.

Returns:	largest rg + 1
Type:	int

kbbq.covariate.pad_axis(array, axis, n)[source]¶

Pad the axis of np.ndarray with n zeros.

Parameters:	axis (int) – axis to pad n (int) – size of pad to apply
Returns:	padded array
Return type:	`numpy.npdarray`

Module Reference¶

kbbq¶

kbbq.recaltable¶

kbbq.compare_reads¶

kbbq.plot¶

kbbq.benchmark¶

kbbq.read¶

kbbq.covariate¶

kbbq ¶

kbbq.recaltable ¶

kbbq.compare_reads ¶

kbbq.plot ¶

kbbq.benchmark ¶

kbbq.read ¶

kbbq.covariate ¶