Module Reference¶
kbbq.recaltable¶
Provides utilities for interfacing with GATK Recalibration tables.
-
class
kbbq.recaltable.GATKReport(tables, version='1.1')[source]¶ Bases:
objectA base class representing a GATK Report file that contains many tables.
Attributes
Methods
write()- write the report to a filefromfile()- read the report from a fileget_headerstring()- get the header line.- Does not include a newline.
-
__init__(tables, version='1.1')[source]¶ Initialize a GATKReport.
This module was written by reverse engineering a version 1.1 report. If you have a table with a newer version, please file a bug report, especially if this module fails to parse it.
Parameters: Returns: a representation of a report
Return type:
-
classmethod
fromfile(filename)[source]¶ Initialize the GATKReport from a file.
Parameters: filename (str) – The file to be read. Returns: a representation of a report Return type: GATKReport
-
get_headerstring()[source]¶ Get the header line.
The header line identifies the file as a report, shows the version, and shows the number of tables that should be present (such as
#:GATKReport.v1.1:5). This method returns a string that does not include a trailing newline.Returns: the header line without a newline. Return type: str
-
version= None¶ The report version identifier (such as
'1.1')
-
class
kbbq.recaltable.GATKTable(title, description, data)[source]¶ Bases:
objectA class representing a GATK report table.
The intended method of interacting with the table is using the data attribute, which is a standard Pandas dataframe.
Attributes
title- The title of the tabledescription- A description of the table. May be''.data- A Pandas dataframe containing the table data. Accessing this- attribute is the primary way to interact with the data.
typemap- A dict mapping types to a character to be used- for constructing the format string.
precisionmap- A dict mapping column headers to precision- to be used for constructing the format string.
Methods
fromstring()- Initialize aGATKTablefrom a string.parse_fmtstring()- Parse a fmtstring into a dictionary of typesget_fmtstring()- Get a formatstring by inspecting the dataframe.get_unindexed()- Get a copy of the dataframe with indices resetget_colfmts()- Get a list of format strings for the data in- the dataframe
get_titlestring()- Get the title stringget_datastring()- Format the dataframe into a stringget_nrows()- Get the number of rows in the dataframe.get_ncols()- Get the number of columns in the dataframe.write()- Write the table to a filehandle.
-
__init__(title, description, data)[source]¶ Initialize the table.
Parameters: - title (str) – the table title
- description (str) – a short description of the table. May be
'' - data (pandas.DataFrame) – the data contained by the table
-
data= None¶ A Pandas dataframe containing the table data. Accessing this attribute is the primary way to interact with the data.
-
description= None¶ A short description of the table.
-
classmethod
fromstring(tablestring)[source]¶ Initialize the table from a string
The first line of the table contains the a string describing the table format and is like
#:GATKTable:ncol:nrow:%f:%f:%f:;. It is:delimited and;terminated. The last format string will be empty. The second line of the table specifies the name of the table. It is formatted like#:GATKTable:title:description.descriptioncan be an empty string. It is:delimited. The third line contains the header. The header and row data are delimited by two or more spaces. The number of spaces are determined by padding each value in the row such that the the character width of the column is constant and can represent all the data in the column. Strings (including column headings) are left justified and numeric types are right justified.
-
get_colfmts()[source]¶ Get the formats for each column in the dataframe.
Uses
typemapandprecisionmapto create the strings. This will return something like['%s', '%s', '%.2f', '%.4f']Returns: format strings for each column Return type: list(str)
-
get_datastring()[source]¶ Get the data string. Includes the header line and the data in the table.
Returns: the data table as a formatted string Return type: str
-
get_fmtstring()[source]¶ Get the format string by inspecting the dataframe.
Uses the dtypes, number of rows, and number of columns in the data frame along with the strings returned by
get_colfmts()to create the format stringwhich is formatted like
#:GATKTable:ncol:nrow:%f:%f:%f:;.Returns: The format string Return type: str
-
get_ncols()[source]¶ Get the number of columns in the dataframe.
Returns: The number of columns in the dataframe. Return type: int
-
get_nrows()[source]¶ Get the number of rows in the dataframe.
Returns: The number of rows in the dataframe. Return type: int
-
get_titlestring()[source]¶ Get the titlestring.
Uses the
titleanddescriptionto create the title line. It is formatted like#:GATKTable:title:subtitle.Returns: the title string Return type: str
-
get_unindexed()[source]¶ Return a copy of the dataframe without the indexes included.
If the table has an unnamed index, such as the default index, return a copy of the data frame. Otherwise, reset the index.
-
static
parse_fmtstring(header, fmtstring)[source]¶ Turn a fmtstring into a dictionary of {col_title : type}
Parameters: Returns: A dictionary mapping {col_title : type}
Return type:
-
precisionmap= None¶ A dictionary of {colname : str} to determine the precision string to prepend to each formatstring. This only affects
get_fmtstring().
-
title= None¶ The title of the table.
-
typemap= None¶ A dictionary of {type : str} to determine the string for each type.
Note that in the current implementation this only affects
get_fmtstring().
-
write(filehandle)[source]¶ Write the table to a filehandle.
Adds a newline to the end that isn’t present in
__str__()Parameters: filehandle (File Object) – The file object to write the table to. Returns: Whatever the file object’s write method returns. Usually the number of characters written. Return type: Usually int
-
class
kbbq.recaltable.RecalibrationReport(tables, version='1.1')[source]¶ Bases:
kbbq.recaltable.GATKReportA class representing a GATK Recalibration Report used for Base Quality Score Recalibration.
It extends the
GATKReport, so the primary method of interacting with the table is thetablesattribute. The primary difference is the__init__()function will set indices and data types as they should be for a recalibration report.- Table
tables[0]contains the arguments used to call the program - that generated the table.
- Table
- Table
tables[1]contains the quanization map - Table
tables[2]contains read group error information. - Table
tables[3]contains error information subset by read group, - then reported quality score.
- Table
- Table
tables[4]contains error information subset by read group, - then reported quality score, then covariate name and value. Both Context and Cycle are valid for the covariate name.
- Table
See the table below for a schema.
Table Number of Columns Variables 0 2 Argument, Value 1 3 QualityScore, Count, QuantizedScore 2 6 ReadGroup, EventType, EmpiricalQuality, EstimatedQReported, Observations, Errors 3 6 ReadGroup, QualityScore, EventType, EmpiricalQuality, Observations, Errors 4 8 ReadGroup, QualityScore, CovariateValue, CovariateName, EventType, EmpiricalQuality, Observations, Errors For current GATK parameters, EventType will be
Mfor mismatch thoughIfor insertion andDfor deletion are possible. Valid values for CovariateName areContextandCycle.-
__init__(tables, version='1.1')[source]¶ Initialize the recalibration report from a list of
GATKTables.This initializes the tables with some data wrangling to set indices and data types properly. If you don’t want this, use a
GATKReportinstead.
kbbq.compare_reads¶
A mish-mash of functions for recalibration. Should probably be renamed in the near future.
-
class
kbbq.compare_reads.Dinucleotide[source]¶ Bases:
objectA class to cache dinucleotides and maintain a consistent dinuc -> int map throughout the module.
-
complement= {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}¶ Dictionary for complementing nucleotides
-
dinuc_to_int= {'AA': 0, 'AC': 3, 'AG': 2, 'AT': 1, 'CA': 12, 'CC': 15, 'CG': 14, 'CT': 13, 'GA': 8, 'GC': 11, 'GG': 10, 'GT': 9, 'TA': 4, 'TC': 7, 'TG': 6, 'TT': 5}¶ Dictionary mapping dinuc -> int
-
dinucs= ['AA', 'AT', 'AG', 'AC', 'TA', 'TT', 'TG', 'TC', 'GA', 'GT', 'GG', 'GC', 'CA', 'CT', 'CG', 'CC']¶ List of valid dinucleotides
-
nucleotides= ['A', 'T', 'G', 'C']¶ List of valid nucleotides
-
vectorized_complement= <numpy.vectorize object>¶
-
vectorized_get= <numpy.vectorize object>¶
-
-
class
kbbq.compare_reads.RescaledNormal[source]¶ Bases:
objectA class to cache the rescaled normal prior used in the bayesian recalibration model.
Attributes
maxscore- max score supportedprior_dist- numpy array of prior probability
Methods
prior()- get the prior probability for a given quality score difference
Most of these attributes are nonsense; the “proper” way to interact with the class is via the
prior_distarray. If that makes you uncomfortable, use the provided accessor functionprior().Under no circumstances should you attempt to replace any of these attributes, it will most likely not have the desired effect. The caching mechanism here only works because the class attributes are immediately instatiated when the class is created, so by the time you replace them it won’t matter. Reimplement it yourself if you want a different prior.
-
i= 42¶
-
maxscore= 42¶ The maximum quality score supported by this class.
-
oldset= {'divide': 'warn', 'invalid': 'warn', 'over': 'warn', 'under': 'ignore'}¶
-
possible_diffs= array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42])¶
-
classmethod
prior(difference)[source]¶ Return the prior probability for a given difference in quality score.
Parameters: difference (int) – The difference in quality score Returns: the prior probability Return type: np.longdouble
-
prior_dist= array([-1.05360516e-01, -2.10536052e+00, -8.10536052e+00, -1.81053605e+01, -3.21053605e+01, -5.01053605e+01, -7.21053605e+01, -9.81053605e+01, -1.28105361e+02, -1.62105361e+02, -2.00105361e+02, -2.42105361e+02, -2.88105361e+02, -3.38105361e+02, -3.92105361e+02, -4.50105361e+02, -5.12105361e+02, -5.78105361e+02, -6.48105361e+02, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf], dtype=float128)¶
-
kbbq.compare_reads.fastq_infer_rg(read)[source]¶ Infer the read group from appended read information, such as produced by the samtools fastq command.
Requires the read to be formatted such the rg tag is added to the end of the read name delimited by a
_. Returns the read group tag.
-
kbbq.compare_reads.find_read_errors(read, ref, variable)[source]¶ Use the CIGAR to find errors in the read or sites to skip.
Softclipped bases will be added to the skip array. Returns a tuple with the error array and the skip array. We don’t consider indel errors.
-
kbbq.compare_reads.gatk_delta_q(prior_q, numerrs, numtotal, maxscore=42)[source]¶ Calculate the shift in quality scores from the prior given data.
This is achieved by finding the difference between the maximum a posteriori and the prior point estimate of Q.
-
kbbq.compare_reads.get_var_sites(vcf)[source]¶ Get positions covered by any record in a VCF file.
Pass the file name. Each value in the returned dict will be a list of 0-based positions.
-
kbbq.compare_reads.load_positions(posfile)[source]¶ Get positions that are covered by a non-zipped BED file.
Pass the file name; it will be opened with
python.open(). This is slightly different thanbenchmark.get_bed_dict()because it doesn’t require a refdict and only provides a list of 0-based positions, not a boolean array of all positions in the reference.
kbbq.plot¶
Utilities for plotting calibration and other relevant info.
-
kbbq.plot.plot_benchmark(fhin, outfile, plottype='calibration')[source]¶ Plot predicted vs. true probability or size of each quality bin.
This plots the predicted quality score (x) against the true quality score (y) for each quality bin. This is done by calculating the mean probability of error for each bin and converting that probability back to a quality score.
With plottype = ‘sample-size’, instead plots the size of each predicted quality score bin.
Parameters: - fhin (python.IOBase) – file containing benchmark data
- outfile (str) – name of file to save plot to
kbbq.benchmark¶
Utilities for benchmarking calibration methods.
-
kbbq.benchmark.benchmark(bamfile, fafile, vcffile, fastqfile=None, label=None, use_oq=False, bedfh=None)[source]¶ Perform the benchmark and print the results to stdout.
If a fastqfile is not given, the benchmark will be taken from the reads in the BAM file. If a fastqfile is given, it should have the same read names as the reads in the BAM file. The names will then be used to look up where errors occur in the reads.
-
kbbq.benchmark.calculate_q(errors, quals)[source]¶ Calculate Actual Q and Predicted Q given a flat array of errors and flat array of quality scores.
The index of the returned array represents the predicted quality score, the value represents the actual quality score or the number of bases.
-
kbbq.benchmark.get_bam_readname(read)[source]¶ Given a bam read, get a canonicalized name. The name is the query name + a suffix based on whether it is first or second in pair.
-
kbbq.benchmark.get_bamread_quals(read, use_oq=False)[source]¶ Return the qualities of the read as an np.array.
Use the OQ tag for read quals if use_oq = True.
-
kbbq.benchmark.get_error_dict(bamfile, refdict, fullskips)[source]¶ Finds errors given an authoritative BAM, ref, and sites to skip.
The returned dict has readname keys and tuple(np.array, np.array) of bools as values.
-
kbbq.benchmark.get_fastq_readname(read)[source]¶ Given a fastq read, get a canonicalized name. This is based on whether the read is first in pair.
-
kbbq.benchmark.print_benchmark(actual_q, label, nbases)[source]¶ Print benchmark data to a tab separated table.
Label should be a string and is used for all values in the table. No headers are printed. This means the benchmark command can be called many times and each table can be concatenated together.
kbbq.read¶
Class and functions for read data.
This module includes the ReadData class and a few utility functions for
working with BAM reads of type pysam.AlignedSegment.
Classes
Functions
-
class
kbbq.read.ReadData(seq, qual, skips, name, rg, second, errors)[source]¶ Bases:
objectA class that represents some minimal information from a sequencing read.
Since this won’t hold as much data as a BAM read, if you want to recalibrate a read you should turn it into a ReadData object, then update the original read with the new information; otherwise you’ll lose annotations. This class also manages read group information. You should instantiate the object from one of the class methods instead of directly instantiating it if possible. You should never assign directly to any of the class attributes; treat them as read-only. Instance variables should be fine to manipulate.
Class Attributes
Instance Attributes
seq- Numpy array of characters representing the sequencequal- Numpy array of int representing the quality scoreskips- Numpy array of bools representing sites to skipname- string representing the name of the readrg- int representing the read group the read belongs tosecond- bool representing whether the read is 2nd in pairerrors- Numpy array of bools representing whether the base is an error
Class Methods
from_bamread()- instantiate a ReadData object from a BAM readfrom_fastq()- instantiate a ReadData object from a fastq readload_rgs_from_bamfile()- load read groups into the class from a bam file object
Instance Methods
str_qual()- Get the quality score as a list of charscanonical_name()- Return the name with a ‘/1’ or ‘/2’ suffixget_rg_int()- Get the read group index using the rg_to_int dictionaryget_pu()- Get the PU from the read group and rg_to_pu dictionary.not_skipped_errors()- Return a logical and of ~:attr:skips anderrors
Note that from https://docs.python.org/3/library/stdtypes.html#dict , iteration order is guaranteed to be in insertion order. Thus we are OK saving the rg as an int on the fly so long as we don’t remove any read groups from the rg_to_pu dictionary. So don’t do that! In fact, we should probably remove __del__() and pop() from the dicts…
-
__init__(seq, qual, skips, name, rg, second, errors)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
canonical_name()[source]¶ The name with an added suffix based on whether the read is firstinpair or not.
If the read has its second flag set to True,
/2is added to the end. Otherwise,/1is added.Returns: read name with a suffix Return type: str
-
errors= None¶ numpy.ndarrayof bools representing whether the base is an error
-
classmethod
from_bamread(bamread, use_oq=False)[source]¶ ReadData factory that instantiates a ReadData object from a pysam AlignedSegment.
Skips and errors are initialized to be empty bool arrays. If you need to use these attributes, make sure you set them the way you need to for your use case.
If
use_oqisTrue, use the OQ tag for base quality. The default is to use regular quality scores.If the read group hasn’t been loaded in the dictionary, it will be registered with a PU equal to the value of
rg. To load the dictionary, callload_rgs_from_bamfile()first. If there is no RG tag for the read it will be given a generic RG of None and lumped in with all other reads that have no RG tag.This will reverse-complement the sequence and reverse the qualities if the read aligns on the reverse strand.
Parameters: - bamread (
pysam.AlignedSegment) – read to get data from - use_oq (bool) – use the OQ tag for quality scores
Returns: a read object
Return type: - bamread (
-
classmethod
from_fastq(fastqread, rg=None, second=None, namedelimiter='_')[source]¶ ReadData factory that instantiates a ReadData object from a pysam FastqProxy.
Skips and errors are initialized to be empty bool arrays. If you need to use them, make sure you set them the way you need to for your use case. The read name will be set to the first field of the delimited read name, minus any trailing
/1or/2if they exist.If
rgis None (the default), we will attempt to infer the read group id from the read name. To infer rg, the read group must be in its own field with the field beginning withRG:, such asRG:example, with fields delimited bynamedelimiter. When multiple fields are found that begin withRG:, the last is chosen to be the true ID. If inference fails, the read group will remain None.If the read group (either inferred or provided explicitly) hasn’t been loaded in the dictionary, it will be registered with a PU equal to the value of
rg.If second is None (the default) we will attempt to infer if the read is second in pair based on the name. To infer second, the first field of the read name must end with
/2. If the last 2 characters of the first field are not/2, second will be inferred to be false.Parameters: - fastqread (
pysam.FastqProxy) – The fastq read to get data from - rg (str) – the read group the read belongs to
- second (bool) – whether the read is 2nd in pair
- namedelimiter (str) – the delimiter for parsing the read name
Returns: a read object
Return type: - fastqread (
-
get_cycle_array()[source]¶ Return an array of cycle values.
This is
range(len(seq))if second is false, otherwise it is-1..-len(seq)inclusive. This way we can hold cycle values as an array of size 2 * seqlen and store negative values at the end of the array.Returns: cycle values Return type: numpy.ndarray(int)
-
get_cycle_errors()[source]¶ Return an array of cycle values that were errors and of all valid cycle values.
The errors are always a subset of the valid values.
Returns: erroneous and valid cycle values Return type: tuple( numpy.ndarray(int) ,numpy.ndarray(int))
-
get_dinuc_errors(minscore=6)[source]¶ Return an array of dinucleotide values that were errors and of all valid dinucleotide values.
The errors are always a subset of the valid values.
Returns: erroneous and valid dinucleotide values Return type: tuple( numpy.ndarray(int) ,numpy.ndarray(int))
-
get_dinucleotide_array(minscore=6)[source]¶ Return an array of dinucleotide values.
The character to int map is stored in
kbbq.compare_reads.Dinucleotide.Returns: dinucleotide values Return type: numpy.ndarray(int)
-
get_pu()[source]¶ Return the PU from the rg_to_pu dict.
Returns: The Read Group’s PU tag Return type: str
-
get_q_errors()[source]¶ Return an array of q values that were errors and of all valid q values.
The errors are always a subset of the valid values.
Returns: erroneous and valid q values Return type: tuple( numpy.ndarray(int) ,numpy.ndarray(int))
-
get_rg_errors()[source]¶ Return an array of rg values that were errors and all valid rg values.
The errors are always a subset of the valid sites.
For example, a read with 4 non skipped sites and 1 error will return
([rg], [rg, rg, rg, rg])Returns: errors and valid rg values Return type: tuple( numpy.ndarray(int) ,numpy.ndarray(int))
-
get_rg_int()[source]¶ Return the RG as an int suitable for indexing rather than as the actual string RG ID.
Returns: read group index Return type: int
-
classmethod
load_rgs_from_bamfile(bamfileobj)[source]¶ Load read group IDs and PUs from the header of the pysam bamfileobj.
Recalibration is done on a PU basis, so when building a model using reads in a BAM, the PU for each read is obtained by looking up the PU associated with its read group. The dictionary that controls these lookups is a class variable. To store the RG data in an int form, an rg-to-int dict is also loaded. This should be equal to
dict(zip(rg_to_pu, range(len(rg_to_pu))))Parameters: bamfileobj ( pysam.AlignmentFile) – The opened bam file
-
name= None¶ string representing the name of the read
-
not_skipped_errors()[source]¶ Return a logical and of not
skipsanderrorsReturns: array of valid errors Return type: numpy.ndarray(bool)
-
numrgs= 0¶ Number of readgroups encountered so far.
-
qual= None¶ numpy.ndarrayof int representing the quality score
-
rg= None¶ int representing the read group the read belongs to
-
rg_to_int= {}¶ Dict mapping RG ids to integer indices.
-
rg_to_pu= {}¶ Dict mapping RG ids to the PU tag for the read group.
-
second= None¶ bool representing whether the read is 2nd in pair
-
seq= None¶ numpy.ndarrayof characters representing the sequence
-
skips= None¶ numpy.ndarrayof bools representing sites to skip
-
kbbq.read.bamread_get_oq(read, offset=33)[source]¶ Get the OQ of the given bamread as an array of int.
offset (default 33) will be subtracted from the ASCII value of each character.
Parameters: - read (
pysam.AlignedSegment) – Read to get quals from - offset (int) – Offset to subtract
Returns: quality scores
Return type: numpy.ndarrayof int- read (
-
kbbq.read.bamread_get_quals(read, use_oq=False)[source]¶ Return the qualities of a pysam bam read as an array.
If use_oq = True, use the OQ tag for base quality. The default is to use regular quality scores.
Parameters: - read (
pysam.AlignedSegment) – Read to get quals from - use_oq (bool) – Use OQ tag for quality scores
Returns: quality scores
Return type: numpy.ndarrayof int- read (
kbbq.covariate¶
A data class and helper functions for read covariates.
This module includes the Covariate class and subclasses,
as well as the pad_axis() helper function for appending
to the end of an arbitrary axis of an numpy.ndarray.
- Classes
Covariate- a covariate base classRGCovariateQCovariateCycleCovariateDinucCovariateCovariateData- a container for the covariates
- Functions
pad_axis()- helper function to pad annp.ndarraywith zeros.
-
class
kbbq.covariate.Covariate(shape=0)[source]¶ Bases:
objectA base class for covariates.
Holds 2
np.ndarray,errorsfor number of errors andtotalfor the total number of observations of this covariate. The indices of each array are they keys; ie. the number of times covariate 2 was observed can be found withself.total[2].Warning
Don’t assign directly to
errorsortotal.shape()assumes both arrays have the same shape and only returns the shape oftotal. If the arrays’ shapes become different somehow,__getitem__()and__setitem__()may start throwing errors unexpectedly. If you want to use two arrays that aren’t guaranteed to have the same size, use two arrays. This isn’t the right class for that use case.__getitem__()will pass the key to both arrays and return a tuple of (errors, total). As a pneumonic, you can remember this as a fraction from left to right, as total should always be larger than errors.__setitem__()will also pass the key to both arrays, and will expect value to be a tuple of values, setting the first value to errors and the second to total.Attributes
errors-numpy.ndarrayof the number of errors observedtotal-numpy.ndarrayof the total number of observations
Methods
__init__()- Initialize arrays with desired shape.pad_axis()- Pad specified axis of both arrayspad_axis_to_fit()- Pad specified axis so the given index is validincrement()- Increment the given indices of each array.shape()- return the shape of the arrays.__getitem__()- get number of errors and total observations at the given index__setitem__()- set errors and observations at the given index
-
__getitem__(key)[source]¶ Pass key to each
np.ndarrayand return the result as a tuple.The result will be in order of (errors, total). As a pneumonic, you can remember this as a fraction from left to right, as total should always be larger than errors.
Parameters: key (slice) – indices to get Returns: number of errors and number of total observations Return type: tuple(int, int) or tuple( numpy.ndarray,numpy.ndarray)
-
__init__(shape=0)[source]¶ Initialize the arrays with the given shapes.
Parameters: tuple (shape) – desired shape of error and total arrays Returns: empty object Return type: Covariate
-
__setitem__(key, value)[source]¶ Pass key to each
np.ndarrayand assign each value.Value must be a tuple of values to assign. The first value is assigned to errors and the second to total.
Parameters:
-
errors= None¶ np.ndarray(int) holding number of errors observed for each key.
-
increment(idx, value=(1, 1))[source]¶ Increment the given indices by value.
idxshould be a tuple of values interpretable bynp.add().The first indices given will increment the error array. The second indices given will increment the total array.
Note that this is implemented with
np.add.at(). For example, ifidx[0]is[2,2],errors[2] will be incremented twice. This is equivalent toerrors[2] += 1 errors[2] += 1
If
value[0]is 2,errorswill be instead incremented by 4 and is equivalent toerrors[2] += 2 errors[2] += 2
Parameters:
-
pad_axis_to_fit(axis, idx)[source]¶ Expand the axis until idx is a valid idx in that dimension.
Does nothing if idx is already valid. idx should be a single scalar value, not a slice.
Parameters:
-
shape()[source]¶ Return the shape of the underlying arrays.
Returns: array shape Return type: tuple(int)
-
total= None¶ np.ndarray(int) holding number of observations for each key.
-
class
kbbq.covariate.CovariateData[source]¶ Bases:
objectA class for holding all the covariate data needed to recalibrate reads.
Attributes
Methods
consume_read()- Add covariates from the read to the proper data arraysget_num_rgs()- Return number of rgs in the rg covariateget_num_qs()- Return number of qs in the q covariateget_num_cycles()- Return number of cycles in the cycle covariateget_num_dinucs()- Return number of dinucs in the dinuc covariate
-
consume_read(read)[source]¶ Add covariates from the read to the proper data arrays.
Parameters: read ( kbbq.read.ReadData) – the read to take data from
-
cyclecov= None¶ Cycle covariates
-
dinuccov= None¶ Dinucleotide covariates
-
get_num_cycles()[source]¶ Return number of cycles in the cycle covariate
This is the half the length of the cycle vector; there are twice as many cycles since positive and negative cycles are treated differently.
Returns: half the length of cycle vector Return type: int
-
get_num_dinucs()[source]¶ Return number of dinucs in the dinuc covariate.
Returns: length of the dinuc vector Return type: int
-
get_num_qs()[source]¶ Return number of qs in the q covariate
Returns: length of q vector Return type: int
-
get_num_rgs()[source]¶ Return number of rgs in the rg covariate
Returns: length of rg vector Return type: int
-
qcov= None¶ Q covariates
-
class
kbbq.covariate.CycleCovariate[source]¶ Bases:
kbbq.covariate.CovariateA 3d covariate array in the read group, Q, and cycle dimensions.
pad_axis()is overriden here since the way cycles are stored means appending to the end of the array changes the location of the data.Methods:
__init__()- Initialize empty arrayspad_axis()- Pad given axis.
-
num_cycles()[source]¶ Return number of cycles in the cycle covariate
This is the half the length of the cycle vector; there are twice as many cycles since positive and negative cycles are treated differently.
Returns: half the length of cycle vector Return type: int
-
pad_axis(axis, n=1)[source]¶ Expand the axis dimension by n.
This is overriden for the CycleCovariate since we use one array to handle both positive and negative cycle values. Appending 0’s to the end would ruin the array since negative indices are significant. Thus, we have to wrangle the existing data a bit.
Parameters:
-
class
kbbq.covariate.DinucCovariate[source]¶ Bases:
kbbq.covariate.CovariateA 3d covariate array in the read group, Q, and dinucleotide dimensions.
-
class
kbbq.covariate.QCovariate[source]¶ Bases:
kbbq.covariate.CovariateA 2d covariate array in the read group and Q dimensions.
This class holds a
RGCovariateto efficiently increment both objects while iterating through reads.Attributes
rgcov- RG covariates
Methods
__init__()- Initialize arraysconsume_read()- increment the rg and q covariates given akbbq.read.ReadDatanum_qs()- return number of qs observed so far
-
__init__()[source]¶ Initialize an empty
RGCovariateand initialize the empty q arrays.
-
consume_read(read)[source]¶ Increment the rg and q covariates given a
kbbq.read.ReadDataParameters: read ( kbbq.read.ReadData) – read to take covariate values fromReturns: erroneous and valid read group values and erroneous and valid q values Return type: (( numpy.ndarray,numpy.ndarray),(numpy.ndarray,numpy.ndarray))
-
num_qs()[source]¶ Return size needed to hold largest q observed so far.
Returns: largest q + 1 Type: int
-
rgcov= None¶ RG covariates
-
class
kbbq.covariate.RGCovariate[source]¶ Bases:
kbbq.covariate.CovariateA 1d covariate array in the read group dimension.
Methods
__init__()- Initialize arraysconsume_read()- increment the covariates given akbbq.read.ReadDatanum_rgs()- return number of rgs observed so far
-
consume_read(read)[source]¶ Increment the rg covariates given a
kbbq.read.ReadDataParameters: read ( kbbq.read.ReadData) – read to take covariate values fromReturns: erroneous and valid read group values Return type: tuple( numpy.ndarray,numpy.ndarray)