Chromosome class¶
- class pytadbit.chromosome.Chromosome(name, species=None, assembly=None, experiment_resolutions=None, experiment_tads=None, experiment_hic_data=None, experiment_norm_data=None, experiment_names=None, max_tad_size=inf, chr_len=0, parser=None, centromere_search=False, silent=False, **kw_descr)[source]¶
A Chromosome object designed to deal with Topologically Associating Domains predictions from different experiments, in different cell types for a given chromosome of DNA, and to compare them.
- Parameters
name – name of the chromosome (might be a chromosome name for example)
species (None) – species name
assembly (None) – version number of the genomic assembly used
resolutions (None) – list of resolutions corresponding to a list of experiments passed.
experiment_hic_data (None) –
list()
of paths to files containing the Hi-C count matrices corresponding to different experimentsexperiment_tads (None) –
list()
of paths to files containing the definition of TADs corresponding to different experimentsexperiment_names (None) –
list()
of the names of each experimentmax_tad_size (infinite) – maximum TAD size allowed. TADs longer than this value will not be considered, and size of the corresponding chromosome size will be reduced accordingly
chr_len (0) – size of the DNA chromosome in bp. By default it will be inferred from the distribution of TADs
parser (None) –
a parser function that returns a tuple of lists representing the data matrix and the length of a row/column. With the file example.tsv:
chrT_001 chrT_002 chrT_003 chrT_004 chrT_001 629 164 88 105 chrT_002 164 612 175 110 chrT_003 88 175 437 100 chrT_004 105 110 100 278
the output of parser(‘example.tsv’) would be be:
[([629, 164, 88, 105, 164, 612, 175, 110, 88, 175, 437, 100, 105, 110, 100, 278]), 4]
kw_descr (None) –
any other argument passed would be stored as complementary descriptive field. For example:
crm = Chromosome('19', species='mus musculus', subspecies='musculus musculus', skin_color='black') print crm # Chromosome 19: # 0 experiment loaded: # 0 alignment loaded: # species : mus musculus # assembly version: UNKNOWN # subspecies : musculus musculus # skin_color : black
note that these fields may appear in the header of generated out files
- Returns
Chromosome object
- add_experiment(name, resolution=None, tad_def=None, hic_data=None, norm_data=None, replace=False, parser=None, conditions=None, **kwargs)[source]¶
Add a Hi-C experiment to Chromosome
- Parameters
name – name of the experiment or of the Experiment object
resolution – resolution of the experiment (needed if name is not an Experiment object)
hic_data (None) – whether a file or a list of lists corresponding to the Hi-C data
tad_def (None) – a file or a dict with precomputed TADs for this experiment
replace (False) – overwrite the experiments loaded under the same name
parser (None) –
a parser function that returns a tuple of lists representing the data matrix and the length of a row/column. With a file example.tsv containing:
chrT_001 chrT_002 chrT_003 chrT_004 chrT_001 629 164 88 105 chrT_002 164 612 175 110 chrT_003 88 175 437 100 chrT_004 105 110 100 278
the output of parser(‘example.tsv’) would be:
[([629, 164, 88, 105, 164, 612, 175, 110, 88, 175, 437, 100, 105, 110, 100, 278]), 4]
- align_experiments(names=None, verbose=False, randomize=False, rnd_method='interpolate', rnd_num=1000, get_score=False, **kwargs)[source]¶
Align the predicted boundaries of two different experiments. The resulting alignment will be stored in the self.experiment list.
- Parameters
names (None) – list of names of the experiments to align. If None, align all
experiment1 – name of the first experiment to align
experiment2 – name of the second experiment to align
penalty (-0.1) – penalty for inserting a gap in the alignment
max_dist (100000) – maximum distance between two boundaries allowing match (100Kb seems fair with HUMAN chromosomes)
verbose (False) – if True, print some information about the alignments
randomize (False) – check the alignment quality by comparing randomized boundaries over Chromosomes of the same size. This will return a extra value, the p-value of accepting that the observed alignment is not better than a random alignment
get_score (False) – returns alignemnt object, alignment score and percentage of identity from one side and from the other
rnd_method (interpolate) – by default uses the interpolation of TAD distribution. The alternative method is ‘shuffle’, where TADs are simply shuffled
rnd_num (1000) – number of randomizations to do
method (reciprocal) – if global, Needleman-Wunsch is used to align (see
pytadbit.boundary_aligner.globally.needleman_wunsch()
); if reciprocal, a method based on reciprocal closest boundaries is used (seepytadbit.boundary_aligner.reciprocally.reciprocal()
)
- Returns
an alignment object or, if the randomizattion was invoked, an alignment object, and a list of statistics that are, the alignment score, the probability that observed alignment performs better than randoms, the proportion of borders from the first experiment found aligned in the second experiment and the proportion of borders from the second experiment found aligned in the first experiment. Returned calues can be catched like this:
ali = crm.align_experiments()
or, with randomization test:
ali, (score, pval, prop1, prop2) = crm.align_experiments(randomize=True)
- find_tad(experiments, name=None, n_cpus=1, verbose=True, max_tad_size='max', heuristic=True, batch_mode=False, **kwargs)[source]¶
Call the
pytadbit.tadbit.tadbit()
function to calculate the position of Topologically Associated Domain boundaries- Parameters
experiment – A square matrix of interaction counts of Hi-C data or a list of such matrices for replicated experiments. The counts must be evenly sampled and not normalized. ‘experiment’ can be either a list of lists, a path to a file or a file handler
normalized (True) – if False simple normalization will be computed, as well as a simple column filtering will be applied (remove columns where value at the diagonal is null)
n_cpus (1) – The number of CPUs to allocate to TADbit. If n_cpus=’max’ the total number of CPUs will be used
max_tad_size (max) – an integer defining the maximum size of a TAD. Default (auto) defines it as the number of rows/columns
heuristic (True) – whether to use or not some heuristics
batch_mode (False) – if True, all the experiments will be concatenated into one for the search of TADs. The resulting TADs found are stored under the name ‘batch’ plus a concatenation of the experiment names passed (e.g.: if experiments=[‘exp1’, ‘exp2’], the name would be: ‘batch_exp1_exp2’).
- get_experiment(name)[source]¶
Fetch an Experiment according to its name. This can also be done directly with Chromosome.experiments[name].
- Parameters
name – name of the experiment to select
- Returns
pytadbit.Experiment
- get_tad_hic(tad, x_name, normed=True, matrix_num=0)[source]¶
Retrieve the Hi-C data matrix corresponding to a given TAD.
- Parameters
tad – a given TAD (
dict
)x_name – name of the experiment
normed (True) – if True, normalize the Hi-C data
- Returns
Hi-C data matrix for the given TAD
- iter_tads(x_name, normed=True)[source]¶
Iterate over the TADs corresponding to a given experiment.
- Parameters
x_name – name of the experiment
normed (True) – normalize Hi-C data returned
- Yields
Hi-C data corresponding to each TAD
- save_chromosome(out_f, fast=True, divide=True, force=False)[source]¶
Save a Chromosome object to a file (it uses
pickle.load()
from thepickle
). Once saved, the object can be loaded withload_chromosome()
.- Parameters
out_f – path to the file where to store the
pickle
objectfast (True) – if True, skip Hi-C data and weights
divide (True) – if True writes two pickles, one with what would result by using the fast option, and the second with the Hi-C and weights data. The second file name will be extended by ‘_hic’ (ie: with out_f=’chromosome12.pik’ we would obtain chromosome12.pik and chromosome12.pik_hic). When loaded
load_chromosome()
will automatically search for both filesforce (False) – overwrite the existing file
- set_max_tad_size(value)[source]¶
Change the maximum size allowed for TADs. It also applies to the computed experiments.
- Parameters
value – an int value (default is 5000000)
- tad_density_plot(name, axe=None, focus=None, extras=None, normalized=True, savefig=None, shape='ellipse')[source]¶
Draw an summary of the TAD found in a given experiment and their density in terms of relative Hi-C interaction count.
- Parameters
name – name of the experiment to visualize
focus (None) – can pass a tuple (bin_start, bin_stop) to display the alignment between these genomic bins
extras (None) – list of coordinates (genomic bin) where to draw a red cross
ymax (None) – limit the y axis up to a given value
) (('grey',) – successive colors for alignment
normalized (True) – normalized Hi-C count are plotted instead of raw data.
shape ('ellipse') – which kind of shape to use as schematic representation of TADs. Implemented: ‘ellipse’, ‘rectangle’, ‘triangle’
savefig (None) – path to a file where to save the image generated; if None, the image will be shown using matplotlib GUI (the extension of the file name will determine the desired format).
- visualize(names=None, tad=None, focus=None, paint_tads=False, axe=None, show=True, logarithm=True, normalized=False, relative=True, decorate=True, savefig=None, clim=None, scale=(8, 6), cmap='jet')[source]¶
Visualize the matrix of Hi-C interactions of a given experiment
- Parameters
names (None) – name of the experiment to visualize, or list of experiment names. If None, all experiments will be shown
tad (None) –
a given TAD in the form:
{'start': start, 'end' : end, 'brk' : end, 'score': score}
Alternatively a list of the TADs can be passed (all the TADs between the first and last one passed will be showed. Thus, passing more than two TADs might be superfluous)
focus (None) – a tuple with the start and end positions of the region to visualize
paint_tads (False) – draw a box around the TADs defined for this experiment
axe (None) – an axe object from matplotlib can be passed in order to customize the picture
show (True) – either to pop-up matplotlib image or not
logarithm (True) – show the logarithm values
normalized (True) – show the normalized data (weights might have been calculated previously). Note: white rows/columns may appear in the matrix displayed; these rows correspond to filtered rows (see
pytadbit.utils.hic_filtering.hic_filtering_for_modelling()
)relative (True) – color scale is relative to the whole matrix of data, not only to the region displayed
decorate (True) – draws color bar, title and axes labels
savefig (None) – path to a file where to save the image generated; if None, the image will be shown using matplotlib GUI (the extension of the file name will determine the desired format).
clim (None) – tuple with minimum and maximum value range for color scale. I.e. clim=(-4, 10)
cmap ('jet') – color map from matplotlib. Can also be a preconfigured cmap object.
Load chromosome¶
- pytadbit.chromosome.load_chromosome(in_f, fast=2)[source]¶
Load a Chromosome object from a file. A Chromosome object can be saved with the
Chromosome.save_chromosome()
function.- Parameters
in_f – path to a saved Chromosome object file
fast (2) – if fast=2 do not load the Hi-C data (in the case that they were saved in a separate file see
Chromosome.save_chromosome()
). If fast is equal to 1, the weights will be skipped from load to save memory. Finally if fast=0, both the weights and Hi-C data will be loaded
- Returns
a Chromosome object
TODO: remove first try/except type error… this is loading old experiments
ExperimentList class¶
- class pytadbit.chromosome.ExperimentList(thing, crm)[source]¶
Inherited from python built in
list()
, modified for TADbitpytadbit.Experiment
.Mainly, getitem, setitem, and append were modified in order to be able to search for experiments by index or by name, and to add experiments simply using Chromosome.experiments.append(Experiment).
The whole ExperimentList object is linked to a Chromosome instance (
pytadbit.Chromosome
).