Experiment class¶

class pytadbit.experiment.Experiment(name, resolution, hic_data=None, norm_data=None, tad_def=None, parser=None, no_warn=False, weights=None, conditions=None, identifier=None, cell_type=None, enzyme=None, exp_type='Hi-C', **kw_descr)[source]¶

Hi-C experiment.

Parameters

name – name of the experiment
resolution – the resolution of the experiment (size of a bin in bases)
identifier (None) – some identifier relative to the Hi-C data
cell_type (None) – cell type on which the experiment was done
enzyme (None) – restriction enzyme used in the Hi-C experiment
exp_type (Hi-C) – name of the experiment used (currently only Hi-C is supported)
hic_data (None) – whether a file or a list of lists corresponding to the Hi-C data
tad_def (None) – a file or a dict with precomputed TADs for this experiment

parser (None) –

a parser function that returns a tuple of lists representing the data matrix and the length of a row/column. With the file example.tsv:

chrT_001    chrT_002    chrT_003    chrT_004
chrT_001    629    164    88    105
chrT_002    164    612    175    110
chrT_003    88    175    437    100
chrT_004    105    110    100    278

the output of parser(‘example.tsv’) would be be: [([629, 164, 88, 105, 164, 612, 175, 110, 88, 175, 437, 100, 105, 110, 100, 278]), 4]

conditions (None) – list() of experimental conditions, e.g. the cell type, the enzyme… (i.e.: [‘HindIII’, ‘cortex’, ‘treatment’]). This parameter may be used to compare the effect of this conditions on the TADs
filter_columns (True) – filter the columns with unexpectedly high content of low values

kw_descr (None) –

any other argument passed would be stored as complementary descriptive field. For example:

exp  = Experiment('k562_rep2', resolution=100000,
                  identifier='SRX015263', cell_type='K562',
                  enzyme='HindIII', cylce='synchronized')
print exp

# Experiment k562_rep2:
#    resolution        : 100Kb
#    TADs              : None
#    Hi-C rows         : None
#    normalized        : None
#    identifier        : SRX015263
#    cell type         : K562
#    restriction enzyme: HindIII
#    cylce             : synchronized

note that these fields may appear in the header of generated out files

TODO: doc conditions TODO: normalization

filter_columns(silent=False, draw_hist=False, savefig=None, perc_zero=99, by_mean=True, min_count=None)[source]¶

Call filtering function, to remove artifactual columns in a given Hi-C matrix. This function will detect columns with very low interaction counts. Filtered out columns will be stored in the dictionary Experiment._zeros.

Parameters

silent (False) – does not warn for removed columns
draw_hist (False) – shows the distribution of mean values by column the polynomial fit, and the cut applied.
savefig (None) – path to a file where to save the image generated; if None, the image will be shown using matplotlib GUI (the extension of the file name will determine the desired format).
perc_zero (99) – maximum percentage of cells with no interactions allowed.
min_count (None) – minimum number of reads mapped to a bin (recommended value could be 2500). If set this option overrides the perc_zero filtering… This option is slightly slower.
by_mean (True) – filter columns by mean column value using pytadbit.utils.hic_filtering.filter_by_mean() function

get_hic_matrix(focus=None, diagonal=True, normalized=False)[source]¶

Return the Hi-C matrix.

Parameters

focus (None) – if a tuple is passed (start, end), wil return a Hi-C matrix starting at start, and ending at end (all inclusive).
diagonal (True) – replace the values in the diagonal by one. Used for the filtering in order to smooth the distribution of mean values

Para False normalized

returns normalized data instead of raw Hi-C

Returns

list of lists representing the Hi-C data matrix of the current experiment

get_hic_zscores(normalized=True, zscored=True, remove_zeros=True)[source]¶

Normalize the Hi-C raw data. The result will be stored into the private Experiment._zscore list.

Parameters

normalized (True) – whether to normalize the result using the weights (see normalize_hic())
zscored (True) – calculate the z-score of the data
remove_zeros (False) – remove null interactions. Dangerous, null interaction are informative.

load_hic_data(hic_data, parser=None, wanted_resolution=None, data_resolution=None, silent=False, **kwargs)[source]¶

Add a Hi-C experiment to the Chromosome object.

Parameters

hic_data (None) – whether a file or a list of lists corresponding to the Hi-C data
name – name of the experiment
force (False) – overwrite the experiments loaded under the same name

parser (None) –

a parser function that returns a tuple of lists representing the data matrix and the length of a row/column. With the file example.tsv:

chrT_001    chrT_002    chrT_003    chrT_004
chrT_001    629    164    88    105
chrT_002    86    612    175    110
chrT_003    159    216    437    105
chrT_004    100    111    146    278

the output of parser(‘example.tsv’) would be: [([629, 86, 159, 100, 164, 612, 216, 111, 88, 175, 437, 146, 105, 110, 105, 278]), 4]

resolution (None) – resolution of the experiment in the file; it will be adjusted to the resolution of the experiment. By default the file is expected to contain a Hi-C experiment with the same resolution as the pytadbit.Experiment created, and no change is made
filter_columns (True) – filter the columns with unexpectedly high content of low values
silent (False) – does not warn for removed columns

load_norm_data(norm_data, parser=None, resolution=None, normalization='visibility', **kwargs)[source]¶

Add a normalized Hi-C experiment to the Chromosome object.

Parameters

norm_data (None) – whether a file or a list of lists corresponding to the normalized Hi-C data
name – name of the experiment
force (False) – overwrite the experiments loaded under the same name

parser (None) –

a parser function that returns a tuple of lists representing the data matrix and the length of a row/column. With the file example.tsv:

chrT_001    chrT_002    chrT_003    chrT_004
chrT_001    12.5    164    8.8    0.5
chrT_002    8.6    61.2    1.5    1.1
chrT_003    15.9    21.6    3.7    0.5
chrT_004    0.0    1.1    1.6    2.8

resolution (None) – resolution of the experiment in the file; it will be adjusted to the resolution of the experiment. By default the file is expected to contain a Hi-C experiment with the same resolution as the pytadbit.Experiment created, and no change is made
filter_columns (True) – filter the columns with unexpectedly high content of low values
silent (False) – does not warn for removed columns

load_tad_def(tad_def, weights=None)[source]¶

Add the Topologically Associated Domains definition detection to Slice

Parameters

tad_def (None) – a file or a dict with pre-computed TADs for this experiment
name (None) – name of the experiment, if None f_name will be used
weights (None) – Store information about the weights, corresponding to the normalization of the Hi-C data (see TADbit function documentation)

model_region(start=1, end=None, n_models=5000, n_keep=1000, n_cpus=1, verbose=0, keep_all=False, close_bins=1, outfile=None, config={'kbending': 0.0, 'kforce': 5, 'lowfreq': - 0.7, 'maxdist': 600, 'reference': 'victor corces dataset 2013', 'scale': 0.01, 'upfreq': 0.3}, container=None, single_particle_restraints=None, use_HiC=True)[source]¶

Generates of three-dimensional models using IMP, for a given segment of chromosome.

Parameters

start (1) – first bin to model (bin number)
end (None) – last bin to model (bin number). By default goes to the last bin.
n_models (5000) – number of modes to generate
n_keep (1000) – number of models used in the final analysis (usually the top 20% of the generated models). The models are ranked according to their objective function value (the lower the better)
keep_all (False) – whether or not to keep the discarded models (if True, models will be stored under tructuralModels.bad_models)
close_bins (1) – number of particles away (i.e. the bin number difference) a particle pair must be in order to be considered as neighbors (e.g. 1 means consecutive particles)
n_cpus – number of CPUs to use
verbose (0) – the information printed can be: nothing (0), the objective function value the selected models (1), the objective function value of all the models (2), all the modeling information (3)
container (None) – restrains particle to be within a given object. Can only be a ‘cylinder’, which is, in fact a cylinder of a given height to which are added hemispherical ends. This cylinder is defined by a radius, its height (with a height of 0 the cylinder becomes a sphere) and the force applied to the restraint. E.g. for modeling E. coli genome (2 micrometers length and 0.5 micrometer of width), these values could be used: [‘cylinder’, 250, 1500, 50], and for a typical mammalian nuclei (6 micrometers diameter): [‘cylinder’, 3000, 0, 50]

config (CONFIG) –

a dictionary containing the standard parameters used to generate the models. The dictionary should contain the keys kforce, maxdist, upfreq and lowfreq. Examples can be seen by doing:

from pytadbit.imp.CONFIG import CONFIG

where CONFIG is a dictionarry of dictionnaries to be passed to this function:

CONFIG = {
 # use these paramaters with the Hi-C data from:
 'reference' : 'victor corces dataset 2013',

 # Force applied to the restraints inferred to neighbor particles
 'kforce'    : 5,

 # Maximum experimental contact distance
 'maxdist'   : 600, # OPTIMIZATION: 500-1200

 # Minimum and maximum thresholds used to decide which experimental values have to be
 # included in the computation of restraints. Z-score values bigger than upfreq
 # and less that lowfreq will be include, whereas all the others will be rejected
 'upfreq'    : 0.3, # OPTIMIZATION: min/max Z-score

 'lowfreq'   : -0.7, # OPTIMIZATION: min/max Z-score

 # How much space (radius in nm) ocupies a nucleotide
 'scale'     : 0.005
 }

use_HiC (True) – apply hic data restraints to the model

Param

None single_particle_restraints: a list containing restraints to single particles. Each restraint in the list is itself a list with the following information:

[bin, [position_x, position_y, position_z], type, kforce, radius] bin: bin number of the particle to restraint [position_x, position_y, position_z](nm): center of the sphere of the restraint.

The center of the coordinate system is the center of the base of the cylinder defined as the container.

type: ‘Harmonic’, ‘HarmonicLowerBound’, ‘HarmonicUpperBound’ kforce: weigth of the restraint radius (nm): radius of the sphere

Returns

a pytadbit.imp.structuralmodels.StructuralModels object.

normalize_hic(factor=1, iterations=0, max_dev=0.1, silent=False)[source]¶

Normalize the Hi-C data. This normalization step does the same of the pytadbit.tadbit.tadbit() function (default parameters),

It fills the Experiment.norm variable with the Hi-C values divided by the calculated weight.

The weight of a given cell in column i and row j corresponds to the square root of the product of the sum of column i by the sum of row j.

normalization is done according to this formula:

\[weight_{(I,J)} = \frac{\sum^N_{j=0}\sum^N_{i=0}(matrix(i,j))} {\sum^N_{i=0}(matrix(i,J)) \times \sum^N_{j=0}(matrix(I,j))}\]

with N being the number or rows/columns of the Hi-C matrix in both cases.

Parameters

factor (1) – final mean number of normalized interactions wanted per cell
silent (False) – does not warn when overwriting weights
rowsums (None) – input a list of rowsums calculated elsewhere

optimal_imp_parameters(start=1, end=None, n_models=500, n_keep=100, n_cpus=1, upfreq_range=(0, 1, 0.1), close_bins=1, kbending_range=0.0, lowfreq_range=(- 1, 0, 0.1), scale_range=[0.01], maxdist_range=(400, 1400, 100), dcutoff_range=[2], outfile=None, verbose=True, corr='spearman', off_diag=1, savedata=None, container=None)[source]¶

Find the optimal set of parameters to be used for the 3D modeling in IMP.

Parameters

start (1) – first bin to model (bin number)
end (None) – last bin to model (bin number). By default goes to the last bin.
n_models (500) – number of modes to generate
n_keep (100) – number of models used in the final analysis (usually the top 20% of the generated models). The models are ranked according to their objective function value (the lower the better)
close_bins (1) – number of particles away (i.e. the bin number difference) a particle pair must be in order to be considered as neighbors (e.g. 1 means consecutive particles)
n_cpus – number of CPUs to use
verbose (True) – if set to True, information about the distance, force and Z-score between particles will be printed
lowfreq_range ((-1,0,0.1)) – range of lowfreq values to be optimized. The last value of the input tuple is the incremental step for the lowfreq values
upfreq_range ((0,1,0.1,0.1)) – range of upfreq values to be optimized. The last value of the input tuple is the incremental step for the upfreq values
maxdist_range ((400,1400,100)) – upper and lower bounds used to search for the optimal maximum experimental distance. The last value of the input tuple is the incremental step for maxdist values
scale_range ([0.01]) – upper and lower bounds used to search for the optimal scale parameter (nm per nucleotide). The last value of the input tuple is the incremental step for scale parameter values
dcutoff_range ([2]) – upper and lower bounds used to search for the optimal distance cutoff parameter (distance, in number of beads, from which to consider 2 beads as being close). The last value of the input tuple is the incremental step for scale parameter values
container (None) – restrains particle to be within a given object. Can only be a ‘cylinder’, which is, in fact a cylinder of a given height to which are added hemispherical ends. This cylinder is defined by a radius, its height (with a height of 0 the cylinder becomes a sphere) and the force applied to the restraint. E.g. for modeling E. coli genome (2 micrometers length and 0.5 micrometer of width), these values could be used: [‘cylinder’, 250, 1500, 50], and for a typical mammalian nuclei (6 micrometers diameter): [‘cylinder’, 3000, 0, 50]
verbose – print the results to the standard output

Note

Each of the _range parameters accept tuples in the form

(start, end, step), or a list with the list of values to test.

E.g.:

scale_range=[0.001, 0.005, 0.006] will test these three values.
scale_range=(0.001, 0.005, 0.001) will test the values 0.001, 0.002, 0.003, 0.004 and 0.005

Returns: an pytadbit.imp.impoptimizer.IMPoptimizer object

print_hic_matrix(print_it=True, normalized=False, zeros=False)[source]¶

Return the Hi-C matrix as string

Parameters

print_it (True) – Otherwise, returns the string
normalized (False) – returns normalized data, instead of raw Hi-C
zeros (False) – take into account filtered columns

Returns

list of lists representing the Hi-C data matrix of the current experiment

set_resolution(resolution, keep_original=True)[source]¶

Set a new value for the resolution. Copy the original data into Experiment._ori_hic and replace the Experiment.hic_data with the data corresponding to new data (pytadbit.Chromosome.compare_condition()).

Parameters

resolution – an integer representing the resolution. This number must be a multiple of the original resolution, and higher than it
keep_original (True) – either to keep or not the original data

view(tad=None, focus=None, paint_tads=False, axe=None, show=True, logarithm=True, normalized=False, relative=True, decorate=True, savefig=None, where='both', clim=None, cmap='jet')[source]¶

Visualize the matrix of Hi-C interactions

Parameters

tad (None) –
a given TAD in the form:
```
{'start': start,
 'end'  : end,
 'brk'  : end,
 'score': score}
```
Alternatively a list of the TADs can be passed (all the TADs between the first and last one passed will be showed. Thus, passing more than two TADs might be superfluous)
focus (None) – a tuple with the start and end positions of the region to visualize
paint_tads (False) – draw a box around the TADs defined for this experiment
axe (None) – an axe object from matplotlib can be passed in order to customize the picture
show (True) – either to pop-up matplotlib image or not
logarithm (True) – show the logarithm values
normalized (True) – show the normalized data (weights might have been calculated previously). Note: white rows/columns may appear in the matrix displayed; these rows correspond to filtered rows (see pytadbit.utils.hic_filtering.hic_filtering_for_modelling() )
relative (True) – color scale is relative to the whole matrix of data, not only to the region displayed
decorate (True) – draws color bar, title and axes labels
savefig (None) – path to a file where to save the image generated; if None, the image will be shown using matplotlib GUI (the extension of the file name will determine the desired format).
clim (None) – tuple with minimum and maximum value range for color scale. I.e. clim=(-4, 10)
cmap ('jet') – color map from matplotlib. Can also be a preconfigured cmap object.

write_interaction_pairs(fname, normalized=True, zscored=True, diagonal=False, cutoff=None, header=False, true_position=False, uniq=True, remove_zeros=False, focus=None, format='tsv')[source]¶

Creates a tab separated file with all the pairwise interactions.

Parameters

fname – file name where to write the pairwise interactions
zscored (True) – computes the z-score of the log10(data)
normalized (True) – use the weights to normalize the data
cutoff (None) – if defined, only the zscores above the cutoff will be writen to the file
uniq (False) – only writes one representent per interacting pair
true_position (False) – if, true writes genomic coordinates, otherwise, genomic bin.
focus (None) – writes interactions between the start and stop bin passed to this parameter.
format ('tsv') – in which to write the file, can be tab separated (tsv) or JSON (json)

write_json(filename, focus=None, normalized=False)[source]¶

Save hic matrix in the json format, read by TADkit.

Parameters

filename – location where the file will be written
focus (None) – if a tuple is passed (start, end), json will contain a Hi-C matrix starting at start, and ending at end (all inclusive).

Para False normalized

use normalized data instead of raw Hi-C

write_tad_borders(density=False, savedata=None, normalized=False)[source]¶

Print a table summarizing the TADs found by tadbit. This function outputs something similar to the R function.

Parameters

density (False) – if True, adds a column with the relative interaction frequency measured within each TAD (value of 1 means an interaction frequency equal to the expectation in the experiment)
savedata (None) – path to a file where to save the density data generated (1 column per step + 1 for particle number). If None, print a table.
normalized (False) – uses normalized data to calculate the density

Load experiment¶

pytadbit.experiment.load_experiment_from_reads(name, fnam, genome_seq, resolution, conditions=None, identifier=None, cell_type=None, enzyme=None, exp_type='Hi-C', **kw_descr)[source]¶

Loads an experiment object from TADbit-generated read files, that are lists of pairs of reads mapped to a reference genome.

Parameters

fnam – tsv file with reads1 and reads2
name – name of the experiment
resolution – the resolution of the experiment (size of a bin in bases)
identifier (None) – some identifier relative to the Hi-C data
cell_type (None) – cell type on which the experiment was done
enzyme (None) – restriction enzyme used in the Hi-C experiment
exp_type (Hi-C) – name of the experiment used (currently only Hi-C is supported)
conditions (None) – list() of experimental conditions, e.g. the cell type, the enzyme… (i.e.: [‘HindIII’, ‘cortex’, ‘treatment’]). This parameter may be used to compare the effect of this conditions on the TADs

kw_descr (None) –

any other argument passed would be stored as complementary descriptive field. For example:

exp  = Experiment('k562_rep2', resolution=100000,
                  identifier='SRX015263', cell_type='K562',
                  enzyme='HindIII', cylce='synchronized')
print exp

# Experiment k562_rep2:
#    resolution        : 100Kb
#    TADs              : None
#    Hi-C rows         : None
#    normalized        : None
#    identifier        : SRX015263
#    cell type         : K562
#    restriction enzyme: HindIII
#    cylce             : synchronized

note that these fields may appear in the header of generated out files

Experiment class¶

Load experiment¶

Table of Contents

Previous topic

Next topic