Experiment class¶
- class pytadbit.experiment.Experiment(name, resolution, hic_data=None, norm_data=None, tad_def=None, parser=None, no_warn=False, weights=None, conditions=None, identifier=None, cell_type=None, enzyme=None, exp_type='Hi-C', **kw_descr)[source]¶
Hi-C experiment.
- Parameters
name – name of the experiment
resolution – the resolution of the experiment (size of a bin in bases)
identifier (None) – some identifier relative to the Hi-C data
cell_type (None) – cell type on which the experiment was done
enzyme (None) – restriction enzyme used in the Hi-C experiment
exp_type (Hi-C) – name of the experiment used (currently only Hi-C is supported)
hic_data (None) – whether a file or a list of lists corresponding to the Hi-C data
tad_def (None) – a file or a dict with precomputed TADs for this experiment
parser (None) –
a parser function that returns a tuple of lists representing the data matrix and the length of a row/column. With the file example.tsv:
chrT_001 chrT_002 chrT_003 chrT_004 chrT_001 629 164 88 105 chrT_002 164 612 175 110 chrT_003 88 175 437 100 chrT_004 105 110 100 278
the output of parser(‘example.tsv’) would be be:
[([629, 164, 88, 105, 164, 612, 175, 110, 88, 175, 437, 100, 105, 110, 100, 278]), 4]
conditions (None) –
list()
of experimental conditions, e.g. the cell type, the enzyme… (i.e.: [‘HindIII’, ‘cortex’, ‘treatment’]). This parameter may be used to compare the effect of this conditions on the TADsfilter_columns (True) – filter the columns with unexpectedly high content of low values
kw_descr (None) –
any other argument passed would be stored as complementary descriptive field. For example:
exp = Experiment('k562_rep2', resolution=100000, identifier='SRX015263', cell_type='K562', enzyme='HindIII', cylce='synchronized') print exp # Experiment k562_rep2: # resolution : 100Kb # TADs : None # Hi-C rows : None # normalized : None # identifier : SRX015263 # cell type : K562 # restriction enzyme: HindIII # cylce : synchronized
note that these fields may appear in the header of generated out files
TODO: doc conditions TODO: normalization
- filter_columns(silent=False, draw_hist=False, savefig=None, perc_zero=99, by_mean=True, min_count=None)[source]¶
Call filtering function, to remove artifactual columns in a given Hi-C matrix. This function will detect columns with very low interaction counts. Filtered out columns will be stored in the dictionary Experiment._zeros.
- Parameters
silent (False) – does not warn for removed columns
draw_hist (False) – shows the distribution of mean values by column the polynomial fit, and the cut applied.
savefig (None) – path to a file where to save the image generated; if None, the image will be shown using matplotlib GUI (the extension of the file name will determine the desired format).
perc_zero (99) – maximum percentage of cells with no interactions allowed.
min_count (None) – minimum number of reads mapped to a bin (recommended value could be 2500). If set this option overrides the perc_zero filtering… This option is slightly slower.
by_mean (True) – filter columns by mean column value using
pytadbit.utils.hic_filtering.filter_by_mean()
function
- get_hic_matrix(focus=None, diagonal=True, normalized=False)[source]¶
Return the Hi-C matrix.
- Parameters
focus (None) – if a tuple is passed (start, end), wil return a Hi-C matrix starting at start, and ending at end (all inclusive).
diagonal (True) – replace the values in the diagonal by one. Used for the filtering in order to smooth the distribution of mean values
- Para False normalized
returns normalized data instead of raw Hi-C
- Returns
list of lists representing the Hi-C data matrix of the current experiment
- get_hic_zscores(normalized=True, zscored=True, remove_zeros=True)[source]¶
Normalize the Hi-C raw data. The result will be stored into the private Experiment._zscore list.
- Parameters
normalized (True) – whether to normalize the result using the weights (see
normalize_hic()
)zscored (True) – calculate the z-score of the data
remove_zeros (False) – remove null interactions. Dangerous, null interaction are informative.
- load_hic_data(hic_data, parser=None, wanted_resolution=None, data_resolution=None, silent=False, **kwargs)[source]¶
Add a Hi-C experiment to the Chromosome object.
- Parameters
hic_data (None) – whether a file or a list of lists corresponding to the Hi-C data
name – name of the experiment
force (False) – overwrite the experiments loaded under the same name
parser (None) –
a parser function that returns a tuple of lists representing the data matrix and the length of a row/column. With the file example.tsv:
chrT_001 chrT_002 chrT_003 chrT_004 chrT_001 629 164 88 105 chrT_002 86 612 175 110 chrT_003 159 216 437 105 chrT_004 100 111 146 278
the output of parser(‘example.tsv’) would be:
[([629, 86, 159, 100, 164, 612, 216, 111, 88, 175, 437, 146, 105, 110, 105, 278]), 4]
resolution (None) – resolution of the experiment in the file; it will be adjusted to the resolution of the experiment. By default the file is expected to contain a Hi-C experiment with the same resolution as the
pytadbit.Experiment
created, and no change is madefilter_columns (True) – filter the columns with unexpectedly high content of low values
silent (False) – does not warn for removed columns
- load_norm_data(norm_data, parser=None, resolution=None, normalization='visibility', **kwargs)[source]¶
Add a normalized Hi-C experiment to the Chromosome object.
- Parameters
norm_data (None) – whether a file or a list of lists corresponding to the normalized Hi-C data
name – name of the experiment
force (False) – overwrite the experiments loaded under the same name
parser (None) –
a parser function that returns a tuple of lists representing the data matrix and the length of a row/column. With the file example.tsv:
chrT_001 chrT_002 chrT_003 chrT_004 chrT_001 12.5 164 8.8 0.5 chrT_002 8.6 61.2 1.5 1.1 chrT_003 15.9 21.6 3.7 0.5 chrT_004 0.0 1.1 1.6 2.8
resolution (None) – resolution of the experiment in the file; it will be adjusted to the resolution of the experiment. By default the file is expected to contain a Hi-C experiment with the same resolution as the
pytadbit.Experiment
created, and no change is madefilter_columns (True) – filter the columns with unexpectedly high content of low values
silent (False) – does not warn for removed columns
- load_tad_def(tad_def, weights=None)[source]¶
Add the Topologically Associated Domains definition detection to Slice
- model_region(start=1, end=None, n_models=5000, n_keep=1000, n_cpus=1, verbose=0, keep_all=False, close_bins=1, outfile=None, config={'kbending': 0.0, 'kforce': 5, 'lowfreq': - 0.7, 'maxdist': 600, 'reference': 'victor corces dataset 2013', 'scale': 0.01, 'upfreq': 0.3}, container=None, single_particle_restraints=None, use_HiC=True)[source]¶
Generates of three-dimensional models using IMP, for a given segment of chromosome.
- Parameters
start (1) – first bin to model (bin number)
end (None) – last bin to model (bin number). By default goes to the last bin.
n_models (5000) – number of modes to generate
n_keep (1000) – number of models used in the final analysis (usually the top 20% of the generated models). The models are ranked according to their objective function value (the lower the better)
keep_all (False) – whether or not to keep the discarded models (if True, models will be stored under tructuralModels.bad_models)
close_bins (1) – number of particles away (i.e. the bin number difference) a particle pair must be in order to be considered as neighbors (e.g. 1 means consecutive particles)
n_cpus – number of CPUs to use
verbose (0) – the information printed can be: nothing (0), the objective function value the selected models (1), the objective function value of all the models (2), all the modeling information (3)
container (None) – restrains particle to be within a given object. Can only be a ‘cylinder’, which is, in fact a cylinder of a given height to which are added hemispherical ends. This cylinder is defined by a radius, its height (with a height of 0 the cylinder becomes a sphere) and the force applied to the restraint. E.g. for modeling E. coli genome (2 micrometers length and 0.5 micrometer of width), these values could be used: [‘cylinder’, 250, 1500, 50], and for a typical mammalian nuclei (6 micrometers diameter): [‘cylinder’, 3000, 0, 50]
config (CONFIG) –
a dictionary containing the standard parameters used to generate the models. The dictionary should contain the keys kforce, maxdist, upfreq and lowfreq. Examples can be seen by doing:
from pytadbit.imp.CONFIG import CONFIG
where CONFIG is a dictionarry of dictionnaries to be passed to this function:
CONFIG = { # use these paramaters with the Hi-C data from: 'reference' : 'victor corces dataset 2013', # Force applied to the restraints inferred to neighbor particles 'kforce' : 5, # Maximum experimental contact distance 'maxdist' : 600, # OPTIMIZATION: 500-1200 # Minimum and maximum thresholds used to decide which experimental values have to be # included in the computation of restraints. Z-score values bigger than upfreq # and less that lowfreq will be include, whereas all the others will be rejected 'upfreq' : 0.3, # OPTIMIZATION: min/max Z-score 'lowfreq' : -0.7, # OPTIMIZATION: min/max Z-score # How much space (radius in nm) ocupies a nucleotide 'scale' : 0.005 }
use_HiC (True) – apply hic data restraints to the model
- Param
None single_particle_restraints: a list containing restraints to single particles. Each restraint in the list is itself a list with the following information:
[bin, [position_x, position_y, position_z], type, kforce, radius] bin: bin number of the particle to restraint [position_x, position_y, position_z](nm): center of the sphere of the restraint.
The center of the coordinate system is the center of the base of the cylinder defined as the container.
type: ‘Harmonic’, ‘HarmonicLowerBound’, ‘HarmonicUpperBound’ kforce: weigth of the restraint radius (nm): radius of the sphere
- Returns
a
pytadbit.imp.structuralmodels.StructuralModels
object.
- normalize_hic(factor=1, iterations=0, max_dev=0.1, silent=False)[source]¶
Normalize the Hi-C data. This normalization step does the same of the
pytadbit.tadbit.tadbit()
function (default parameters),It fills the Experiment.norm variable with the Hi-C values divided by the calculated weight.
The weight of a given cell in column i and row j corresponds to the square root of the product of the sum of column i by the sum of row j.
normalization is done according to this formula:
\[weight_{(I,J)} = \frac{\sum^N_{j=0}\sum^N_{i=0}(matrix(i,j))} {\sum^N_{i=0}(matrix(i,J)) \times \sum^N_{j=0}(matrix(I,j))}\]with N being the number or rows/columns of the Hi-C matrix in both cases.
- Parameters
factor (1) – final mean number of normalized interactions wanted per cell
silent (False) – does not warn when overwriting weights
rowsums (None) – input a list of rowsums calculated elsewhere
- optimal_imp_parameters(start=1, end=None, n_models=500, n_keep=100, n_cpus=1, upfreq_range=(0, 1, 0.1), close_bins=1, kbending_range=0.0, lowfreq_range=(- 1, 0, 0.1), scale_range=[0.01], maxdist_range=(400, 1400, 100), dcutoff_range=[2], outfile=None, verbose=True, corr='spearman', off_diag=1, savedata=None, container=None)[source]¶
Find the optimal set of parameters to be used for the 3D modeling in IMP.
- Parameters
start (1) – first bin to model (bin number)
end (None) – last bin to model (bin number). By default goes to the last bin.
n_models (500) – number of modes to generate
n_keep (100) – number of models used in the final analysis (usually the top 20% of the generated models). The models are ranked according to their objective function value (the lower the better)
close_bins (1) – number of particles away (i.e. the bin number difference) a particle pair must be in order to be considered as neighbors (e.g. 1 means consecutive particles)
n_cpus – number of CPUs to use
verbose (True) – if set to True, information about the distance, force and Z-score between particles will be printed
lowfreq_range ((-1,0,0.1)) – range of lowfreq values to be optimized. The last value of the input tuple is the incremental step for the lowfreq values
upfreq_range ((0,1,0.1,0.1)) – range of upfreq values to be optimized. The last value of the input tuple is the incremental step for the upfreq values
maxdist_range ((400,1400,100)) – upper and lower bounds used to search for the optimal maximum experimental distance. The last value of the input tuple is the incremental step for maxdist values
scale_range ([0.01]) – upper and lower bounds used to search for the optimal scale parameter (nm per nucleotide). The last value of the input tuple is the incremental step for scale parameter values
dcutoff_range ([2]) – upper and lower bounds used to search for the optimal distance cutoff parameter (distance, in number of beads, from which to consider 2 beads as being close). The last value of the input tuple is the incremental step for scale parameter values
container (None) – restrains particle to be within a given object. Can only be a ‘cylinder’, which is, in fact a cylinder of a given height to which are added hemispherical ends. This cylinder is defined by a radius, its height (with a height of 0 the cylinder becomes a sphere) and the force applied to the restraint. E.g. for modeling E. coli genome (2 micrometers length and 0.5 micrometer of width), these values could be used: [‘cylinder’, 250, 1500, 50], and for a typical mammalian nuclei (6 micrometers diameter): [‘cylinder’, 3000, 0, 50]
verbose – print the results to the standard output
Note
- Each of the _range parameters accept tuples in the form
(start, end, step), or a list with the list of values to test.
- E.g.:
scale_range=[0.001, 0.005, 0.006] will test these three values.
scale_range=(0.001, 0.005, 0.001) will test the values 0.001, 0.002, 0.003, 0.004 and 0.005
- Returns
an
pytadbit.imp.impoptimizer.IMPoptimizer
object
- print_hic_matrix(print_it=True, normalized=False, zeros=False)[source]¶
Return the Hi-C matrix as string
- Parameters
print_it (True) – Otherwise, returns the string
normalized (False) – returns normalized data, instead of raw Hi-C
zeros (False) – take into account filtered columns
- Returns
list of lists representing the Hi-C data matrix of the current experiment
- set_resolution(resolution, keep_original=True)[source]¶
Set a new value for the resolution. Copy the original data into Experiment._ori_hic and replace the Experiment.hic_data with the data corresponding to new data (
pytadbit.Chromosome.compare_condition()
).- Parameters
resolution – an integer representing the resolution. This number must be a multiple of the original resolution, and higher than it
keep_original (True) – either to keep or not the original data
- view(tad=None, focus=None, paint_tads=False, axe=None, show=True, logarithm=True, normalized=False, relative=True, decorate=True, savefig=None, where='both', clim=None, cmap='jet')[source]¶
Visualize the matrix of Hi-C interactions
- Parameters
tad (None) –
a given TAD in the form:
{'start': start, 'end' : end, 'brk' : end, 'score': score}
Alternatively a list of the TADs can be passed (all the TADs between the first and last one passed will be showed. Thus, passing more than two TADs might be superfluous)
focus (None) – a tuple with the start and end positions of the region to visualize
paint_tads (False) – draw a box around the TADs defined for this experiment
axe (None) – an axe object from matplotlib can be passed in order to customize the picture
show (True) – either to pop-up matplotlib image or not
logarithm (True) – show the logarithm values
normalized (True) – show the normalized data (weights might have been calculated previously). Note: white rows/columns may appear in the matrix displayed; these rows correspond to filtered rows (see
pytadbit.utils.hic_filtering.hic_filtering_for_modelling()
)relative (True) – color scale is relative to the whole matrix of data, not only to the region displayed
decorate (True) – draws color bar, title and axes labels
savefig (None) – path to a file where to save the image generated; if None, the image will be shown using matplotlib GUI (the extension of the file name will determine the desired format).
clim (None) – tuple with minimum and maximum value range for color scale. I.e. clim=(-4, 10)
cmap ('jet') – color map from matplotlib. Can also be a preconfigured cmap object.
- write_interaction_pairs(fname, normalized=True, zscored=True, diagonal=False, cutoff=None, header=False, true_position=False, uniq=True, remove_zeros=False, focus=None, format='tsv')[source]¶
Creates a tab separated file with all the pairwise interactions.
- Parameters
fname – file name where to write the pairwise interactions
zscored (True) – computes the z-score of the log10(data)
normalized (True) – use the weights to normalize the data
cutoff (None) – if defined, only the zscores above the cutoff will be writen to the file
uniq (False) – only writes one representent per interacting pair
true_position (False) – if, true writes genomic coordinates, otherwise, genomic bin.
focus (None) – writes interactions between the start and stop bin passed to this parameter.
format ('tsv') – in which to write the file, can be tab separated (tsv) or JSON (json)
- write_json(filename, focus=None, normalized=False)[source]¶
Save hic matrix in the json format, read by TADkit.
- Parameters
filename – location where the file will be written
focus (None) – if a tuple is passed (start, end), json will contain a Hi-C matrix starting at start, and ending at end (all inclusive).
- Para False normalized
use normalized data instead of raw Hi-C
- write_tad_borders(density=False, savedata=None, normalized=False)[source]¶
Print a table summarizing the TADs found by tadbit. This function outputs something similar to the R function.
- Parameters
density (False) – if True, adds a column with the relative interaction frequency measured within each TAD (value of 1 means an interaction frequency equal to the expectation in the experiment)
savedata (None) – path to a file where to save the density data generated (1 column per step + 1 for particle number). If None, print a table.
normalized (False) – uses normalized data to calculate the density
Load experiment¶
- pytadbit.experiment.load_experiment_from_reads(name, fnam, genome_seq, resolution, conditions=None, identifier=None, cell_type=None, enzyme=None, exp_type='Hi-C', **kw_descr)[source]¶
Loads an experiment object from TADbit-generated read files, that are lists of pairs of reads mapped to a reference genome.
- Parameters
fnam – tsv file with reads1 and reads2
name – name of the experiment
resolution – the resolution of the experiment (size of a bin in bases)
identifier (None) – some identifier relative to the Hi-C data
cell_type (None) – cell type on which the experiment was done
enzyme (None) – restriction enzyme used in the Hi-C experiment
exp_type (Hi-C) – name of the experiment used (currently only Hi-C is supported)
conditions (None) –
list()
of experimental conditions, e.g. the cell type, the enzyme… (i.e.: [‘HindIII’, ‘cortex’, ‘treatment’]). This parameter may be used to compare the effect of this conditions on the TADskw_descr (None) –
any other argument passed would be stored as complementary descriptive field. For example:
exp = Experiment('k562_rep2', resolution=100000, identifier='SRX015263', cell_type='K562', enzyme='HindIII', cylce='synchronized') print exp # Experiment k562_rep2: # resolution : 100Kb # TADs : None # Hi-C rows : None # normalized : None # identifier : SRX015263 # cell type : K562 # restriction enzyme: HindIII # cylce : synchronized
note that these fields may appear in the header of generated out files