Experiment class

class pytadbit.experiment.Experiment(name, resolution, hic_data=None, norm_data=None, tad_def=None, parser=None, no_warn=False, weights=None, conditions=None, identifier=None, cell_type=None, enzyme=None, exp_type='Hi-C', **kw_descr)[source]

Hi-C experiment.

Parameters
  • name – name of the experiment

  • resolution – the resolution of the experiment (size of a bin in bases)

  • identifier (None) – some identifier relative to the Hi-C data

  • cell_type (None) – cell type on which the experiment was done

  • enzyme (None) – restriction enzyme used in the Hi-C experiment

  • exp_type (Hi-C) – name of the experiment used (currently only Hi-C is supported)

  • hic_data (None) – whether a file or a list of lists corresponding to the Hi-C data

  • tad_def (None) – a file or a dict with precomputed TADs for this experiment

  • parser (None) –

    a parser function that returns a tuple of lists representing the data matrix and the length of a row/column. With the file example.tsv:

    chrT_001    chrT_002    chrT_003    chrT_004
    chrT_001    629    164    88    105
    chrT_002    164    612    175    110
    chrT_003    88    175    437    100
    chrT_004    105    110    100    278
    

    the output of parser(‘example.tsv’) would be be: [([629, 164, 88, 105, 164, 612, 175, 110, 88, 175, 437, 100, 105, 110, 100, 278]), 4]

  • conditions (None) – list() of experimental conditions, e.g. the cell type, the enzyme… (i.e.: [‘HindIII’, ‘cortex’, ‘treatment’]). This parameter may be used to compare the effect of this conditions on the TADs

  • filter_columns (True) – filter the columns with unexpectedly high content of low values

  • kw_descr (None) –

    any other argument passed would be stored as complementary descriptive field. For example:

    exp  = Experiment('k562_rep2', resolution=100000,
                      identifier='SRX015263', cell_type='K562',
                      enzyme='HindIII', cylce='synchronized')
    print exp
    
    # Experiment k562_rep2:
    #    resolution        : 100Kb
    #    TADs              : None
    #    Hi-C rows         : None
    #    normalized        : None
    #    identifier        : SRX015263
    #    cell type         : K562
    #    restriction enzyme: HindIII
    #    cylce             : synchronized
    

    note that these fields may appear in the header of generated out files

TODO: doc conditions TODO: normalization

filter_columns(silent=False, draw_hist=False, savefig=None, perc_zero=99, by_mean=True, min_count=None)[source]

Call filtering function, to remove artifactual columns in a given Hi-C matrix. This function will detect columns with very low interaction counts. Filtered out columns will be stored in the dictionary Experiment._zeros.

Parameters
  • silent (False) – does not warn for removed columns

  • draw_hist (False) – shows the distribution of mean values by column the polynomial fit, and the cut applied.

  • savefig (None) – path to a file where to save the image generated; if None, the image will be shown using matplotlib GUI (the extension of the file name will determine the desired format).

  • perc_zero (99) – maximum percentage of cells with no interactions allowed.

  • min_count (None) – minimum number of reads mapped to a bin (recommended value could be 2500). If set this option overrides the perc_zero filtering… This option is slightly slower.

  • by_mean (True) – filter columns by mean column value using pytadbit.utils.hic_filtering.filter_by_mean() function

get_hic_matrix(focus=None, diagonal=True, normalized=False)[source]

Return the Hi-C matrix.

Parameters
  • focus (None) – if a tuple is passed (start, end), wil return a Hi-C matrix starting at start, and ending at end (all inclusive).

  • diagonal (True) – replace the values in the diagonal by one. Used for the filtering in order to smooth the distribution of mean values

Para False normalized

returns normalized data instead of raw Hi-C

Returns

list of lists representing the Hi-C data matrix of the current experiment

get_hic_zscores(normalized=True, zscored=True, remove_zeros=True)[source]

Normalize the Hi-C raw data. The result will be stored into the private Experiment._zscore list.

Parameters
  • normalized (True) – whether to normalize the result using the weights (see normalize_hic())

  • zscored (True) – calculate the z-score of the data

  • remove_zeros (False) – remove null interactions. Dangerous, null interaction are informative.

load_hic_data(hic_data, parser=None, wanted_resolution=None, data_resolution=None, silent=False, **kwargs)[source]

Add a Hi-C experiment to the Chromosome object.

Parameters
  • hic_data (None) – whether a file or a list of lists corresponding to the Hi-C data

  • name – name of the experiment

  • force (False) – overwrite the experiments loaded under the same name

  • parser (None) –

    a parser function that returns a tuple of lists representing the data matrix and the length of a row/column. With the file example.tsv:

    chrT_001    chrT_002    chrT_003    chrT_004
    chrT_001    629    164    88    105
    chrT_002    86    612    175    110
    chrT_003    159    216    437    105
    chrT_004    100    111    146    278
    

    the output of parser(‘example.tsv’) would be: [([629, 86, 159, 100, 164, 612, 216, 111, 88, 175, 437, 146, 105, 110, 105, 278]), 4]

  • resolution (None) – resolution of the experiment in the file; it will be adjusted to the resolution of the experiment. By default the file is expected to contain a Hi-C experiment with the same resolution as the pytadbit.Experiment created, and no change is made

  • filter_columns (True) – filter the columns with unexpectedly high content of low values

  • silent (False) – does not warn for removed columns

load_norm_data(norm_data, parser=None, resolution=None, normalization='visibility', **kwargs)[source]

Add a normalized Hi-C experiment to the Chromosome object.

Parameters
  • norm_data (None) – whether a file or a list of lists corresponding to the normalized Hi-C data

  • name – name of the experiment

  • force (False) – overwrite the experiments loaded under the same name

  • parser (None) –

    a parser function that returns a tuple of lists representing the data matrix and the length of a row/column. With the file example.tsv:

    chrT_001    chrT_002    chrT_003    chrT_004
    chrT_001    12.5    164    8.8    0.5
    chrT_002    8.6    61.2    1.5    1.1
    chrT_003    15.9    21.6    3.7    0.5
    chrT_004    0.0    1.1    1.6    2.8
    

  • resolution (None) – resolution of the experiment in the file; it will be adjusted to the resolution of the experiment. By default the file is expected to contain a Hi-C experiment with the same resolution as the pytadbit.Experiment created, and no change is made

  • filter_columns (True) – filter the columns with unexpectedly high content of low values

  • silent (False) – does not warn for removed columns

load_tad_def(tad_def, weights=None)[source]

Add the Topologically Associated Domains definition detection to Slice

Parameters
  • tad_def (None) – a file or a dict with pre-computed TADs for this experiment

  • name (None) – name of the experiment, if None f_name will be used

  • weights (None) – Store information about the weights, corresponding to the normalization of the Hi-C data (see TADbit function documentation)

model_region(start=1, end=None, n_models=5000, n_keep=1000, n_cpus=1, verbose=0, keep_all=False, close_bins=1, outfile=None, config={'kbending': 0.0, 'kforce': 5, 'lowfreq': - 0.7, 'maxdist': 600, 'reference': 'victor corces dataset 2013', 'scale': 0.01, 'upfreq': 0.3}, container=None, single_particle_restraints=None, use_HiC=True)[source]

Generates of three-dimensional models using IMP, for a given segment of chromosome.

Parameters
  • start (1) – first bin to model (bin number)

  • end (None) – last bin to model (bin number). By default goes to the last bin.

  • n_models (5000) – number of modes to generate

  • n_keep (1000) – number of models used in the final analysis (usually the top 20% of the generated models). The models are ranked according to their objective function value (the lower the better)

  • keep_all (False) – whether or not to keep the discarded models (if True, models will be stored under tructuralModels.bad_models)

  • close_bins (1) – number of particles away (i.e. the bin number difference) a particle pair must be in order to be considered as neighbors (e.g. 1 means consecutive particles)

  • n_cpus – number of CPUs to use

  • verbose (0) – the information printed can be: nothing (0), the objective function value the selected models (1), the objective function value of all the models (2), all the modeling information (3)

  • container (None) – restrains particle to be within a given object. Can only be a ‘cylinder’, which is, in fact a cylinder of a given height to which are added hemispherical ends. This cylinder is defined by a radius, its height (with a height of 0 the cylinder becomes a sphere) and the force applied to the restraint. E.g. for modeling E. coli genome (2 micrometers length and 0.5 micrometer of width), these values could be used: [‘cylinder’, 250, 1500, 50], and for a typical mammalian nuclei (6 micrometers diameter): [‘cylinder’, 3000, 0, 50]

  • config (CONFIG) –

    a dictionary containing the standard parameters used to generate the models. The dictionary should contain the keys kforce, maxdist, upfreq and lowfreq. Examples can be seen by doing:

    from pytadbit.imp.CONFIG import CONFIG
    

    where CONFIG is a dictionarry of dictionnaries to be passed to this function:

    CONFIG = {
     # use these paramaters with the Hi-C data from:
     'reference' : 'victor corces dataset 2013',
    
     # Force applied to the restraints inferred to neighbor particles
     'kforce'    : 5,
    
     # Maximum experimental contact distance
     'maxdist'   : 600, # OPTIMIZATION: 500-1200
    
     # Minimum and maximum thresholds used to decide which experimental values have to be
     # included in the computation of restraints. Z-score values bigger than upfreq
     # and less that lowfreq will be include, whereas all the others will be rejected
     'upfreq'    : 0.3, # OPTIMIZATION: min/max Z-score
    
     'lowfreq'   : -0.7, # OPTIMIZATION: min/max Z-score
    
     # How much space (radius in nm) ocupies a nucleotide
     'scale'     : 0.005
     }
    

  • use_HiC (True) – apply hic data restraints to the model

Param

None single_particle_restraints: a list containing restraints to single particles. Each restraint in the list is itself a list with the following information:

[bin, [position_x, position_y, position_z], type, kforce, radius] bin: bin number of the particle to restraint [position_x, position_y, position_z](nm): center of the sphere of the restraint.

The center of the coordinate system is the center of the base of the cylinder defined as the container.

type: ‘Harmonic’, ‘HarmonicLowerBound’, ‘HarmonicUpperBound’ kforce: weigth of the restraint radius (nm): radius of the sphere

Returns

a pytadbit.imp.structuralmodels.StructuralModels object.

normalize_hic(factor=1, iterations=0, max_dev=0.1, silent=False)[source]

Normalize the Hi-C data. This normalization step does the same of the pytadbit.tadbit.tadbit() function (default parameters),

It fills the Experiment.norm variable with the Hi-C values divided by the calculated weight.

The weight of a given cell in column i and row j corresponds to the square root of the product of the sum of column i by the sum of row j.

normalization is done according to this formula:

\[weight_{(I,J)} = \frac{\sum^N_{j=0}\sum^N_{i=0}(matrix(i,j))} {\sum^N_{i=0}(matrix(i,J)) \times \sum^N_{j=0}(matrix(I,j))}\]

with N being the number or rows/columns of the Hi-C matrix in both cases.

Parameters
  • factor (1) – final mean number of normalized interactions wanted per cell

  • silent (False) – does not warn when overwriting weights

  • rowsums (None) – input a list of rowsums calculated elsewhere

optimal_imp_parameters(start=1, end=None, n_models=500, n_keep=100, n_cpus=1, upfreq_range=(0, 1, 0.1), close_bins=1, kbending_range=0.0, lowfreq_range=(- 1, 0, 0.1), scale_range=[0.01], maxdist_range=(400, 1400, 100), dcutoff_range=[2], outfile=None, verbose=True, corr='spearman', off_diag=1, savedata=None, container=None)[source]

Find the optimal set of parameters to be used for the 3D modeling in IMP.

Parameters
  • start (1) – first bin to model (bin number)

  • end (None) – last bin to model (bin number). By default goes to the last bin.

  • n_models (500) – number of modes to generate

  • n_keep (100) – number of models used in the final analysis (usually the top 20% of the generated models). The models are ranked according to their objective function value (the lower the better)

  • close_bins (1) – number of particles away (i.e. the bin number difference) a particle pair must be in order to be considered as neighbors (e.g. 1 means consecutive particles)

  • n_cpus – number of CPUs to use

  • verbose (True) – if set to True, information about the distance, force and Z-score between particles will be printed

  • lowfreq_range ((-1,0,0.1)) – range of lowfreq values to be optimized. The last value of the input tuple is the incremental step for the lowfreq values

  • upfreq_range ((0,1,0.1,0.1)) – range of upfreq values to be optimized. The last value of the input tuple is the incremental step for the upfreq values

  • maxdist_range ((400,1400,100)) – upper and lower bounds used to search for the optimal maximum experimental distance. The last value of the input tuple is the incremental step for maxdist values

  • scale_range ([0.01]) – upper and lower bounds used to search for the optimal scale parameter (nm per nucleotide). The last value of the input tuple is the incremental step for scale parameter values

  • dcutoff_range ([2]) – upper and lower bounds used to search for the optimal distance cutoff parameter (distance, in number of beads, from which to consider 2 beads as being close). The last value of the input tuple is the incremental step for scale parameter values

  • container (None) – restrains particle to be within a given object. Can only be a ‘cylinder’, which is, in fact a cylinder of a given height to which are added hemispherical ends. This cylinder is defined by a radius, its height (with a height of 0 the cylinder becomes a sphere) and the force applied to the restraint. E.g. for modeling E. coli genome (2 micrometers length and 0.5 micrometer of width), these values could be used: [‘cylinder’, 250, 1500, 50], and for a typical mammalian nuclei (6 micrometers diameter): [‘cylinder’, 3000, 0, 50]

  • verbose – print the results to the standard output

Note

Each of the _range parameters accept tuples in the form

(start, end, step), or a list with the list of values to test.

E.g.:
  • scale_range=[0.001, 0.005, 0.006] will test these three values.

  • scale_range=(0.001, 0.005, 0.001) will test the values 0.001, 0.002, 0.003, 0.004 and 0.005

Returns

an pytadbit.imp.impoptimizer.IMPoptimizer object

print_hic_matrix(print_it=True, normalized=False, zeros=False)[source]

Return the Hi-C matrix as string

Parameters
  • print_it (True) – Otherwise, returns the string

  • normalized (False) – returns normalized data, instead of raw Hi-C

  • zeros (False) – take into account filtered columns

Returns

list of lists representing the Hi-C data matrix of the current experiment

set_resolution(resolution, keep_original=True)[source]

Set a new value for the resolution. Copy the original data into Experiment._ori_hic and replace the Experiment.hic_data with the data corresponding to new data (pytadbit.Chromosome.compare_condition()).

Parameters
  • resolution – an integer representing the resolution. This number must be a multiple of the original resolution, and higher than it

  • keep_original (True) – either to keep or not the original data

view(tad=None, focus=None, paint_tads=False, axe=None, show=True, logarithm=True, normalized=False, relative=True, decorate=True, savefig=None, where='both', clim=None, cmap='jet')[source]

Visualize the matrix of Hi-C interactions

Parameters
  • tad (None) –

    a given TAD in the form:

    {'start': start,
     'end'  : end,
     'brk'  : end,
     'score': score}
    

    Alternatively a list of the TADs can be passed (all the TADs between the first and last one passed will be showed. Thus, passing more than two TADs might be superfluous)

  • focus (None) – a tuple with the start and end positions of the region to visualize

  • paint_tads (False) – draw a box around the TADs defined for this experiment

  • axe (None) – an axe object from matplotlib can be passed in order to customize the picture

  • show (True) – either to pop-up matplotlib image or not

  • logarithm (True) – show the logarithm values

  • normalized (True) – show the normalized data (weights might have been calculated previously). Note: white rows/columns may appear in the matrix displayed; these rows correspond to filtered rows (see pytadbit.utils.hic_filtering.hic_filtering_for_modelling() )

  • relative (True) – color scale is relative to the whole matrix of data, not only to the region displayed

  • decorate (True) – draws color bar, title and axes labels

  • savefig (None) – path to a file where to save the image generated; if None, the image will be shown using matplotlib GUI (the extension of the file name will determine the desired format).

  • clim (None) – tuple with minimum and maximum value range for color scale. I.e. clim=(-4, 10)

  • cmap ('jet') – color map from matplotlib. Can also be a preconfigured cmap object.

write_interaction_pairs(fname, normalized=True, zscored=True, diagonal=False, cutoff=None, header=False, true_position=False, uniq=True, remove_zeros=False, focus=None, format='tsv')[source]

Creates a tab separated file with all the pairwise interactions.

Parameters
  • fname – file name where to write the pairwise interactions

  • zscored (True) – computes the z-score of the log10(data)

  • normalized (True) – use the weights to normalize the data

  • cutoff (None) – if defined, only the zscores above the cutoff will be writen to the file

  • uniq (False) – only writes one representent per interacting pair

  • true_position (False) – if, true writes genomic coordinates, otherwise, genomic bin.

  • focus (None) – writes interactions between the start and stop bin passed to this parameter.

  • format ('tsv') – in which to write the file, can be tab separated (tsv) or JSON (json)

write_json(filename, focus=None, normalized=False)[source]

Save hic matrix in the json format, read by TADkit.

Parameters
  • filename – location where the file will be written

  • focus (None) – if a tuple is passed (start, end), json will contain a Hi-C matrix starting at start, and ending at end (all inclusive).

Para False normalized

use normalized data instead of raw Hi-C

write_tad_borders(density=False, savedata=None, normalized=False)[source]

Print a table summarizing the TADs found by tadbit. This function outputs something similar to the R function.

Parameters
  • density (False) – if True, adds a column with the relative interaction frequency measured within each TAD (value of 1 means an interaction frequency equal to the expectation in the experiment)

  • savedata (None) – path to a file where to save the density data generated (1 column per step + 1 for particle number). If None, print a table.

  • normalized (False) – uses normalized data to calculate the density

Load experiment

pytadbit.experiment.load_experiment_from_reads(name, fnam, genome_seq, resolution, conditions=None, identifier=None, cell_type=None, enzyme=None, exp_type='Hi-C', **kw_descr)[source]

Loads an experiment object from TADbit-generated read files, that are lists of pairs of reads mapped to a reference genome.

Parameters
  • fnam – tsv file with reads1 and reads2

  • name – name of the experiment

  • resolution – the resolution of the experiment (size of a bin in bases)

  • identifier (None) – some identifier relative to the Hi-C data

  • cell_type (None) – cell type on which the experiment was done

  • enzyme (None) – restriction enzyme used in the Hi-C experiment

  • exp_type (Hi-C) – name of the experiment used (currently only Hi-C is supported)

  • conditions (None) – list() of experimental conditions, e.g. the cell type, the enzyme… (i.e.: [‘HindIII’, ‘cortex’, ‘treatment’]). This parameter may be used to compare the effect of this conditions on the TADs

  • kw_descr (None) –

    any other argument passed would be stored as complementary descriptive field. For example:

    exp  = Experiment('k562_rep2', resolution=100000,
                      identifier='SRX015263', cell_type='K562',
                      enzyme='HindIII', cylce='synchronized')
    print exp
    
    # Experiment k562_rep2:
    #    resolution        : 100Kb
    #    TADs              : None
    #    Hi-C rows         : None
    #    normalized        : None
    #    identifier        : SRX015263
    #    cell type         : K562
    #    restriction enzyme: HindIII
    #    cylce             : synchronized
    

    note that these fields may appear in the header of generated out files