HiC_data class

class pytadbit.hic_data.HiC_data(items, size, chromosomes=None, dict_sec=None, resolution=1, masked=None, symmetricized=False)[source]

This may also hold the print/write-to-file matrix functions

add_sections(lengths, chr_names=None, binned=False)[source]

Add genomic coordinate to HiC_data object by getting them from a FASTA file containing chromosome sequences. Orders matters.

Parameters
  • lengths – list of chromosome lengths

  • chr_names (None) – list of corresponding chromosome names.

  • binned (False) – if True, lengths will not be divided by resolution

add_sections_from_fasta(fasta)[source]

Add genomic coordinate to HiC_data object by getting them from a FASTA file containing chromosome sequences

Parameters

fasta – path to a FASTA file

cis_trans_ratio(normalized=False, exclude=None, diagonal=True, equals=None)[source]

Counts the number of interactions occurring within chromosomes (cis) with respect to the total number of interactions

Parameters
  • normalized (False) – used normalized data

  • exclude (None) – exclude a given list of chromosome from the ratio (may want to exclude translocated chromosomes)

  • diagonal (False) – replace values in the diagonal by 0 or 1

  • equals (None) – can pass a function that would decide if 2 chromosomes have to be considered as the same. e.g. lambda x, y: x[:4]==y[:4] will consider chr2L and chr2R as being the same chromosome. WARNING: only working on consecutive chromosomes.

Returns

the ratio of cis interactions over the total number of interactions. This number is expected to be between at least 40-60% in Human classic dilution Hi-C with HindIII as restriction enzyme.

filter_columns(draw_hist=False, savefig=None, perc_zero=99, by_mean=True, min_count=None, silent=False)[source]

Call filtering function, to remove artifactual columns in a given Hi-C matrix. This function will detect columns with very low interaction counts. Filtered out columns will be stored in the dictionary Experiment._zeros.

Parameters
  • draw_hist (False) – shows the distribution of mean values by column the polynomial fit, and the cut applied.

  • savefig (None) – path to a file where to save the image generated; if None, the image will be shown using matplotlib GUI (the extension of the file name will determine the desired format).

  • perc_zero (75) – maximum percentage of cells with no interactions allowed.

  • min_count (None) – minimum number of reads mapped to a bin (recommended value could be 2500). If set this option overrides the perc_zero filtering… This option is slightly slower.

  • by_mean (True) – filter columns by mean column value using pytadbit.utils.hic_filtering.filter_by_mean() function

find_compartments(crms=None, savefig=None, savedata=None, savecorr=None, show=False, suffix='', ev_index=None, rich_in_A=None, format='png', savedir=None, max_ev=3, show_compartment_labels=False, **kwargs)[source]

Search for A/B compartments in each chromosome of the Hi-C matrix. Hi-C matrix is normalized by the number interaction expected at a given distance, and by visibility (one iteration of ICE). A correlation matrix is then calculated from this normalized matrix, and its first eigenvector is used to identify compartments. Changes in sign marking boundaries between compartments. Result is stored as a dictionary of compartment boundaries, keys being chromosome names.

Parameters
  • perc_zero (99) – to filter bad columns

  • signal_to_noise (0.05) – to calculate expected interaction counts, if not enough reads are observed at a given distance the observations of the distance+1 are summed. a signal to noise ratio of < 0.05 corresponds to > 400 reads.

  • crms (None) – only runs these given list of chromosomes

  • savefig (None) – path to a directory to store matrices with compartment predictions, one image per chromosome, stored under ‘chromosome-name_EV1.png’.

  • format (png) – in which to save the figures.

  • show (False) – show the plot

  • savedata (None) – path to a new file to store compartment predictions, one file only.

  • savedir (None) – path to a directory to store coordinates of each eigenvector, one per chromosome. Each file contains one eigenvector per column, the first one being the one used as reference. This eigenvector is also rotated according to the prediction if a rich_in_A array was given.

  • savecorr (None) – path to a directory where to save correlation matrices of each chromosome

  • vmin (-1) – for the color scale of the plotted map (use vmin=’auto’, and vmax=’auto’ to color according to the absolute maximum found).

  • vmax (1) – for the color scale of the plotted map (use vmin=’auto’, and vmax=’auto’ to color according to the absolute maximum found).

  • yield_ev1 (False) – if True yields one list per chromosome with the first eigenvector used to compute compartments.

  • suffix ('') – to be placed after file names of compartment images

  • max_ev (3) – maximum number of EV to try

  • ev_index (None) – a list of number referring to the index of the eigenvector to be used. By default the first eigenvector is used. WARNING: index starts at 1, default is thus a list of ones. Note: if asking for only one chromosome the list should be only of one element.

  • rich_in_A (None) – by default compartments are identified using mean number of intra-interactions (A compartments are expected to have less). However this measure is not very accurate. Using this parameter a path to a BED or BED-Graph file with a list of genes or active epigenetic marks can be passed, and used instead of the mean interactions.

  • show_compartment_labels (False) – if True draw A and B compartment blocks.

TODO: this is really slow…

Notes: building the distance matrix using the amount of interactions

instead of the mean correlation, gives generally worse results.

Returns

1- a dictionary with the N (max_ev) first eigenvectors in the form: {Chromosome_name: (Eigenvalue: [Eigenvector])} Sign of the eigenvectors are changed in order to match the prediction of A/B compartments (positive is A). 2- a dictionary of statistics of enrichment for A compartments (Spearman rho).

find_compartments_beta(crms=None, savefig=None, savedata=None, savecorr=None, show=False, suffix='', how='', label_compartments='hmm', log=None, max_mean_size=10000, ev_index=None, rich_in_A=None, max_ev=3, show_compartment_labels=False, **kwargs)[source]

Search for A/B compartments in each chromosome of the Hi-C matrix. Hi-C matrix is normalized by the number interaction expected at a given distance, and by visibility (one iteration of ICE). A correlation matrix is then calculated from this normalized matrix, and its first eigenvector is used to identify compartments. Changes in sign marking boundaries between compartments. Result is stored as a dictionary of compartment boundaries, keys being chromosome names.

Parameters
  • perc_zero (99) – to filter bad columns

  • signal_to_noise (0.05) – to calculate expected interaction counts, if not enough reads are observed at a given distance the observations of the distance+1 are summed. a signal to noise ratio of < 0.05 corresponds to > 400 reads.

  • crms (None) – only runs these given list of chromosomes

  • savefig (None) – path to a directory to store matrices with compartment predictions, one image per chromosome, stored under ‘chromosome-name.png’.

  • show (False) – show the plot

  • savedata (None) – path to a new file to store compartment predictions, one file only.

  • savecorr (None) – path to a directory where to save correlation matrices of each chromosome

  • vmin (-1) – for the color scale of the plotted map (use vmin=’auto’, and vmax=’auto’ to color according to the absolute maximum found).

  • vmax (1) – for the color scale of the plotted map (use vmin=’auto’, and vmax=’auto’ to color according to the absolute maximum found).

  • yield_ev1 (False) – if True yields one list per chromosome with the first eigenvector used to compute compartments.

  • suffix ('') – to be placed after file names of compartment images

  • max_ev (3) – maximum number of EV to try

  • ev_index (None) – a list of number referring to the index of the eigenvector to be used. By default the first eigenvector is used. WARNING: index starts at 1, default is thus a list of ones. Note: if asking for only one chromosome the list should be only of one element.

  • rich_in_A (None) – by default compartments are identified using mean number of intra-interactions (A compartments are expected to have less). However this measure is not very accurate. Using this parameter a path to a BED or BED-Graph file with a list of genes or active epigenetic marks can be passed, and used instead of the mean interactions.

  • log (None) – path to a folder where to save log of the assignment of A/B compartments

  • label_compartments (hmm) – label compartments into A/B categories, otherwise just find borders (faster). Can be either hmm (default), or cluster.

  • how ('ratio') – ratio divide by column, subratio divide by compartment, diagonal only uses diagonal

  • False'show_compartment_labels' – if True draw A and B compartment blocks.

TODO: this is really slow…

Notes: building the distance matrix using the amount of interactions

instead of the mean correlation, gives generally worse results.

Returns

a dictionary with the N (max_ev) first eigen vectors used to define compartment borders for each chromosome (keys are chromosome names)

get_hic_data_as_csr()[source]

Returns a scipy sparse matrix in Compressed Sparse Row format of the Hi-C data in the dictionary

Returns

scipy sparse matrix in Compressed Sparse Row format

get_matrix(focus=None, diagonal=True, normalized=False, masked=False)[source]

returns a matrix.

Parameters
  • focus (None) – a tuple with the (start, end) position of the desired window of data (start, starting at 1, and both start and end are inclusive). Alternatively a chromosome name can be input or a tuple of chromosome name, in order to retrieve a specific inter-chromosomal region

  • diagonal (True) – if False, diagonal is replaced by ones, or zeroes if normalized

  • normalized (False) – get normalized data

  • masked (False) – return masked arrays using the definition of bad columns

Returns

matrix (a list of lists of values)

load_biases(fnam, protocol=None)[source]

Load biases, decay and bad columns from pickle file

Parameters

fnam – path to input pickle file

normalize_hic(iterations=0, max_dev=0.1, silent=False, sqrt=False, factor=1)[source]

Normalize the Hi-C data.

It fills the Experiment.norm variable with the Hi-C values divided by the calculated weight.

Parameters
  • iteration (0) – number of iterations

  • max_dev (0.1) – iterative process stops when the maximum deviation between the sum of row is equal to this number (0.1 means 10%)

  • silent (False) – does not warn when overwriting weights

  • sqrt (False) – uses the square root of the computed biases

  • factor (1) – final mean number of normalized interactions wanted per cell (excludes filtered, or bad, out columns)

save_biases(fnam, protocol=None)[source]

Save biases, decay and bad columns in pickle format (to be loaded by the function load_hic_data_from_bam)

Parameters

fnam – path to output file

sum(bias=None, bads=None)[source]

Sum Hi-C data matrix WARNING: parameters are not meant to be used by external users

Params None bias

expects a dictionary of biases to use normalized matrix

Params None bads

extends computed bad columns

Returns

the sum of the Hi-C matrix skipping bad columns

write_compartments(savedata, chroms=None, ev_nums=None)[source]

Write compartments to a file.

Parameters
  • savedata – path to a file.

  • chroms (None) – write only the given list of chromosomes (default all chromosomes are written, note that the first column corresponding to chromosome name will disappear in non default case)

write_cooler(fname, normalized=False)[source]

writes the hic_data to a cooler file.

Parameters

normalized (False) – get normalized data

write_coord_table(fname, focus=None, diagonal=True, normalized=False, format='BED')[source]

writes a coordinate table to a file.

Parameters
  • focus (None) – a tuple with the (start, end) position of the desired window of data (start, starting at 1, and both start and end are inclusive). Alternatively a chromosome name can be input or a tuple of chromosome name, in order to retrieve a specific inter-chromosomal region

  • diagonal (True) – if False, diagonal is replaced by zeroes

  • normalized (False) – get normalized data

  • format (BED) –

    either “BED”

    chr1 111 222 chr2:333-444,55 1 . chr2 333 444 chr1:111-222,55 2 .

    or “long-range” format:

    chr1:111-222 chr2:333-444 55 chr2:333-444 chr1:111-222 55

write_matrix(fname, focus=None, diagonal=True, normalized=False)[source]

writes the matrix to a file.

Parameters
  • focus (None) – a tuple with the (start, end) position of the desired window of data (start, starting at 1, and both start and end are inclusive). Alternatively a chromosome name can be input or a tuple of chromosome name, in order to retrieve a specific inter-chromosomal region

  • diagonal (True) – if False, diagonal is replaced by zeroes

  • normalized (False) – get normalized data

yield_matrix(focus=None, diagonal=True, normalized=False)[source]

Yields a matrix line by line. Bad row/columns are returned as null row/columns.

Parameters
  • focus (None) – a tuple with the (start, end) position of the desired window of data (start, starting at 1, and both start and end are inclusive). Alternatively a chromosome name can be input or a tuple of chromosome name, in order to retrieve a specific inter-chromosomal region

  • diagonal (True) – if False, diagonal is replaced by zeroes

  • normalized (False) – get normalized data

Yields

matrix line by line (a line being a list of values)