HiC_data class¶
- class pytadbit.hic_data.HiC_data(items, size, chromosomes=None, dict_sec=None, resolution=1, masked=None, symmetricized=False)[source]¶
This may also hold the print/write-to-file matrix functions
- add_sections(lengths, chr_names=None, binned=False)[source]¶
Add genomic coordinate to HiC_data object by getting them from a FASTA file containing chromosome sequences. Orders matters.
- Parameters
lengths – list of chromosome lengths
chr_names (None) – list of corresponding chromosome names.
binned (False) – if True, lengths will not be divided by resolution
- add_sections_from_fasta(fasta)[source]¶
Add genomic coordinate to HiC_data object by getting them from a FASTA file containing chromosome sequences
- Parameters
fasta – path to a FASTA file
- cis_trans_ratio(normalized=False, exclude=None, diagonal=True, equals=None)[source]¶
Counts the number of interactions occurring within chromosomes (cis) with respect to the total number of interactions
- Parameters
normalized (False) – used normalized data
exclude (None) – exclude a given list of chromosome from the ratio (may want to exclude translocated chromosomes)
diagonal (False) – replace values in the diagonal by 0 or 1
equals (None) – can pass a function that would decide if 2 chromosomes have to be considered as the same. e.g. lambda x, y: x[:4]==y[:4] will consider chr2L and chr2R as being the same chromosome. WARNING: only working on consecutive chromosomes.
- Returns
the ratio of cis interactions over the total number of interactions. This number is expected to be between at least 40-60% in Human classic dilution Hi-C with HindIII as restriction enzyme.
- filter_columns(draw_hist=False, savefig=None, perc_zero=99, by_mean=True, min_count=None, silent=False)[source]¶
Call filtering function, to remove artifactual columns in a given Hi-C matrix. This function will detect columns with very low interaction counts. Filtered out columns will be stored in the dictionary Experiment._zeros.
- Parameters
draw_hist (False) – shows the distribution of mean values by column the polynomial fit, and the cut applied.
savefig (None) – path to a file where to save the image generated; if None, the image will be shown using matplotlib GUI (the extension of the file name will determine the desired format).
perc_zero (75) – maximum percentage of cells with no interactions allowed.
min_count (None) – minimum number of reads mapped to a bin (recommended value could be 2500). If set this option overrides the perc_zero filtering… This option is slightly slower.
by_mean (True) – filter columns by mean column value using
pytadbit.utils.hic_filtering.filter_by_mean()
function
- find_compartments(crms=None, savefig=None, savedata=None, savecorr=None, show=False, suffix='', ev_index=None, rich_in_A=None, format='png', savedir=None, max_ev=3, show_compartment_labels=False, **kwargs)[source]¶
Search for A/B compartments in each chromosome of the Hi-C matrix. Hi-C matrix is normalized by the number interaction expected at a given distance, and by visibility (one iteration of ICE). A correlation matrix is then calculated from this normalized matrix, and its first eigenvector is used to identify compartments. Changes in sign marking boundaries between compartments. Result is stored as a dictionary of compartment boundaries, keys being chromosome names.
- Parameters
perc_zero (99) – to filter bad columns
signal_to_noise (0.05) – to calculate expected interaction counts, if not enough reads are observed at a given distance the observations of the distance+1 are summed. a signal to noise ratio of < 0.05 corresponds to > 400 reads.
crms (None) – only runs these given list of chromosomes
savefig (None) – path to a directory to store matrices with compartment predictions, one image per chromosome, stored under ‘chromosome-name_EV1.png’.
format (png) – in which to save the figures.
show (False) – show the plot
savedata (None) – path to a new file to store compartment predictions, one file only.
savedir (None) – path to a directory to store coordinates of each eigenvector, one per chromosome. Each file contains one eigenvector per column, the first one being the one used as reference. This eigenvector is also rotated according to the prediction if a rich_in_A array was given.
savecorr (None) – path to a directory where to save correlation matrices of each chromosome
vmin (-1) – for the color scale of the plotted map (use vmin=’auto’, and vmax=’auto’ to color according to the absolute maximum found).
vmax (1) – for the color scale of the plotted map (use vmin=’auto’, and vmax=’auto’ to color according to the absolute maximum found).
yield_ev1 (False) – if True yields one list per chromosome with the first eigenvector used to compute compartments.
suffix ('') – to be placed after file names of compartment images
max_ev (3) – maximum number of EV to try
ev_index (None) – a list of number referring to the index of the eigenvector to be used. By default the first eigenvector is used. WARNING: index starts at 1, default is thus a list of ones. Note: if asking for only one chromosome the list should be only of one element.
rich_in_A (None) – by default compartments are identified using mean number of intra-interactions (A compartments are expected to have less). However this measure is not very accurate. Using this parameter a path to a BED or BED-Graph file with a list of genes or active epigenetic marks can be passed, and used instead of the mean interactions.
show_compartment_labels (False) – if True draw A and B compartment blocks.
TODO: this is really slow…
- Notes: building the distance matrix using the amount of interactions
instead of the mean correlation, gives generally worse results.
- Returns
1- a dictionary with the N (max_ev) first eigenvectors in the form: {Chromosome_name: (Eigenvalue: [Eigenvector])} Sign of the eigenvectors are changed in order to match the prediction of A/B compartments (positive is A). 2- a dictionary of statistics of enrichment for A compartments (Spearman rho).
- find_compartments_beta(crms=None, savefig=None, savedata=None, savecorr=None, show=False, suffix='', how='', label_compartments='hmm', log=None, max_mean_size=10000, ev_index=None, rich_in_A=None, max_ev=3, show_compartment_labels=False, **kwargs)[source]¶
Search for A/B compartments in each chromosome of the Hi-C matrix. Hi-C matrix is normalized by the number interaction expected at a given distance, and by visibility (one iteration of ICE). A correlation matrix is then calculated from this normalized matrix, and its first eigenvector is used to identify compartments. Changes in sign marking boundaries between compartments. Result is stored as a dictionary of compartment boundaries, keys being chromosome names.
- Parameters
perc_zero (99) – to filter bad columns
signal_to_noise (0.05) – to calculate expected interaction counts, if not enough reads are observed at a given distance the observations of the distance+1 are summed. a signal to noise ratio of < 0.05 corresponds to > 400 reads.
crms (None) – only runs these given list of chromosomes
savefig (None) – path to a directory to store matrices with compartment predictions, one image per chromosome, stored under ‘chromosome-name.png’.
show (False) – show the plot
savedata (None) – path to a new file to store compartment predictions, one file only.
savecorr (None) – path to a directory where to save correlation matrices of each chromosome
vmin (-1) – for the color scale of the plotted map (use vmin=’auto’, and vmax=’auto’ to color according to the absolute maximum found).
vmax (1) – for the color scale of the plotted map (use vmin=’auto’, and vmax=’auto’ to color according to the absolute maximum found).
yield_ev1 (False) – if True yields one list per chromosome with the first eigenvector used to compute compartments.
suffix ('') – to be placed after file names of compartment images
max_ev (3) – maximum number of EV to try
ev_index (None) – a list of number referring to the index of the eigenvector to be used. By default the first eigenvector is used. WARNING: index starts at 1, default is thus a list of ones. Note: if asking for only one chromosome the list should be only of one element.
rich_in_A (None) – by default compartments are identified using mean number of intra-interactions (A compartments are expected to have less). However this measure is not very accurate. Using this parameter a path to a BED or BED-Graph file with a list of genes or active epigenetic marks can be passed, and used instead of the mean interactions.
log (None) – path to a folder where to save log of the assignment of A/B compartments
label_compartments (hmm) – label compartments into A/B categories, otherwise just find borders (faster). Can be either hmm (default), or cluster.
how ('ratio') – ratio divide by column, subratio divide by compartment, diagonal only uses diagonal
False'show_compartment_labels' – if True draw A and B compartment blocks.
TODO: this is really slow…
- Notes: building the distance matrix using the amount of interactions
instead of the mean correlation, gives generally worse results.
- Returns
a dictionary with the N (max_ev) first eigen vectors used to define compartment borders for each chromosome (keys are chromosome names)
- get_hic_data_as_csr()[source]¶
Returns a scipy sparse matrix in Compressed Sparse Row format of the Hi-C data in the dictionary
- Returns
scipy sparse matrix in Compressed Sparse Row format
- get_matrix(focus=None, diagonal=True, normalized=False, masked=False)[source]¶
returns a matrix.
- Parameters
focus (None) – a tuple with the (start, end) position of the desired window of data (start, starting at 1, and both start and end are inclusive). Alternatively a chromosome name can be input or a tuple of chromosome name, in order to retrieve a specific inter-chromosomal region
diagonal (True) – if False, diagonal is replaced by ones, or zeroes if normalized
normalized (False) – get normalized data
masked (False) – return masked arrays using the definition of bad columns
- Returns
matrix (a list of lists of values)
- load_biases(fnam, protocol=None)[source]¶
Load biases, decay and bad columns from pickle file
- Parameters
fnam – path to input pickle file
- normalize_hic(iterations=0, max_dev=0.1, silent=False, sqrt=False, factor=1)[source]¶
Normalize the Hi-C data.
It fills the Experiment.norm variable with the Hi-C values divided by the calculated weight.
- Parameters
iteration (0) – number of iterations
max_dev (0.1) – iterative process stops when the maximum deviation between the sum of row is equal to this number (0.1 means 10%)
silent (False) – does not warn when overwriting weights
sqrt (False) – uses the square root of the computed biases
factor (1) – final mean number of normalized interactions wanted per cell (excludes filtered, or bad, out columns)
- save_biases(fnam, protocol=None)[source]¶
Save biases, decay and bad columns in pickle format (to be loaded by the function load_hic_data_from_bam)
- Parameters
fnam – path to output file
- sum(bias=None, bads=None)[source]¶
Sum Hi-C data matrix WARNING: parameters are not meant to be used by external users
- Params None bias
expects a dictionary of biases to use normalized matrix
- Params None bads
extends computed bad columns
- Returns
the sum of the Hi-C matrix skipping bad columns
- write_compartments(savedata, chroms=None, ev_nums=None)[source]¶
Write compartments to a file.
- Parameters
savedata – path to a file.
chroms (None) – write only the given list of chromosomes (default all chromosomes are written, note that the first column corresponding to chromosome name will disappear in non default case)
- write_cooler(fname, normalized=False)[source]¶
writes the hic_data to a cooler file.
- Parameters
normalized (False) – get normalized data
- write_coord_table(fname, focus=None, diagonal=True, normalized=False, format='BED')[source]¶
writes a coordinate table to a file.
- Parameters
focus (None) – a tuple with the (start, end) position of the desired window of data (start, starting at 1, and both start and end are inclusive). Alternatively a chromosome name can be input or a tuple of chromosome name, in order to retrieve a specific inter-chromosomal region
diagonal (True) – if False, diagonal is replaced by zeroes
normalized (False) – get normalized data
format (BED) –
- either “BED”
chr1 111 222 chr2:333-444,55 1 . chr2 333 444 chr1:111-222,55 2 .
- or “long-range” format:
chr1:111-222 chr2:333-444 55 chr2:333-444 chr1:111-222 55
- write_matrix(fname, focus=None, diagonal=True, normalized=False)[source]¶
writes the matrix to a file.
- Parameters
focus (None) – a tuple with the (start, end) position of the desired window of data (start, starting at 1, and both start and end are inclusive). Alternatively a chromosome name can be input or a tuple of chromosome name, in order to retrieve a specific inter-chromosomal region
diagonal (True) – if False, diagonal is replaced by zeroes
normalized (False) – get normalized data
- yield_matrix(focus=None, diagonal=True, normalized=False)[source]¶
Yields a matrix line by line. Bad row/columns are returned as null row/columns.
- Parameters
focus (None) – a tuple with the (start, end) position of the desired window of data (start, starting at 1, and both start and end are inclusive). Alternatively a chromosome name can be input or a tuple of chromosome name, in order to retrieve a specific inter-chromosomal region
diagonal (True) – if False, diagonal is replaced by zeroes
normalized (False) – get normalized data
- Yields
matrix line by line (a line being a list of values)