From FASTQ to matrix

Mapping

pytadbit.mapping.full_mapper.full_mapping(mapper_index_path, fastq_path, out_map_dir, mapper='gem', r_enz=None, frag_map=True, min_seq_len=15, windows=None, add_site=True, clean=False, get_nread=False, mapper_binary=None, mapper_params=None, **kwargs)[source]

Maps FASTQ reads to an indexed reference genome. Mapping can be done either without knowledge of the restriction enzyme used, or for experiments performed without one, like Micro-C (iterative mapping), or using the ligation sites created from the digested ends (fragment-based mapping).

Parameters
  • mapper_index_path – path to index file created from a reference genome using gem-index tool or bowtie2-build

  • fastq_path – PATH to FASTQ file, either compressed or not.

  • out_map_dir – path to a directory where to store mapped reads in MAP format .

  • r_enz (None) – name of the restriction enzyme used in the experiment e.g. HindIII. This is optional if frag_map option is False

  • frag_map (True) – two step mapper, first full length is mapped, then remaining, unmapped reads, are divided into restriction-enzyme fragments andeach is mapped.

  • add_site (True) – when splitting the sequence by ligated sites found, removes the ligation site, and put back the original RE site.

  • min_seq_len (15) – minimum size of a fragment to map

  • windows (None) –

    tuple of ranges for beginning and end of the mapping. This parameter allows to do classical iterative mapping, e.g.

    windows=((1,25),(1,30),(1,35),(1,40),(1,45),(1,50))

    A unique window can also be passed, for trimming, like this:

    windows=((1,101),)

  • clean (False) – remove intermediate files created in temp_dir

  • nthreads (4) – number of threads to use for mapping (number of CPUs)

  • max_edit_distance (0.04) – The maximum number of edit operations allowed while verifying candidate matches by dynamic programming.

  • mismatches (0.04) – The maximum number of nucleotide substitutions allowed while mapping each k-mer. It is always guaranteed that, however other options are chosen, all the matches up to the specified number of substitutions will be found by the program.

  • temp_dir (/tmp) – important to change. Intermediate FASTQ files will be written there.

  • get_nreads (False) – returns a list of lists where each element contains a path and the number of reads processed

  • mapper_binary (gem-mapper) – path to the binary mapper

  • mapper_params (None) – extra parameters for the mapper

Returns

a list of paths to generated outfiles. To be passed to pytadbit.parsers.map_parser.parse_map()

Quality check and plotting

pytadbit.mapping.analyze.hic_map(data, resolution=None, normalized=False, masked=None, by_chrom=False, savefig=None, show=False, savedata=None, focus=None, clim=None, perc_clim=None, cmap='jet', pdf=False, decay=True, perc=20, name=None, decay_resolution=None, **kwargs)[source]

function to retrieve data from HiC-data object. Data can be stored as a square matrix, or drawn using matplotlib

Parameters
  • data – can be either a path to a file with pre-processed reads (filtered or not), or a Hi-C-data object

  • resolution (None) – at which to bin the data (try having a dense matrix with < 10% of cells with zero interaction counts). Note: not necessary if a hic_data object is passed as ‘data’.

  • normalized (False) – used normalized data, based on precalculated biases

  • masked – a list of columns to be removed. Usually because to few interactions

  • by_chrom (False) –

    data can be stored in a partitioned way. This parameter can take the values of:

    • ’intra’: one output per each chromosome will be created

    • ’inter’: one output per each possible pair of chromosome will be

      created

    • ’all’ : both of the above outputs

  • savefig (None) – path where to store the output images. Note that, if the by_chrom option is used, then savefig will be the name of the directory containing the output files.

  • savedata (None) – path where to store the output matrices. Note that, if the by_chrom option is used, then savefig will be the name of the directory containing the output files.

  • focus (None) – can be either two number (i.e.: (1, 100)) specifying the start and end position of the sub-matrix to display (start and end, along the diagonal of the original matrix); or directly a chromosome name; or two chromosome names (i.e.: focus=(‘chr2, chrX’)), in order to store the data corresponding to inter chromosomal interactions between these two chromosomes

  • decay (True) – plot the correlation between genomic distance and interactions (usually a decay).

  • force_image (False) – force to generate an image even if resolution is crazy…

  • clim (None) – cutoff for the upper and lower bound in the coloring scale of the heatmap. (perc_clim should be set to None)

  • perc_clim (None) – cutoff for the upper and lower bound in the coloring scale of the heatmap; in percentile. (clim should be set to None)

  • pdf (False) – when using the bny_chrom option, to specify the format of the stored images

  • cmap (jet) – color map to be used for the heatmap; “tadbit” color map is also implemented and will use percentiles of the distribution of interactions to defines intensities of red.

  • decay_resolution (None) – chromatin fragment size to consider when calculating decay of the number of interactions with genomic distance. Default is equal to resolution of the matrix.

pytadbit.mapping.analyze.insert_sizes(fnam, savefig=None, nreads=None, max_size=99.9, axe=None, show=False, xlog=False, stats=('median', 'perc_max'), too_large=10000)[source]

Deprecated function, use fragment_size

pytadbit.mapping.analyze.correlate_matrices(hic_data1, hic_data2, max_dist=10, intra=False, axe=None, savefig=None, show=False, savedata=None, min_dist=1, normalized=False, remove_bad_columns=True, **kwargs)[source]
Compare the interactions of two Hi-C matrices at a given distance,

with Spearman rank correlation.

Also computes the SCC reproducibility score as in HiCrep (see

https://doi.org/10.1101/gr.220640.117). It’s implementation is inspired by the version implemented in dryhic by Enrique Vidal (https://github.com/qenvio/dryhic).

Parameters
  • hic_data1 – Hi-C-data object

  • hic_data2 – Hi-C-data object

  • resolution (1) – to be used for scaling the plot

  • max_dist (10) – maximum distance from diagonal (e.g. 10 mean we will not look further than 10 times the resolution)

  • min_dist (1) – minimum distance from diagonal (set to 0 to reproduce result from HicRep)

  • savefig (None) – path to save the plot

  • intra (False) – only takes into account intra-chromosomal contacts

  • show (False) – displays the plot

  • normalized (False) – use normalized data

  • remove_bads (True) – computes the union of bad columns between samples and exclude them from the comparison

Returns

list of correlations, list of genomic distances, SCC and standard deviation of SCC

pytadbit.mapping.analyze.plot_strand_bias_by_distance(fnam, nreads=1000000, valid_pairs=True, half_step=20, half_len=2000, full_step=500, full_len=50000, savefig=None)[source]

Classify reads into four categories depending on the strand on which each of its end is mapped, and plots the proportion of each of these categories in function of the genomic distance between them.

Only full mapped reads mapped on two diferent restriction fragments (still same chromosome) are considered.

The four categories are:
  • Both read-ends mapped on the same strand (forward)

  • Both read-ends mapped on the same strand (reverse)

  • Both read-ends mapped on the different strand (facing), like extra-dangling-ends

  • Both read-ends mapped on the different strand (opposed), like extra-self-circles

Params fnam

path to tsv file with intersection of mapped ends

Params True valid_pairs

consider only read-ends mapped on different restriction fragments. If False, considers only read-ends mapped on the same restriction fragment.

Params 1000000 nreads

number of reads used to plot (if None, all will be used)

Params 20 half_step

binning for the first part of the plot

Params 2000 half_len

maximum distance for the first part of the plot

Params 500 full_step

binning for the second part of the plot

Params 50000 full_len

maximum distance for the second part of the plot

Params None savefig

path to save figure

pytadbit.mapping.analyze.eig_correlate_matrices(hic_data1, hic_data2, nvect=6, normalized=False, savefig=None, show=False, savedata=None, remove_bad_columns=True, **kwargs)[source]

Compare the interactions of two Hi-C matrices using their 6 first eigenvectors, with Pearson correlation

Parameters
  • hic_data1 – Hi-C-data object

  • hic_data2 – Hi-C-data object

  • nvect (6) – number of eigenvectors to compare

  • savefig (None) – path to save the plot

  • show (False) – displays the plot

  • normalized (False) – use normalized data

  • remove_bads (True) – computes the union of bad columns between samples and exclude them from the comparison

  • kwargs – any argument to pass to matplotlib imshow function

Returns

matrix of correlations

pytadbit.mapping.analyze.plot_distance_vs_interactions(data, min_diff=1, max_diff=1000, show=False, genome_seq=None, resolution=None, axe=None, savefig=None, normalized=False, plot_each_cell=False)[source]

Plot the number of interactions observed versus the genomic distance between the mapped ends of the read. The slope is expected to be around -1, in logarithmic scale and between 700 kb and 10 Mb (according to the prediction of the fractal globule model).

Parameters
  • data – input file name (either tsv or TADbit generated BAM), or HiC_data object or list of lists

  • min_diff (10) – lower limit (in number of bins)

  • max_diff (1000) – upper limit (in number of bins) to look for

  • resolution (100) – group reads that are closer than this resolution parameter

  • axe (None) – a matplotlib.axes.Axes object to define the plot appearance

  • savefig (None) – path to a file where to save the image generated; if None, the image will be shown using matplotlib GUI (the extension of the file name will determine the desired format).

Param_hash False plot_each_cell

if false, only the mean distances by bin will be represented, otherwise each pair of interactions will be plotted.

Returns

slope, intercept and R square of each of the 3 correlations

pytadbit.mapping.analyze.plot_iterative_mapping(fnam1, fnam2, total_reads=None, axe=None, savefig=None)[source]

Plots the number of reads mapped at each step of the mapping process (in the case of the iterative mapping, each step is mapping process with a given size of fragments).

Parameters
  • fnam – input file name

  • total_reads – total number of reads in the initial FASTQ file

  • axe (None) – a matplotlib.axes.Axes object to define the plot appearance

  • savefig (None) – path to a file where to save the image generated; if None, the image will be shown using matplotlib GUI (the extension of the file name will determine the desired format).

Returns

a dictionary with the number of reads per mapped length

pytadbit.mapping.analyze.plot_genomic_distribution(fnam, first_read=None, resolution=10000, ylim=None, yscale=None, savefig=None, show=False, savedata=None, chr_names=None, nreads=None)[source]

Plot the number of reads in bins along the genome (or along a given chromosome).

Parameters
  • fnam – input file name

  • first_read (True) – uses first read.

  • resolution (100) – group reads that are closer than this resolution parameter

  • ylim (None) – a tuple of lower and upper bound for the y axis

  • yscale (None) – if set_bad to “log” values will be represented in log2 scale

  • savefig (None) – path to a file where to save the image generated; if None, the image will be shown using matplotlib GUI (the extension of the file name will determine the desired format).

  • savedata (None) – path where to store the output read counts per bin.

  • chr_names (None) – can pass a list of chromosome names in case only some them the need to be plotted (this option may last even more than default)

  • nreads (None) – number of reads to process (default: all reads)

pytadbit.utils.fastq_utils.quality_plot(fnam, r_enz=None, nreads=inf, axe=None, savefig=None, paired=False)[source]

Plots the sequencing quality of a given FASTQ file. If a restrinction enzyme (RE) name is provided, can also represent the distribution of digested and undigested RE sites and estimate an expected proportion of dangling-ends.

Proportion of dangling-ends is inferred by counting the number of times a dangling-end site, is found at the beginning of any of the reads (divided by the number of reads).

Parameters
  • fnam – path to FASTQ file

  • nreads (None) – max number of reads to read, not necesary to read all

  • savefig (None) – path to a file where to save the image generated; if None, the image will be shown using matplotlib GUI (the extension of the file name will determine the desired format).

  • paired (False) – is input FASTQ contains both ends

Returns

the percentage of dangling-ends (sensu stricto) and the percentage of reads with at least a ligation site.

Filtering

pytadbit.mapping.restriction_enzymes.map_re_sites(enzyme_name, genome_seq, frag_chunk=100000, verbose=False)[source]

map all restriction enzyme (RE) sites of a given enzyme in a genome. Position of a RE site is defined as the genomic coordinate of the first nucleotide after the first cut (genomic coordinate starts at 1).

In the case of HindIII the genomic coordinate is this one:

123456 789…

v

—–A|AGCT T————– —–T TCGA|A————–

In this example the coordinate of the RE site would be 7.

Parameters
  • enzyme_name – name of the enzyme to map (upper/lower case are important)

  • genome_seq – a dictionary containing the genomic sequence by chromosome

  • frag_chunk (100000) – in order to optimize the search for nearby RE sites, each chromosome is splitted into chunks.

pytadbit.mapping.restriction_enzymes.repaired(r_enz)[source]

returns the resulting sequence after reparation of two digested and repaired ends, marking dangling ends.

pytadbit.mapping.get_intersection(fname1, fname2, out_path, verbose=False, compress=False)[source]
Merges the two files corresponding to each reads sides. Reads found in both

files are merged and written in an output file.

Dealing with multiple contacts:
  • a pairwise contact is created for each possible combnation of the multicontacts. The name of the read is extended by ‘# 1/3’ in case the reported pairwise contact corresponds to the first of 3 possibles

  • it may happen that different contacts are mapped on a single RE fragment (if each are on different end), in which case:

    • if no other fragment from this read are mapped than, both are kept

    • otherwise, they are merged into one longer (as if they were mapped in the positive strand)

Parameters
  • fname1 – path to a tab separated file generated by the function pytadbit.parsers.sam_parser.parse_sam()

  • fname2 – path to a tab separated file generated by the function pytadbit.parsers.sam_parser.parse_sam()

  • out_path – path to an outfile. It will written in a similar format as the inputs

  • compress (False) – compress (gzip) input files. This is done in the background while next input files are parsed.

Returns

final number of pair of interacting fragments, and a dictionary with the number of multiple contacts (keys of the dictionary being the number of fragment cought together, can be 3, 4, 5..)

pytadbit.mapping.merge_2d_beds(path1, path2, outpath)[source]
Merge two result files (file resulting from get_intersection or from

the filtering) into one.

Parameters
  • path1 – path to first file

  • path2 – path to second file

Returns

number of reads processed

pytadbit.mapping.filter.filter_reads(fnam, output=None, max_molecule_length=500, over_represented=0.005, max_frag_size=100000, min_frag_size=100, re_proximity=5, verbose=True, savedata=None, min_dist_to_re=750, strict_duplicates=False, fast=True)[source]

Filter mapped pair of reads in order to remove experimental artifacts (e.g. dangling-ends, self-circle, PCR artifacts…)

Applied filters are:
1- self-circlereads are coming from a single RE fragment and

point to the outside (—-<===—===>—)

2- dangling-endreads are coming from a single RE fragment and

point to the inside (—-===>—<===—)

3- errorreads are coming from a single RE fragment and

point in the same direction

4- extra dangling-endreads are coming from different RE fragment but

are close enough (< max_molecule length) and point to the inside

5- too close from RESsemi-dangling-end filter, start position of one

of the read is too close (5 bp by default) from RE cutting site. This filter is skipped in case read is involved in multi-contact. This filter may be too conservative for 4bp cutter REs.

6- too shortremove reads coming from small restriction less

than 100 bp (default) because they are comparable to the read length

7- too largeremove reads coming from large restriction

fragments (default: 100 Kb, P < 10-5 to occur in a randomized genome) as they likely represent poorly assembled or repeated regions

8- over-representedreads coming from the top 0.5% most frequently

detected restriction fragments, they may be prone to PCR artifacts or represent fragile regions of the genome or genome assembly errors

9- duplicatedthe combination of the start positions of the

reads is repeated -> PCR artifact (only keep one copy)

10- random breaksstart position of one of the read is too far (

more than min_dist_to_re) from RE cutting site. Non-canonical enzyme activity or random physical breakage of the chromatin.

Parameters
  • fnam – path to file containing the pair of reads in tsv format, file generated by pytadbit.mapping.mapper.get_intersection()

  • output (None) – PATH where to write files containing IDs of filtered reads. Uses fnam by default.

  • max_molecule_length (500) – facing reads that are within max_molecule_length, will be classified as ‘extra dangling-ends’

  • over_represented (0.005) – to remove the very top fragment containing more reads

  • max_frag_size (100000) – maximum fragment size allowed (fragments should not span over several bins)

  • min_frag_size (100) – remove fragment that are too short (shorter than the sequenced read length)

  • re_proximity (5) – should be adjusted according to RE site, to filter semi-dangling-ends

  • min_dist_to_re (750) – minimum distance the start of a read should be from a RE site (usually 1.5 times the insert size). Applied in filter 10

  • savedata (None) – PATH where to write the number of reads retained by each filter

  • fast (True) – parallel version, requires 4 CPUs and more RAM memory

  • strict_duplicates (False) – by default reads are considered duplicates if they coincide in genomic coordinates and strand; with strict_duplicates enabled, we also ask to consider read length (WARNING: this option is called strict, but it is more permissive).

Returns

dictionary with, as keys, the kind of filter applied, and as values a set of read IDs to be removed

Note: Filtering is not exclusive, one read can be filtered several times.

pytadbit.mapping.filter.apply_filter(fnam, outfile, masked, filters=None, reverse=False, verbose=True)[source]

Create a new file with reads filtered

Parameters
  • fnam – input file path, where non-filtered read are stored

  • outfile – output file path, where filtered read will be stored

  • masked – dictionary given by the pytadbit.mapping.filter.filter_reads()

  • filters (None) – list of numbers corresponding to the filters we want to apply (numbers correspond to the keys in the masked dictionary)

  • reverse (False) – if set, the resulting outfile will only contain the reads filtered, not the valid pairs.

  • verbose (False) –

Returns

number of reads kept