From FASTQ to matrix¶

Mapping¶

pytadbit.mapping.full_mapper.full_mapping(mapper_index_path, fastq_path, out_map_dir, mapper='gem', r_enz=None, frag_map=True, min_seq_len=15, windows=None, add_site=True, clean=False, get_nread=False, mapper_binary=None, mapper_params=None, **kwargs)[source]¶

Maps FASTQ reads to an indexed reference genome. Mapping can be done either without knowledge of the restriction enzyme used, or for experiments performed without one, like Micro-C (iterative mapping), or using the ligation sites created from the digested ends (fragment-based mapping).

Parameters

mapper_index_path – path to index file created from a reference genome using gem-index tool or bowtie2-build
fastq_path – PATH to FASTQ file, either compressed or not.
out_map_dir – path to a directory where to store mapped reads in MAP format .
r_enz (None) – name of the restriction enzyme used in the experiment e.g. HindIII. This is optional if frag_map option is False
frag_map (True) – two step mapper, first full length is mapped, then remaining, unmapped reads, are divided into restriction-enzyme fragments andeach is mapped.
add_site (True) – when splitting the sequence by ligated sites found, removes the ligation site, and put back the original RE site.
min_seq_len (15) – minimum size of a fragment to map
windows (None) –
tuple of ranges for beginning and end of the mapping. This parameter allows to do classical iterative mapping, e.g.

windows=((1,25),(1,30),(1,35),(1,40),(1,45),(1,50))

A unique window can also be passed, for trimming, like this:
windows=((1,101),)
clean (False) – remove intermediate files created in temp_dir
nthreads (4) – number of threads to use for mapping (number of CPUs)
max_edit_distance (0.04) – The maximum number of edit operations allowed while verifying candidate matches by dynamic programming.
mismatches (0.04) – The maximum number of nucleotide substitutions allowed while mapping each k-mer. It is always guaranteed that, however other options are chosen, all the matches up to the specified number of substitutions will be found by the program.
temp_dir (/tmp) – important to change. Intermediate FASTQ files will be written there.
get_nreads (False) – returns a list of lists where each element contains a path and the number of reads processed
mapper_binary (gem-mapper) – path to the binary mapper
mapper_params (None) – extra parameters for the mapper

Returns

a list of paths to generated outfiles. To be passed to pytadbit.parsers.map_parser.parse_map()

Quality check and plotting¶

pytadbit.mapping.analyze.hic_map(data, resolution=None, normalized=False, masked=None, by_chrom=False, savefig=None, show=False, savedata=None, focus=None, clim=None, perc_clim=None, cmap='jet', pdf=False, decay=True, perc=20, name=None, decay_resolution=None, **kwargs)[source]¶

function to retrieve data from HiC-data object. Data can be stored as a square matrix, or drawn using matplotlib

Parameters

data – can be either a path to a file with pre-processed reads (filtered or not), or a Hi-C-data object
resolution (None) – at which to bin the data (try having a dense matrix with < 10% of cells with zero interaction counts). Note: not necessary if a hic_data object is passed as ‘data’.
normalized (False) – used normalized data, based on precalculated biases
masked – a list of columns to be removed. Usually because to few interactions
by_chrom (False) –
data can be stored in a partitioned way. This parameter can take the values of:
- ’intra’: one output per each chromosome will be created
- ’inter’: one output per each possible pair of chromosome will be
  created
- ’all’ : both of the above outputs
savefig (None) – path where to store the output images. Note that, if the by_chrom option is used, then savefig will be the name of the directory containing the output files.
savedata (None) – path where to store the output matrices. Note that, if the by_chrom option is used, then savefig will be the name of the directory containing the output files.
focus (None) – can be either two number (i.e.: (1, 100)) specifying the start and end position of the sub-matrix to display (start and end, along the diagonal of the original matrix); or directly a chromosome name; or two chromosome names (i.e.: focus=(‘chr2, chrX’)), in order to store the data corresponding to inter chromosomal interactions between these two chromosomes
decay (True) – plot the correlation between genomic distance and interactions (usually a decay).
force_image (False) – force to generate an image even if resolution is crazy…
clim (None) – cutoff for the upper and lower bound in the coloring scale of the heatmap. (perc_clim should be set to None)
perc_clim (None) – cutoff for the upper and lower bound in the coloring scale of the heatmap; in percentile. (clim should be set to None)
pdf (False) – when using the bny_chrom option, to specify the format of the stored images
cmap (jet) – color map to be used for the heatmap; “tadbit” color map is also implemented and will use percentiles of the distribution of interactions to defines intensities of red.
decay_resolution (None) – chromatin fragment size to consider when calculating decay of the number of interactions with genomic distance. Default is equal to resolution of the matrix.

pytadbit.mapping.analyze.insert_sizes(fnam, savefig=None, nreads=None, max_size=99.9, axe=None, show=False, xlog=False, stats=('median', 'perc_max'), too_large=10000)[source]¶: Deprecated function, use fragment_size

pytadbit.mapping.analyze.correlate_matrices(hic_data1, hic_data2, max_dist=10, intra=False, axe=None, savefig=None, show=False, savedata=None, min_dist=1, normalized=False, remove_bad_columns=True, **kwargs)[source]¶

Compare the interactions of two Hi-C matrices at a given distance,: with Spearman rank correlation.
Also computes the SCC reproducibility score as in HiCrep (see: https://doi.org/10.1101/gr.220640.117). It’s implementation is inspired by the version implemented in dryhic by Enrique Vidal (https://github.com/qenvio/dryhic).

Parameters

hic_data1 – Hi-C-data object
hic_data2 – Hi-C-data object
resolution (1) – to be used for scaling the plot
max_dist (10) – maximum distance from diagonal (e.g. 10 mean we will not look further than 10 times the resolution)
min_dist (1) – minimum distance from diagonal (set to 0 to reproduce result from HicRep)
savefig (None) – path to save the plot
intra (False) – only takes into account intra-chromosomal contacts
show (False) – displays the plot
normalized (False) – use normalized data
remove_bads (True) – computes the union of bad columns between samples and exclude them from the comparison

Returns

list of correlations, list of genomic distances, SCC and standard deviation of SCC

pytadbit.mapping.analyze.plot_strand_bias_by_distance(fnam, nreads=1000000, valid_pairs=True, half_step=20, half_len=2000, full_step=500, full_len=50000, savefig=None)[source]¶

Classify reads into four categories depending on the strand on which each of its end is mapped, and plots the proportion of each of these categories in function of the genomic distance between them.

Only full mapped reads mapped on two diferent restriction fragments (still same chromosome) are considered.

The four categories are:

Both read-ends mapped on the same strand (forward)
Both read-ends mapped on the same strand (reverse)
Both read-ends mapped on the different strand (facing), like extra-dangling-ends
Both read-ends mapped on the different strand (opposed), like extra-self-circles

Params fnam: path to tsv file with intersection of mapped ends
Params True valid_pairs: consider only read-ends mapped on different restriction fragments. If False, considers only read-ends mapped on the same restriction fragment.
Params 1000000 nreads: number of reads used to plot (if None, all will be used)
Params 20 half_step: binning for the first part of the plot
Params 2000 half_len: maximum distance for the first part of the plot
Params 500 full_step: binning for the second part of the plot
Params 50000 full_len: maximum distance for the second part of the plot
Params None savefig: path to save figure

pytadbit.mapping.analyze.eig_correlate_matrices(hic_data1, hic_data2, nvect=6, normalized=False, savefig=None, show=False, savedata=None, remove_bad_columns=True, **kwargs)[source]¶

Compare the interactions of two Hi-C matrices using their 6 first eigenvectors, with Pearson correlation

Parameters

hic_data1 – Hi-C-data object
hic_data2 – Hi-C-data object
nvect (6) – number of eigenvectors to compare
savefig (None) – path to save the plot
show (False) – displays the plot
normalized (False) – use normalized data
remove_bads (True) – computes the union of bad columns between samples and exclude them from the comparison
kwargs – any argument to pass to matplotlib imshow function

Returns

matrix of correlations

pytadbit.mapping.analyze.plot_distance_vs_interactions(data, min_diff=1, max_diff=1000, show=False, genome_seq=None, resolution=None, axe=None, savefig=None, normalized=False, plot_each_cell=False)[source]¶

Plot the number of interactions observed versus the genomic distance between the mapped ends of the read. The slope is expected to be around -1, in logarithmic scale and between 700 kb and 10 Mb (according to the prediction of the fractal globule model).

Parameters

data – input file name (either tsv or TADbit generated BAM), or HiC_data object or list of lists
min_diff (10) – lower limit (in number of bins)
max_diff (1000) – upper limit (in number of bins) to look for
resolution (100) – group reads that are closer than this resolution parameter
axe (None) – a matplotlib.axes.Axes object to define the plot appearance
savefig (None) – path to a file where to save the image generated; if None, the image will be shown using matplotlib GUI (the extension of the file name will determine the desired format).

Param_hash False plot_each_cell

if false, only the mean distances by bin will be represented, otherwise each pair of interactions will be plotted.

Returns

slope, intercept and R square of each of the 3 correlations

pytadbit.mapping.analyze.plot_iterative_mapping(fnam1, fnam2, total_reads=None, axe=None, savefig=None)[source]¶

Plots the number of reads mapped at each step of the mapping process (in the case of the iterative mapping, each step is mapping process with a given size of fragments).

Parameters

fnam – input file name
total_reads – total number of reads in the initial FASTQ file
axe (None) – a matplotlib.axes.Axes object to define the plot appearance
savefig (None) – path to a file where to save the image generated; if None, the image will be shown using matplotlib GUI (the extension of the file name will determine the desired format).

Returns

a dictionary with the number of reads per mapped length

pytadbit.mapping.analyze.plot_genomic_distribution(fnam, first_read=None, resolution=10000, ylim=None, yscale=None, savefig=None, show=False, savedata=None, chr_names=None, nreads=None)[source]¶

Plot the number of reads in bins along the genome (or along a given chromosome).

Parameters

fnam – input file name
first_read (True) – uses first read.
resolution (100) – group reads that are closer than this resolution parameter
ylim (None) – a tuple of lower and upper bound for the y axis
yscale (None) – if set_bad to “log” values will be represented in log2 scale
savefig (None) – path to a file where to save the image generated; if None, the image will be shown using matplotlib GUI (the extension of the file name will determine the desired format).
savedata (None) – path where to store the output read counts per bin.
chr_names (None) – can pass a list of chromosome names in case only some them the need to be plotted (this option may last even more than default)
nreads (None) – number of reads to process (default: all reads)

pytadbit.utils.fastq_utils.quality_plot(fnam, r_enz=None, nreads=inf, axe=None, savefig=None, paired=False)[source]¶

Plots the sequencing quality of a given FASTQ file. If a restrinction enzyme (RE) name is provided, can also represent the distribution of digested and undigested RE sites and estimate an expected proportion of dangling-ends.

Proportion of dangling-ends is inferred by counting the number of times a dangling-end site, is found at the beginning of any of the reads (divided by the number of reads).

Parameters

fnam – path to FASTQ file
nreads (None) – max number of reads to read, not necesary to read all
savefig (None) – path to a file where to save the image generated; if None, the image will be shown using matplotlib GUI (the extension of the file name will determine the desired format).
paired (False) – is input FASTQ contains both ends

Returns

the percentage of dangling-ends (sensu stricto) and the percentage of reads with at least a ligation site.

Filtering¶

pytadbit.mapping.restriction_enzymes.map_re_sites(enzyme_name, genome_seq, frag_chunk=100000, verbose=False)[source]¶

map all restriction enzyme (RE) sites of a given enzyme in a genome. Position of a RE site is defined as the genomic coordinate of the first nucleotide after the first cut (genomic coordinate starts at 1).

In the case of HindIII the genomic coordinate is this one:

123456 789…: v

—–A|AGCT T————– —–T TCGA|A————–

In this example the coordinate of the RE site would be 7.

Parameters

enzyme_name – name of the enzyme to map (upper/lower case are important)
genome_seq – a dictionary containing the genomic sequence by chromosome
frag_chunk (100000) – in order to optimize the search for nearby RE sites, each chromosome is splitted into chunks.

pytadbit.mapping.restriction_enzymes.repaired(r_enz)[source]¶: returns the resulting sequence after reparation of two digested and repaired ends, marking dangling ends.

pytadbit.mapping.get_intersection(fname1, fname2, out_path, verbose=False, compress=False)[source]¶

Merges the two files corresponding to each reads sides. Reads found in both

files are merged and written in an output file.

Dealing with multiple contacts:

a pairwise contact is created for each possible combnation of the multicontacts. The name of the read is extended by ‘# 1/3’ in case the reported pairwise contact corresponds to the first of 3 possibles
it may happen that different contacts are mapped on a single RE fragment (if each are on different end), in which case:
- if no other fragment from this read are mapped than, both are kept
- otherwise, they are merged into one longer (as if they were mapped in the positive strand)

Parameters

fname1 – path to a tab separated file generated by the function pytadbit.parsers.sam_parser.parse_sam()
fname2 – path to a tab separated file generated by the function pytadbit.parsers.sam_parser.parse_sam()
out_path – path to an outfile. It will written in a similar format as the inputs
compress (False) – compress (gzip) input files. This is done in the background while next input files are parsed.

Returns

final number of pair of interacting fragments, and a dictionary with the number of multiple contacts (keys of the dictionary being the number of fragment cought together, can be 3, 4, 5..)

pytadbit.mapping.merge_2d_beds(path1, path2, outpath)[source]¶

Merge two result files (file resulting from get_intersection or from: the filtering) into one.

Parameters

path1 – path to first file
path2 – path to second file

Returns

number of reads processed

pytadbit.mapping.filter.filter_reads(fnam, output=None, max_molecule_length=500, over_represented=0.005, max_frag_size=100000, min_frag_size=100, re_proximity=5, verbose=True, savedata=None, min_dist_to_re=750, strict_duplicates=False, fast=True)[source]¶

Filter mapped pair of reads in order to remove experimental artifacts (e.g. dangling-ends, self-circle, PCR artifacts…)

Applied filters are:

1- self-circlereads are coming from a single RE fragment and: point to the outside (—-<===—===>—)
2- dangling-endreads are coming from a single RE fragment and: point to the inside (—-===>—<===—)
3- errorreads are coming from a single RE fragment and: point in the same direction
4- extra dangling-endreads are coming from different RE fragment but: are close enough (< max_molecule length) and point to the inside
5- too close from RESsemi-dangling-end filter, start position of one: of the read is too close (5 bp by default) from RE cutting site. This filter is skipped in case read is involved in multi-contact. This filter may be too conservative for 4bp cutter REs.
6- too shortremove reads coming from small restriction less: than 100 bp (default) because they are comparable to the read length
7- too largeremove reads coming from large restriction: fragments (default: 100 Kb, P < 10-5 to occur in a randomized genome) as they likely represent poorly assembled or repeated regions
8- over-representedreads coming from the top 0.5% most frequently: detected restriction fragments, they may be prone to PCR artifacts or represent fragile regions of the genome or genome assembly errors
9- duplicatedthe combination of the start positions of the: reads is repeated -> PCR artifact (only keep one copy)
10- random breaksstart position of one of the read is too far (: more than min_dist_to_re) from RE cutting site. Non-canonical enzyme activity or random physical breakage of the chromatin.

Parameters

fnam – path to file containing the pair of reads in tsv format, file generated by pytadbit.mapping.mapper.get_intersection()
output (None) – PATH where to write files containing IDs of filtered reads. Uses fnam by default.
max_molecule_length (500) – facing reads that are within max_molecule_length, will be classified as ‘extra dangling-ends’
over_represented (0.005) – to remove the very top fragment containing more reads
max_frag_size (100000) – maximum fragment size allowed (fragments should not span over several bins)
min_frag_size (100) – remove fragment that are too short (shorter than the sequenced read length)
re_proximity (5) – should be adjusted according to RE site, to filter semi-dangling-ends
min_dist_to_re (750) – minimum distance the start of a read should be from a RE site (usually 1.5 times the insert size). Applied in filter 10
savedata (None) – PATH where to write the number of reads retained by each filter
fast (True) – parallel version, requires 4 CPUs and more RAM memory
strict_duplicates (False) – by default reads are considered duplicates if they coincide in genomic coordinates and strand; with strict_duplicates enabled, we also ask to consider read length (WARNING: this option is called strict, but it is more permissive).

Returns

dictionary with, as keys, the kind of filter applied, and as values a set of read IDs to be removed

Note: Filtering is not exclusive, one read can be filtered several times.

pytadbit.mapping.filter.apply_filter(fnam, outfile, masked, filters=None, reverse=False, verbose=True)[source]¶

Create a new file with reads filtered

Parameters

fnam – input file path, where non-filtered read are stored
outfile – output file path, where filtered read will be stored
masked – dictionary given by the pytadbit.mapping.filter.filter_reads()
filters (None) – list of numbers corresponding to the filters we want to apply (numbers correspond to the keys in the masked dictionary)
reverse (False) – if set, the resulting outfile will only contain the reads filtered, not the valid pairs.
verbose (False) –

Returns

number of reads kept

From FASTQ to matrix¶

Mapping¶

Quality check and plotting¶

Filtering¶

Table of Contents

Previous topic

Next topic