Bin filtering and normalization

from pytadbit.parsers.hic_parser import load_hic_data_from_bam
cell = 'mouse_B'
repl = 'rep1'

reso = 100000
hic_data = load_hic_data_from_bam(
    'results/fragment/{0}_{1}/03_filtering/valid_reads12_{0}_{1}.bam'.format(cell, repl),
    reso, ncpus=8)
(Matrix size 27269x27269)                                                    [2020-02-06 11:17:21]

- Parsing BAM (122 chunks)                                                   [2020-02-06 11:17:21]
   .......... .......... .......... .......... ..........     50/122
   .......... .......... .......... .......... ..........    100/122
   .......... .......... ..                                  122/122

- Getting matrices                                                           [2020-02-06 11:17:53]
   .......... .......... .......... .......... ..........     50/122
   .......... .......... .......... .......... ..........    100/122
   .......... .......... ..                                  122/122

Filter bins with low interaction counts

In this case, the filtering we are applying is divided in two steps: 1. First we ask to remove all columns having a total of less than 10 interactions (min_count=10). This is a relatively loose filtering. 2. We apply a more complex filter. We will remove bins with a total number of counts much lower than the mean of the interaction counts.

Note: For this (very sparse) dataset, we required very few (10) interactions per bin. In normal cases, to filter bins with too low interaction counts one should require that each bin contains a minimum percentage (default 75%) of cells with no-zero counts, or a minimum number of interactions in total (for good quality data the recommended value could be 2500).

hic_data.filter_columns(draw_hist=True, min_count=10, by_mean=True, silent=True)
WARNING: Using twice min_count as the matrix was symmetricized and contains twice as many interactions as the original
../_images/tutorial_7-Bin_filtering_and_normalization_6_1.png
print('removed {:,} columns out of {:,} ({:.1f}%)'.format(len(hic_data.bads), len(hic_data),
                                                           len(hic_data.bads) / float(len(hic_data)) * 100))
removed 2,660 columns out of 27,269 (9.8%)

At 100 kb it makes no sense to view the full matrix (your matrix will have more cells than your screen have pixels), so it is much more informative to plot a specific region of the genome only using the focus parameter.

from pytadbit.mapping.analyze import hic_map
hic_map(hic_data, normalized=False, focus='chr3', show=True, cmap='viridis')
../_images/tutorial_7-Bin_filtering_and_normalization_10_0.png

Normalization algorithms

Note: if columns with a lot of zeroes are present the ICE normalization will last very long to converge, and these low-coverage columns will present, at the end of the normalization, few cells with very high values of interaction

Iterative Correction and Eigenvector decomposition (ICE)

ICE normalization (Imakaev et al., 2012) assumes an equal experimental visibility of each bin and seeks iteratively for biases that equalize the sum of counts per bin in the matrix. At each iteration, a new matrix is generated by dividing each cell by the product of the sum of counts in its row times the sum of counts in its column. The process converges to a matrix in which all bins have identical sum.

If \(W\) is the raw matrix, \(N\) is its size, and \(i\)(\(j\)) the index of the columns(rows), the normalized matrix \(M\) is iteratively computed as:

\[M_{i,j} = \frac{W_{i,j}}{\sqrt{\sum_{n=0}^N{W_{i,n}} \times \sum_{n=0}^N{W_{n,j}}}}\]

This normalization has usually a quite strong effect, and visually the matrices look very smooth and regular.

hic_data.normalize_hic(iterations=100, max_dev=0.00001)
iterative correction
  - copying matrix
  - computing biases
          1083.000        5384.660      101635.000    0  17.87492
          4407.395        5483.780        8251.728    1   0.50475
          4560.550        5512.003        6238.145    2   0.17261
          5116.644        5521.481        6119.412    3   0.10829
          5173.833        5524.944        5781.389    4   0.06355
          5374.323        5526.241        5761.161    5   0.04251
          5380.193        5526.744        5621.079    6   0.02652
          5470.168        5526.942        5623.500    7   0.01747
          5465.095        5527.021        5561.686    8   0.01120
          5506.121        5527.053        5567.812    9   0.00737
          5500.600        5527.066        5539.771   10   0.00479
          5519.386        5527.071        5544.417   11   0.00314
          5515.770        5527.073        5531.733   12   0.00205
          5524.250        5527.074        5534.470   13   0.00134
          5522.248        5527.075        5528.790   14   0.00087
          5526.033        5527.075        5530.231   15   0.00057
          5525.014        5527.075        5527.710   16   0.00037
          5526.688        5527.075        5528.422   17   0.00024
          5526.195        5527.075        5527.312   18   0.00016
          5526.929        5527.075        5527.650   19   0.00010
          5526.699        5527.075        5527.165   20   0.00007
          5527.019        5527.075        5527.321   21   0.00004
          5526.915        5527.075        5527.110   22   0.00003
          5527.053        5527.075        5527.180   23   0.00002
          5527.006        5527.075        5527.089   24   0.00001
          5527.066        5527.075        5527.120   25   0.00001
rescaling to factor 1
  - getting the sum of the matrix
    => 24609.000
  - rescaling biases
from pytadbit.mapping.analyze import hic_map

hic_map(hic_data, normalized=True, focus='chr3', show=True, cmap='viridis')
../_images/tutorial_7-Bin_filtering_and_normalization_16_0.png

Save biases

hic_data.save_biases('results/fragment/{0}_{1}/03_filtering/valid_reads12_{0}_{1}_ICE_{2}kb.biases'.format(
    cell, repl, reso // 1000))

Vanilla coverage normalization

The vanilla normalization (Rao et al., 2014) is a variation of the ICE where a single iteration is performed.

hic_data.normalize_hic(iterations=0, max_dev=0.00001)
iterative correction
  - copying matrix
  - computing biases
rescaling to factor 1
  - getting the sum of the matrix
    => 25061.997
  - rescaling biases
from pytadbit.mapping.analyze import hic_map

hic_map(hic_data, normalized=True, focus='chr3', show=True, cmap='viridis')
../_images/tutorial_7-Bin_filtering_and_normalization_22_0.png

Save biases

hic_data.save_biases('results/fragment/{0}_{1}/03_filtering/valid_reads12_{0}_{1}_Vanilla_{2}kb.biases'.format(
    cell, repl, reso // 1000))

Square root vanilla coverage (SQRT) normalization

The SQRT vanilla normalization (Rao et al., 2014) is a variation of the Vanilla coverage where each element in the matrix is divided by the square root of the product of sums of counts.

\[M_{i,j} = \frac{W_{i,j}}{\sqrt{\sum_{n=0}^N{W_{i,n}} \times \sum_{n=0}^N{W_{n,j}}}}\]
hic_data.normalize_hic(iterations=0, sqrt=True, max_dev=0.00001)
iterative correction
  - copying matrix
  - computing biases
rescaling to factor 1
  - getting the sum of the matrix
    => 1790593.806
  - rescaling biases
from pytadbit.mapping.analyze import hic_map

hic_map(hic_data, normalized=True, focus='chr3', show=True, cmap='viridis')
../_images/tutorial_7-Bin_filtering_and_normalization_28_0.png

Save biases

hic_data.save_biases('results/fragment/{0}_{1}/03_filtering/valid_reads12_{0}_{1}_SQRT_{2}kb.biases'.format(
    cell, repl, reso // 1000))

OneD normalization

OneD normalization (Vidal et al., 2018) is based on fitting a non-linear model between the total amount of contacts per bin and the known biases: - GC content - number of RE sites (the most important bias, the more cut sites, the more mapped reads) - read mappability

As the estimation of each of this statistics is very important for the normalization, they are left outside the normalization function, in order to allow user to modify them.

from pytadbit.parsers.bed_parser          import parse_mappability_bedGraph
from pytadbit.utils.normalize_hic         import oneD
from pytadbit.mapping.restriction_enzymes import RESTRICTION_ENZYMES
from pytadbit.parsers.genome_parser       import parse_fasta, get_gc_content
fasta = 'genome/Mus_musculus-GRCm38.p6/Mus_musculus-GRCm38.p6.fa'
genome = parse_fasta(fasta, verbose=False)

get mappability ~2 min

f_mappability = 'genome/Mus_musculus-GRCm38.p6/Mus_musculus-GRCm38.p6.50mer.bedGraph'

mappability = parse_mappability_bedGraph(f_mappability, reso)
from functools import reduce

for c in genome:
    if not c in mappability:
        mappability[c] = [float('nan')] * (len(genome) // reso + 1)
    if len(mappability[c]) < len(genome) // reso + 1:
        mappability[c] += [float('nan')] * (
            (len(genome) // reso + 1) - len(mappability[c]))
mappability = reduce(lambda x, y: x + y,
                     (mappability.get(c, []) for c in genome))

get GC content ~ 30 sec

gc_content = get_gc_content(genome, reso, n_cpus=8)

compute r_sites ~30 sec

re_site = RESTRICTION_ENZYMES['MboI'].replace('|', '')

n_rsites  = []
for crm in genome:
    for pos in range(0, len(genome[crm]), reso):
        seq = genome[crm][pos:pos + reso + 400]
        n_rsites.append(seq.count(re_site))
sum_cols = [float('nan') if c in hic_data.bads else
            sum(0.0 if l in hic_data.bads else hic_data[c, l] for l in range(len(hic_data)))
            for c in range(len(hic_data))]
biases = oneD(tot=sum_cols, map=mappability, res=n_rsites, cg=gc_content)

hic_data.bias = list(biases)
from pytadbit.mapping.analyze import hic_map

hic_map(hic_data, normalized=True, focus='chr19', show=True, cmap='viridis')
../_images/tutorial_7-Bin_filtering_and_normalization_45_0.png

Save biases

hic_data.save_biases('results/fragment/{0}_{1}/03_filtering/valid_reads12_{0}_{1}_oneD_{2}kb.biases'.format(
    cell, repl, reso // 1000))

Comparison

By eye, the effect of choosing one normalization over an other is often very small

import matplotlib.pyplot as plt
import numpy as np
from pytadbit.utils.extraviews import plot_HiC_matrix
plt.figure(figsize=(22, 21))
axe = plt.subplot(3, 2, 1)
plt.title('Raw data')
matrix = hic_data.get_matrix(focus='chr3:10000000-40000000', normalized=False)
plot_HiC_matrix(np.log2(matrix), axe=axe,#  vmin=0.5,
                triangular=True)
axe.axis('off')
for num, norm in enumerate(['SQRT', 'oneD', 'Vanilla', 'ICE'], 3):
    # load biases
    hic_data.load_biases('results/fragment/{0}_{1}/03_filtering/valid_reads12_{0}_{1}_{3}_{2}kb.biases'.format(
    cell, repl, reso // 1000, norm))
    # retrieve data as matrix
    matrix = hic_data.get_matrix(focus='chr3:10000000-40000000', normalized=True, masked=True)
    # define subplot
    axe = plt.subplot(3, 2, num)
    # draw plot
    plt.title(norm)
    plot_HiC_matrix(np.log2(matrix), axe=axe,
                    vmin=-0.5 if norm=='oneD' else 1,  # manually adjust color scale depending on the normalization
                    bad_color='grey', triangular=True)
    axe.axis('off')
plt.show()
/home/dcastillo/miniconda2/envs/py3_tadbit/lib/python3.7/site-packages/ipykernel_launcher.py:5: RuntimeWarning: divide by zero encountered in log2
  """
/home/dcastillo/miniconda2/envs/py3_tadbit/lib/python3.7/site-packages/ipykernel_launcher.py:18: RuntimeWarning: divide by zero encountered in log2
../_images/tutorial_7-Bin_filtering_and_normalization_51_1.png

Other normalizations

ICE and Vanilla normalizations are widely used however other, more convoluted, normalizations (Hu et al., 2012) (Yaffe and Tanay, 2011) can be used outside TADbit and then loaded in TADbit as normalized matrices for further analysis.

Best normalization

Which is the best normalization to use is a question that can not be answered easily because it depends on the type of data and the type of analysis.

Most of the time Hi-C experiments are conducted in different conditions and, for each, in several replicates. A good way to find the best normalization method may be to select the one that achieve to minimize the differences between replicates and maximize the differences between conditions (this in the context of the analysis to be performed).

References

[^](#ref-1) Imakaev, Maxim V and Fudenberg, Geoffrey and McCord, Rachel Patton and Naumova, Natalia and Goloborodko, Anton and Lajoie, Bryan R and Dekker, Job and Mirny, Leonid A. 2012. Iterative correction of Hi-C data reveals hallmarks of chromosome organization.. URL

[^](#ref-2) Rao, Suhas S P and Huntley, Miriam H and Durand, Neva C and Stamenova, Elena K and Bochkov, Ivan D. and James T. Robinson and Sanborn, Adrian L. and Machol, Ido and Omer, Arina D. and Lander, Eric S. and Lieberman-Aiden, Erez. 2014. A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping. URL

[^](#ref-3) Enrique Vidal, François le Dily, Javier Quilez, Ralph Stadhouders, Yasmina Cuartero, Thomas Graf, Marc A Marti-Renom, Miguel Beato, Guillaume J Filion. 2018. OneD: increasing reproducibility of Hi-C samples with abnormal karyotypes. URL

[^](#ref-4) Hu, Ming and Deng, Ke and Selvaraj, Siddarth and Qin, Zhaohui and Ren, Bing and Liu, Jun S. 2012. HiCNorm: removing biases in Hi-C data via Poisson regression.

[^](#ref-5) Yaffe, Eitan and Tanay, Amos. 2011. Probabilistic modeling of Hi-C contact maps eliminates systematic biases to characterize global chromosomal architecture.. URL