Retrieve HiC dataset from NCBI

We will use data from (Stadhouders R, Vidal E, Serra F, Di Stefano B et al. 2018), which comes from mouse cells where Hi-C experiment where conducted in different states during highly-efficient somatic cell reprogramming.

The data can be downloaded from:

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE53463

Once downloaded the files can be converted to the FASTQ format in order for TADbit to read them.

The easiest way to download the data might be through the fastq-dump program from the SRA Toolkit (http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=show&f=software&m=software&s=software).

We download 100M reads for each of 4 replicates (2 replicates from B cells and 2 from Pluripotent Stem Cells),and organize each in two files, one per read-end (this step is long and can take up to 6 hours):

%%bash

mkdir -p FASTQs

fastq-dump SRR5344921 --defline-seq '@$ac.$si' -X 100000000 --split-files --outdir FASTQs/
mv FASTQs/SRR5344921_1.fastq FASTQs/mouse_B_rep1_1.fastq
mv FASTQs/SRR5344921_2.fastq FASTQs/mouse_B_rep1_2.fastq

fastq-dump SRR5344925 --defline-seq '@$ac.$si' -X 100000000 --split-files --outdir FASTQs/
mv FASTQs/SRR5344925_1.fastq FASTQs/mouse_B_rep2_1.fastq
mv FASTQs/SRR5344925_2.fastq FASTQs/mouse_B_rep2_2.fastq

fastq-dump SRR5344969 --defline-seq '@$ac.$si' -X 100000000 --split-files --outdir FASTQs
mv FASTQs/SRR5344969_1.fastq FASTQs/mouse_PSC_rep1_1.fastq
mv FASTQs/SRR5344969_2.fastq FASTQs/mouse_PSC_rep1_2.fastq

fastq-dump SRR5344973 --defline-seq '@$ac.$si' -X 100000000 --split-files --outdir FASTQs/
mv FASTQs/SRR5344973_1.fastq FASTQs/mouse_PSC_rep2_1.fastq
mv FASTQs/SRR5344973_2.fastq FASTQs/mouse_PSC_rep2_2.fastq
Read 100000000 spots for SRR5344921
Written 100000000 spots for SRR5344921
Read 100000000 spots for SRR5344925
Written 100000000 spots for SRR5344925
Read 100000000 spots for SRR5344969
Written 100000000 spots for SRR5344969
Read 100000000 spots for SRR5344973
Written 100000000 spots for SRR5344973

Files are renamed for convenience.

Note: the parameter used here for fastq-dump are for generating simple FASTQ files, ``–defline-seq ‘@$ac.$si’`` reduces the information in the headers to the accession number and the read id, ``–split-files`` is to separate both read-ends in different files, finally ``-X 100000000`` is to download only the first 100 Million reads of each replicate

Note: alternatively you can also directly download the FASTQ from http://www.ebi.ac.uk/

Compression

Each of these 8 files, contains 100M reads of 75 nucleotides each, and occupies ~17 Gb (total 130 Gb).

Internally we use DSRC (Roguski and Deorowicz, 2014) that allows better compression ration and, more importantly, faster decompression:

%%bash

dsrc c -t8 FASTQs/mouse_B_rep1_1.fastq FASTQs/mouse_B_rep1_1.fastq.dsrc
dsrc c -t8 FASTQs/mouse_B_rep1_2.fastq FASTQs/mouse_B_rep1_2.fastq.dsrc
dsrc c -t8 FASTQs/mouse_B_rep2_1.fastq FASTQs/mouse_B_rep2_1.fastq.dsrc
dsrc c -t8 FASTQs/mouse_B_rep2_2.fastq FASTQs/mouse_B_rep2_2.fastq.dsrc
dsrc c -t8 FASTQs/mouse_PSC_rep1_1.fastq FASTQs/mouse_PSC_rep1_1.fastq.dsrc
dsrc c -t8 FASTQs/mouse_PSC_rep1_2.fastq FASTQs/mouse_PSC_rep1_2.fastq.dsrc
dsrc c -t8 FASTQs/mouse_PSC_rep2_1.fastq FASTQs/mouse_PSC_rep2_1.fastq.dsrc
dsrc c -t8 FASTQs/mouse_PSC_rep2_2.fastq FASTQs/mouse_PSC_rep2_2.fastq.dsrc

After compression we reduce the total size to 27 Gb (20% of the original size, and dsrc ensures fast reading of the compressed data)

Note: - using gzip instead reduces size to ~38 Gb (occupies ~40% more than dsrc compressed files) - using bzip2 instead reduces size to ~31 Gb (occupies ~15% more than dsrc compressed files)

Both are much slower to generate and read

Cleanup

%%bash

rm -f FASTQs/mouse_B_rep1_1.fastq
rm -f FASTQs/mouse_B_rep1_2.fastq
rm -f FASTQs/mouse_B_rep2_1.fastq
rm -f FASTQs/mouse_B_rep2_2.fastq
rm -f FASTQs/mouse_PSC_rep1_1.fastq
rm -f FASTQs/mouse_PSC_rep1_2.fastq
rm -f FASTQs/mouse_PSC_rep2_1.fastq
rm -f FASTQs/mouse_PSC_rep2_2.fastq

References

[^](#ref-1) Stadhouders R, Vidal E, Serra F, Di Stefano B et al. 2018. Transcription factors orchestrate dynamic interplay between genome topology and gene regulation during cell reprogramming.

[^](#ref-4) Roguski, :raw-latex:`Lukasz `and Deorowicz, Sebastian. 2014. DSRC 2—Industry-oriented compression of FASTQ files.