TADbit tools¶

TADbit also provides a set of command line tools that are installed with the library and that cover the main functionalities.

This tools are idependent but share a working directory where a local database is created to store the input/outputs of each process or job, and also some statistics.

Each of these tools has extensive help, so we will here review only their general usage and function.

TADbit map¶

Map Hi-C reads and organize results in an output working directory

usage: tadbit map [-h] [--skip_mapping] -w PATH --fastq PATH [--fastq2 PATH] --index PATH
                  [--genome PATH [PATH ...]] --read INT --renz STR [STR ...]
                  [--chr_name STR [STR ...]] [--tmp PATH] [--tmpdb PATH] [--noX] [--iterative]
                  [--fast_fragment] [--windows WINDOWS [WINDOWS ...]] [--species STR]
                  [--descr LIST [LIST ...]] [--skip] [--keep_tmp] [-C CPUS] [--mapper STR]
                  [--mapper_binary STR] [--mapper_param MAPPER_PARAM [MAPPER_PARAM ...]]

optional arguments:
  -h, --help               show this help message and exit

General options:
  --skip_mapping           generate a Hi-C specific quality plot from FASTQ and exits
  -w PATH, --workdir PATH  path to an output folder.
  --fastq PATH             path to a FASTQ files (can be compressed files)
  --fastq2 PATH            (beta) path to a FASTQ file of read 2 (can be compressed files).
                           Needed for fast_fragment
  --index PATH             paths to file(s) with indexed FASTA files of the reference genome.
  --genome PATH [PATH ...]
                           paths to file(s) with FASTA files of the reference genome. Needed
                           for fast_fragment mapping. If many, files will be concatenated.
                           I.e.: --genome chr_1.fa chr_2.fa In this last case, order is
                           important or the rest of the analysis. Note: it can also be the path
                           to a previously parsed genome in pickle format.
  --read INT               read number
  --renz STR [STR ...]     restriction enzyme name(s). Use "--renz CHECK" to search for most
                           probable and exit; and use "--renz NONE" to avoid using RE site
                           information.
  --chr_name STR [STR ...]
                           [fasta header] chromosome name(s). Used in the same order as data.
  --tmp PATH               path to a temporary directory (default next to "workdir" directory)
  --tmpdb PATH             if provided uses this directory to manipulate the database
  --noX                    no display server (X screen)
  --skip                   [DEBUG] in case already mapped.
  --keep_tmp               [DEBUG] keep temporary files.

Mapping options:
  --iterative              default mapping strategy is fragment based use this flag for
                           iterative mapping
  --fast_fragment          (beta) use fast fragment mapping. Both fastq files are mapped using
                           fragment based mapping in GEM v3. The output file is an intersected
                           read file than can be used directly in tadbit filter (no tadbit
                           parse needed). Access to samtools is needed for fast_fragment to
                           work. --fastq2 and --genome needs to be specified and --read value
                           should be 0.
  --windows WINDOWS [WINDOWS ...]
                           defines windows to be used to trim the input FASTQ reads, for
                           example an iterative mapping can be defined as: "--windows 1:20 1:25
                           1:30 1:35 1:40 1:45 1:50". But this parameter can also be used for
                           fragment based mapping if for example pair-end reads are both in the
                           same FASTQ, for example: "--windows 1:50" (if the length of the
                           reads is 100). Note: that the numbers are both inclusive.
  -C CPUS, --cpus CPUS     [32] Maximum number of CPU cores available in the execution host. If
                           higher than 1, tasks with multi-threading capabilities will enabled
                           (if 0 all available) cores will be used
  --mapper STR             [gem] mapper used, options are gem, bowtie2 or hisat2
  --mapper_binary STR      [None] path to mapper binary
  --mapper_param MAPPER_PARAM [MAPPER_PARAM ...]
                           any parameter that could be passed to the GEM, BOWTIE2 or HISAT2
                           mapper. e.g. if we want to set the proportion of mismatches to 0.05
                           and the maximum indel length to 10, (in GEM v2 it would be: -e 0.05
                           --max-big-indel-length 10), here we could write: "--mapper_param
                           e:0.05 max-big-indel-length:10". For BOWTIE2, GEM3 and HISAT2 you
                           can also pass directly the parameters enclosed between quotes like:
                           --mapper_param "-e 0.05 --alignment-local-min-score 15" IMPORTANT:
                           some options are incompatible with 3C-derived experiments.

Descriptive, optional arguments:
  --species STR            species name
  --descr LIST [LIST ...]  extra descriptive fields each filed separated by coma, and inside
                           each, name and value separated by column:
                           --descr=cell:lymphoblast,flowcell:C68AEACXX,index:24nf

TADbit parse¶

Parse mapped Hi-C reads and get the intersection

usage: tadbit parse [-h] [-w PATH] [--type STR] [--read INT] [--mapped1 PATHs [PATHs ...]]
                    [--mapped2 PATHs [PATHs ...]] [--renz STR] [--filter_chrom FILTER_CHROM]
                    [--skip] [--compress_input] [--tmpdb PATH] [--genome PATH [PATH ...]]
                    [--jobids INT [INT ...]] [--noX]

optional arguments:
  -h, --help               show this help message and exit

General options:
  -w PATH, --workdir PATH  path to working directory (generated with the tool tadbit mapper)
  --type STR               [0map]file type to be parser, MAP (GEM-mapper), SAM or BAM
  --read INT               In case only one of the reads needs to be parsed
  --filter_chrom FILTER_CHROM
                           default: --filter_chrom
                           "^(chr)?[A-Za-z]?[0-9]{0,3}[XVI]{0,3}(?:ito)?[A-Z-a-z]?(_dna)?$",
                           regexp to consider only chromosome names passing
  --skip                   [DEBUG] in case already mapped.
  --compress_input         Compress input mapped files when parsing is done. This is done in
                           background, while next MAP file is processed, or while reads are
                           sorted.
  --tmpdb PATH             if provided uses this directory to manipulate the database
  --genome PATH [PATH ...]
                           paths to file(s) with FASTA files of the reference genome. If many,
                           files will be concatenated. I.e.: --genome chr_1.fa chr_2.fa In this
                           last case, order is important or the rest of the analysis. Note: it
                           can also be the path to a previously parsed genome in pickle format.
  --jobids INT [INT ...]   Use as input data generated by a job with a given jobid(s). Use
                           tadbit describe to find out which. In this case one jobid can be
                           passed per read.
  --noX                    no display server (X screen)

Mapped outside TADbit options:
  --mapped1 PATHs [PATHs ...]
                           paths to mapped bam files (first read-end)
  --mapped2 PATHs [PATHs ...]
                           paths to mapped bam files (second read-end)
  --renz STR               restriction enzyme name

TADbit filter¶

Filter parsed Hi-C reads and get valid pair of reads to work with

usage: tadbit filter [-h] [--force] [--resume] [--apply INT [INT ...]] [-w PATH] [-C CPUS]
                     [--noX] [--over_represented NUM] [--strict_duplicates]
                     [--max_frag_size NUM] [--min_frag_size NUM] [--re_proximity NUM]
                     [--mad NUM] [--max_f NUM] [--median NUM] [--tmpdb PATH]
                     [--pathids INT [INT ...]] [--compress_input] [--format {short,mid,long}]
                     [--valid] [--clean] [--samtools PATH]

optional arguments:
  -h, --help               show this help message and exit

General options:
  --force                  overwrite previously run job
  --resume                 use filters of previously run job
  -w PATH, --workdir PATH  path to working directory (generated with the tool tadbit mapper)
  -C CPUS, --cpus CPUS     [32] Maximum number of CPU cores available in the execution host. If
                           higher than 1, tasks with multi-threading capabilities will enabled
                           (if 0 all available) cores will be used
  --noX                    no display server (X screen)
  --tmpdb PATH             if provided uses this directory to manipulate the database
  --pathids INT [INT ...]  Use as input data generated by a job under a given pathids. Use
                           tadbit describe to find out which. To filter an intersected file
                           produced with tadbit map --fast_fragment only one PATHid is needed
                           otherwise one per read is needed, first for read 1, second for read
                           2.
  --compress_input         Compress input mapped files when parsing is done. This is done in
                           background, while next MAP file is processed, or while reads are
                           sorted.
  --samtools PATH          path samtools binary

Storage options:
  --format {short,mid,long}
                           [0mid] for compression into pseudo-BAM format. Short contains only
                           positions of reads mapped, mid everything but restriction sites.
  --valid                  stores only valid-pairs discards filtered out reads.
  --clean                  remove intermediate files. WARNING: together with format "short" or
                           valid options, this may results in losing data

Filtering options:
  --apply INT [INT ...]    [[1, 2, 3, 4, 6, 7, 9, 10]] Use filters to define a set os valid
                           pair of reads e.g.: '--apply 1 2 3 4 6 7 8 9'. Where these
                           numberscorrespond to: 1: self-circle, 2: dangling-end, 3: error, 4:
                           extra dangling-end, 5: too close from RES, 6: too short, 7: too
                           large, 8: over-represented, 9: duplicated, 10: random breaks
  --over_represented NUM   [0.001%] percentage of restriction-enzyme (RE) genomic fragments
                           with more coverage to exclude (possible PCR artifact).
  --strict_duplicates      by default reads are considered duplicates if they coincide in
                           genomic coordinates and strand; with strict_duplicates enabled, we
                           also ask to consider read length (WARNING: this option is called
                           strict, but it is more permissive)
  --max_frag_size NUM      [100000] to exclude large genomic RE fragments (probably resulting
                           from gaps in the reference genome)
  --min_frag_size NUM      [50] to exclude small genomic RE fragments (smaller than sequenced
                           reads)
  --re_proximity NUM       [5] to exclude read-ends falling too close from RE site (pseudo-
                           dangling-ends)
  --mad NUM                MAD fragment length normally computed from observed distribution
  --max_f NUM              Maximum fragment length normally computed from observed distribution
  --median NUM             Median fragment length normally computed from observed distribution

TADbit normalize¶

Normalize Hi-C data and generates array of biases

usage: tadbit normalize [-h] -w PATH -r INT [--bam PATH] [-j INT] [--max_njobs INT]
                        [--tmpdb PATH] [-C CPUS] [--normalize_only] [--noX]
                        [--normalization STR] [--biases_path BIASES_PATH] [--mappability PATH]
                        [--fasta PATH] [--renz STR] [--factor NUM] [--prop_data FLOAT]
                        [--seed INT] [--min_count INT] [--cis_limit CIS_LIMIT]
                        [--trans_limit TRANS_LIMIT] [--ratio_limit RATIO_LIMIT]
                        [--cistrans_filter] [--filter_only]
                        [-B CHR:POS1-POS2 [CHR:POS1-POS2 ...]] [-F INT [INT ...]] [--valid]

normalize Hi-C data and generates array of biases

optional arguments:
  -h, --help               show this help message and exit

Required options:
  -w PATH, --workdir PATH  path to working directory (generated with the tool tadbit mapper)
  -r INT, --resolution INT
                           resolution at which to output matrices

General options:
  --bam PATH               path to a TADbit-generated BAM file with all reads (other wise the
                           tool will guess from the working directory database)
  -j INT, --jobid INT      Use as input data generated by a job with a given jobid. Use tadbit
                           describe to find out which.
  --max_njobs INT          [100] Define maximum number of jobs for reading BAM file (set to
                           higher numbers for large files and low RAM memory).
  --tmpdb PATH             if provided uses this directory to manipulate the database
  -C CPUS, --cpus CPUS     [32] Maximum number of CPU cores available in the execution host. If
                           higher than 1, tasks with multi-threading capabilities will enabled
                           (if 0 all available) cores will be used
  --normalize_only         skip calculation of Cis-percentage and decay
  --noX                    no display server (X screen)

Bin filtering options:
  --min_count INT          [None] minimum number of reads mapped to a bin (recommended value
                           could be 2500). If set this option overrides the perc_zero
                           filtering... This option is slightly slower.
  --cis_limit CIS_LIMIT    Maximum distance in bins at which to consider an interaction cis for
                           the filtering. By default it is the number of bins corresponding to
                           1Mb
  --trans_limit TRANS_LIMIT
                           Maximum distance in bins at which to consider an interaction trans
                           for the filtering. By default it is five times the cis_limit (if
                           also default, it would correspond to the number of bins needed to
                           reach 5Mb).
  --ratio_limit RATIO_LIMIT
                           [1.0] Minimum cis/trans (as defined with cis_limit and trans_limit
                           parameters) to filter out bins.
  --cistrans_filter        filter using cis-trans ratio.
  --filter_only            skip normalization
  -B CHR:POS1-POS2 [CHR:POS1-POS2 ...], --badcols CHR:POS1-POS2 [CHR:POS1-POS2 ...]
                           extra regions to be added to bad-columns (ingenomic position). e.g.:
                           --badcols 1:150000000-160000000 2:1200000-1300000

Read filtering options:
  -F INT [INT ...], --filter INT [INT ...]
                           [[1, 2, 3, 4, 6, 7, 9, 10]] Use filters to define a set os valid
                           pair of reads e.g.: '--apply 1 2 3 4 8 9 10'. Where these
                           numberscorrespond to: 1: self-circle, 2: dangling-end, 3: error, 4:
                           extra dangling-end, 5: too close from RES, 6: too short, 7: too
                           large, 8: over-represented, 9: duplicated, 10: random breaks, 11:
                           trans-chromosomic
  --valid                  input BAM file contains only valid pairs (already filtered).

Normalization options:
  --normalization STR      [Vanilla] normalization(s) to apply. Order matters. Choices:
                           Vanilla, ICE, SQRT, oneD, custom
  --biases_path BIASES_PATH
                           biases file to compute decay. REQUIRED with "custom" normalization.
                           Format: single column with header
  --mappability PATH       Path to mappability bedGraph file, required for oneD normalization.
                           Mappability file can be generated with GEM (example from the genomic FASTA file hg38.fa):

                                gem-indexer -i hg38.fa -o hg38
                                gem-mappability -I hg38.gem -l 50 -o hg38.50mer -T 8
                                gem-2-wig -I hg38.gem -i hg38.50mer.mappability -o hg38.50mer
                                wigToBigWig hg38.50mer.wig hg38.50mer.sizes hg38.50mer.bw
                                bigWigToBedGraph hg38.50mer.bw  hg38.50mer.bedGraph
  --fasta PATH             Path to FASTA file with genome sequence, to compute GC content and
                           number of restriction sites per bin. Required for oneD normalization
  --renz STR               restriction enzyme name(s). Required for oneD normalization
  --factor NUM             [1] target mean value of a cell after normalization (can be used to
                           weight experiments before merging)
  --prop_data FLOAT        [1] Only for oneD normalization: proportion of data to be used in
                           fitting (for very large datasets). Number between 0 and 1.
  --seed INT               [1] Only for oneD normalization: seed number for the random picking
                           of data when using the "prop_data" parameter

TADbit bin¶

Bin Hi-C data into matrices

usage: tadbit bin [-h] -w PATH [--noX] -r INT [--bam PATH] [-j INT] [--force] [-q]
                  [--tmpdb PATH] [--nchunks NCHUNKS] [-C CPUS] [--chr_name STR [STR ...]]
                  [--matrix] [--cooler] [--rownames] [--plot] [--force_plot] [--only_plot] [-i]
                  [--triangular] [--xtick_rotation XTICK_ROTATION] [--cmap CMAP]
                  [--bad_color BAD_COLOR] [--format FORMAT] [--zrange ZRANGE]
                  [--transform {log2,log,none}] [--figsize FIGSIZE] [--tad_def TAD_DEF] [-c]
                  [-c2] [--biases PATH] [--norm STR [STR ...]] [-F INT [INT ...]] [--only_txt]

optional arguments:
  -h, --help               show this help message and exit

Required options:
  -w PATH, --workdir PATH  path to working directory (generated with the tool tadbit mapper)
  -r INT, --resolution INT
                           resolution at which to output matrices

General options:
  --noX                    no display server (X screen)
  --bam PATH               path to a TADbit-generated BAM file with all reads (other wise the
                           tool will guess from the working directory database)
  -j INT, --jobid INT      Use as input data generated by a job with a given jobid. Use tadbit
                           describe to find out which.
  --force                  overwrite previously run job
  -q, --quiet              remove all messages
  --tmpdb PATH             if provided uses this directory to manipulate the database
  --nchunks NCHUNKS        maximum number of chunks into which to cut the BAM
  -C CPUS, --cpus CPUS     [32] Maximum number of CPU cores available in the execution host. If
                           higher than 1, tasks with multi-threading capabilities will enabled
                           (if 0 all available) cores will be used
  --chr_name STR [STR ...]
                           [fasta header] chromosome name(s). Order of chromosomes in the
                           output matrices.

Read filtering options:
  -F INT [INT ...], --filter INT [INT ...]
                           [[1, 2, 3, 4, 6, 7, 9, 10]] Use filters to define a set os valid
                           pair of reads e.g.: '--apply 1 2 3 4 8 9 10'. Where these
                           numberscorrespond to: 0: nothing, 1: self-circle, 2: dangling-end,
                           3: error, 4: extra dangling-end, 5: too close from RES, 6: too
                           short, 7: too large, 8: over-represented, 9: duplicated, 10: random
                           breaks, 11: trans-chromosomic

Normalization options:
  --biases PATH            path to file with pre-calculated biases by columns
  --norm STR [STR ...]     [['raw']] normalization(s) to apply. Choices are: [norm, decay, raw,
                           raw&decay]

Output options:
  --matrix                 Write text matrix in multiple columns (square). By defaults matrices
                           are written in BED-like format (also only way to get a raw matrix
                           with all values including the ones in masked columns).
  --cooler                 Write i,j,v matrix in cooler format instead of text.
  --rownames               To store row names in the output text matrix. WARNING: when non-
                           matrix, results in two extra columns
  --only_plot              [False] Skip writing matrix in text format.
  -i, --interactive        [False] Open matplotlib interactive plot (nothing will be saved).
  -c , --coord             Coordinate of the region to retrieve. By default all genome,
                           arguments can be either one chromosome name, or the coordinate in
                           the form: "-c chr3:110000000-120000000"
  -c2 , --coord2           Coordinate of a second region to retrieve the matrix in the
                           intersection with the first region.
  --only_txt               Save only text file for matrices, not images

Plotting options:
  --plot                   Plot matrix in desired format.
  --force_plot             Force plotting even with demoniacally big matrices (more than
                           5000x5000, or 1500x1500with interactive option).
  --triangular             [False] represents only half matrix. Note that this also results in
                           truly vectorial images of matrix.
  --xtick_rotation XTICK_ROTATION
                           [-25] x-tick rotation
  --cmap CMAP              [viridis] Matplotlib color map to use.
  --bad_color BAD_COLOR    [white] Matplotlib color to use on bins filtered out (only used with
                           normalized matrices, not raw).
  --format FORMAT          [png] plot file format.
  --zrange ZRANGE          Range, in log2 scale of the color scale. i.e.: --zrange=-2,2
  --transform {log2,log,none}
                           [log2] can be any of [log2, log, none]
  --figsize FIGSIZE        Range, in log2 scale of the color scale. default for triangular
                           matrices: --figsize=16,10 and for square matrices: --figsize=16,14
  --tad_def TAD_DEF        jobid with the TAD segmentation, alternatively a tsv file with tad
                           definition, columns: # start end score density

TADbit segment¶

Finds TAD or compartment segmentation in Hi-C data.

usage: tadbit segment [-h] -w PATH [--tmpdb PATH] [--nosql] [--all_bins] [--mreads PATH]
                      [--biases PATH] -r INT [--norm_matrix PATH] [--raw_matrix PATH]
                      [-F INT [INT ...]] [--noX] [--rich_in_A PATH] [--fasta PATH] [--savecorr]
                      [--fix_corr_scale] [--format FORMAT] [--n_evs INT]
                      [--ev_index INT [INT ...]] [--only_compartments] [--only_tads] [-v]
                      [-j INT] [-c STR [STR ...]] [--max_tad_size INT] [-C CPUS] [--force]

optional arguments:
  -h, --help               show this help message and exit

General options:
  -w PATH, --workdir PATH  path to working directory (generated with the tool tadbit mapper)
  --tmpdb PATH             if provided uses this directory to manipulate the database
  --nosql                  do not load/store data from/in sqlite database
  --all_bins               skip the filtering of bins for detection of TADs
  --mreads PATH            path valid-pairs file (TADbit BAM format)
  --biases PATH            path to file with precalculated biases by columns
  -r INT, --resolution INT
                           resolution at which to output matrices
  --norm_matrix PATH       path to a matrix file with normalized read counts
  --raw_matrix PATH        path to a matrix file with raw read counts
  -F INT [INT ...], --filter INT [INT ...]
                           [[1, 2, 3, 4, 6, 7, 9, 10]] Use filters to define a set os valid
                           pair of reads e.g.: '--apply 1 2 3 4 8 9 10'. Where these
                           numberscorrespond to: 1: self-circle, 2: dangling-end, 3: error, 4:
                           extra dangling-end, 5: too close from RES, 6: too short, 7: too
                           large, 8: over-represented, 9: duplicated, 10: random breaks, 11:
                           trans-chromosomic
  --noX                    no display server (X screen)
  --only_compartments      search A/B compartments using first eigen vector of the correlation
                           matrix
  --only_tads              search TAD boundaries break-point detection algorithm
  -v, --verbose            print more messages
  -j INT, --jobid INT      Use as input data generated by a job with a given jobid. Use tadbit
                           describe to find out which.
  -c STR [STR ...], --chromosomes STR [STR ...]
                           Name of the chromosomes on which to search for TADs or compartments.
  -C CPUS, --cpu CPUS      [32] Maximum number of CPU cores available in the execution host. If
                           higher than 1, tasks with multi-threading capabilities will enabled
                           (if 0 all available) cores will be used
  --force                  overwrite previously run job

Compartment calling options:
  --rich_in_A PATH         path to a BED or bedGraph file with list of protein coding gene or
                           other active epigenetic mark, to be used to label compartments
                           instead of using the relative interaction count.
  --fasta PATH             Path to fasta file with genome sequence, to compute GC content and
                           use it to label compartments
  --savecorr               Save correlation matrix used to predict compartments.
  --fix_corr_scale         Correlation matrix plot scaled between correlation 1 and -1 instead
                           of maximum observed values.
  --format FORMAT          [png] file format for figures
  --n_evs INT              [3] Number of eigenvectors to store. if "-1" all eigenvectors will
                           be calculated
  --ev_index INT [INT ...]
                           list of indexes of eigenvectors capturing compartments signal (one
                           index per chromosome, in the same order as chromosomes in fasta
                           file). Example picking the first eigenvector for all chromosomes but
                           for chromosome 3: '--ev_index 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
                           1 1 1 1

TAD calling options:
  --max_tad_size INT       an integer defining the maximum size of TAD. Default defines it as
                           the number of rows/columns

TADbit model¶

Generates 3D models given an input interaction matrix and a set of input parameters

usage: tadbit model [-h] -w PATH [--input_matrix PATH] [--rand INT] [--nmodels INT]
                    [--nkeep INT] [-j INT] [--optimization_id INT] [--restart_id INT]
                    [--fig_format STR] [--noX] [--corr STR] [--species STRING]
                    [--assembly STRING] [--cell STRING] [--exp_type STRING] [--project STRING]
                    [--crm NAME] [--beg INT] [--end INT] [--matrix_beg INT] [-r INT]
                    [--perc_zero FLOAT] [--smooth_factor INT] [--optimize] [--model]
                    [--model_ptadbit] [--force] [--maxdist LIST [LIST ...]]
                    [--upfreq LIST [LIST ...]] [--lowfreq LIST [LIST ...]]
                    [--scale LIST [LIST ...]] [--dcutoff LIST [LIST ...]]
                    [--container LIST [LIST ...]] [--analyze] [-C CPUS] [--job_list]
                    [--nmodels_per_job INT] [--cpus_per_job INT] [--concurrent_jobs INT]
                    [--timeout_job INT] [--script_cmd STR] [--script_args STR]
                    [--script_template STR] [--tmpdb PATH] [--analyze_list INT [INT ...]]
                    [--not_write_cmm] [--not_write_xyz] [--not_write_json]

Generates 3D models given an input interaction matrix and a set of input parameters

optional arguments:
  -h, --help               show this help message and exit

General options:
  -w PATH, --workdir PATH  path to working directory (generated with the tool TADbit mapper)
  --input_matrix PATH      In case input was not generated with the TADbit tools
  --rand INT               [1] random initial number. NOTE: when running single model at the
                           time, should be different for each run
  --nmodels INT            [5000] number of models to generate for modeling
  --nkeep INT              [1000] number of models to keep for modeling
  -j INT, --jobid INT      Use as input data generated by a job with a given jobid. Use tadbit
                           describe to find out which.
  --optimization_id INT    [None] ID of a pre-run optimization batch job
  --restart_id INT         [None] ID of a job to be restarted, for example after building the
                           models in a cluster
  --fig_format STR         file format and extension for figures and plots (can be any
                           supported by matplotlib, png, eps...)
  --noX                    no display server (X screen)
  --corr STR               correlation method to compare contact maps and original matrix
                           (options are speraman, pearson, kendall, logpearson, chi2, scc )

Descriptive, optional arguments:
  --species STRING         species name, with no spaces, i.e.: homo_sapiens
  --assembly STRING        NCBI ID of the original assembly (i.e.: NCBI36 for human)
  --cell STRING            cell type name
  --exp_type STRING        experiment type name (i.e.: Hi-C)
  --project STRING         project name

Modeling preparation:
  --crm NAME               chromosome name
  --beg INT                genomic coordinate from which to start modeling
  --end INT                genomic coordinate where to end modeling
  --matrix_beg INT         genomic coordinate of the first row/column of the input matrix. This
                           has to be specified if the input matrix is not the TADbit tools
                           generated abc format
  -r INT, --reso INT       resolution of the Hi-C experiment
  --perc_zero FLOAT

Parameter optimization:
  --optimize               optimization run, store less info about models
  --model                  modelling run
  --model_ptadbit          modelling run using pTADbit
  --force                  use input parameters, and skip any precalculated optimization
  --maxdist LIST [LIST ...]
                           range of numbers for maxdist, i.e. 400:1000:100 -- or just a number
                           -- or a list of numbers
  --upfreq LIST [LIST ...]
                           range of numbers for upfreq, i.e. 0:1.2:0.3 -- or just a number --
                           or a list of numbers
  --lowfreq LIST [LIST ...]
                           range of numbers for lowfreq, i.e. -1.2:0:0.3 -- or just a number --
                           or a list of numbers
  --scale LIST [LIST ...]  [0.01] range of numbers to be test as optimal scale value, i.e.
                           0.005:0.01:0.001 -- Can also pass only one number -- or a list of
                           numbers
  --dcutoff LIST [LIST ...]
                           [2] range of numbers to be test as optimal distance cutoff parameter
                           (distance, in number of beads, from which to consider 2 beads as
                           being close), i.e. 1:1.5:0.5 -- Can also pass only one number -- or
                           a list of numbers
  --container LIST [LIST ...]
                           restrains particle to be within a given object. Can only be a
                           'cylinder', which is, in fact a cylinder of a given height to which
                           are added hemispherical ends. This cylinder is defined by a radius,
                           its height (with a height of 0 the cylinder becomes a sphere) and
                           the force applied to the restraint. E.g. for modeling E. coli genome
                           (2 micrometers length and 0.5 micrometer of width), these values
                           could be used: 'cylinder' 250 1500 50, and for a typical mammalian
                           nuclei (6 micrometers diameter): 'cylinder' 3000 0 50
  --analyze                analyze models.

Analysis:
  --analyze_list INT [INT ...]
                           [2 3 4 5 6 7 8 9 10 11 12 13] list of numbers representing the
                           analysis to be done. Choose between: 0) do nothing 1) optimization
                           plot 2) correlation real/models 3) z-score plot 4) constraints 5)
                           objective function 6) centroid 7) consistency 8) density 9) contact
                           map 10) walking angle 11) persistence length 12) accessibility 13)
                           interaction
  --not_write_cmm          [False] do not generate cmm files for each model (Chimera input)
  --not_write_xyz          [False] do not generate xyz files for each model (3D coordinates)
  --not_write_json         [False] do not generate json file.

Running jobs:
  --smooth_factor INT      Hi-C matrix smoothing value of the mean kernel for pTADbit. Useful
                           in case of using matrices with low sequencing depth
  -C CPUS, --cpu CPUS      [32] Maximum number of CPU cores available in the execution host. If
                           higher than 1, tasks with multi-threading capabilities will enabled
                           (if 0 all available) cores will be used
  --job_list               generate a list of commands stored in a file named joblist_HASH.q
                           (where HASH is replaced by a string specific to the parameters
                           used). note that dcutoff will never be split as it does not require
                           to re-run models.
  --nmodels_per_job INT    Number of models per distributed job.
  --cpus_per_job INT       Number of cpu nodes per distributed job.
  --concurrent_jobs INT    Number of concurrent jobs in distributed mode.
  --timeout_job INT        Time to wait for a concurrent jobs to finish before canceling it in
                           distributed mode.
  --script_cmd STR         Command to call the jobs in distributed mode.
  --script_args STR        Argumnets to script_cmd to call the jobs in distributed mode.
  --script_template STR    Template to generate a file that script_cmd will call for each job
                           in distributed mode. Each __file__ marker in the template will be
                           replacedby the job file __name__ with the name and __dir__ with the
                           folder.
  --tmpdb PATH             if provided uses this directory to manipulate the database

TADbit merge¶

Load two working directories with different Hi-C data samples and merges them into a new
working directory generating some statistics.

usage: tadbit merge [-h] [-w PATH] [-w1 PATH] [-w2 PATH] [--bam1 PATH] [--noX] [--bam2 PATH]
                    [-C CPUS] [-r INT] [--skip_comparison] [--skip_merge]
                    [--save STR [STR ...]] [--jobid1 INT] [--jobid2 INT] [--force] [--norm]
                    [--biases1 PATH] [--biases2 PATH] [--filter INT [INT ...]]
                    [--samtools PATH] [--tmpdb PATH]

optional arguments:
  -h, --help               show this help message and exit

General options:
  -w PATH, --workdir PATH  path to a new output folder
  -w1 PATH, --workdir1 PATH
                           path to working directory of the first HiC data sample to merge
  -w2 PATH, --workdir2 PATH
                           path to working directory of the second HiC data sample to merge
  --bam1 PATH              path to the first TADbit-generated BAM file with all reads (other
                           wise the tool will guess from the working directory database)
  --noX                    no display server (X screen)
  --bam2 PATH              path to the second TADbit-generated BAM file with all reads (other
                           wise the tool will guess from the working directory database)
  -C CPUS, --cpus CPUS     [32] Maximum number of CPU cores available in the execution host. If
                           higher than 1, tasks with multi-threading capabilities will enabled
                           (if 0 all available) cores will be used
  -r INT, --resolution INT
                           resolution at which to do the comparison, and generate the matrices.
  --skip_comparison        skip the comparison between replicates (faster). Comparisons are
                           performed at 3 levels 1- comparing first diagonals of each
                           experiment (and generating SCC score and standard deviation see
                           https://doi.org/10.1101/gr.220640.117) 2- Comparing the first
                           eigenvectors of input experiments 3- Generates reproducibility score
                           using function from https://doi.org/10.1093/bioinformatics/btx152
  --skip_merge             skip the merge of replicates (faster).
  --save STR [STR ...]     [genome] save genomic or chromosomic matrix.
  --jobid1 INT             Use as input data generated by a job with a given jobid. Use tadbit
                           describe to find out which.
  --jobid2 INT             Use as input data generated by a job with a given jobid. Use tadbit
                           describe to find out which.
  --force                  overwrite previously run job
  --norm                   compare normalized matrices
  --biases1 PATH           path to file with precalculated biases by columns
  --biases2 PATH           path to file with precalculated biases by columns
  --filter INT [INT ...]   [[1, 2, 3, 4, 6, 7, 9, 10]] Use filters to define a set os valid
                           pair of reads e.g.: '--apply 1 2 3 4 8 9 10'. Where these
                           numberscorrespond to: 1: self-circle, 2: dangling-end, 3: error, 4:
                           extra dangling-end, 5: too close from RES, 6: too short, 7: too
                           large, 8: over-represented, 9: duplicated, 10: random breaks, 11:
                           trans-chromosomic
  --samtools PATH          path samtools binary
  --tmpdb PATH             if provided uses this directory to manipulate the database

TADbit describe¶

Describe jobs and results in a given working directory

usage: tadbit describe [-h] [-w PATH] [--noX] [-t  [...]] [-T  [...]] [-j INT [INT ...]]
                       [-W STR [STR ...]] [-s STR [STR ...]] [--tmpdb PATH] [--tsv] [-o OUTPUT]

optional arguments:
  -h, --help               show this help message and exit

General options:
  -w PATH, --workdir PATH  path to working directory (generated with the tool tadbit map)
  --noX                    no display server (X screen)
  -t  [ ...], --tables  [ ...]
                           [['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12',
                           '13']] what tables to show, write either the sequence of names or
                           indexes, according to this list: 1: paths, 2: jobs, 3:
                           mapped_outputs, 4: mapped_inputs, 5: parsed_outputs, 6:
                           intersection_outputs, 7: filter_outputs, 8: normalize_outputs, 9:
                           merge_stats, 10: merge_outputs, 11: segment_outputs, 12: models, 13:
                           modeled_regions
  -T  [ ...], --skip_tables  [ ...]
                           [[]] what tables NOT to show, write either the sequence of names or
                           indexes, according to this list: 1: paths, 2: jobs, 3:
                           mapped_outputs, 4: mapped_inputs, 5: parsed_outputs, 6:
                           intersection_outputs, 7: filter_outputs, 8: normalize_outputs, 9:
                           merge_stats, 10: merge_outputs, 11: segment_outputs, 12: models, 13:
                           modeled_regions
  -j INT [INT ...], --jobids INT [INT ...]
                           Display only items matching these jobids.
  -W STR [STR ...], --where STR [STR ...]
                           Select rows. List pairs of keywords (column header) and values to
                           filter results. For example to get only results where "18" appears
                           in the column "Chromosome", the option should be set as: `-W
                           Chromosome,18`
  -s STR [STR ...], --select STR [STR ...]
                           Select columns. List the keyword (column headers) to be displayed.
                           E.g. to show only the colmns JobIds: `-s Jobids`
  --tmpdb PATH             if provided uses this directory to manipulate the database
  --tsv                    Print output in tab separated format
  -o OUTPUT, --output OUTPUT
                           Writes output in specified file.

TADbit clean¶

Delete jobs and results of a given list of jobids in a given directories

usage: tadbit clean [-h] [-w PATH] [-j INT [INT ...]] [--delete] [--compress] [--noX]
                    [--change_workdir PATH] [--tmpdb PATH]

optional arguments:
  -h, --help               show this help message and exit
  --change_workdir PATH    In case folder was moved, input the new path

General options:
  -w PATH, --workdir PATH  path to working directory (generated with the tool tadbit mapper)
  -j INT [INT ...], --jobids INT [INT ...]
                           jobids of the files and entries to be removed
  --delete                 delete files, otherwise only DB entries.
  --compress               compress files and update paths accordingly
  --noX                    no display server (X screen)
  --tmpdb PATH             if provided uses this directory to manipulate the database

TADbit import¶

Import Hi-C data to TADbit toy BAM

usage: tadbit import [-h] -w PATH -r INT [--format {text,matrix,cooler}] -i STR [-c]
                     [--tmpdb PATH] [-C CPUS] [--samtools PATH]

optional arguments:
  -h, --help               show this help message and exit

Required options:
  -w PATH, --workdir PATH  path to working directory (generated with the tool tadbit mapper)
  -r INT, --resolution INT
                           resolution at which to output matrices
  --format {text,matrix,cooler}
                           [text] can be any of [text, matrix, cooler]
  -i STR, --input STR      path to input file

General options:
  -c , --coord             Coordinate of the region to import. By default all genome, arguments
                           can be either one chromosome name, or the coordinate in the form:
                           "-c chr3:110000000-120000000"
  --tmpdb PATH             if provided uses this directory to manipulate the database
  -C CPUS, --cpus CPUS     [32] Maximum number of CPU cores available in the execution host. If
                           higher than 1, tasks with multi-threading capabilities will enabled
                           (if 0 all available) cores will be used
  --samtools PATH          path samtools binary

TADbit export¶

Export Hi-C data to other formats

usage: tadbit export [-h] -w PATH -r INT [--format {text,matrix,cooler,hic}] -o STR
                     [--bam PATH] [-j INT] [--force] [-q] [--tmpdb PATH] [--nchunks NCHUNKS]
                     [-C CPUS] [--chr_name STR [STR ...]] [--juicerjar PATH] [--rownames] [-c]
                     [-c2] [--biases PATH] [--norm] [-F INT [INT ...]]

optional arguments:
  -h, --help               show this help message and exit

Required options:
  -w PATH, --workdir PATH  path to working directory (generated with the tool tadbit mapper)
  -r INT, --resolution INT
                           resolution at which to output matrices
  --format {text,matrix,cooler,hic}
                           [text] can be any of [text, matrix, cooler, hic]
  -o STR, --output STR     path to output file

General options:
  --bam PATH               path to a TADbit-generated BAM file with all reads (other wise the
                           tool will guess from the working directory database)
  -j INT, --jobid INT      Use as input data generated by a job with a given jobid. Use tadbit
                           describe to find out which.
  --force                  overwrite previously run job
  -q, --quiet              remove all messages
  --tmpdb PATH             if provided uses this directory to manipulate the database
  --nchunks NCHUNKS        maximum number of chunks into which to cut the BAM
  -C CPUS, --cpus CPUS     [32] Maximum number of CPU cores available in the execution host. If
                           higher than 1, tasks with multi-threading capabilities will enabled
                           (if 0 all available) cores will be used
  --chr_name STR [STR ...]
                           [fasta header] chromosome name(s). Order of chromosomes in the
                           output matrices.
  --juicerjar PATH         path to the juicer tools jar file needed to export matrices to hic
                           format (check https://github.com/aidenlab/juicer/wiki/Download).
                           Note that you also need java available in the path.

Read filtering options:
  -F INT [INT ...], --filter INT [INT ...]
                           [[1, 2, 3, 4, 6, 7, 9, 10]] Use filters to define a set of valid
                           pair of reads e.g.: '--filter 1 2 3 4 8 9 10'. Where these numbers
                           correspond to: 0: nothing, 1: self-circle, 2: dangling-end, 3:
                           error, 4: extra dangling-end, 5: too close from RES, 6: too short,
                           7: too large, 8: over-represented, 9: duplicated, 10: random breaks,
                           11: trans-chromosomic

Normalization options:
  --biases PATH            path to file with pre-calculated biases by columns
  --norm                   export normalized matrix

Output options:
  --rownames               To store row names in the output text matrix. WARNING: when non-
                           matrix, results in two extra columns
  -c , --coord             Coordinate of the region to retrieve. By default all genome,
                           arguments can be either one chromosome name, or the coordinate in
                           the form: "-c chr3:110000000-120000000"
  -c2 , --coord2           Coordinate of a second region to retrieve the matrix in the
                           intersection with the first region.

Fork me on GitHub