Parsers

pytadbit.parsers.hic_parser.read_matrix(things, parser=None, hic=True, resolution=1, **kwargs)[source]

Read and checks a matrix from a file (using pytadbit.parser.hic_parser.autoreader()) or a list.

Parameters
  • things – might be either a file name, a file handler or a list of list (all with same length)

  • parser (None) –

    a parser function that returns a tuple of lists representing the data matrix, with this file example.tsv:

    chrT_001    chrT_002    chrT_003    chrT_004
    chrT_001    629    164    88    105
    chrT_002    86    612    175    110
    chrT_003    159    216    437    105
    chrT_004    100    111    146    278
    

    the output of parser(‘example.tsv’) might be: ([629, 86, 159, 100, 164, 612, 216, 111, 88, 175, 437, 146, 105, 110, 105, 278])

  • resolution (1) – resolution of the matrix

  • hic (True) – if False, TADbit assumes that files contains normalized data

Returns

the corresponding matrix concatenated into a huge list, also returns number or rows

pytadbit.parsers.hic_parser.load_hic_data_from_reads(fnam, resolution, **kwargs)[source]
Parameters
  • fnam – tsv file with reads1 and reads2

  • resolution – the resolution of the experiment (size of a bin in bases)

  • genome_seq – a dictionary containing the genomic sequence by chromosome

  • get_sections (False) – for very very high resolution, when the column index does not fit in memory

pytadbit.parsers.genome_parser.parse_fasta(f_names, chr_names=None, chr_filter=None, chr_regexp=None, verbose=True, save_cache=True, reload_cache=False, only_length=False)[source]

Parse a list of fasta files, or just one fasta.

WARNING: The order is important

Parameters
  • f_names – list of pathes to files, or just a single path

  • chr_names (None) – pass list of chromosome names, or just one. If None are passed, then chromosome names will be inferred from fasta headers

  • chr_filter (None) – use only chromosome in the input list

  • chr_regexp (None) – use only chromosome matching

  • save_cache (True) – save a cached version of this file for faster loadings (~4 times faster)

  • reload_cache (False) – reload cached genome

  • only_length (False) – returns dictionary with length of genome,not sequence

Returns

a sorted dictionary with chromosome names as keys, and sequences as values (sequence in upper case)

pytadbit.parsers.sam_parser.parse_sam(f_names1, f_names2=None, out_file1=None, out_file2=None, genome_seq=None, re_name=None, verbose=False, clean=True, mapper=None, **kwargs)[source]

Parse sam/bam file using pysam tools.

Keep a summary of the results into 2 tab-separated files that will contain 6

columns: read ID, Chromosome, position, strand (either 0 or 1), mapped sequence lebgth, position of the closest upstream RE site, position of the closest downstream RE site

Parameters
  • f_names1 – a list of path to sam/bam files corresponding to the mapping of read1, can also be just one file

  • f_names1 – a list of path to sam/bam files corresponding to the mapping of read2, can also be just one file

  • out_file1 – path to outfile tab separated format containing mapped read1 information

  • out_file1 – path to outfile tab separated format containing mapped read2 information

  • genome_seq – a dictionary generated by pyatdbit.parser.genome_parser.parse_fasta(). containing the genomic sequence

  • re_name – name of the restriction enzyme used

  • mapper (None) – software used to map (supported are GEM and BOWTIE2). Guessed from file by default.

pytadbit.parsers.tad_parser.parse_tads(handler)[source]

Parse a tab separated value file that contains the list of TADs of a given experiment. This file might have been generated whith the pytadbit.tadbit.print_result_R() or with the R binding for tadbit

Parameters
  • handler – path to file

  • bin_size (1) – resolution of the experiment

Returns

list of TADs and list of weights, each TAD being a dict of type:

{TAD_num: {'start': start,
           'end'  : end,
           'brk'  : end,
           'score': score}}