Chromatin immunoprecipitation (ChIP) followed by high-throughput sequencing (ChIP-seq) is a powerful method to determine how transcription factors and other chromatin-associated proteins interact with DNA in order to regulate gene transcription. A single ChIP-seq experiment produces large amounts of highly reproducible data. The challenge is to extract knowledge from the data by thoughtful application of appropriate bioinformatics tools.

We have developed a set of software applications for performing common ChIP-seq data analysis tasks across the whole genome, including positional correlation analysis, peak detection, and genome partitioning into signal-rich and signal-poor regions.

The ChIP-Seq tools exist as stand-alone C programs and include the following programs:

  • chipcor : generates a positional correlation histogram for two genomic features;
  • chipextract : extract ChIP-seq tags that are distributed within a given distance from reference anchor points (reference feature);
  • chipcenter : shifts tags mapping to the '+' or '-' strand to the estimated center-positions of DNA fragments;
  • chippeak : finds regions (peaks) in the genome that are enriched in ChIP-Seq tags compared to what it is expected to see by chance;
  • chippart : partitions the genome into signal-rich an poor regions, e.g. for the definition of chromatin domains within ChIP-Seq data for histone modifications;
  • chipscore : scores a list of genomic positions with tag counts from a reference feature, e.g. ChIP-Seq peaks with ChIP tag counts or conservation scores;

Project Description

The ChIP-Seq tools have been designed to be simple, fast and highly modular. Each program carries out a well defined data processing procedure that can potentially fit into a pipeline framework.

As an internal working format, the ChIP-Seq programs use a compact format called SGA (Simplified Genome Annotation). SGA files are single-line-oriented and tab-delimited text files with the following five mandatory fields:

  • Sequence name (Char String)
  • Feature (Char String)
  • Sequence Position (Integer)
  • Strand (+/- or 0)
  • Tag Counts (Integer)

Any number of additional fields may be added containing application-specific information.

The Chip-Seq programs require SGA files to be sorted by sequence name, position, and strand. Note that SGA is a generic format that can be used to represent other genome annotations, e.g. the location of transcription start sites (TSS) or cross-genome conservation scores. Orientation-less features will be associated with a strand value of 0.

Technically, the programs are fast and are able to carry out data analysis across an entire SGA-formatted data file (which can be several hundreds of MBs) in a few minutes, thus allowing high-throughput genomic data analysis.

The programs are documented by UNIX style man pages and a README file that explains the installation procedure. The current distribution also contains a number of auxiliary Perl scripts for reformatting and other pre- and post-processing tasks.