Tools for the analysis of ChIP-seq data
DNA sequencing has recently been pushed to a new era with the development of massively parallel sequencing technologies. Chromatin Immuno Precipitation (ChIP) allows the enrichment of genomic DNA fragments based on their interaction with specific proteins. In combination with high-throughput sequencing (ChIP-seq) of these fragments, the technique generates millions of short sequence reads or tags (generally 30 to 50 bp in length) that are subsequently mapped back to the reference genome. The ChIP-seq protocol generates thereby a comprehensive definition of genomic loci sharing a common binding site or a particular epigenetic modification.
We propose a set of software modules for performing common ChIP-seq data analysis tasks across the whole genome, including positional correlation analysis, peak detection, and genome partitioning into signal-rich and signal-poor regions.
Currently, the ChIP-Seq tools exist as stand-alone C programs and include the following modules:
The ChIP-seq tools have been developed by Giovanna Ambrosini at the School of Life Sciences of Ecole Polythechnique Federale de Lausanne (EPFL/FSV).
The ChIP-Seq tools are designed to be simple, fast and highly modular. Each program carries out a well defined data processing procedure that can potentially fit into a pipeline framework.
As an internal working format, the ChIP-Seq programs use a compact format called SGA (Simplified Genome Annotation). SGA files are single-line-oriented and tab-delimited text files with the following five mandatory fields:
Any number of additional fields may be added containing application-specific information.
The Chip-Seq programs require SGA files to be sorted by sequence name, position, and strand. Note that SGA is a generic format that can be used to represent other genome annotations, e.g. the location of transcription start sites (TSS) or cross-genome conservation scores. Orientation-less features will be associated with a strand value of 0.
Technically, the programs are fast and are able to carry out data analysis across an entire SGA-formatted data file (which can be several hundreds of MBs) in a few minutes, thus allowing high-throughput genomic data analysis.
The programs are documented by UNIX style man pages and a README file that explains the installation procedure. The current distribution also contains a number of auxiliary Perl scripts for reformatting and other pre- and post-processing tasks.