Introduction

Chromatin immunoprecipitation (ChIP) followed by high-throughput sequencing (ChIP-seq) is a powerful method to determine how transcription factors and other chromatin-associated proteins interact with DNA in order to regulate gene transcription. A single ChIP-seq experiment produces large amounts of highly reproducible data. The challenge is to extract knowledge from the data by thoughtful application of appropriate bioinformatics tools.

We have developed a set of software applications for performing common ChIP-seq data analysis tasks across the whole genome, including positional correlation analysis, peak detection, and genome partitioning into signal-rich and signal-poor regions.

The ChIP-Seq tools exist as stand-alone C programs and include the following programs:

  • chipcor : positional correlation and generation of an aggregation plot for two genomic features;
  • chipextract : extraction of specific genome annotation features around reference genomic anchor points;
  • chipcenter : read shifting;
  • chippeak : narrow peak caller that uses a fixed width peak size;
  • chippart : broad peak caller algorithm used for broad regions of enrichment (i.e. histone marks);
  • chipscore : feature selection tool based on a read count threshold;

Project Description

The ChIP-Seq tools have been designed to be simple, fast and highly modular. Each program carries out a well defined data processing procedure that can potentially fit into a pipeline framework.

As an internal working format, the ChIP-Seq programs use a compact format called SGA (Simplified Genome Annotation). SGA files are single-line-oriented and tab-delimited text files with the following five mandatory fields:

  • Sequence name (Char String)
  • Feature (Char String)
  • Sequence Position (Integer)
  • Strand (+/- or 0)
  • Read Counts (Integer)

Any number of additional fields may be added containing application-specific information.

The Chip-Seq programs require SGA files to be sorted by sequence name, position, and strand. Note that SGA is a generic format that can be used to represent other genome annotations, e.g. the location of transcription start sites (TSS) or cross-genome conservation scores. Orientation-less features will be associated with a strand value of 0.

Technically, the programs are fast and are able to carry out data analysis across an entire SGA-formatted data file (which can be several hundreds of MBs) in a few minutes, thus allowing high-throughput genomic data analysis.

The programs are documented by UNIX style man pages and a README file that explains the installation procedure. The current distribution also contains a number of auxiliary Perl scripts as well as C programs for reformatting and other pre- and post-processing tasks.