This is one of the FragmentAssembly Program(ContigAssembly Program) are generally used in the WholeGenomeShotGun sequencing method for sequence processing(LargeScaleSequencing).

http://seq.cs.iastate.edu/

Introduction

We have made the following improvements to the CAP sequence assembly program.

  1. Use of forward-reverse constraints to correct assembly errors and link contigs.
  2. Use of base quality values in alignment of sequence reads.
  3. Automatic clipping of 5' and 3' poor regions of reads.
  4. Generation of assembly results in ace file format for Consed.
  5. CAP3 can be used in GAP4 of the Staden package.

The improved program is named CAP3. These improvements allow CAP3 to take longer sequences of higher errors and produce more accurate consensus sequences.

Use of constraints in layout generation

A forward-reverse constraint is often produced by sequencing both ends of a subclone. A forward-reverse constraint specifies that the two reads should be on the opposite strands of the DNA molecule within a specified range of distance.

CAP3 makes use of a large number of forward-reverse constraints to locate and correct errors in layout of sequence reads. This capability allows CAP3 to address assembly errors due to repeats.

CAP3 also uses constraints to link contigs separated by a gap. This feature provides useful information to sequence finishers.

The algorithm used in CAP3 is designed to tolerate wrong constraints, which are due to errors in naming and lane tracking.

Use of quality values in alignment

CAP3 makes use of base quality values in constructing an alignment of sequence reads and generating a consensus sequence for each contig.

This allows the program to use both base quality values and the depth of coverage at a position to improve the accuracy in generating a consensus base at the position.

The alignment method in CAP3 is very tolerable of reads of high sequencing errors.

Automatic clipping of 5' and 3' poor regions

CAP3 clips 5' and 3' poor regions of reads and uses only good regions of reads in assembly. Thus there is no need to perform clipping in advance. Note that vector sequences in reads must be masked before using CAP3.

Input to CAP3

CAP3 takes as input a file of sequence reads in FASTA format. CAP3 takes two optional files: a file of quality values in FASTA format and a file of forward-reverse constraints.

The file of quality values must be named "xyz.qual", and the file of forward-reverse constraints must be named "xyz.con", where "xyz" is the name of the sequence file. CAP3 uses the same format of a quality file as Phrap. An example including input and output files is available.

Each line of the constraint file specifies one forward-reverse constraint of the form:

ReadA ReadB MinDistance MaxDistance

where ReadA and ReadB are names of two reads, and MinDistance and MaxDistance are distances (integers) in base pairs. The constraint is satisfied if ReadA in forward orientation occurs in a contig before ReadB in reverse orientation, or ReadB in forward orientation occurs in a contig before ReadA in reverse orientation, and their distance is between MinDistance and MaxDistance.

We have a separate program to generate a constraint file from the sequence file. CAP3 works better if a lot more constraints are used.

Output from CAP3

Assembly results in CAP format go to the standard output and need to be directed to a file. Note that clipped 5' and 3' sequences of reads are not shown in CAP3 format output.

CAP3 also produces assembly results in ace file format (".ace"). This allows CAP3 output to be viewed in Consed. Note that clipped 5' and 3' sequences of reads are shown in ace format output.

CAP3 saves consensus sequences in file ".contigs" and their quality values in file ".contigs.qual". Reads that are not used in assembly are put in file ".singlets". Additional information about assembly is given in file ".info".

The CAP3 program reports whether each constraint is satisfied or not. The report is in file ".con.results". A sample report file is given here:

CPBKY55.F CPBKY55.R 500 6000 3210 satisfied CPBKY92.F CPBKY92.R 500 6000 497 unsatisfied in distance CPBKY28.F CPBKY28.R 500 6000 unsatisfied CPBKY56.F CPBKY56.R 500 6000 10th link between CPBKI23.F+ and CPBKT37.R-

The first four columns are simply taken from the constraint file. Line 1 indicates that the constraint is satisfied, where the actual distance between the two reads is given on the fifth column. Line 2 indicates that the constraint is not satisfied in distance, that is, the two reads in opposite orientation occur in the same contig, but their distance (given on the fifth column) is out of the given range. Line 3 indicates that the constraint is not satisfied. Line 4 indicates that this constraint is the 10th one that links two contigs, where the 3' read of one contig is "CPBKI23.F" in plus orientation and the 5' read of the other is "CPBKT37.R" in minus orientation. The information suggests that the two contigs should go together in the gap closure phase.

CAP3 takes 20 to 60 minutes to assemble a cosmid or BAC data set on a Sun Ultra1 workstation.

Reference

Huang, X. and Madan, A. (1999) CAP3: A DNA Sequence Assembly Program. Genome Research, 9: 868-877.

Availability

The CAP3 program is available upon request from Xiaoqiu Huang at huang@mtu.edu

Documentation on CAP3 is available at http://genome.cs.mtu.edu/sas.html

Acknowledgments

I thank Jun Qian for producing output in ace format and other help, Kathryn Beal for incorporating CAP3 in GAP4, Tim Hunkapiller and Granger Sutton for discussion, Bruce Roe and Granger Sutton for providing sequence data sets. This project was supported by NIH Grant R01HG01502-02 from NHGRI.


CategoryProgramBio

CAP3 (last edited 2011-08-03 11:00:45 by localhost)

web biohackers.net