NucleicAcidsResearch, 2002, Vol.30, No. 1 11752248 Ensembl
Contents
Introduction
[Ensembl] is Integration of [Genome] annotation
- Vertical : increasing range of data type
- Horizontal : comparative genome sequence views such as mouse, rat, zebrafish
For [Bioinformatics]
- all software freely available and designing the system to be completely portable
- to provide a bioinformatics framework that is easy to apply to different organisms and types of data
- spirit of open source community
[Ensembl] [Genome] Annotation
Ensembl의 역할
- annotates known genes
predicts novel genes with functional annotation from InterPro
- additional annotation by
- [OMIM]
- [SAGE] expression
GenePrediction은 가장 중요한 부분
그러나, eukaryotic organism의 경우 많은 intron들로 인해 많은 FalsePositive 비율을 지닌다. 따라서, 이용가능한 모든 GenePrediction방법을 통합하는데 중점을 둔다.
Ensembl [Gene] build system
AbInitio gene predictions
Homology and GenePrediction [HMM]s
[Gene]s are placed in the [Genome] using 3 step process
'best in genome' positions for all known human proteins from SwissProt and TrEmbl using a fast protein to DNA matcher (pmatch, RichardDurbin, unpublished software)
(->) using [Genewise], refine gene structure
- align paralogous human proteins with other organism (for form a set of novel human genes)
AbInitio GenePrediction GenScan is run across the entire genome to create a set of genscan peptides
(->) exons from these predicted peptides that are confirmed by [BLAST] matches, and vertebrate mRNA and UniGene clusters are assembled into genes
위방법으로 만들어진 Ensembl genes는 적어도 하나 이상의 서열유사성 실험정보와 함께 낮은 FalsePositive를 지니는 정확히 예측된 gene structure로 여겨진다. 이것이 identifier ENSG를 만든다. ENSG에는
- transcripts begin ENST
- exons begin ENSE
- translations begin ENSP
들로 구성되어있고, assembly시 안정적으로 유지된다.
Ensembl은 계속적으로 gene building process를 실험적으로 검증된 데이터와 함께 extending, refining, calibrating 하고있다. 22번 염색체의 경우 [EST]정보가 통합되었는데, [EST]는 특히도 3'-UTR에 위치한 non-coding exons 예측에 도움을 준다. [EST]/[Genome] SequenceAlignment방법으로 [Exonerate]와 [EST_Genome]이라는 프로그램이 사용되었다.
MouseGenome의 WholeGenomeShotGun 서열 역시 identifying human genes의 유용한 자원이다. 매우 빠른 gapped DNA-DNA alignment algorithm인 [Exonerate]라는 프로그램을 개발하고, 14M의 mouse reads와 assembl된 HumanGenome을 매치하여 GenScan과 함께 잠재적인 novel coding exons을 가려내왔다.
Ensembl Web site
Ensembl contigview web pages
- scroll along entire chromosome
- fetures are integrated from external data sources
- [HUGO] gene names
- genetic markers
- disease genes and [SNP] with links to primary [Database]s
- user can control [DAS] data sources
match between mouse WholeGenomeShotGun is displayed
Alternative views of the data
- mapview web pages,
- show relationships between cytogenetic bands and the genome sequence via markers
- displays feature distribution plot
- geneview web pages
- show individual Ensembl gene with its transcripts and gene structures
- proteinview web pages
show individual Ensembl translations with functional annotation from InterPro
- Similarity seaching tools available against the entire [Genome]
- [BLAST]
- [SSAHA]
Ensembl can be accessed in a variety of ways
- [Apollo] [Java] viewer
Ensembl FTP provides a variety of data download formats
[EMBL] and GenBank formats containing annotation of raw genomic sequence
- full dumps of [MySQL]
in contigview web pages, allowing regions to be selected and dumped in many FlatFile formats
Ensembl Software System
개발환경
For scalability and consistency (->) Using BioPerl and [MySQL] RelationalDatabase
- Most [Perl], with extension in [Cee], some alternative interface in [Java]
Architecture (for [Orthogonality])
- biologically meaningful object (business objects)
- database connectivity objects (adaptors)
Core design features is virtual contig object allows Ensembl to handle draft genome data in a seamless way
- allow access to genomic sequence ant its annotation as if it was a continous piece of DNA
- handles reading and writing of features
- behave identically regardless of whether the underlying sequence is sorted as a single real piece(a single raw contig) or an assembly of many fragments of DNA(many raw contigs)
Access to software
- stable : FTP
- developing : [CVS]
Ensembl Data Analysis Pipeline
HumanGenomeProjectWorkingDraft와의 DNA레벨에서의 연동
continuously changing assembly 의 능동적인 대처 (->) Ensembl의 challenges
- full analysis pipeline
- encapsulates running a single analysis process
- encapsulates reading and writing the input and results of an analysis from a database
DistributedAnnotationSystem ([DAS])
To enable users to easily view and compare annotation from different sources that are distributed across the Internet
it makes its annotation data available http://servlet.sanger.ac.uk:8080/das/ using BioJava [DAS] server [DAZZLE]
3rd party client의 세팅없이 HumanGenome annotation정보를 확인, Ensembl contigview can be configured to act as a [DAS] client
- server 세팅없이 [DAS]서버를 돌리는 user를 위해 사용자 annotation의 일부가 [DAS]서버에 올려질 수 있다.