NucleicAcidsResearch, 2002, Vol.30, No. 1 11752248 Ensembl

Introduction

[Ensembl] is Integration of [Genome] annotation

  • Vertical : increasing range of data type
  • Horizontal : comparative genome sequence views such as mouse, rat, zebrafish

For [Bioinformatics]

  • all software freely available and designing the system to be completely portable
  • to provide a bioinformatics framework that is easy to apply to different organisms and types of data
  • spirit of open source community

[Ensembl] [Genome] Annotation

Ensembl의 역할

  • annotates known genes
  • predicts novel genes with functional annotation from InterPro

  • additional annotation by
    • [OMIM]
    • [SAGE] expression

GenePrediction은 가장 중요한 부분

그러나, eukaryotic organism의 경우 많은 intron들로 인해 많은 FalsePositive 비율을 지닌다. 따라서, 이용가능한 모든 GenePrediction방법을 통합하는데 중점을 둔다.

Ensembl [Gene] build system

[Gene]s are placed in the [Genome] using 3 step process

  1. 'best in genome' positions for all known human proteins from SwissProt and TrEmbl using a fast protein to DNA matcher (pmatch, RichardDurbin, unpublished software)

    • (->) using [Genewise], refine gene structure

  2. align paralogous human proteins with other organism (for form a set of novel human genes)
  3. AbInitio GenePrediction GenScan is run across the entire genome to create a set of genscan peptides

    • (->) exons from these predicted peptides that are confirmed by [BLAST] matches, and vertebrate mRNA and UniGene clusters are assembled into genes

위방법으로 만들어진 Ensembl genes는 적어도 하나 이상의 서열유사성 실험정보와 함께 낮은 FalsePositive를 지니는 정확히 예측된 gene structure로 여겨진다. 이것이 identifier ENSG를 만든다. ENSG에는

  • transcripts begin ENST
  • exons begin ENSE
  • translations begin ENSP

들로 구성되어있고, assembly시 안정적으로 유지된다.

Ensembl은 계속적으로 gene building process를 실험적으로 검증된 데이터와 함께 extending, refining, calibrating 하고있다. 22번 염색체의 경우 [EST]정보가 통합되었는데, [EST]는 특히도 3'-UTR에 위치한 non-coding exons 예측에 도움을 준다. [EST]/[Genome] SequenceAlignment방법으로 [Exonerate]와 [EST_Genome]이라는 프로그램이 사용되었다.

MouseGenomeWholeGenomeShotGun 서열 역시 identifying human genes의 유용한 자원이다. 매우 빠른 gapped DNA-DNA alignment algorithm인 [Exonerate]라는 프로그램을 개발하고, 14M의 mouse reads와 assembl된 HumanGenome을 매치하여 GenScan과 함께 잠재적인 novel coding exons을 가려내왔다.

Ensembl Web site

Ensembl contigview web pages

  • scroll along entire chromosome
  • fetures are integrated from external data sources
    • [HUGO] gene names
    • genetic markers
    • disease genes and [SNP] with links to primary [Database]s
  • user can control [DAS] data sources
  • match between mouse WholeGenomeShotGun is displayed

Alternative views of the data

  • mapview web pages,
    • show relationships between cytogenetic bands and the genome sequence via markers
    • displays feature distribution plot
  • geneview web pages
    • show individual Ensembl gene with its transcripts and gene structures
  • proteinview web pages
    • show individual Ensembl translations with functional annotation from InterPro

  • Similarity seaching tools available against the entire [Genome]
    • [BLAST]
    • [SSAHA]

Ensembl can be accessed in a variety of ways

  • [Apollo] [Java] viewer

Ensembl FTP provides a variety of data download formats

  • [EMBL] and GenBank formats containing annotation of raw genomic sequence

    • full dumps of [MySQL]
    • in contigview web pages, allowing regions to be selected and dumped in many FlatFile formats

Ensembl Software System

개발환경

  • For scalability and consistency (->) Using BioPerl and [MySQL] RelationalDatabase

  • Most [Perl], with extension in [Cee], some alternative interface in [Java]

Architecture (for [Orthogonality])

  • biologically meaningful object (business objects)
  • database connectivity objects (adaptors)

Core design features is virtual contig object allows Ensembl to handle draft genome data in a seamless way

  • allow access to genomic sequence ant its annotation as if it was a continous piece of DNA
  • handles reading and writing of features
  • behave identically regardless of whether the underlying sequence is sorted as a single real piece(a single raw contig) or an assembly of many fragments of DNA(many raw contigs)

Access to software

  • stable : FTP
  • developing : [CVS]

Ensembl Data Analysis Pipeline

  • HumanGenomeProjectWorkingDraft와의 DNA레벨에서의 연동

  • continuously changing assembly 의 능동적인 대처 (->) Ensembl의 challenges

  • full analysis pipeline
    • encapsulates running a single analysis process
    • encapsulates reading and writing the input and results of an analysis from a database

DistributedAnnotationSystem ([DAS])

To enable users to easily view and compare annotation from different sources that are distributed across the Internet

  1. it makes its annotation data available http://servlet.sanger.ac.uk:8080/das/ using BioJava [DAS] server [DAZZLE]

  2. 3rd party client의 세팅없이 HumanGenome annotation정보를 확인, Ensembl contigview can be configured to act as a [DAS] client

  3. server 세팅없이 [DAS]서버를 돌리는 user를 위해 사용자 annotation의 일부가 [DAS]서버에 올려질 수 있다.


CategoryPaper

EnsemblGenomeDatabaseProject (last edited 2011-08-03 11:00:45 by localhost)

web biohackers.net