BioPythonTip for GenBank

파싱한 후 취할 수 있는 객체들

RecordParser를 썼을때

parser = GenBank.RecordParser()
iterator = GenBank.Iterator()
while 1:
    cur_record = iterator.next()
    if cur_record is None:
        break
    cur_record.어쩌구저쩌구

위 파서는 핵산DB, 단백질 DB모두에게 적용된다.

  • locus - The name specified after the LOCUS keyword in the GenBank record. This may be the accession number, or a clone id or something else.

  • size - The size of the record.
  • residue_type - The type of residues making up the sequence in this record. Normally something like RNA, DNA or PROTEIN, but may be as esoteric as 'ss-RNA circular'.
  • data_file_division - The division this record is stored under in GenBank (ie. PLN -> plants; PRI -> humans, primates; BCT -> bacteria...)

  • date - The date of submission of the record, in a form like '28-JUL-1998'
  • accession - list of all accession numbers for the sequence.
  • nid - Nucleotide identifier number.
  • pid - Proteint identifier number
  • version - The accession number + version (ie. AB01234.2)
  • db_source - Information about the database the record came from
  • gi - The NCBI gi identifier for the record.
  • keywords - A list of keywords related to the record.
  • segment - If the record is one of a series, this is info about which segment this record is (something like '1 of 6').
  • source - The source of material where the sequence came from.
  • organism - The genus and species of the organism (ie. 'Homo sapiens')
  • taxonomy - A listing of the taxonomic classification of the organism, starting general and getting more specific.
  • references - A list of Reference objects.
    • number - The number of the reference in the listing of references.
    • bases - The bases in the sequence the reference refers to.
    • authors - String with all of the authors.
    • title - The title of the reference.
    • journal - Information about the journal where the reference appeared.
    • medline_id - The medline id for the reference.
    • pubmed_id - The pubmed_id for the reference.
    • remark - Free-form remarks about the reference.
  • comment - Text with any kind of comment about the record.
  • features - A listing of Features making up the feature table.
    • key - The key name of the featue (ie. source)
    • location - The string specifying the location of the feature.
    • qualfiers - A listing Qualifier objects in the feature.
      • key - The key name of the qualifier (ie. /organim=)
      • value - The value of the qualifier ("Dictyostelium discoideum").
  • base_counts - A string with the counts of bases for the sequence.
  • origin - A string specifying info about the origin of the sequence.
  • sequence - A string with the sequence itself.

FeatureParser 를 썼을때

parser = GenBank.FeatureParser()
iterator = GenBank.Iterator()
while 1:
    cur_record = iterator.next()
    if cur_record is None:
        break
    cur_record.어쩌구저쩌구

가장 자세히 파싱할 수 있는 파서이며, 위 cur_record 객체는 SeqFeature를 따르는 객체로, BioCorba로 연동도 될 수 있다.

  • features
    • location - the location of the feature on the sequence
      • position - The position of the boundary.
      • extension - An optional argument which must be zero since we don't have an extension. The argument is provided so that the same number of arguments can be passed to all position types.
    • type - the specified type of the feature (ie. CDS, exon, repeat...)
    • ref - A reference to another sequence. This could be an accession number for some different sequence.
    • ref_db - A different database for the reference accession number.
    • qualifier - A dictionary of qualifiers on the feature. These are analagous to the qualifiers from a GenBank feature table. The keys of the dictionary are qualifier names, the values are the qualifier values.

    • sub_features - Additional SeqFeatures which fall under this 'parent' feature. For instance, if we having something like:

      • CDS    join(1..10,30..40,50..60) The the top level feature would be a CDS from 1 to 60, and the sub features would be of 'CDS_span' type and would be from 1 to 10, 30 to 40 and 50 to 60, respectively.

  • references
    • location - A list of Location objects specifying regions of the sequence that the references correspond to. If no locations are specified, the entire sequence is assumed.
    • authors - A big old string, or a list split by author, of authors for the reference.
    • title - The title of the reference.
    • journal - Journal the reference was published in.
    • medline_id - A medline reference for the article.
    • pubmed_id - A pubmed reference for the article.
    • comment - A place to stick any comments about the reference.
  • name : Version
  • annotations
  • discription : Definition
web biohackers.net