BioPythonTip for GenBank
파싱한 후 취할 수 있는 객체들
RecordParser를 썼을때
parser = GenBank.RecordParser() iterator = GenBank.Iterator() while 1: cur_record = iterator.next() if cur_record is None: break cur_record.어쩌구저쩌구
위 파서는 핵산DB, 단백질 DB모두에게 적용된다.
locus - The name specified after the LOCUS keyword in the GenBank record. This may be the accession number, or a clone id or something else.
- size - The size of the record.
- residue_type - The type of residues making up the sequence in this record. Normally something like RNA, DNA or PROTEIN, but may be as esoteric as 'ss-RNA circular'.
data_file_division - The division this record is stored under in GenBank (ie. PLN -> plants; PRI -> humans, primates; BCT -> bacteria...)
- date - The date of submission of the record, in a form like '28-JUL-1998'
- accession - list of all accession numbers for the sequence.
- nid - Nucleotide identifier number.
- pid - Proteint identifier number
- version - The accession number + version (ie. AB01234.2)
- db_source - Information about the database the record came from
- gi - The NCBI gi identifier for the record.
- keywords - A list of keywords related to the record.
- segment - If the record is one of a series, this is info about which segment this record is (something like '1 of 6').
- source - The source of material where the sequence came from.
- organism - The genus and species of the organism (ie. 'Homo sapiens')
- taxonomy - A listing of the taxonomic classification of the organism, starting general and getting more specific.
- references - A list of Reference objects.
- number - The number of the reference in the listing of references.
- bases - The bases in the sequence the reference refers to.
- authors - String with all of the authors.
- title - The title of the reference.
- journal - Information about the journal where the reference appeared.
- medline_id - The medline id for the reference.
- pubmed_id - The pubmed_id for the reference.
- remark - Free-form remarks about the reference.
- comment - Text with any kind of comment about the record.
- features - A listing of Features making up the feature table.
- key - The key name of the featue (ie. source)
- location - The string specifying the location of the feature.
- qualfiers - A listing Qualifier objects in the feature.
- key - The key name of the qualifier (ie. /organim=)
- value - The value of the qualifier ("Dictyostelium discoideum").
- base_counts - A string with the counts of bases for the sequence.
- origin - A string specifying info about the origin of the sequence.
- sequence - A string with the sequence itself.
FeatureParser 를 썼을때
parser = GenBank.FeatureParser() iterator = GenBank.Iterator() while 1: cur_record = iterator.next() if cur_record is None: break cur_record.어쩌구저쩌구
가장 자세히 파싱할 수 있는 파서이며, 위 cur_record 객체는 SeqFeature를 따르는 객체로, BioCorba로 연동도 될 수 있다.
- features
- location - the location of the feature on the sequence
- position - The position of the boundary.
- extension - An optional argument which must be zero since we don't have an extension. The argument is provided so that the same number of arguments can be passed to all position types.
- type - the specified type of the feature (ie. CDS, exon, repeat...)
- ref - A reference to another sequence. This could be an accession number for some different sequence.
- ref_db - A different database for the reference accession number.
qualifier - A dictionary of qualifiers on the feature. These are analagous to the qualifiers from a GenBank feature table. The keys of the dictionary are qualifier names, the values are the qualifier values.
sub_features - Additional SeqFeatures which fall under this 'parent' feature. For instance, if we having something like:
CDS join(1..10,30..40,50..60) The the top level feature would be a CDS from 1 to 60, and the sub features would be of 'CDS_span' type and would be from 1 to 10, 30 to 40 and 50 to 60, respectively.
- location - the location of the feature on the sequence
- references
- location - A list of Location objects specifying regions of the sequence that the references correspond to. If no locations are specified, the entire sequence is assumed.
- authors - A big old string, or a list split by author, of authors for the reference.
- title - The title of the reference.
- journal - Journal the reference was published in.
- medline_id - A medline reference for the article.
- pubmed_id - A pubmed reference for the article.
- comment - A place to stick any comments about the reference.
- name : Version
- annotations
- discription : Definition