| Size: 4440 Comment: FastaConvertor added | Size: 5267 Comment:  | 
| Deletions are marked like this. | Additions are marked like this. | 
| Line 13: | Line 13: | 
| LAAVEAQQQMLKLTIWGVK >my test sequence for 532319 ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK | |
| Line 96: | Line 102: | 
| * [http://biohackers.net/yongslib/wiki/FastaConvertor FastaConvertor] | 간단한 [Iterator] {{{#!python class FastaIterator: def __init__(self, ifile): self.ifile = ifile self.g = self.getGenerator() def getGenerator(self): lines = [self.ifile.next()] for line in self.ifile: if line.startswith('>'): yield ''.join(lines) lines = [line] else: lines.append(line) else: yield ''.join(lines) def __iter__(self): return self.g }}} 각종 변환 프로그램(using WxPython) --> YongsLib:wiki/FastaConvertor | 
[FASTA] format.
A sequence in FastaFormat begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length. An example sequence is:
>gi|532319|pir|TVFV2E|TVFV2E envelope protein ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF APTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXVQSQHLLAGILQQQKNL LAAVEAQQQMLKLTIWGVK >my test sequence for 532319 ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK LAAVEAQQQMLKLTIWGVK
BioSequence is expected to be represented in the standard IUB/IUPAC AminoAcid and NucleicAcid codes, with these exceptions: lower-case letters are accepted and are mapped into upper-case; a single hyphen or dash can be used to represent a gap of indeterminate length; and in AminoAcid sequences, U and * are acceptable letters (see below). Before submitting a request, any numerical digits in the query sequence should either be removed or replaced by appropriate letter codes (e.g., N for unknown NucleicAcid residue or X for unknown AminoAcid residue).
BioPython을 써서 FastaFormat다루기
BioPython으로 FastaFormat을 다루는 요령은 다음과 같다.
입력할때 - 주로 FastaFormat의 [Parsing]
   1 from Bio import Fasta, File
   2 from cStringIO import StringIO 
   3 #file = File.UndoHandle(StringIO(fastaStr)) # 만일 스트링으로 갖고있을경우
   4 file = open('file.fasta', 'r') 
   5 parser = Fasta.RecordParser() 
   6 iterator = Fasta.Iterator(file, parser) 
   7 while 1: 
   8     curRecord = iterator.next()  # 하나의 fasta file내에 여러개의 record를 반복적으로 접근 
   9     if curRecord is None: break 
  10     title = curRecord.title   # 레코드에서 타이틀 
  11     seq = curRecord.sequence  # 레코드에서 서열 
출력할때 - stdout으로 뿌려준다면
   1 from Bio import Fasta 
   2 title = '>This is test title'  # fasta file의 title 
   3 seq = 'ATGGGGGTGTGTGTGGGG' # 하나의 긴 문자열 
   4 fasta = Fasta.Record()   # fasta라는 인스턴스를 만듦 
   5 fasta.title = title              # 강제로 title속성에 값을 부여 
   6 fasta.sequence = seq    # 마찬가지 
   7 print fasta                     # 이 명령으로 60자리후의 '\n'입력까지 자동으로 된다. 
   8 
   9 # if you want to write on file
  10 wfile = open('쓰고자하는파일', 'w') 
  11 wfile.write(str(fasta))
RelationalDatabase에서 직접만들기
SELECT CONCAT(">gi|", annot.gi, "|sp|", annot.acc, "|", sp.name, " ", annot.descr, "\n", protein.seq)
FROM   protein INNER JOIN annot USING (prot_id) INNER JOIN sp USING (acc)
WHERE  annot.current = 1;
$ mysql seqdb -N < swissprot.sql > swissprot.fa
관련코드모음
[HTML]로 FastaFormat꾸미기
- DecoratorPattern 이용 : [FastaDecorator.py] 
- JuneKim씨 코드(2004-06-13) : 파이썬 커뮤니티에 정규식 중에 중간에 개행문자가 들어와도 되는 경우를 물으셨더군요. 다음과 같이 할 수도 있습니다. - 1 import re 2 3 class Enclose: 4 def __init__(self,d): 5 self.d=[(v,self.fragmentable(k)) for k,v in d] 6 self.p=re.compile("(?i)(%s)"%")|(".join([f for _,f in self.d])) 7 def fragmentable(self,s): return '\s?'.join(list(s)) 8 def __call__(self, m): 9 opener,closer=self.d[m.lastindex-1][0] 10 return "%s%s%s"%(opener,m.group(),closer) 11 def do(self, text): 12 return self.p.sub(self, text) 13 14 if __name__ == "__main__": 15 sequence = """\ 16 TCTTCTCCTCACCTCGCTCTCGCCGCCTGCTCGCCCCGNCCGCTTTGCTCGGCGCCCCAA 17 AACACNCTTCCACCATGNGCCACCTCGGCGAGCCCTCCCACTTGAACAAAGGGGTGCTCG 18 GCGCGTGTACNNATGGCCC\ 19 """ 20 expected="""TCTTCTCCTCACCTCGCTCTCGCCGCCTGCTCGCCCCGNCCGCTTTGCTCGGCG<b>CCCCAA 21 AACACN</b>CTTCCACCATGNGCC<font color="red">ACCTCGGCGAGCC</font>CTCCCACTTGAACAAAGGGGTGC<i>TCG 22 GCGCGTG</i>TACNNATGGCCC""" 23 24 d=(('CCCCAAAACACN',('<b>','</b>')), 25 ('TCGGCGCGTG',('<i>','</i>')), 26 ('ACCTCGGCGAGCC',('<font color="red">','</font>')), 27 ) 28 29 r=Enclose(d).do(sequence) 30 assert r==expected 
간단한 [Iterator]
   1 class FastaIterator:
   2     def __init__(self, ifile):
   3         self.ifile = ifile
   4         self.g = self.getGenerator()
   5     def getGenerator(self):
   6         lines = [self.ifile.next()]
   7         for line in self.ifile:
   8             if line.startswith('>'):
   9                 yield ''.join(lines)
  10                 lines = [line]
  11             else:
  12                 lines.append(line)
  13         else:
  14             yield ''.join(lines)
  15     def __iter__(self):
  16         return self.g
각종 변환 프로그램(using WxPython) --> wiki/FastaConvertor
 BioHackersNet
 BioHackersNet