BioinformaticsInformation에서 진행되는 BTC(Bioinformatics Training Course)의 스터디공간.

각종 assignment풀이시 궁금한점이나, 토론이 필요한 사항들. 혹은 관련 정보에 관해 공유하는 공간으로, 토론이 필요하다 싶으면, 해당 내용을 대표하는 PageName의 페이지를 만들고 그곳에서 토론한다.

BTC 강의 따라잡기

Assignment1

1.Assume that you are given a sorted array, A1, and trying to find out an item in the array. One obvious method is searching the array from the top toward the bottom, which has the ComputationalComplexity of O(n), where n is the size of the array. Devise other method that has better asymptotic complexity than O(n). If a new array, A2, is an unsorted one, what can be the best asymptotic complexity? If you have to search the array, A2, many times, then what should you do to obtain a better overall performance?

2. Describe how you would choose "the window size" and "the step size" when you use sliding window method.

3. Suppose that you are trying to choose a threshold for a sequence analysis program based on sliding window method. You have some positive examples and some negative examples, meaning that their identities are predetermined by some outside knowledge. Having these examples, how would you use them in your decision for choosing the threshold?

4. Most higher eukaryotes have GC contents not far removed from 50%. However, some prokaryotes have strongly biased GC contents. A noticeable example is thermophilic bacteria and they require it to withstand high temperature. Calculate the average Shannon's entropy (in bits) for the genome with the GC content of 50%. Also, calculate it for the genome with the GC content of 80%. Considering genomes as the recording media for protein sequences, how can the latter genome overcome the weakness of being poor recording media?

5. Keratin and collagen are the most abundant animal proteins. Study their structures from a biochemistry textbook if you have not done so yet. These structural proteins exhibit problems when doing sequence alignment. Explain.

6. Which one, DNA sequence or protein sequence, would yield better performance in general for Boyer-Moore string search algorithm?

7. A novel DNA sequence, which does not show any significant homology with GenBank sequences, was given to you by someone who is infamous for his ill nature. Since the sequence was from that person, you are starting to doubt that it may not be of biological origin. Devise as many methods as you can think of to confirm that it is not biologically originated.

8. Draw deterministic finite state automata that can identify all codons for the positively charged amino acids.

9. Below is a multiple alignment of an imaginary promoter region. First, make a consensus sequence for the region. Second, write a regular expression that may describe the region best. (You can either use the Unix style regular expression or the PROSITE style.) Third, make a position specific weight matrix. Make your own choice for the problem of unobserved nucleotides and describe how you did and why.

CCGTGTTC AGGTCGTG GCCTGGTC AGATCATT GAATGTTC AGGTGTTC

What is the scores obtained using the position specific weight matrix for the following two sequences? Which sequence would be the better candidate for the promoter when using the consensus sequence method, the regular expression method, and the position specific weight matrix method, respectively? Explain the results.

Sequence 1 : AGGCGTAC Sequence 2 : CAGTCATC

10. A fanciful looking result of a novel motif modeling method was just presented in front of you. The result looked so good that it seemed almost perfect, meaning that it could detect all the motif sites in the sequences they used and didn't detect any false site. Now, it is your turn to say something about this incredible accomplishment. What would you say?

11. Mammalian genomes are significantly lack of CG dinucleotide. CpG island is the region with higher frequency of CG dinucleotide around the translation starting site of genes. CpG island can be found in most of house keeping genes. Therefore, it is one of the important clues to find a gene. (The structure of mammalian genes are very complex and it is hard to find the starting site of gene.) Write down two possible transition matrices for the Markov models to detect CpG islands, one for CpG island regions and the other for non-CpG island regions. (For extra points, you can use real data obtained from GenBank to calculate the matrices.)

12. To have a large value in PAM or BLOSUM matrices, what property should a pair of amino acids have? Answer it by verbally describing the meaning of the log odd ratio used to calculate the matrices.

13. Write the table for the dynamic programming algorithm that aligns the following two sequences. And write the resulting alignments separately. Use the global alignment method.

ACGTAGTGAGCTCGCG AGCTGCTAGCTCGGCTG

14. Affine gap penalty is commonly used one. Logarithmic gap penalty has been attempted, too. Why do we try it even if it worsens the computing time? Also, why can't we use more complex gap penalties such as periodic gap penalty?

15.Why do BLOSUM matrices serve as better weight matrices than PAM matrices?

16. The alignment score between two aligned sequences is simply the sum of corresponding values in weight matrix minus the sum of gap penalties. However, alignments are obtained as a list by matching against a large collection of biological sequences (i.e., GenBank). In the list, the ones with high scores are considered to be due to homology (meaning common ancestry). The problem we are confronting here is that homologous proteins are not always well conserved and two non-homologous proteins can give a quite high score by random chance. Another deteriorating factor is that these sequences code for proteins that have the constraint of being physical entities (e.g., being a certain type of fold, etc.). Also, a worsening effect on this problem is that the distribution of alignment scores obtained from the search against a large collection of sequences follows extreme value distribution. Explain (within 5 lines) why this fact (i.e., following extreme value distribution) worsens the problem? Remember that the amount of biological sequences in the collection is getting ever bigger and bigger, too. Considering above discussions, what would you do to overcome the problem. (within 7 lines)

17. In spite of its weaknesses, the original BLAST (i.e., BLAST version 1.x) had been used as the primary sequence search tool for several years. FASTA was another one that was competing with BLAST. Answer the following questions (within 5 lines for each):

- What was their common weakness? - Why BLAST was more popular than FASTA? - What is the major improvement of BLAST version 2.x over BLAST version 1.x?

18. cDNA sequences are obtained by sequencing mRNA via complementary DNA. Sometimes we want to find out the corresponding genomic DNA for a cDNA sequence. For such purpose, how would you adjust gap penalties? If you are going to use BLASTX (instead of BLASTN) for the purpose, which weight matrix would you use?

19. After searching GenBank using BLAST with a protein sequence of your interest, you couldn't find any highly significant match. However, some sequences with marginal significance were detected. When you examined the amino acid compositions of the detected sequences, you found that they were rich in Pro, Gly, Lys, and Leu. Explain why these matches may not be of biological significance?

20. You are designing a procedure for a DNA computer following the original method by Adleman. The aim is to find Hamiltonian paths of the following graph:

http://www.bioinformatics.pe.kr/course/assignments/graph.jpg

Comparing to the graph used in the Adleman's original work, why is this graph much easier for DNA computer? What would be the most critical step that gets rid of the majority of non-solution oligomers?

Assignment 문제풀이시 토론사항

  • 어쩌고저쩌고문제

참고자료


RefactorMe... 관심있으신분들의 많은 RefactoringPages를 바랍니다.

web biohackers.net