码迷,mamicode.com
首页 > 其他好文 > 详细

05 Computing GC Content

时间:2017-07-30 14:50:09      阅读:153      评论:0      收藏:0      [点我收藏+]

标签:ros   absolute   problem   read   data   div   where   max   tool   

Problem

The GC-content of a DNA string is given by the percentage of symbols in the string that are ‘C‘ or ‘G‘. For example, the GC-content of "AGCTATAG" is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.

DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with ‘>‘, followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with ‘>‘ indicates the label of the next string.

In Rosalind‘s implementation, a string in FASTA format will be labeled by the ID "Rosalind_xxxx", where "xxxx" denotes a four-digit code between 0000 and 9999.

Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).

Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.

Sample Dataset

>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT

Sample Output

Rosalind_0808
60.919540


方法一:
# -*- coding: utf-8 -*-


# to open FASTA format sequence file:
s=open(‘Computing_GC_Content.txt‘,‘r‘).readlines()

# to create two lists, one for names, one for sequences
name_list=[]
seq_list=[]

data=‘‘ # to put the sequence from several lines together

for line in s:
    line=line.strip()
    for i in line:
        if i == ‘>‘:
            name_list.append(line[1:])
            if data:
                seq_list.append(data)         #将每一行的的核苷酸字符串连接起来
                data=‘‘                       # 合完后data 清零
            break
        else:
            line=line.upper()
    if all([k==k.upper() for k in line]):    #验证是不是所有的都是大写
        data=data+line
seq_list.append(data)                         # is there a way to include the last sequence in the for loop?
GC_list=[]
for seq in seq_list:
    i=0
    for k in seq:
        if k=="G" or k==‘C‘:
            i+=1
    GC_cont=float(i)/len(seq)*100.0
    GC_list.append(GC_cont)


m=max(GC_list)
print name_list[GC_list.index(m)]              # to find the index of max GC
print "{:0.6f}".format(m)                    # 保留6位小数

  

05 Computing GC Content

标签:ros   absolute   problem   read   data   div   where   max   tool   

原文地址:http://www.cnblogs.com/think-and-do/p/7258980.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!