C8 Exercises#
c8e1: 3-frame translation#
In this exercise, the input file c8e1_input_NM_000558.5.fa contains the fasta sequence of NM_000558. This is an spliced mRNA of HBA1 (hemoglobin subunit alpha 1 -Homo sapiens-) obtained from the NCBI repositories. See the next entry for HBA1
Very detailed information corresponding to its spliced mRNA and translated protein can be found at:
NCBI Reference Sequences (RefSeq)
mRNA and Protein(s)
NM_000558.5 → NP_000549.1 hemoglobin subunit alpha
Spliced mRNA (NM_000558.5)
Corresponding protein (NP_000549.1)
In the same resource, the fasta sequences of the mRNA and protein can be retrieved:
mRNA: NM_000558.5
protein: NP_000549.1
Note that, like in any fasta file, the first line contains the description of the sequence and the next lines the sequence. For instance:
>NM_000558.5 Homo sapiens hemoglobin subunit alpha 1 (HBA1), mRNA
ACTCTTCTGGTCCCCACAGACTCAGAGA...
a. Retrieve the input sequence from the fasta file. Display it in your standard output (terminal) together with its length.
b. Initialize in your program a dictionary containing the standard genetic code.
For the sake of simplicity: With the previous information you can easily set up a dictionary that translates some codons. Code the dict manually without any help. Once that you see that it is fine, copy the rest of the items from the next dictionary:
standard_genetic_code = {
'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
'TAC':'Y', 'TAT':'Y', 'TAA':'*', 'TAG':'*',
'TGC':'C', 'TGT':'C', 'TGA':'*', 'TGG':'W',
}
and display the dictionary in the standard output.
c. For each of the three frames, get a list of the “complete” codons and display them (separated by colons) in the standard output. Note: if at the end of the sequence there are < 3 nts (because 3 nt == 1 codon), we just ignore them.
d. Translate the sequence of each frame to amino acids (standard genetic code)
e. Compare the sequence of the protein (NP_000549.1) annotated in NCBI with the 3-frame translated sequences. Use an online pairwise alignment tool. For instance:
Which of the 3-frame translated sequences corresponds to the protein? Does the whole mRNA align to the protein? How can that be possible?
Sample#
Input:
c8e1_input_NM_000558.5.fa
Output a:
ACTCTTCTGGTCCCCACA...TAAAGTCTGAGTGGGCGGCA 577
Output b:
standard_genetic_code:
{'ATA': 'I', 'ATC': 'I', 'ATT': 'I', 'ATG': 'M', 'ACA': 'T', 'ACC': 'T', 'ACG': 'T', 'ACT': 'T', 'AAC': 'N', 'AAT': 'N', 'AAA': 'K', 'AAG': 'K', 'AGC': 'S', 'AGT': 'S', 'AGA': 'R', 'AGG': '
R', 'CTA': 'L', 'CTC': 'L', 'CTG': 'L', 'CTT': 'L', 'CCA': 'P', 'CCC': 'P', 'CCG': 'P', 'CCT': 'P', 'CAC': 'H', 'CAT': 'H', 'CAA': 'Q', 'CAG': 'Q', 'CGA': 'R', 'CGC': 'R', 'CGG': 'R', 'CGT':
'R', 'GTA': 'V', 'GTC': 'V', 'GTG': 'V', 'GTT': 'V', 'GCA': 'A', 'GCC': 'A', 'GCG': 'A', 'GCT': 'A', 'GAC': 'D', 'GAT': 'D', 'GAA': 'E', 'GAG': 'E', 'GGA': 'G', 'GGC': 'G', 'GGG': 'G', 'GGT
': 'G', 'TCA': 'S', 'TCC': 'S', 'TCG': 'S', 'TCT': 'S', 'TTC': 'F', 'TTT': 'F', 'TTA': 'L', 'TTG': 'L', 'TAC': 'Y', 'TAT': 'Y', 'TAA': '*', 'TAG': '*', 'TGC': 'C', 'TGT': 'C', 'TGA': '*', 'T
GG': 'W'}
Output c:
f1: ACT:CTT:CTG:...:GTG:GGC:GGC:
f2: CTC:TTC:TGG:...:TGG:GCG:GCA:
f3: TCT:TCT:GGT:...:AGT:GGG:CGG:
Output d:
>f1
TLLVPTDSERTHHGAVSCRQDQRQGRLG*GRRARWRVWCGGPGEDVPVLP
HHQDLLPALRPEPRLCPG*GPRQEGGRRADQRRGARGRHAQRAVRPERPA
RAQASGGPGQLQAPKPLPAGDPGRPPPRRVHPCGARLPGQVPGFCEHRAD
LQIPLSWSLGGHASCPLGLPPAPPPLPAPVPPWSLNKV*VGG
>f2
LFWSPQTQREPTMVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFP
TTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLH
AHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLT
SKYR*AGASVAMLLAPWASPQPLLPFLHPYPRGL*IKSEWAA
>f3
SSGPHRLRENPPWCCLLPTRPTSRPPGVRSARTLASMVRRPWRGCSCPSP
PPRPTSRTST*ATALPRLRATARRWPTR*PTPWRTWTTCPTRCPP*ATCT
RTSFGWTRSTSSS*ATACW*PWPPTSPPSSPLRCTPPWTSSWLL*APC*P
PNTVKLEPRWPCFLPLGPPPSPSSPSCTRTPVVFE*SLSGR
Output e:
You have to find yourself! :(: