C8 Exercises#

c8e1: 3-frame translation#

In this exercise, the input file (c8e1_input_NM_000558.5.fa) contains the fasta sequence of NM_000558. This is an spliced mRNA of HBA1 (hemoglobin subunit alpha 1 -Homo sapiens-) obtained from the NCBI repositories. See the next entry for HBA1

Very detailed information corresponding to its spliced mRNA and translated protein can be found at:

  • NCBI Reference Sequences (RefSeq)

    • mRNA and Protein(s)

      • NM_000558.5 → NP_000549.1 hemoglobin subunit alpha

In the same resource, the fasta sequences of the mRNA and protein can be retrieved:

Note that, like in any fasta file, the first line contains the description of the sequence and the next lines the sequence. For instance:

>NM_000558.5 Homo sapiens hemoglobin subunit alpha 1 (HBA1), mRNA
ACTCTTCTGGTCCCCACAGACTCAGAGA...

a. Retrieve the input sequence from the fasta file. Display it in your standard output (terminal) together with its length.

b. Initialize in your program a dictionary containing the standard genetic code.

For the sake of simplicity: With the previous information you can easily set up a dictionary that translates some codons. Code the dict manually without any help. Once that you see that it is fine, copy the rest of the items from the next dictionary:

standard_genetic_code = {
    'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
    'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
    'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
    'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',                
    'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
    'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
    'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
    'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
    'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
    'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
    'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
    'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
    'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
    'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
    'TAC':'Y', 'TAT':'Y', 'TAA':'*', 'TAG':'*',
    'TGC':'C', 'TGT':'C', 'TGA':'*', 'TGG':'W',
}

and display the dictionary in the standard output.

c. For each of the three frames, get a list of the “complete” codons and display them (separated by colons) in the standard output. Note: if at the end of the sequence there are < 3 nts (because 3 nt == 1 codon), we just ignore them.

d. Translate the sequence of each frame to amino acids (standard genetic code)

e. Compare the sequence of the protein (NP_000549.1) annotated in NCBI with the 3-frame translated sequences. Use an online pairwise alignment tool. For instance:

Which of the 3-frame translated sequences corresponds to the protein? Does the whole mRNA align to the protein? How can that be possible?

Sample#

Input:

c8e1_input_NM_000558.5.fa

Output a:

ACTCTTCTGGTCCCCACA...TAAAGTCTGAGTGGGCGGCA 577

Output b:

standard_genetic_code:
 {'ATA': 'I', 'ATC': 'I', 'ATT': 'I', 'ATG': 'M', 'ACA': 'T', 'ACC': 'T', 'ACG': 'T', 'ACT': 'T', 'AAC': 'N', 'AAT': 'N', 'AAA': 'K', 'AAG': 'K', 'AGC': 'S', 'AGT': 'S', 'AGA': 'R', 'AGG': '
R', 'CTA': 'L', 'CTC': 'L', 'CTG': 'L', 'CTT': 'L', 'CCA': 'P', 'CCC': 'P', 'CCG': 'P', 'CCT': 'P', 'CAC': 'H', 'CAT': 'H', 'CAA': 'Q', 'CAG': 'Q', 'CGA': 'R', 'CGC': 'R', 'CGG': 'R', 'CGT':
 'R', 'GTA': 'V', 'GTC': 'V', 'GTG': 'V', 'GTT': 'V', 'GCA': 'A', 'GCC': 'A', 'GCG': 'A', 'GCT': 'A', 'GAC': 'D', 'GAT': 'D', 'GAA': 'E', 'GAG': 'E', 'GGA': 'G', 'GGC': 'G', 'GGG': 'G', 'GGT
': 'G', 'TCA': 'S', 'TCC': 'S', 'TCG': 'S', 'TCT': 'S', 'TTC': 'F', 'TTT': 'F', 'TTA': 'L', 'TTG': 'L', 'TAC': 'Y', 'TAT': 'Y', 'TAA': '*', 'TAG': '*', 'TGC': 'C', 'TGT': 'C', 'TGA': '*', 'T
GG': 'W'}

Output c:

f1:	ACT:CTT:CTG:...:GTG:GGC:GGC:
f2:	CTC:TTC:TGG:...:TGG:GCG:GCA:
f3:	TCT:TCT:GGT:...:AGT:GGG:CGG:

Output d:

>f1
TLLVPTDSERTHHGAVSCRQDQRQGRLG*GRRARWRVWCGGPGEDVPVLP
HHQDLLPALRPEPRLCPG*GPRQEGGRRADQRRGARGRHAQRAVRPERPA
RAQASGGPGQLQAPKPLPAGDPGRPPPRRVHPCGARLPGQVPGFCEHRAD
LQIPLSWSLGGHASCPLGLPPAPPPLPAPVPPWSLNKV*VGG
>f2
LFWSPQTQREPTMVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFP
TTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLH
AHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLT
SKYR*AGASVAMLLAPWASPQPLLPFLHPYPRGL*IKSEWAA
>f3
SSGPHRLRENPPWCCLLPTRPTSRPPGVRSARTLASMVRRPWRGCSCPSP
PPRPTSRTST*ATALPRLRATARRWPTR*PTPWRTWTTCPTRCPP*ATCT
RTSFGWTRSTSSS*ATACW*PWPPTSPPSSPLRCTPPWTSSWLL*APC*P
PNTVKLEPRWPCFLPLGPPPSPSSPSCTRTPVVFE*SLSGR

Output e:

You have to find yourself!  :(: