C2 Exercises
Contents
C2 Exercises#
c2e1: GC and AT content#
Calculate the GC and AT content of a DNA sequence. For instance, the GC content is the total amount of G or C within the sequence. That is, “AAAGGGCCCTTT” will have 50% of GC and 50% of AT content.
Sample#
Input:
ATGACAGCCATCATCAAAGAGATCGTTAGCAGAAACAAAAGGAGATATCAAGAGG
Output:
GC content: 40.0 %
AT content: 60.0 %
Tips#
Try with a simple DNA sequence to see that your code is coherent. For instance:
ATGC
c2e2: reverse a DNA sequence#
From a DNA sequence, obtain the reverse sequence
Sample#
Input:
AATGACA
Output:
ACAGTAA
Tips#
if you reverse it again, the original DNA template should be obtained
Surround your sequence with parentheses to check that no whitespace is within the sequence. For instance:
(ACAGTAA)
c2e3: complement of a sequence#
From a DNA sequence, show the sequence and its corresponding sequence in the other chain. That is, the complementary sequence:
ATCC
||||
TAGG
Sample#
Input:
AATGACA
Output:
AATGACA
|||||||
TTACTGT
Tips#
Be careful: if you change A by T, you can not immediately change T by A in order to obtain the complementary sequence of the original DNA. Can you?
c2e4: reverse complement of a sequence#
From a DNA sequence, find its reverse-complement
Sample#
Input:
AATGACA
Output:
TGTCATT
Extra#
The reverse complement of the reverse complement should be the original DNA template
c2e5: restriction enzymes and their recognition sites#
Restriction enzymes recognize a specific sequence of nucleotides producing a double stranded cut in the DNA. For instance,
restriction endonuclease enzyme isolated from species E. coli (EcoRI) cuts at G*AATTC or Smal (Serratia marcescens) at CCC*GGG, where * indicates where the enzyme cuts.
For more information, see a list of enzyme cutting sites in wikipedia:
For instance, Smal will cut the next DNA sequence, like this:
______
\
TCAGATGCCC \ GGGACTAGTTTTC
\______
Then, if Smal cuts a DNA sequence in two fragments.
Provide the two fragments and their lengths, a line for each fragment.
Sample#
Input:
AATTCAGATGCTGTTAGTACCTACATCAGTGAATTCCAACAACTTACACTTATTTTCCCGGGACTAGTTTTC
Output:
AATTCAGATGCTGTTAGTACCTACATCAGTGAATTCCAACAACTTACACTTATTTTCCC 59
GGGACTAGTTTTC 13
Tips#
Use your text editor to validate the length of your fragments
If the enzyme does not cut the sequence, the program will fail unless you prepare your code for it; we have not learn how to do it. So, be very careful.
Extra#
Can EcoRI cut the first fragment already cut by Smal? If so, provide all the final fragments and their lengths
Extra output:
AATTCAGATGCTGTTAGTACCTACATCAGTG 31
AATTCCAACAACTTACACTTATTTTCCC 28
GGGACTAGTTTTC 13
Extra check#
If you join again the fragments, you should obtain the original DNA template.
c2e6: splicing structure#
a DNA sequence (from the + strand) has the next structure:
Exon1-Intron-Exon2
The first exon runs from the start of the DNA sequence up to the position 63.
Note: be careful with your problem formulation. In this case, position 63 is the end but it is within the intron.
Another important note: biological coordinates not “String” coordinates from Python. Do not mix up them.
The second exon starts in the position 91 (biological coordinates, ending up at the end of the sequence. Considering that the whole exons code for protein (CDS: Coding Dna Sequences).
a. Print the exon sequences and their lengths, one line per sequence. b. Calculate the percentage of sequence that codes for protein.
c. Print the DNA input but CDS in uppercase and non-CDS in lower case; this is a standard format for plain-tex sequences in computational biology.
Sample#
Input:
ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCGATCGATCGATCGATCGATCGATCGATCATGCTATCATCGATCGATATCGATGCATCGACTACTAT
Output a:
ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCGATCG 62
ATCATCGATCGATATCGATGCATCGACTACTAT 33
Output b:
77.23577235772358
Output c:
ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCGATCGatcgatcgatcgatcgatcgatcatgctATCATCGATCGATATCGATGCATCGACTACTAT
Tips#
Because the sequence is from + strand, it needs no modification. But, if the sequence would from the - strand, the reverse complement needs to be calculated.
The programmer has to be very careful with the coordinates. Usually it is good idea to check your sequence in a genome browser like
https://genome.ucsc.edu/. But you can not use the genome browser in the case, you need to know the species and genome version corresponding to your sequence.
Extra#
Something in this problem is suspicious to be wrong from a biological point of view. Tip: Observe the splice donor and acceptor sites.