C2 Exercises#

c2e1: GC and AT content#

Calculate the GC and AT content of a DNA sequence. For instance, the GC content is the total amount of G or C within the sequence. That is, “AAAGGGCCCTTT” will have 50% of GC and 50% of AT content.

Sample#

Input:

ATGACAGCCATCATCAAAGAGATCGTTAGCAGAAACAAAAGGAGATATCAAGAGG

Output:

GC content:  40.0 %
AT content:  60.0 %

Tips#

  • Try with a simple DNA sequence to see that your code is coherent. For instance:

ATGC

c2e2: reverse a DNA sequence#

From a DNA sequence, obtain the reverse sequence

Sample#

Input:

AATGACA

Output:

ACAGTAA

Tips#

  • if you reverse it again, the original DNA template should be obtained

  • Surround your sequence with parentheses to check that no whitespace is within the sequence. For instance:

(ACAGTAA)

c2e3: complement of a sequence#

From a DNA sequence, show the sequence and its corresponding sequence in the other chain. That is, the complementary sequence:

ATCC
||||
TAGG

Sample#

Input:

AATGACA

Output:

AATGACA
|||||||
TTACTGT

Tips#

  • Be careful: if you change A by T, you can not immediately change T by A in order to obtain the complementary sequence of the original DNA. Can you?


c2e4: reverse complement of a sequence#

From a DNA sequence, find its reverse-complement

Sample#

Input:

AATGACA

Output:

TGTCATT

Extra#

  • The reverse complement of the reverse complement should be the original DNA template


c2e5: restriction enzymes and their recognition sites#

Restriction enzymes recognize a specific sequence of nucleotides producing a double stranded cut in the DNA. For instance, restriction endonuclease enzyme isolated from species E. coli (EcoRI) cuts at G*AATTC or Smal (Serratia marcescens) at CCC*GGG, where * indicates where the enzyme cuts.
For more information, see a list of enzyme cutting sites in wikipedia:

For instance, Smal will cut the next DNA sequence, like this:

    ______
          \
TCAGATGCCC \ GGGACTAGTTTTC
            \______

Then, if Smal cuts a DNA sequence in two fragments.
Provide the two fragments and their lengths, a line for each fragment.

Sample#

Input:

AATTCAGATGCTGTTAGTACCTACATCAGTGAATTCCAACAACTTACACTTATTTTCCCGGGACTAGTTTTC

Output:

AATTCAGATGCTGTTAGTACCTACATCAGTGAATTCCAACAACTTACACTTATTTTCCC 59
GGGACTAGTTTTC 13

Tips#

  • Use your text editor to validate the length of your fragments

  • If the enzyme does not cut the sequence, the program will fail unless you prepare your code for it; we have not learn how to do it. So, be very careful.

Extra#

  • Can EcoRI cut the first fragment already cut by Smal? If so, provide all the final fragments and their lengths

Extra output:

AATTCAGATGCTGTTAGTACCTACATCAGTG 31
AATTCCAACAACTTACACTTATTTTCCC 28
GGGACTAGTTTTC 13

Extra check#

  • If you join again the fragments, you should obtain the original DNA template.


c2e6: splicing structure#

a DNA sequence (from the + strand) has the next structure:

Exon1-Intron-Exon2

The first exon runs from the start of the DNA sequence up to the position 63.

  • Note: be careful with your problem formulation. In this case, position 63 is the end but it is within the intron.

  • Another important note: biological coordinates not “String” coordinates from Python. Do not mix up them.

The second exon starts in the position 91 (biological coordinates, ending up at the end of the sequence. Considering that the whole exons code for protein (CDS: Coding Dna Sequences).
a. Print the exon sequences and their lengths, one line per sequence. b. Calculate the percentage of sequence that codes for protein. c. Print the DNA input but CDS in uppercase and non-CDS in lower case; this is a standard format for plain-tex sequences in computational biology.

Sample#

Input:

ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCGATCGATCGATCGATCGATCGATCGATCATGCTATCATCGATCGATATCGATGCATCGACTACTAT

Output a:

ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCGATCG 62
ATCATCGATCGATATCGATGCATCGACTACTAT 33

Output b:

77.23577235772358

Output c:

ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCGATCGatcgatcgatcgatcgatcgatcatgctATCATCGATCGATATCGATGCATCGACTACTAT

Tips#

  • Because the sequence is from + strand, it needs no modification. But, if the sequence would from the - strand, the reverse complement needs to be calculated.

  • The programmer has to be very careful with the coordinates. Usually it is good idea to check your sequence in a genome browser like
    https://genome.ucsc.edu/. But you can not use the genome browser in the case, you need to know the species and genome version corresponding to your sequence.

Extra#

  • Something in this problem is suspicious to be wrong from a biological point of view. Tip: Observe the splice donor and acceptor sites.