C4 Exercises#

c4e1: Processing DNA in a file#

We have 5 different DNA sequences in the input file (“c4e1_input_seqs.txt”). Each sequence in a different line. Every DNA sequence contains at the start the very same sequence fragment (14 nt from a sequencing adapter) and at the end 20 nt of a repeat (poly_ATGC; that is 5 times the repeat).

In an output file (“c4e1_output.fasta”) should be displayed an “alignment”. For each input sequence:

  • In one line the original sequence and its length

  • In another line the “shifted” sequence (without the 14 nt fragment and without the 20 nt poly_ATCG) followed by its length (without considering the starting fragment and ending repeats).

  • And between the previous lines, another line showing the alignment (with “|” characters).

See the sample for the sake of clarity.

Sample#

Input:

"c4e1_input_seqs.txt" contains:  
ATTCGATTATAAGCTCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCATGCATGCATGCATGCATGC
ATTCGATTATAAGCACTGATCGATCGATCGATCGATCGATGCTATCGTCGTATGCATGCATGCATGCATGC
...and so on  

Output:

"c4e1_output.fasta" contains, for 2 of the 5 DNA input sequences:  
ATTCGATTATAAGCTCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCATGCATGCATGCATGCATGC 76
              ||||||||||||||||||||||||||||||||||||||||||                    
              TCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATC                     42
ATTCGATTATAAGCACTGATCGATCGATCGATCGATCGATGCTATCGTCGTATGCATGCATGCATGCATGC 71
              |||||||||||||||||||||||||||||||||||||                    
              ACTGATCGATCGATCGATCGATCGATGCTATCGTCGT                     37
...and so on

c4e2: Multiple exons from genomic DNA#

A sequence of DNA is contained in one line within the file “c4e2_input_genomic_dna.txt” and the file “c4e2_input_exons.txt” contains the positions of 4 exons within the DNA sequence: the start and end (separated by comma) of the 4 exons are in 4 different lines.

Write in a file (“c4e2_output.txt”) the exons concatenated with a human readable spacer (“<—>”)

Sample#

Input:

"c4e2_input_genomic_dna.txt" contains:  
TCGATCGTACCGTCGACGATGCTACGATCGTCGATCGTAGTCGATCA...   

"c4e2_input_exons.txt" contains:  
5,58
72,133
...

Output:

"c4e2_output.txt" contains:  
CGTACCGTCGACGATGCTACGATCGTCGATCGTAGTCGATCATCGATCGATCG<--->CGATCGATCGATATCGATCGATATCATCGATGCATCGATCATCGATCGATCGATCGATCGA<--->CGATCGATCGATCGTAGCTAGCTAGCTAGATCGATCATCATCGTAGCTAGCTCGACTAGCTACGTACGATCGATGCATCGATCGTA<--->CGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGTAGCTAGCTACGATCG

Note:#

  • The exon positions: start, end are in array coordinates [0,…], not in biological coordinates [1,…]. That is, start is inclusive and end is exclusive.