C7 Exercises#

c7e1: Patterns in a file of accession names#

In this exercise, the input file (c7e1_input_data.txt) contains several commented lines that should be ignored; all those start with “#” (hashtag). The information is contained in one line: accession names, comma separated.

a. Retrieve the annotation names in a list, ignoring the lines that are commented.

b. Detect the accession names that:

  • b1. Contain the number 5

  • b2. Contain the letter d or e

  • b3. Contain the letters d and e in that order (anything in between)

  • b4. Contain the letters d and e in that order, but there should be a single letter between them

  • b5. Contain the letters d and e in any order

  • b6. Start with x or y

  • b7. Start with x or y and end with e

  • b8. Contains three or more consecutive digits

  • b9. Ends with d followed by: a, r, or p

Sample#

Input:

c7e1_input_data.txt

Output a:

['xjhd53e', '45da', 'de37dp', 'yhdck2', 'eihd39d9', 'xkn59438', 'chdsye847', 'hedle3455']

Output b1:

xjhd53e
45da
xkn59438
hedle3455

Output b2:

xjhd53e
45da
de37dp
yhdck2
eihd39d9
chdsye847
hedle3455

Output b3:

xjhd53e
de37dp
chdsye847
hedle3455

Output b4:

hedle3455

Output b5:

xjhd53e
de37dp
eihd39d9
chdsye847
hedle3455

Output b6:

xjhd53e
yhdck2
xkn59438

Output b7:

xjhd53e

Output b8:

xkn59438
chdsye847
hedle3455

Output b9:

45da
de37dp

c7e2: Restriction enzymes#

Given the DNA sequence file (fasta file: c7e2_input_data.fa). Note that, in a fasta file, the line starting with “>” describes the sequence; in this exercise should be ignored. The next line contains the DNA sequence.

a. Assign the DNA sequence to a variable.

b. The restriction enzyme AbcI has ANTAAT as recognition site. It finds ANTAAT and cuts where the start is (ANTAAT). See the IUPAC notation. Find the position of the cuts in the DNA sequence.

Output b:

AcbI cuts: [1143, 1628]

c. A new restriction enzyme is applied in combination with AbcI. The new enzyme, AbcII, has GCRW*TG as recognition site. See the IUPAC notation. R is a purine: A or G. W is A or T.

Output c:

AbcI and AbcII cuts: [488, 1143, 1577, 1628]