C7. Pattern recognition in biology#

Regular expressions#

Detecting patterns within plain text is a frequent task for many different objectives. It can be carried out with complex tools, like those from data mining or artificial intelligence. Here we will focus on the most simple tool that Python offers us: the use of regular expressions. In other fields, like physics or engineering, numbers (int, float, complex) have a more important role, because their patterns are numerical; for instance, in statistics or subsymbolic artificial intelligence, where numerical data is detected, extracted, and analyzed.

In biology, an obvious example is the search for patterns within sequences (str). For instance:

  • Finding motifs (polyAs, Kozak sequence, repeats, start/stop codons…)

  • Restriction enzyme cuts in DNA

  • Alternative splicing sites

  • CpG islands

  • And so on …

Knowledge of regular expressions#

What you will learn here is ground programming knowledge and it has a big advantage: the same regular expressions can be used in most programming languages, they are the very same ones (Perl, Java, R, and so on). In the end, regular expressions are a very important subject to learn from the beginning, moreover in this Data Science era.

The most simple approach#

Let’s think in a sequence

str#

Question:  
Is "ATG" in "GATGTGGCGGGACTCTTTATGCGCTGCGGCAGGA"?
Return a boolean: True or False
# Is in? Answer
print("ATG" in "GATGTGGCGGGACTCTTTATGCGCTGCGGCAGGA")
True
# or negative test (*not* in)
print("KOZAK" not in "GATGTGGCGGGACTCTTTATGCGCTGCGGCAGGA")  # Marilyn S. Kozak
True

Note: The Kozak sequence#

surrounds the initiation codon

We already have some experience for detecting patterns#

str#

We know about count, a str method

# str.count()
sequence = "GATGTGGCGGGACTCTTTATGCGCTGCGGCAGGA"
print(sequence.count("ATG")) # 2
print(sequence.count("AGG")) # 1
2
1

But there are more str methods#

str.find()#

S.find(sub[, start[, end]]) -> int
   
Return the lowest index in S where substring sub is found,
such that sub is contained within S[start:end].
Arguments start and end are interpreted as in slice notation.
   
Return -1 on failure.
# str.find(start, end) # if there is no start, end => the whole sequence
# It returns the position (array coordinates not biological) 
sequence = "GATGTGGCGGGACTCTTTATGCGCTGCGGCAGGATG"

print(sequence.find("ATG")) # in all the string  1: "G*ATGT..."
print(sequence.find("ATG", 16, len(sequence))) #18: "...TTT*ATGC..."
print(sequence.find("ATG", 19, len(sequence))) #33: "...GGCAGG*ATG"
print(sequence.find("ATG", 34, len(sequence))) #-1: "...TG"
1
18
33
-1

str.find() within if#

Conditionals

sequence = "GATGTGGCGGGACTCTTTATGCGCTGCGGCAGGATG"

if sequence.find("PATTERN", 34, len(sequence)) == -1: 
    print("This message means that the motif was not found")
else:
    print("This message means that the motif was found")
This message means that the motif was not found

str.rfind()#

S.rfind(sub[, start[, end]]) -> int
       
Return the highest index in S where substring sub is found,
such that sub is contained within S[start:end]. Optional
arguments start and end are interpreted as in slice notation.
       
Return -1 on failure.
sequence = "GATGTGGCGGGACTCTTTATGCGCTGCGGCAGGATG"

# str.rfind(start, end) # without start, end => the whole sequence
# It returns the LAST match (array coord.) in seq/subseq
print(sequence.rfind("ATG"))  # i.e. no start, end;   #33: "...GGCAGG*ATG"
print(sequence.rfind("ATG", 34, len(sequence)))  #-1: "...TG"
33
-1

Find all the motifs within a sequence#

We only need to combine:

  • a loop (while)

  • str.find()

sequence = "GATGTGGCGGGACTCTTTATGCGCTGCGGCAGGATG" # template
motif = "GC"

start = 0 
while start != -1: # while there is any match
    start = sequence.find(motif, start, len(sequence)) # returns -1 if the motif is not within the sequence
    if start >= 0: # if there is match
        print("...", sequence[start:start+len(motif)], "...\t", motif, " at position:", start, sep="")
        start = start + 1 # prepare start for the next iteration
...GC...	GC at position:6
...GC...	GC at position:20
...GC...	GC at position:22
...GC...	GC at position:25
...GC...	GC at position:28

Even more sofisticated#

Representing a kind of alignment and now you have enough knowledge to analyze the next code; if not, do not worry and just observe the output

# functions
def find_motifs_and_show_alignments(sequence, motif):
    lm = len(motif)
    start = 0
    end = len(sequence)
    number_motifs = 0
    position_motifs = []
    print(sequence)
    while end != -1:
        end = sequence.find(motif, start, len(sequence))
        if end >= 0: # if it has found something
            print("."*end + "|"*lm + "."*(len(sequence)-end-lm)) 
            print(sequence[0:end]+ motif.lower() + sequence[end+lm:])
            number_motifs = number_motifs + 1 
            position_motifs.append(end)
        start = end + 1
    print (number_motifs, motif, "found, at positions:", position_motifs) 

    
# main 
sequence = "GATGTGGCGGGACTCTTTATGCGCTGCGGCAGGATG"
motif = "GGC"
find_motifs_and_show_alignments(sequence, motif)
GATGTGGCGGGACTCTTTATGCGCTGCGGCAGGATG
.....|||............................
GATGTggcGGGACTCTTTATGCGCTGCGGCAGGATG
...........................|||......
GATGTGGCGGGACTCTTTATGCGCTGCggcAGGATG
2 GGC found, at positions: [5, 27]

More on strs: a flash back on raw strings#

Remember!: We had special characters: \n, \t, …

r stands for raw: Here they are not special characters

# See the differences:
print("1.", "Hello world!\n\tWelcome to this course:\tProgramming")
print("2.", r"Hello world!\n\tWelcome to this course:\tProgramming") 
1. Hello world!
	Welcome to this course:	Programming
2. Hello world!\n\tWelcome to this course:\tProgramming

Modules: more functionality is needed!#

import module#

A module is file of Python code that offers functions, classes, variables, etc to solve specific problems. For instance, you can use a module for plotting, one specialized in phylogeny, other in sequence alignment, and so on. You can even built your own modules, libraries of code that help you with your specific coding problems

re is the module for regular expressions in Python

import re
import re  # This loads the re module for regular expressions
           # and know is ready to be used in your Python program

Pandas: a data analysis module#

It is a very important module for data analysis. For instance, a classical is to capture data from files, or the web using pandas and plot it using another specific module. In this chapter showing an example. After finishing this course will be good to learn pandas and a module for plotting like matplotlib

Then, for retrieving data from a csv table that is in a web page

# More advanced module: pandas
# It is very common for data analysis
import pandas as pd

# note that we can also easily read the data from a local csv file
df = pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/school_earnings.csv")
print(df) # df is because of DataFrame
        School  Women  Men  Gap
0          MIT     94  152   58
1     Stanford     96  151   55
2      Harvard    112  165   53
3       U.Penn     92  141   49
4    Princeton     90  137   47
5      Chicago     78  118   40
6   Georgetown     94  131   37
7        Tufts     76  112   36
8         Yale     79  114   35
9     Columbia     86  119   33
10        Duke     93  124   31
11   Dartmouth     84  114   30
12         NYU     67   94   27
13  Notre Dame     73  100   27
14     Cornell     80  107   27
15    Michigan     62   84   22
16       Brown     72   92   20
17    Berkeley     71   88   17
18       Emory     68   82   14
19        UCLA     64   78   14
20       SoCal     72   81    9

re.search(pattern, string)#

Help on function search in module re:
    Scan through string looking for a match to the pattern, returning
    a Match object, or None if no match was found.

Important: this method returns a match object

# simple example
import re
if(re.search("ATG","AGTATGGCGAC")): # search returns an object=True or None=False
    print("The start of the translation has been found")
else:
    print("ATG not detected")
The start of the translation has been found

Note: The name of the module needs to specified when calling the function.
We run in the previous example re.search() Why? It could happen that 2 different modules have the same function name. If the name of the module is not specified, Python will raise an Error:

if(search("GT","AGTT")): # NameError: name 'search' is not defined
    print("Match")
    
>NameError: name 'search' is not defined

Searching complex conditions#

# For instance, searching for Valine: GTA, GTT, GTG, GTC
dna = "ATGGACCCGGCAGTCGCCTAA"

if re.search(r"GTA", dna) or re.search(r"GTT", dna) or re.search(r"GTG", dna) or re.search(r"GTC", dna): # or
    print("Valine found!")
else:
    print("Valine not found!")
Valine found!

A more elegant solution#

dna = "ATGGACCCGGCAGTCGCCTAA"  

if re.search(r"GT(A|T|G|C)", dna): # (A|T|G|C)= A or T or G or C
    print("Valine found!")  # Val: GTA, GTT, GTG, GTC
else:
    print("Valine not found!")
Valine found!

A counter example:#

stop codons

# For instance, searching for stop codons: TGA, TAG, TAA
# The next works well
dna = "ATGGACCCGGCAGTCGCCTAA"   

if re.search(r"TGA", dna) or re.search(r"TAG", dna) or re.search(r"TAA", dna): # with or
    print("Stop codon found!")
else:
    print("Stop codon not found!")
Stop codon found!

But the next code leads to errors:

# for stop codons: TGA, TAG, TAA
# The next is false
dna = "ATGGACCCGGCAGTCGCCTGG"   

if re.search(r"T(A|G)(A|G)", dna): # TGG or TGA! They are not stop codons
    print("Stop codon found!")
else:
    print("Stop codon not found!")
Stop codon found!

But, TGG is not a stop codon!

Character groups#

dna = "ATGGACCCGGCAGTCGCCTAA"   

if re.search(r"GT[ATGC]", dna): # [ATGC] == A or T or G or C 
                                #        == (A|T|G|C)
    print("Valine found!")      # Val: GTA, GTT, GTG, -> "GTC"
Valine found!

Negate the character group: caret [^ at the start for the character group]#

dna = "ATGGACCCGGCAGTCGCCTAC"   
if re.search(r"TA[^AG]", dna): # Tyrosine: TAT, "TAC"...
                               # not really. It is not A, not G
    print("1. Tyrosine found!")
    
if re.search(r"TA[TC]", dna): # Tyrosine: TAT, "TAC"
    print("2. Tyrosine found!")    
1. Tyrosine found!
2. Tyrosine found!
# but, be careful it will match any character other than: A or G
# It will match the next:
dna = "ATGGACCCGGCAGTCGCCTAö"   
if re.search(r"TA[^AG]", dna): # Tyrosine: TAC, TAT
    print("Tyrosine incorrectly found!") # TAö is not Tyrosine
Tyrosine incorrectly found!

Explanation. It will match:

  • TAJ

  • TAU

  • TAC

  • TAö

  • but never TAA or TAG

Any character (.)#

dna = "ATGGACCCGGCAGT#GCCTAA" # the sequence has now a "#" (hashtag)

if re.search(r"GT.", dna): # A or T or G or C... but also any character
    print("Valine incorrectly found!")      # "GT#" instead of Val
Valine incorrectly found!

Quantifiers#

With them, you can set up the number of times that a pattern is repeated. The quantifier is located after a character or a character group:

? : Zero or one time.#

GATTAC?A: GATTAA or GATTACA

GATTA(AAA)?CA: GATTACA or GATTAAAACA
Note: it is greedy

+ : one or more times.#

GATTAC+A: GATTACA, GATTACCA, GGATTACCCA,… (but not GATTAA)

GATTA(AAA)+CA: GATTAAAACA, GATTAAAAAAACA, GATTAAAAAAAAAACA,…
Note: it is greedy

* : zero or more times.#

GATTAC*A: GATTAA, GATTACA, GATTACCA, GGATTACCCA,…

GATTA(AAA)*CA: GATTACA, GATTAAAACA, GATTAAAAAAACA, GATTAAAAAAAAAACA,…
Note: it is greedy

{n}: exactly n times.#

GATTAC{3}A: GATTACCCA (but not GATTAA, GATTACA, GATTACCA, GATTACCCCA,…)

GATTA(AAA){2}CA: GATTAAAAAAACA

{n, m}: between n and m times; both inclusive.#

GATTAC{1,3}A: GATTACA, GATTACCA or GATTACCCA

GATTA(AAA){2,3}CA: GATTAAAAAAACA or GATTAAAAAAAAAACA
Note: it is greedy

Positions#

  • ^ start of the string: ^GAT will match GATTACA but not TGATTACA

  • $ end of the string: ACA$ will match GATTACA but not GATTACAT

if re.search(r"ACA$", "GATTACA"): 
    print("Found!")
else: 
    print("Not found!")
Found!
if re.search(r"ACA$", "ACAGATT"): # ACA is not at the end of the str
    print("Found!")
else: 
    print("Not found!")
Not found!

Greedy or non-greedy#

This is a step ahead (a bit more advanced). Greedy means here that in case of doubt it takes more. Here is a summary of the quantifiers with indications on greediness. There is not need to memorize the next, but try to understand it

    - "."      Matches any character except a newline.
    - "^"      Matches the start of the string.
    - "$"      Matches the end of the string or just before the newline at
               the end of the string.
    - "*"      Matches 0 or more (greedy) repetitions of the preceding regular expression (RE).
               Greedy means that it will match as many repetitions as possible.
    - "+"      Matches 1 or more (greedy) repetitions of the preceding RE.
    - "?"      Matches 0 or 1 (greedy) of the preceding RE.
    - *?,+?,?? Non-greedy versions of the previous three special characters.
    - {m,n}    Matches from m to n repetitions of the preceding RE.
    - {m,n}?   Non-greedy version of the above.
    - []       Indicates a set of characters.
               A "^" as the first character indicates a complementing set.
    - "|"      A|B, creates an RE that will match either A or B.
    - (...)    Matches the RE inside the parentheses.
               The contents can be retrieved or matched later in the string.

Combining groups#

Combining character groups makes pattern recognition quite powerful

Question: What does the next expression represent?

AUG[AUCG]{30,1000}A{5,10}

Answer:

An mRNA sequence (no U but T) with a provided length range

From left to right we can see that there is:

  • No 5’UTR. It starts with the start codon

  • Start codon: AUG

  • the rest of the RNA sequence: [AUCG]{30,1000}

  • PolyA tail: A{5,10}

re.search() vs re.match()#

Comparison of the two methods:

  • search(): the pattern is anywhere within the string (more used)

        Help on function search in module re:  
        search(pattern, string, flags=0)  
            Scan through string looking for a match to the pattern, returning  
            a Match object, or None if no match was found.  
  • match(): the pattern has to match the start of the string

        Help on function match in module re:  
        match(pattern, string, flags=0)  
            Try to apply the pattern at the start of the string, returning  
            a Match object, or None if no match was found.Help on function search in module re:  
# 4 Examples
if re.search("CG", "CATGCGT"):
    print("search: CG found")
else:
    pass # do nothing
#
#
if re.search("XX", "CATGCGT"):
    pass
else:
    print("search: XX not found")
#
#
if re.match("AT", "ATGCGT"):
    print("match: AT found")
else:
    pass
#
#
if re.match("AT", "CATGCGT"):
    pass
else:
    print("match: AT not found")
search: CG found
search: XX not found
match: AT found
match: AT not found

The group() method of the match object#

Which is the pattern found within the string?

dna = "CCCATGACTAGGGGCCC"

m = re.search(r"ATG[ATCG]{2}TAG[ATCG]{2}", dna) # ...ATG AC TAG GG...
print(m.group())  # print the entire match
ATGACTAGGG

Note:
if re.search() returns None. Then, print(None) will raise an ERROR

# Example
dna = "CCCATGACTAGGGGCCC"
m = re.search(r"ATG[ATCG]{2}NNN[ATCG]{2}", dna) # no detection :-(
print(m.group())  # error: no match

>#AttributeError: 'NoneType' object has no attribute 'group'
# The if skips the error: AttributeError: 'NoneType' object has no attribute 'group'
dna = "CCCATGACTAGGGGCCC"

m = re.search(r"ATG[ATCG]{2}NNN[ATCG]{2}", dna) # ...ATG AC TAG GG...
if m:
    print(m.group())  # print the entire match

Capturing#

We use parenthesis to capture some parts of the pattern

  • m.group()

# The captures are between parentheses in the RE
dna = "CCCATGACTAGGGGCCC"

m = re.search(r"ATG([ATCG]{2})TAG([ATCG]{2})", dna) # ...ATG AC TAG GG...
print(m.group())  # print the entire match
print(m.group(1)) # print the 1st capture
print(m.group(2)) # print the 2nd capture
# or, in general
print("Another way:")
for i in m.groups(): # groups
    print(i)
ATGACTAGGG
AC
GG
Another way:
AC
GG

The position of a capture#

Where is the pattern found within the string?

It gets array coordinates

  • m.start(): inclusive

  • m.end(): exclusive

dna = "CCCATGACTAGGGGCCC"

m = re.search(r"ATG([ATCG]{2})TAG([ATCG]{2})", dna) # ...ATG AC TAG GG...
print("length of dna:", len(dna))
print(m.group(), m.start(), m.end())  # print the entire match
print(m.group(1), m.start(1), m.end(1)) # print the 1st capture
print(m.group(2), m.start(2), m.end(2)) # print the 2nd capture
length of dna: 17
ATGACTAGGG 3 13
AC 6 8
GG 11 13

re.split(), splitting a string#

split(pattern, string, maxsplit=0, flags=0)
    Split the source string by the occurrences of the pattern,
    returning a list containing the resulting substrings.  If
    capturing parentheses are used in pattern, then the text of all
    groups in the pattern are also returned as part of the resulting
    list.  If maxsplit is nonzero, at most maxsplit splits occur,
    and the remainder of the string is returned as the final element
    of the list.

Returns a list

# For instance: split a DNA sequence when there is IUPAC codes other than 
# the standard [ATCG] nucleotides
dna = "ATGNACAGMCCATCARTCAAWAGASGATCGTTAYGCAGAKAACAVAAAGGA" # dna with other IUPAC codes: N M R W S Y K V

m = re.split(r"[^ATCG]", dna) # returns a list
print(m)
['ATG', 'ACAG', 'CCATCA', 'TCAA', 'AGA', 'GATCGTTA', 'GCAGA', 'AACA', 'AAAGGA']

More advanced

# The content of the matched parenthesis is also returned
dna = "CCCATGACTAGGGGCCC"

m = re.split(r"ATG([ATCG]{2})TAG([ATCG]{2})", dna) # CCC  ...ATG (AC) TAG (GG)... GCCC
print(m)
['CCC', 'AC', 'GG', 'GCCC']

re.findall(), finding multiple matches#

findall(pattern, string, flags=0)
    Return a list of all non-overlapping matches in the string.
dna = "ATGTACTTTGAGTTCCCTCAGCCGTTACCTGTGTGTGGTGATATCAAAGTAGAGTTCTTCCACAAACAGAACAAGATGCTAAA" # from NM_000314.8

m_all_at = re.findall(r"[AT]{3,200}", dna) # find all the subsequences of 3-200 nt composed by A or/and T's
print(m_all_at)
['TTT', 'TTA', 'ATAT', 'AAA', 'AAA', 'TAAA']

re.finditer(), finding groups instead of sequences#

finditer(pattern, string, flags=0)
    Return an iterator over all non-overlapping matches in the
    string.  For each match, the iterator returns a Match object.
dna = "ATGTACTTTGAGTTCCCTCAGCCGTTACCTGTGTGTGGTGATATCAAAGTAGAGTTCTTCCACAAACAGAACAAGATGCTAAA" # from NM_000314.8

m_all_groups = re.finditer(r"[AT]{3,200}", dna) # find all the subsequences composed with at least 3 A or GT's
print("length", len(dna))
for m in m_all_groups:
    print(m.group(), m.start(), m.end())
length 83
TTT 6 9
TTA 24 27
ATAT 40 44
AAA 45 48
AAA 63 66
TAAA 79 83

Last message on regular expressions: There are many/different ways (regular expressions) to find the same pattern

Summary#

  • Pattern recognition in sequences:

    • Regular expressions in Biology

    • The most simple approach:

      "ATG" in "GATGTGGCGGGACTCTTTATGCGCTGCGGCAGGA"
      returns a boolean: True/False
      
    • We were using already some methods. Strings:

      • str.count()

      • str.find(pattern), str.find(pattern, start, end). Return -1 on failure.

      • str.rfind(pattern), str.rfind(pattern, start, end). Return -1 on failure.

      • Strings, r stands for raw:

        • r”Hello world\nWelcome to this course:\tProgramming”

    • modules

      • import re

    • re.search(pattern, string)

    search(pattern, string, flags=0)
    Scan through string looking for a match to the pattern, returning
    a Match object, or None if no match was found
    
    • Searching complex conditions

      • if re.search(r”GTA”, dna) or re.search(r”GTT”, dna) or re.search(r”GTG”, dna) or re.search(r”GTC”, dna):

      • if re.search(r”GT(A|T|G|C)”, dna):

    • Character groups:

      • if re.search(r”GT[ATGC]”, dna):

      • Negate: if re.search(r”TA[^AG]”, dna): # Tyrosine: TAC, TAT

    • greedy or non-greedy:

      • “.” Matches any character except a newline.

      • “^” Matches the start of the string.

      • “$” Matches the end of the string or just before the newline at the end of the string.

      • “*” Matches 0 or more (greedy) repetitions of the preceding RE. Greedy means that it will match as many repetitions as possible.

      • “+” Matches 1 or more (greedy) repetitions of the preceding RE.

      • “?” Matches 0 or 1 (greedy) of the preceding RE.

      • *?,+?,?? Non-greedy versions of the previous three special characters.

      • {m,n} Matches from m to n repetitions of the preceding RE.

      • {m,n}? Non-greedy version of the above.

    • Combining:

      • ATG[ATCG]{30,1000}A{5,10}

    • re.search() vs re.match()

    • Which/Where is the pattern found within the string?

        m = re.search(r"ATG[ATCG]{2}TAG[ATCG]{2}", dna) # ...ATG AC TAG GG...
        print(m.group())  # print the entire match
      
         dna = "CCCATGACTAGGGGCCC"
         m = re.search(r"ATG([ATCG]{2})TAG([ATCG]{2})", dna) # ...ATG AC TAG GG...
         print(m.group())  # print the entire match
         print(m.group(1)) # print the 1st capture
         print(m.group(2)) # print the 2nd capture
      
      • Getting the position of a match:

      dna = "CCCATGACTAGGGGCCC"
      m = re.search(r"ATG([ATCG]{2})TAG([ATCG]{2})", dna) # ...ATG AC TAG GG...
      print("length of string", len(dna))
      print(m.group(), m.start(), m.end())  # print the entire match
      print(m.group(1), m.start(1), m.end(1)) # print the 1st capture
      print(m.group(2), m.start(2), m.end(2)) # print the 2nd capture
      
      • Splitting a string (with regular expressions)

      dna = "ATGNACAGMCCATCARTCAAWAGASGATCGTTAYGCAGAKAACAVAAAGGA" # dna with other IUPAC codes: N M R W S Y K V
      m = re.split(r"[^ATCG]", dna) # returns a list
      print(m)
      
      • Finding multiple matches
        * re.findall(). Return a list of all non-overlapping matches in the string.

      • Finding groups instead of sequences:
        * re.finditer()


Exercises#