C7. Pattern recognition in biology#
Regular expressions#
Detecting patterns within plain text is a frequent task for many different objectives. It can be carried out with complex tools, like those from data mining or artificial intelligence. Here we will focus on the most simple tool that Python offers us: the use of regular expressions. In other fields, like physics or engineering, numbers (int, float, complex) have a more important role, because their patterns are numerical; for instance, in statistics or subsymbolic artificial intelligence, where numerical data is detected, extracted, and analyzed.
In biology, an obvious example is the search for patterns within sequences (str). For instance:
Finding motifs (polyAs, Kozak sequence, repeats, start/stop codons…)
Restriction enzyme cuts in DNA
Alternative splicing sites
CpG islands
And so on …
Knowledge of regular expressions#
What you will learn here is ground programming knowledge and it has a big advantage: the same regular expressions can be used in most programming languages, they are the very same ones (Perl, Java, R, and so on). In the end, regular expressions are a very important subject to learn from the beginning, moreover in this Data Science era.
The most simple approach#
Let’s think in a sequence
str#
Question:
Is "ATG" in "GATGTGGCGGGACTCTTTATGCGCTGCGGCAGGA"?
Return a boolean: True or False
# Is in? Answer
print("ATG" in "GATGTGGCGGGACTCTTTATGCGCTGCGGCAGGA")
True
# or negative test (*not* in)
print("KOZAK" not in "GATGTGGCGGGACTCTTTATGCGCTGCGGCAGGA") # Marilyn S. Kozak
True
Note: The Kozak sequence#
surrounds the initiation codon
We already have some experience for detecting patterns#
str#
We know about count, a str method
# str.count()
sequence = "GATGTGGCGGGACTCTTTATGCGCTGCGGCAGGA"
print(sequence.count("ATG")) # 2
print(sequence.count("AGG")) # 1
2
1
But there are more str methods#
str.find()#
S.find(sub[, start[, end]]) -> int
Return the lowest index in S where substring sub is found,
such that sub is contained within S[start:end].
Arguments start and end are interpreted as in slice notation.
Return -1 on failure.
# str.find(start, end) # if there is no start, end => the whole sequence
# It returns the position (array coordinates not biological)
sequence = "GATGTGGCGGGACTCTTTATGCGCTGCGGCAGGATG"
print(sequence.find("ATG")) # in all the string 1: "G*ATGT..."
print(sequence.find("ATG", 16, len(sequence))) #18: "...TTT*ATGC..."
print(sequence.find("ATG", 19, len(sequence))) #33: "...GGCAGG*ATG"
print(sequence.find("ATG", 34, len(sequence))) #-1: "...TG"
1
18
33
-1
str.find() within if#
Conditionals
sequence = "GATGTGGCGGGACTCTTTATGCGCTGCGGCAGGATG"
if sequence.find("PATTERN", 34, len(sequence)) == -1:
print("This message means that the motif was not found")
else:
print("This message means that the motif was found")
This message means that the motif was not found
str.rfind()#
S.rfind(sub[, start[, end]]) -> int
Return the highest index in S where substring sub is found,
such that sub is contained within S[start:end]. Optional
arguments start and end are interpreted as in slice notation.
Return -1 on failure.
sequence = "GATGTGGCGGGACTCTTTATGCGCTGCGGCAGGATG"
# str.rfind(start, end) # without start, end => the whole sequence
# It returns the LAST match (array coord.) in seq/subseq
print(sequence.rfind("ATG")) # i.e. no start, end; #33: "...GGCAGG*ATG"
print(sequence.rfind("ATG", 34, len(sequence))) #-1: "...TG"
33
-1
Find all the motifs within a sequence#
We only need to combine:
a loop (while)
str.find()
sequence = "GATGTGGCGGGACTCTTTATGCGCTGCGGCAGGATG" # template
motif = "GC"
start = 0
while start != -1: # while there is any match
start = sequence.find(motif, start, len(sequence)) # returns -1 if the motif is not within the sequence
if start >= 0: # if there is match
print("...", sequence[start:start+len(motif)], "...\t", motif, " at position:", start, sep="")
start = start + 1 # prepare start for the next iteration
...GC... GC at position:6
...GC... GC at position:20
...GC... GC at position:22
...GC... GC at position:25
...GC... GC at position:28
Even more sofisticated#
Representing a kind of alignment and now you have enough knowledge to analyze the next code; if not, do not worry and just observe the output
# functions
def find_motifs_and_show_alignments(sequence, motif):
lm = len(motif)
start = 0
end = len(sequence)
number_motifs = 0
position_motifs = []
print(sequence)
while end != -1:
end = sequence.find(motif, start, len(sequence))
if end >= 0: # if it has found something
print("."*end + "|"*lm + "."*(len(sequence)-end-lm))
print(sequence[0:end]+ motif.lower() + sequence[end+lm:])
number_motifs = number_motifs + 1
position_motifs.append(end)
start = end + 1
print (number_motifs, motif, "found, at positions:", position_motifs)
# main
sequence = "GATGTGGCGGGACTCTTTATGCGCTGCGGCAGGATG"
motif = "GGC"
find_motifs_and_show_alignments(sequence, motif)
GATGTGGCGGGACTCTTTATGCGCTGCGGCAGGATG
.....|||............................
GATGTggcGGGACTCTTTATGCGCTGCGGCAGGATG
...........................|||......
GATGTGGCGGGACTCTTTATGCGCTGCggcAGGATG
2 GGC found, at positions: [5, 27]
More on strs: a flash back on raw strings#
Remember!: We had special characters: \n, \t, …
r stands for raw: Here they are not special characters
# See the differences:
print("1.", "Hello world!\n\tWelcome to this course:\tProgramming")
print("2.", r"Hello world!\n\tWelcome to this course:\tProgramming")
1. Hello world!
Welcome to this course: Programming
2. Hello world!\n\tWelcome to this course:\tProgramming
Modules: more functionality is needed!#
import module#
A module is file of Python code that offers functions, classes, variables, etc to solve specific problems. For instance, you can use a module for plotting, one specialized in phylogeny, other in sequence alignment, and so on. You can even built your own modules, libraries of code that help you with your specific coding problems
re is the module for regular expressions in Python
import re
import re # This loads the re module for regular expressions
# and know is ready to be used in your Python program
Pandas: a data analysis module#
It is a very important module for data analysis. For instance, a classical is to capture data from files, or the web using pandas and plot it using another specific module. In this chapter showing an example. After finishing this course will be good to learn pandas and a module for plotting like matplotlib
Then, for retrieving data from a csv table that is in a web page
# More advanced module: pandas
# It is very common for data analysis
import pandas as pd
# note that we can also easily read the data from a local csv file
df = pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/school_earnings.csv")
print(df) # df is because of DataFrame
School Women Men Gap
0 MIT 94 152 58
1 Stanford 96 151 55
2 Harvard 112 165 53
3 U.Penn 92 141 49
4 Princeton 90 137 47
5 Chicago 78 118 40
6 Georgetown 94 131 37
7 Tufts 76 112 36
8 Yale 79 114 35
9 Columbia 86 119 33
10 Duke 93 124 31
11 Dartmouth 84 114 30
12 NYU 67 94 27
13 Notre Dame 73 100 27
14 Cornell 80 107 27
15 Michigan 62 84 22
16 Brown 72 92 20
17 Berkeley 71 88 17
18 Emory 68 82 14
19 UCLA 64 78 14
20 SoCal 72 81 9
re.search(pattern, string)#
Help on function search in module re:
Scan through string looking for a match to the pattern, returning
a Match object, or None if no match was found.
Important: this method returns a match object
# simple example
import re
if(re.search("ATG","AGTATGGCGAC")): # search returns an object=True or None=False
print("The start of the translation has been found")
else:
print("ATG not detected")
The start of the translation has been found
Note: The name of the module needs to specified when calling the function.
We run in the previous example re.search()
Why? It could happen that 2 different modules have the same function name. If the name of the module is not specified, Python will raise an Error:
if(search("GT","AGTT")): # NameError: name 'search' is not defined
print("Match")
>NameError: name 'search' is not defined
Searching complex conditions#
# For instance, searching for Valine: GTA, GTT, GTG, GTC
dna = "ATGGACCCGGCAGTCGCCTAA"
if re.search(r"GTA", dna) or re.search(r"GTT", dna) or re.search(r"GTG", dna) or re.search(r"GTC", dna): # or
print("Valine found!")
else:
print("Valine not found!")
Valine found!
A more elegant solution#
dna = "ATGGACCCGGCAGTCGCCTAA"
if re.search(r"GT(A|T|G|C)", dna): # (A|T|G|C)= A or T or G or C
print("Valine found!") # Val: GTA, GTT, GTG, GTC
else:
print("Valine not found!")
Valine found!
A counter example:#
stop codons
# For instance, searching for stop codons: TGA, TAG, TAA
# The next works well
dna = "ATGGACCCGGCAGTCGCCTAA"
if re.search(r"TGA", dna) or re.search(r"TAG", dna) or re.search(r"TAA", dna): # with or
print("Stop codon found!")
else:
print("Stop codon not found!")
Stop codon found!
But the next code leads to errors:
# for stop codons: TGA, TAG, TAA
# The next is false
dna = "ATGGACCCGGCAGTCGCCTGG"
if re.search(r"T(A|G)(A|G)", dna): # TGG or TGA! They are not stop codons
print("Stop codon found!")
else:
print("Stop codon not found!")
Stop codon found!
But, TGG is not a stop codon!
Character groups#
dna = "ATGGACCCGGCAGTCGCCTAA"
if re.search(r"GT[ATGC]", dna): # [ATGC] == A or T or G or C
# == (A|T|G|C)
print("Valine found!") # Val: GTA, GTT, GTG, -> "GTC"
Valine found!
Negate the character group: caret [^ at the start for the character group]#
dna = "ATGGACCCGGCAGTCGCCTAC"
if re.search(r"TA[^AG]", dna): # Tyrosine: TAT, "TAC"...
# not really. It is not A, not G
print("1. Tyrosine found!")
if re.search(r"TA[TC]", dna): # Tyrosine: TAT, "TAC"
print("2. Tyrosine found!")
1. Tyrosine found!
2. Tyrosine found!
# but, be careful it will match any character other than: A or G
# It will match the next:
dna = "ATGGACCCGGCAGTCGCCTAö"
if re.search(r"TA[^AG]", dna): # Tyrosine: TAC, TAT
print("Tyrosine incorrectly found!") # TAö is not Tyrosine
Tyrosine incorrectly found!
Explanation. It will match:
TAJ
TAU
TAC
TAö
…
but never TAA or TAG
Any character (.)#
dna = "ATGGACCCGGCAGT#GCCTAA" # the sequence has now a "#" (hashtag)
if re.search(r"GT.", dna): # A or T or G or C... but also any character
print("Valine incorrectly found!") # "GT#" instead of Val
Valine incorrectly found!
Quantifiers#
With them, you can set up the number of times that a pattern is repeated. The quantifier is located after a character or a character group:
? : Zero or one time.#
GATTAC?A: GATTAA or GATTACA
GATTA(AAA)?CA: GATTACA or GATTAAAACA
Note: it is greedy
+ : one or more times.#
GATTAC+A: GATTACA, GATTACCA, GGATTACCCA,… (but not GATTAA)
GATTA(AAA)+CA: GATTAAAACA, GATTAAAAAAACA, GATTAAAAAAAAAACA,…
Note: it is greedy
* : zero or more times.#
GATTAC*A: GATTAA, GATTACA, GATTACCA, GGATTACCCA,…
GATTA(AAA)*CA: GATTACA, GATTAAAACA, GATTAAAAAAACA, GATTAAAAAAAAAACA,…
Note: it is greedy
{n}: exactly n times.#
GATTAC{3}A: GATTACCCA (but not GATTAA, GATTACA, GATTACCA, GATTACCCCA,…)
GATTA(AAA){2}CA: GATTAAAAAAACA
{n, m}: between n and m times; both inclusive.#
GATTAC{1,3}A: GATTACA, GATTACCA or GATTACCCA
GATTA(AAA){2,3}CA: GATTAAAAAAACA or GATTAAAAAAAAAACA
Note: it is greedy
Positions#
^ start of the string: ^GAT will match GATTACA but not TGATTACA
$ end of the string: ACA$ will match GATTACA but not GATTACAT
if re.search(r"ACA$", "GATTACA"):
print("Found!")
else:
print("Not found!")
Found!
if re.search(r"ACA$", "ACAGATT"): # ACA is not at the end of the str
print("Found!")
else:
print("Not found!")
Not found!
Greedy or non-greedy#
This is a step ahead (a bit more advanced). Greedy means here that in case of doubt it takes more. Here is a summary of the quantifiers with indications on greediness. There is not need to memorize the next, but try to understand it
- "." Matches any character except a newline.
- "^" Matches the start of the string.
- "$" Matches the end of the string or just before the newline at
the end of the string.
- "*" Matches 0 or more (greedy) repetitions of the preceding regular expression (RE).
Greedy means that it will match as many repetitions as possible.
- "+" Matches 1 or more (greedy) repetitions of the preceding RE.
- "?" Matches 0 or 1 (greedy) of the preceding RE.
- *?,+?,?? Non-greedy versions of the previous three special characters.
- {m,n} Matches from m to n repetitions of the preceding RE.
- {m,n}? Non-greedy version of the above.
- [] Indicates a set of characters.
A "^" as the first character indicates a complementing set.
- "|" A|B, creates an RE that will match either A or B.
- (...) Matches the RE inside the parentheses.
The contents can be retrieved or matched later in the string.
Combining groups#
Combining character groups makes pattern recognition quite powerful
Question: What does the next expression represent?
AUG[AUCG]{30,1000}A{5,10}
Answer:
An mRNA sequence (no U but T) with a provided length range
From left to right we can see that there is:
No 5’UTR. It starts with the start codon
Start codon: AUG
the rest of the RNA sequence: [AUCG]{30,1000}
PolyA tail: A{5,10}
re.search() vs re.match()#
Comparison of the two methods:
search(): the pattern is anywhere within the string (more used)
Help on function search in module re:
search(pattern, string, flags=0)
Scan through string looking for a match to the pattern, returning
a Match object, or None if no match was found.
match(): the pattern has to match the start of the string
Help on function match in module re:
match(pattern, string, flags=0)
Try to apply the pattern at the start of the string, returning
a Match object, or None if no match was found.Help on function search in module re:
# 4 Examples
if re.search("CG", "CATGCGT"):
print("search: CG found")
else:
pass # do nothing
#
#
if re.search("XX", "CATGCGT"):
pass
else:
print("search: XX not found")
#
#
if re.match("AT", "ATGCGT"):
print("match: AT found")
else:
pass
#
#
if re.match("AT", "CATGCGT"):
pass
else:
print("match: AT not found")
search: CG found
search: XX not found
match: AT found
match: AT not found
The group() method of the match object#
Which is the pattern found within the string?
dna = "CCCATGACTAGGGGCCC"
m = re.search(r"ATG[ATCG]{2}TAG[ATCG]{2}", dna) # ...ATG AC TAG GG...
print(m.group()) # print the entire match
ATGACTAGGG
Note:
if re.search() returns None. Then, print(None) will raise an ERROR
# Example
dna = "CCCATGACTAGGGGCCC"
m = re.search(r"ATG[ATCG]{2}NNN[ATCG]{2}", dna) # no detection :-(
print(m.group()) # error: no match
>#AttributeError: 'NoneType' object has no attribute 'group'
# The if skips the error: AttributeError: 'NoneType' object has no attribute 'group'
dna = "CCCATGACTAGGGGCCC"
m = re.search(r"ATG[ATCG]{2}NNN[ATCG]{2}", dna) # ...ATG AC TAG GG...
if m:
print(m.group()) # print the entire match
Capturing#
We use parenthesis to capture some parts of the pattern
m.group()
# The captures are between parentheses in the RE
dna = "CCCATGACTAGGGGCCC"
m = re.search(r"ATG([ATCG]{2})TAG([ATCG]{2})", dna) # ...ATG AC TAG GG...
print(m.group()) # print the entire match
print(m.group(1)) # print the 1st capture
print(m.group(2)) # print the 2nd capture
# or, in general
print("Another way:")
for i in m.groups(): # groups
print(i)
ATGACTAGGG
AC
GG
Another way:
AC
GG
The position of a capture#
Where is the pattern found within the string?
It gets array coordinates
m.start(): inclusive
m.end(): exclusive
dna = "CCCATGACTAGGGGCCC"
m = re.search(r"ATG([ATCG]{2})TAG([ATCG]{2})", dna) # ...ATG AC TAG GG...
print("length of dna:", len(dna))
print(m.group(), m.start(), m.end()) # print the entire match
print(m.group(1), m.start(1), m.end(1)) # print the 1st capture
print(m.group(2), m.start(2), m.end(2)) # print the 2nd capture
length of dna: 17
ATGACTAGGG 3 13
AC 6 8
GG 11 13
re.split(), splitting a string#
split(pattern, string, maxsplit=0, flags=0)
Split the source string by the occurrences of the pattern,
returning a list containing the resulting substrings. If
capturing parentheses are used in pattern, then the text of all
groups in the pattern are also returned as part of the resulting
list. If maxsplit is nonzero, at most maxsplit splits occur,
and the remainder of the string is returned as the final element
of the list.
Returns a list
# For instance: split a DNA sequence when there is IUPAC codes other than
# the standard [ATCG] nucleotides
dna = "ATGNACAGMCCATCARTCAAWAGASGATCGTTAYGCAGAKAACAVAAAGGA" # dna with other IUPAC codes: N M R W S Y K V
m = re.split(r"[^ATCG]", dna) # returns a list
print(m)
['ATG', 'ACAG', 'CCATCA', 'TCAA', 'AGA', 'GATCGTTA', 'GCAGA', 'AACA', 'AAAGGA']
More advanced
# The content of the matched parenthesis is also returned
dna = "CCCATGACTAGGGGCCC"
m = re.split(r"ATG([ATCG]{2})TAG([ATCG]{2})", dna) # CCC ...ATG (AC) TAG (GG)... GCCC
print(m)
['CCC', 'AC', 'GG', 'GCCC']
re.findall(), finding multiple matches#
findall(pattern, string, flags=0)
Return a list of all non-overlapping matches in the string.
dna = "ATGTACTTTGAGTTCCCTCAGCCGTTACCTGTGTGTGGTGATATCAAAGTAGAGTTCTTCCACAAACAGAACAAGATGCTAAA" # from NM_000314.8
m_all_at = re.findall(r"[AT]{3,200}", dna) # find all the subsequences of 3-200 nt composed by A or/and T's
print(m_all_at)
['TTT', 'TTA', 'ATAT', 'AAA', 'AAA', 'TAAA']
re.finditer(), finding groups instead of sequences#
finditer(pattern, string, flags=0)
Return an iterator over all non-overlapping matches in the
string. For each match, the iterator returns a Match object.
dna = "ATGTACTTTGAGTTCCCTCAGCCGTTACCTGTGTGTGGTGATATCAAAGTAGAGTTCTTCCACAAACAGAACAAGATGCTAAA" # from NM_000314.8
m_all_groups = re.finditer(r"[AT]{3,200}", dna) # find all the subsequences composed with at least 3 A or GT's
print("length", len(dna))
for m in m_all_groups:
print(m.group(), m.start(), m.end())
length 83
TTT 6 9
TTA 24 27
ATAT 40 44
AAA 45 48
AAA 63 66
TAAA 79 83
Last message on regular expressions: There are many/different ways (regular expressions) to find the same pattern
Summary#
Pattern recognition in sequences:
Regular expressions in Biology
The most simple approach:
"ATG" in "GATGTGGCGGGACTCTTTATGCGCTGCGGCAGGA" returns a boolean: True/False
We were using already some methods. Strings:
str.count()
str.find(pattern), str.find(pattern, start, end). Return -1 on failure.
str.rfind(pattern), str.rfind(pattern, start, end). Return -1 on failure.
Strings, r stands for raw:
r”Hello world\nWelcome to this course:\tProgramming”
modules
import re
re.search(pattern, string)
search(pattern, string, flags=0) Scan through string looking for a match to the pattern, returning a Match object, or None if no match was found
Searching complex conditions
if re.search(r”GTA”, dna) or re.search(r”GTT”, dna) or re.search(r”GTG”, dna) or re.search(r”GTC”, dna):
if re.search(r”GT(A|T|G|C)”, dna):
Character groups:
if re.search(r”GT[ATGC]”, dna):
Negate: if re.search(r”TA[^AG]”, dna): # Tyrosine: TAC, TAT
greedy or non-greedy:
“.” Matches any character except a newline.
“^” Matches the start of the string.
“$” Matches the end of the string or just before the newline at the end of the string.
“*” Matches 0 or more (greedy) repetitions of the preceding RE. Greedy means that it will match as many repetitions as possible.
“+” Matches 1 or more (greedy) repetitions of the preceding RE.
“?” Matches 0 or 1 (greedy) of the preceding RE.
*?,+?,?? Non-greedy versions of the previous three special characters.
{m,n} Matches from m to n repetitions of the preceding RE.
{m,n}? Non-greedy version of the above.
Combining:
ATG[ATCG]{30,1000}A{5,10}
re.search() vs re.match()
Which/Where is the pattern found within the string?
m = re.search(r"ATG[ATCG]{2}TAG[ATCG]{2}", dna) # ...ATG AC TAG GG... print(m.group()) # print the entire match
dna = "CCCATGACTAGGGGCCC" m = re.search(r"ATG([ATCG]{2})TAG([ATCG]{2})", dna) # ...ATG AC TAG GG... print(m.group()) # print the entire match print(m.group(1)) # print the 1st capture print(m.group(2)) # print the 2nd capture
Getting the position of a match:
dna = "CCCATGACTAGGGGCCC" m = re.search(r"ATG([ATCG]{2})TAG([ATCG]{2})", dna) # ...ATG AC TAG GG... print("length of string", len(dna)) print(m.group(), m.start(), m.end()) # print the entire match print(m.group(1), m.start(1), m.end(1)) # print the 1st capture print(m.group(2), m.start(2), m.end(2)) # print the 2nd capture
Splitting a string (with regular expressions)
dna = "ATGNACAGMCCATCARTCAAWAGASGATCGTTAYGCAGAKAACAVAAAGGA" # dna with other IUPAC codes: N M R W S Y K V m = re.split(r"[^ATCG]", dna) # returns a list print(m)
Finding multiple matches
* re.findall(). Return a list of all non-overlapping matches in the string.Finding groups instead of sequences:
* re.finditer()