C2. Dealing with text: sequences#
Numbers#
This chapter is devoted to text. But, for the sake of simplicity I will quickly introduce numbers here
Basic numbers types:
int, that is integers: naturals numbers, their corresponding negatives, and 0
…-3, -2, -1, 0, 1, 2, …
float, real numbers
E.g. 3.14159 (note that we are using a point, not a comma)
We could use E notation: 2e-4, that is 0.0002
type():
It is a built-in function that returns the type of the argument. Try the next lines of code (statements):
type(3)
type(3.14)
Basic arithmetic operations:#
- Some operations: +, -, *, / # be aware that "/" returns a float
- Exponential: 10 ** 2
- Parentheses: 5 * (3 + 2)
- Precedence of operators:
1- Parentheses
2- Exponential
3- Multiplication and division (same precedence)
4- Addition and subtraction (same precedence)
5- In case of having the same precedence: "precedence from left to right"
Examples
3 + 3 # 6
3 - 3 # 0
3 * 3 # 9
3 / 3 # 1.0 (float)
3 ** 3 # 27
Precedence:
1 + 2 * 3 # 7, and not 9
Exercise: operations on numbers
Think what the following expressions will return and check quickly with Python if you are right:
2 + 3 - 2 # preference?
3 * 6 * 1
3 * 6 / 1
6 / 1 # which output type?
6.0 / 1
4 - 3 * 2
(4 - 3) * 2
2 ** 3
8 / 2 ** 2
(8 / 2) ** 2
Strings (str)#
A string is a bit of text, some characters surrounded by quotes. The Python type used to represent strings is str. For instance:
"PTEN is a tumor suppresor"
print()#
By now, you need to grasp some concepts:
- function, argument, statement
print("PTEN is a tumor suppressor")
# "print" is a built-in function that displays objects
#
# the "argument" of the function is "PTEN is a tumor suppressor"
#
# this example is a line of code (aka "statement")
PTEN is a tumor suppressor
# we need to write with elegance, even if python admits a bad style in your code
print ( "PTEN" ) # this works, but it is horrible!
PTEN
Advice: observe and copy the style used in this lesson
# print is not only for strings, we can print numbers
print(2 * 3) # write with elegance: observe the spaces
6
More than one argument#
Some functions can have more than one argument
# if there is more than one argument in a function:
# they are separated by commas
print("PTEN", "is", "a", "phosphatase encoded by the gene PTEN.", "Isn't it?")
PTEN is a phosphatase encoded by the gene PTEN. Isn't it?
# of course! One of the arguments (str) can contain a comma
print("Darwin,", "Charles") # the comma will be shown
Darwin, Charles
# some function can have parameters, like sep
# sep="" indicates here that nothing
# will be displayed between the arguments
print("Darwin,", "Charles:", "Evolution", sep="")
Darwin,Charles:Evolution
# now, sep="..." indicates that three points
# will be displayed between the arguments
print("Darwin", "Evolution", sep="...")
Darwin...Evolution
# we can even print strings combined with numbers
print("The origin of species was published on", 1860-1)
The origin of species was published on 1859
Quotes#
Double quotes
# double quotes
#
# be careful!
# 1. Not the German-style of quoting: „Gänsefüßchen“
# 2. Some characters looks like double quotes but they are not
# For instance: “ is not the same as "
#
# the next is correct:
print("PTENP1 is an homologous processed pseudogene of PTEN")
PTENP1 is an homologous processed pseudogene of PTEN
Single quotes
# single quotes
#
# be careful!
# 1. Some characters looks like single quotes but they are not
# For instance: ` is not the same as '
#
# the next is correct:
print('PTENP1 is not a protein')
PTENP1 is not a protein
Introducing comparison operators:#
The simplest are == and !=. The output will be a new type, the boolean type (bool type in Python): True or False
# compare numbers
print(5 == 5)
True
# compare strings
print("PTEN" == 'PTEN') # True, they are the same
True
print("PTEN" != 'PTEN') # False, because they are the very same
False
print("lncRNA" == "lncrna") # False, they look the same...but uppercase != lowercase
False
Nested quotes#
The next is an example of proper nesting
print('PTEN is a "tumor suppressor"') # right
PTEN is a "tumor suppressor"
…But, be careful
print(‘PTEN is a ‘tumor suppressor’’) # wrong
SyntaxError: invalid syntax.
…Although, the next is a correct nesting
print("PTEN is a 'tumor suppressor'") # right
PTEN is a 'tumor suppressor'
Be careful again!
print(“ATG is the codon for “methionine””) # wrong
SyntaxError: invalid syntax.
Escape characters#
\ (backslash) followed by the character you want to escape…
… is a solution to the previous errors
print("ATG is the codon for \"methionine\" ") # right
print('ATG is also the \'start codon\'') # right
ATG is the codon for "methionine"
ATG is also the 'start codon'
More on escape characters#
The next two are very important:
\t (tabular)
\n (end of line)
print("ATG\tmethionine\nCCG\t") # right
ATG methionine
CCG
Exercise:
- Try to print the next with one line of code:
1 one
2 two
3 three
Error messages#
Python is telling you when it detects that something is wrong in your code
The argument of print is not a str or number#
print(the genome) # wrong
File "<stdin>", line 1
print(the genome) # wrong
^
SyntaxError: invalid syntax
An spelling mistake#
pint("Protein")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'pint' is not defined
String divided in two lines#
If the file error_2lines.py contains
print("Dear programmer,
in your first approach to programming in Biology")
And the next is run:
emuro@laptop:~$ python3 error_2lines.py
File "error_2lines.py", line 1
print("Dear programmer,
^
SyntaxError: EOL while scanning string literal
emuro@mylaptop:~$
Solution:
Your can use an escape character (\n)
print("Dear programmer,\nin your first approach to programming in Biology") # \n
Dear programmer,
in your first approach to programming in Biology
Variables#
Variables are kind of drawers, where you can store information. You can later get back that information just using the name of the variable
An example, the information will be here the protein sequence of HBA1, Hemoglobin subunit alpha (human)
A bioinformatician will typically obtain the sequence from Uniprot:
- Uniprot seq
- Fasta seq
Assign a value to a variable name#
Following the next naming-rules:
Can only contain alpha-numeric characters and underscores (A-z, 0-9, and _ )
Cannot start with a number
Variable names are case-sensitive (meth != Meth != METH)
Cannot be a reserved python name, like the name of a function, True, etc
Select a nice name#
Be clear avoiding any ambiguity: for the sake of the quality of your code
# P69905 is the id from Uniprot.
# Find a nice variable name for it! For instance,
hba1_uniprot = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR"
The next is important, because many students confuse the next concepts:
# The next statements are very different
print(hba1_uniprot) # variable (it has information associated)
print("hba1_uniprot") # string (not a variable)
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
hba1_uniprot
print(hba1_uniprot) # again, hba1_uniprot is a variable (no quotes!)
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
# you can print strings combined with variables
hba1_uniprot = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR"
print("The HBA1 canonical protein sequence in uniprot is", hba1_uniprot)
The HBA1 canonical protein sequence in uniprot is MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
# or variables that contain numbers
number_of_pages = 695
print("My edition of 'On the Origin of species' book has", number_of_pages, "pages")
My edition of 'On the Origin of species' book has 695 pages
# the general rule is that constants are variables in UPPER_CASE
DARWIN_BORN_YEAR = 1809
OS_PUBLICATION_YEAR = 1859
print("Darwin was", OS_PUBLICATION_YEAR - DARWIN_BORN_YEAR, "years old")
Darwin was 50 years old
We continue with the same example, HBA1, Hemoglobin subunit alpha (human)
The same bioinformatician could retrieve the protein sequence from NCBI (another web service):
- NCBI::Protein::NP_000549.1
- The sequence of NP_000549.1 (fasta format)
# NP_000549.1 sequence
# Find a nice variable name!
hba1_ncbi = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR"
print(hba1_ncbi)
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
Compare the content of variables#
Do hba1_uniprot and hba1_ncbi store the same sequence?
Take into account that the sequences are obtained from different databases: NCBI and Uniprot
print(hba1_ncbi == hba1_uniprot) # True...but you have to check yourself!
True
Therefore, their lengths have to be the very same
len()#
It is another Python built-in function
len(obj, /)
Return the number of items in a container.
In plain English, the length. Just be aware that it works for many types.
We will see len() applied to different types in future lessons.
# Yes! The lengths are the same
print(len(hba1_uniprot), "amino acids")
print(len(hba1_ncbi), "amino acids")
142 amino acids
142 amino acids
Let’s learn from another example: PTEN, Phosphatase and tensin homolog (human)
Protein vs. mRNA sequence:
Protein sequence (403 nt):
mRNA (far longer):
UCSC::gene::browser (gene: introns + exons. Span > 100Kbp)
NCBI NM_000314.8: GenBank annotation (mRNA. Exons, without introns, after splicing + UTRs)
NCBI NM_000314.8: mRNA fasta sequence (8505 bp. The UTRs conform most of this sequence)
What happens if we need to assign a long sequence to a variable?#
We should not directly assign (on fire) the mRNA sequence to a variable, neither its protein sequence
ie. pten_ncbi = “MTAIIKEIVS…
Imagine how much work is just joining all the lines in one long line!
Bioinformatics solutions:#
- Retrieve the sequence from a fasta file
- Retrieve the sequence from a database (e.g. SQL: local or remote)
Another solution#
Use triple quotes
With this solution, you will be able to assign the long sequence to a var directly within the code.
Still, do not use it for very long sequences, like the mRNA sequence with so many lines
pten_uniprot = """MTAIIKEIVSRNKRRYQEDGFDLDLTYIYPNIIAMGFPAERLEGVYRNNIDDVVRFLDSK
HKNHYKIYNLCAERHYDTAKFNCRVAQYPFEDHNPPQLELIKPFCEDLDQWLSEDDNHVA
AIHCKAGKGRTGVMICAYLLHRGKFLKAQEALDFYGEVRTRDKKGVTIPSQRRYVYYYSY
LLKNHLDYRPVALLFHKMMFETIPMFSGGTCNPQFVVCQLKVKIYSSNSGPTRREDKFMY
FEFPQPLPVCGDIKVEFFHKQNKMLKKDKMFHFWVNTFFIPGPEETSEKVENGSLCDQEI
DSICSIERADNDKEYLVLTLTKNDLDKANKDKANRYFSPNFKVKLYFTKTVEEPSNPEAS
SSTSVTPDVSDNEPDHYRYSDTTDSDPENEPFDEDQHTQITKV"""
print(len(pten_uniprot)) # we know that this protein is 403 aa
# what happens?
409
# check that it maintains the returns ("\n"):
# the next sequence is like the previous one, but without "\n"
pten_uniprot_1L = """MTAIIKEIVSRNKRRYQEDGFDLDLTYIYPNIIAMGFPAERLEGVYRNNIDDVVRFLDSKHKNHYKIYNLCAERHYDTAKFNCRVAQYPFEDHNPPQLELIKPFCEDLDQWLSEDDNHVAAIHCKAGKGRTGVMICAYLLHRGKFLKAQEALDFYGEVRTRDKKGVTIPSQRRYVYYYSYLLKNHLDYRPVALLFHKMMFETIPMFSGGTCNPQFVVCQLKVKIYSSNSGPTRREDKFMYFEFPQPLPVCGDIKVEFFHKQNKMLKKDKMFHFWVNTFFIPGPEETSEKVENGSLCDQEIDSICSIERADNDKEYLVLTLTKNDLDKANKDKANRYFSPNFKVKLYFTKTVEEPSNPEASSSTSVTPDVSDNEPDHYRYSDTTDSDPENEPFDEDQHTQITKV"""
print(len(pten_uniprot_1L))
403
How to change the value of a variable?#
lys = "AAA"
print(lys)
AAA
lys = "AAG" # change the value of Lysine
print(lys) # see how "AAA" is not anymore assigned to the var
AAG
More on str (strings)#
Concatenation of str#
# PTEN CDS is: atg aca gcc atc atc ...aaagagatcgttagcagaaacaaaaggagatatcaagagg...
pten_cds = "atg" + "aca" + "gcc" + "atc" + "atc" + "..."
print(pten_cds)
atgacagccatcatc...
Concatenation of variables that contain strings#
These variables can be concatenated too
pten_cds = "atg" + "aca" + "gcc" + "atc" + "atc" + "..."
meth = "atg"
thr = "aca"
ala = "gcc"
ile = "atc"
new_pten_cds = meth + thr + ala + ile + ile + "..."
print(new_pten_cds)
atgacagccatcatc...
print(pten_cds == new_pten_cds)
True
Be aware of concatenating different types¶#
Note: +: is an arithmetic operator, when all variables are of numeric type
# addition
five = 5
two = 2
addition = five + two
print(addition)
print(type(addition)) # new!
7
<class 'int'>
Important Note:
+: we can not concatenate different types
# the next code is wrong
five = 5
two = “two”
addition = five + two # error!
print(addition)
TypeError: unsupported operand type(s) for +: ‘int’ and ‘str’
Literals: treat the string as a raw string#
Example:
r"\n"
Then, it is not a return of line, but literally: backslash followed by an n:
r"\n" is the same as "\\n"...that is, the backslash is in the string
text_without_r = "one\ntwo" # \n is 'return of line'
print(text_without_r)
one
two
text_with_r = r"one\ntwo" # \n is not 'return of line'
print(text_with_r)
one\ntwo
# It can be confusing. Be careful! If you run (in the interpreter)
r"\n" # it returns '\\n', that is the string with \ escaped
'\\n'
Literal for paths#
They are here very useful. Observe the next paths in the different OS
Windows: r”c:\windows\Desktop\file.txt”
Mac: “/Users/cdarwin/file.txt”
Linux: “/home/cdarwin/file.txt”
# Big difference because of the r
windows_file = "c:\windows\Desktop\nitrogen.txt" # This is not what we want
print(windows_file)
windows_file = r"c:\windows\Desktop\nitrogen.txt" # This is what we want
print(windows_file)
c:\windows\Desktop
itrogen.txt
c:\windows\Desktop\nitrogen.txt
The length of a variable#
The value of the variable can be, for instance, a string contained in the variable. Remember that:
len()#
is a built-in function that returns a number (number of items).
Here it will be the number of characters of a string
meth = "atg"
thr = "aca"
ala = "gcc"
ile = "atc"
print("Methionine is", meth)
Methionine is atg
print("The length of the content of the var meth is: ", len(meth))
The length of the content of the var meth is: 3
peptide = meth + thr + ala + ile + ile
print(peptide)
atgacagccatcatc
print("The length of the peptide is", len(peptide), "(bp)")
The length of the peptide is 15 (bp)
A number cannot be concatenated with a string!#
It will raise an error!
“The length of the previous peptide was “ + len(peptide)
#TypeError: can only concatenate str (not “int”) to str
A solution is casting#
That is, changing types. For example, from number to str
str() is a function
Return a string, in the next case, taking a number as argument
"The length of my peptide was " + str(len(peptide))
'The length of my peptide was 15'
# Note that we previously used the next, with no problem:
print("The length of my peptide was", len(peptide)) # No need of casting
The length of my peptide was 15
We call it string, Python calls it str#
In Python data types are actually classes and variables are instances (objects) of these classes.
For instance, for casting: str(5).
We can use type() to see the class of a variable. See the next example:
number = 5 # int
print(number)
print(type(number)) # object of class int
number = str(5) # from int to string
print(number) # looks the same, but it has a diff. type
print(type(number)) # object of class str
5
<class 'int'>
5
<class 'str'>
# help really helps!...but, it is more advanced
# help(len)
# help(str) ...this could be complicated. It is a class
Changing case#
str.lower()#
It is a class method, not a function. In this case the class is str This method returns a str: the modified string
# lower()
# is a method not a function
# it belongs to the type str
dna = "GATTACA"
print(dna) # print is a function
print(dna.lower()) # variable + . + method + ()
print(dna) # and NOT IN-PLACE
GATTACA
gattaca
GATTACA
# You can even do the next
print("GATTACA".lower())
gattaca
Important!
You need to be very careful, because:
dna = "GATTACA"
print(dna.lower()) # this works!
gattaca
Do not mix up methods with functions!
For instance, the next will provide an error:
print(lower(dna)) # There is not such lower function # str.lower() is a method from str, not built-in function
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Input In [117], in <cell line: 2>()
1 # ...but, the next will provide an error
----> 2 print(lower(dna))
NameError: name 'lower' is not defined
# Save the result of str.lower() to a variable
#
dna = "aTg"
dna_modified = dna.lower() # we need a var if we want to use again the result of the method
print(dna) # dna was not modified
print(dna_modified)
aTg
atg
Check more str methods at python.org
For instance, for str.lower()
str.upper()#
The very same here, upper is a method, not function.
As with lower, it does not change the value of the string (variable)
# upper()
# is a method not a function
# it belongs to a type, in this case (string)
dna = "gattaca"
print(dna) # print is a function
print(dna.upper()) # upper is a method, with no argument => ()
# variable + . + method + ()
print(dna)
gattaca
GATTACA
gattaca
str.replace()#
replace is a method of the class str, not a function
hba1 = "MVLSPADKTNVK...M"
print(hba1)
print(hba1.replace("M", "V")) # it replaces all the occurrences
MVLSPADKTNVK...M
VVLSPADKTNVK...V
# note that the variable does not change
print(hba1)
MVLSPADKTNVK...M
# replace a substring
hba1 = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSA...MVLSPADKT"
print(hba1)
print(hba1.replace("MVLSPADKT", "IAIA")) # it replaces all the occurrences
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSA...MVLSPADKT
IAIANVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSA...IAIA
print(hba1) # again the original var was not modified
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSA...MVLSPADKT
Extracting a substring#
Summary:
Variable[from_inclusive:to_exclusive]
hba1 = "MLVSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR"
print(hba1)
print(len(hba1))
MLVSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
142
# substring
# summary: variable[from_inclusive:to_exclusive]
print(hba1[0:20]) # array coordinates. Not biological coordinates!
# first considered, last not considered
MLVSPADKTNVKAAWGKVGA
print(hba1[1:3]) # not biological coordinates
LV
hba1 = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFP..."
length_hba1 = len(hba1) # The whole sequence. Different methods
print(hba1[0:length_hba1])
print(hba1[0:])
print(hba1[:length_hba1])
print(hba1[:])
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFP...
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFP...
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFP...
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFP...
print(hba1[0:length_hba1-5])
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTY
# Learn a trick:
# display the sequences like an alignment
my_sequence = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALER"
print(my_sequence[0:])
print("|" * len(my_sequence)) # new!: repeat the string * n times
print(my_sequence[:])
MVLSPADKTNVKAAWGKVGAHAGEYGAEALER
||||||||||||||||||||||||||||||||
MVLSPADKTNVKAAWGKVGAHAGEYGAEALER
Jump every n characters#
Summary: variable[from_inclusive:to_exclusive:step]
hba1 = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR"
print(hba1[0::2]) # jump every 2
MLPDTVAWKGHGYAAEMLFTKYPFLHSQKHKVDLNVHDMNLASLAKRDVFLSCLTAHPETAHSDFAVTLSY
print(hba1[0::3]) # jump every 3
MSDNAGGAYEEFFTYHLGQGKALAHDNSSHKVVKSLTAPFAADLVVSR
print(hba1[0::7]) # jump every 7
MKWAASYSKAVPSLFLHPDSY
Another jump example#
Summary: variable[from_inclusive:to_exclusive:step]
meth_with_spaces = "a t g "
print(meth_with_spaces[0::2])
atg
Negative means starting from the end#
-1: the last character
This is very useful when dealing with biological sequences
# for instance, retrieving the last part of the sequence
hba1 = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR"
print(hba1[-1:]) # The last character
print(hba1[-2:]) # last 2 characters
print(hba1[-3:]) # last 3 characters
R
YR
KYR
Python reads here like a sentence in a book: from left to right#
peptide = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLS"
print(peptide[-2:-1]) # The amino acid before the last
print(peptide[0:-2]) # all but the 2 last amino acids
L
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMF
# but be careful: order matters!
# Start can not be greater than end.
print(hba1[-1:-2]) # empty
Summary of substring#
# variable[from_inclusive:to_exclusive]
print(hba1[5:10]) # from 5 (inclusive) to 10 (exclusive). Note: the indexes starts at 0
ADKTN
Reverse a str#
Very useful in biological sequences! Imagine if you have the sequence of one of the DNA strands
my_seq = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK"
print(my_seq[::-1])
KGHGKVQASGHSLDFHPFYTKTTPFSLFMRELAEAGYEGAHAGVKGWAAKVNTKDAPSLVM
str.count()#
It is again a method
hba1 = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR"
print(hba1)
print(len(hba1))
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
142
num_a = hba1.count("A") # count() is a method
print("Number of alanines: " + str(num_a))
Number of alanines: 21
# count substrings
hba1 = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR"
print(hba1.count("LS"))
6
# .. but only 1 "ADK"
print(hba1.count("ADK")) # count() is a method
1
count, an example#
Simple test showing that it counts correctly
hba1 = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR"
print(len(hba1))
142
# the 20 amino acids (aa)
print(hba1.count("A") + hba1.count("R") + hba1.count("N") + hba1.count("D") + hba1.count("C") + hba1.count("Q") + hba1.count("E") + hba1.count("G") + hba1.count("H") + hba1.count("I") + hba1.count("L") + hba1.count("K") + hba1.count("M") + hba1.count("F") + hba1.count("P") + hba1.count("S") + hba1.count("T") + hba1.count("W") + hba1.count("Y") + hba1.count("V"))
142
str.find()#
It is a method: returns the index of the first occurrence of the substring in array coordinates
hba1 = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR"
print(hba1)
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
print(hba1.find("A")) # returns the index of first occurrence of the substring
5
Returns the index of first occurrence of the substring
print(hba1.find("SKY")) # returns the index of first occurrence of the substring
138
print(hba1.find("X")) # returns -1: it is not in hba1, not the last!
-1
Note: for more occurrences or complex patterns, we will need regular expressions
Regular expressions are more advance material
Summary#
Comments
# commentary
Statement, function, argument
meth = "ATG" # or print(meth)
Error messages
>>> primt Traceback (most recent call last): File "<stdin>", line 1, in <module> NameError: name 'primt' is not defined >>>
Variables: assign a value to a variable
meth = "ATG"
Types
Numbers vs stringsFunctions vs. methods
print() vs meth.lower()
String concatenation
dna = "ATG" + "ACG"
be careful: do not concatenate string with number.
Quotes
double quotes
single quotes
Special characters (casting)
-\t
-\nString: changing case (str.lower() and str.upper())
meth = "ATG" meth_in_lowercase = meth.lower()
str.count() (and str.find()) substrings
hba1.count("A") hba1.find("A")
Replacing subtrings
hba1.replace("T", "U") # obtain mRNA from cDNA
Extract subtrings
Variable[start, end, period]
hba1[0::2]
Comments#
Comments are ignored by Python; they are written only for humans