C2. Dealing with text: sequences#

Comments#

Comments are ignored by Python; they are written only for humans

# this is a comment
print("Hello world!") # this is also a comment
Hello world!

Numbers#

This chapter is devoted to text. But, for the sake of simplicity I will quickly introduce numbers here

Basic numbers types:

  • int, that is integers: naturals numbers, their corresponding negatives, and 0

    • …-3, -2, -1, 0, 1, 2, …

  • float, real numbers

    • E.g. 3.14159 (note that we are using a point, not a comma)

    • We could use E notation: 2e-4, that is 0.0002

type():
It is a built-in function that returns the type of the argument. Try the next lines of code (statements):

type(3) 
type(3.14)

Basic arithmetic operations:#

- Some operations: +, -, *, /  # be aware that "/" returns a float
- Exponential: 10 ** 2
- Parentheses: 5 * (3 + 2)
- Precedence of operators:  
    1- Parentheses
    2- Exponential 
    3- Multiplication and division (same precedence)
    4- Addition and subtraction (same precedence)
    5- In case of having the same precedence: "precedence from left to right"

Examples

3 + 3 # 6  
3 - 3 # 0  
3 * 3 # 9  
3 / 3 # 1.0 (float)
3 ** 3 # 27

Precedence: 
    1 + 2 * 3 # 7, and not 9

Exercise: operations on numbers

Think what the following expressions will return and check quickly with Python if you are right:

2 + 3 - 2 # preference?
3 * 6 * 1
3 * 6 / 1  
6 / 1     # which output type?
6.0 / 1   
4 - 3 * 2    
(4 - 3) * 2  
2 ** 3  
8 / 2 ** 2   
(8 / 2) ** 2

Strings (str)#

A string is a bit of text, some characters surrounded by quotes. The Python type used to represent strings is str. For instance:

"PTEN is a tumor suppresor" 

print()#

By now, you need to grasp some concepts:
- function, argument, statement

print("PTEN is a tumor suppressor")
# "print" is a built-in function that displays objects
#
# the "argument" of the function is "PTEN is a tumor suppressor" 
#
# this example is a line of code (aka "statement")
PTEN is a tumor suppressor
# we need to write with elegance, even if python admits a bad style in your code 
print   ( "PTEN"        ) # this works, but it is horrible!
PTEN

Advice: observe and copy the style used in this lesson

# print is not only for strings, we can print numbers
print(2 * 3) # write with elegance: observe the spaces
6

More than one argument#

Some functions can have more than one argument

# if there is more than one argument in a function:
# they are separated by commas
print("PTEN", "is", "a", "phosphatase encoded by the gene PTEN.", "Isn't it?")
PTEN is a phosphatase encoded by the gene PTEN. Isn't it?
# of course! One of the arguments (str) can contain a comma 
print("Darwin,", "Charles") # the comma will be shown
Darwin, Charles
# some function can have parameters, like sep
# sep="" indicates here that nothing 
# will be displayed between the arguments
print("Darwin,", "Charles:", "Evolution", sep="")
Darwin,Charles:Evolution
# now, sep="..." indicates that three points 
# will be displayed between the arguments
print("Darwin", "Evolution", sep="...")
Darwin...Evolution
# we can even print strings combined with numbers
print("The origin of species was published on", 1860-1)
The origin of species was published on 1859

Quotes#

Double quotes

# double quotes
#
# be careful!
# 1. Not the German-style of quoting: „Gänsefüßchen“
# 2. Some characters looks like double quotes but they are not
#    For instance: “ is not the same as "
#
# the next is correct:
print("PTENP1 is an homologous processed pseudogene of PTEN")
PTENP1 is an homologous processed pseudogene of PTEN

Single quotes

# single quotes
#
# be careful!
# 1. Some characters looks like single quotes but they are not
#    For instance: ` is not the same as '
#
# the next is correct:
print('PTENP1 is not a protein')
PTENP1 is not a protein

Introducing comparison operators:#

The simplest are == and !=. The output will be a new type, the boolean type (bool type in Python): True or False

# compare numbers
print(5 == 5)
True
# compare strings
print("PTEN" == 'PTEN')  # True, they are the same
True
print("PTEN" != 'PTEN')  # False, because they are the very same
False
print("lncRNA" == "lncrna")  # False, they look the same...but uppercase != lowercase
False

Nested quotes#

The next is an example of proper nesting

print('PTEN is a "tumor suppressor"')  # right
PTEN is a "tumor suppressor"

…But, be careful

print(‘PTEN is a ‘tumor suppressor’’) # wrong
SyntaxError: invalid syntax.

…Although, the next is a correct nesting

print("PTEN is a 'tumor suppressor'")  # right
PTEN is a 'tumor suppressor'

Be careful again!

print(“ATG is the codon for “methionine””) # wrong
SyntaxError: invalid syntax.

Escape characters#

\ (backslash) followed by the character you want to escape…
… is a solution to the previous errors

print("ATG is the codon for \"methionine\" ")  # right
print('ATG is also the \'start codon\'')       # right
ATG is the codon for "methionine" 
ATG is also the 'start codon'

More on escape characters#

The next two are very important:

  • \t (tabular)

  • \n (end of line)

print("ATG\tmethionine\nCCG\t")  # right
ATG	methionine
CCG	

Exercise:
- Try to print the next with one line of code:

    1	one  
    2	two  
    3	three

Error messages#

Python is telling you when it detects that something is wrong in your code

The argument of print is not a str or number#

print(the genome)  # wrong
File "<stdin>", line 1
  print(the genome)  # wrong
              ^
SyntaxError: invalid syntax 

An spelling mistake#

pint("Protein")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'pint' is not defined 

String divided in two lines#

If the file error_2lines.py contains

print("Dear programmer,
in your first approach to programming in Biology")

And the next is run:

emuro@laptop:~$ python3 error_2lines.py
  File "error_2lines.py", line 1
    print("Dear programmer,
                          ^
SyntaxError: EOL while scanning string literal

emuro@mylaptop:~$

Solution:
Your can use an escape character (\n)

print("Dear programmer,\nin your first approach to programming in Biology") # \n
Dear programmer,
in your first approach to programming in Biology

Variables#

Variables are kind of drawers, where you can store information. You can later get back that information just using the name of the variable

An example, the information will be here the protein sequence of HBA1, Hemoglobin subunit alpha (human)

A bioinformatician will typically obtain the sequence from Uniprot:
- Uniprot seq
- Fasta seq

Assign a value to a variable name#

Following the next naming-rules:

  • Can only contain alpha-numeric characters and underscores (A-z, 0-9, and _ )

  • Cannot start with a number

  • Variable names are case-sensitive (meth != Meth != METH)

  • Cannot be a reserved python name, like the name of a function, True, etc

Select a nice name#

Be clear avoiding any ambiguity: for the sake of the quality of your code

# P69905 is the id from Uniprot.
# Find a nice variable name for it! For instance,
hba1_uniprot = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR"

The next is important, because many students confuse the next concepts:

# The next statements are very different
print(hba1_uniprot)   # variable (it has information associated)
print("hba1_uniprot") # string (not a variable)
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
hba1_uniprot
print(hba1_uniprot) # again, hba1_uniprot is a variable (no quotes!)
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
# you can print strings combined with variables
hba1_uniprot = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR"
print("The HBA1 canonical protein sequence in uniprot is", hba1_uniprot)
The HBA1 canonical protein sequence in uniprot is MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
# or variables that contain numbers
number_of_pages = 695
print("My edition of 'On the Origin of species' book has", number_of_pages, "pages")
My edition of 'On the Origin of species' book has 695 pages
# the  general rule is that constants are variables in UPPER_CASE  
DARWIN_BORN_YEAR = 1809  
OS_PUBLICATION_YEAR = 1859  
print("Darwin was",  OS_PUBLICATION_YEAR - DARWIN_BORN_YEAR, "years old")
Darwin was 50 years old

We continue with the same example, HBA1, Hemoglobin subunit alpha (human)

The same bioinformatician could retrieve the protein sequence from NCBI (another web service):
- NCBI::Protein::NP_000549.1
- The sequence of NP_000549.1 (fasta format)

# NP_000549.1 sequence
# Find a nice variable name!
hba1_ncbi = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR"
print(hba1_ncbi)
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR

Compare the content of variables#

Do hba1_uniprot and hba1_ncbi store the same sequence?
Take into account that the sequences are obtained from different databases: NCBI and Uniprot

print(hba1_ncbi == hba1_uniprot) # True...but you have to check yourself!
True

Therefore, their lengths have to be the very same

len()#

It is another Python built-in function

len(obj, /)
    Return the number of items in a container.

In plain English, the length. Just be aware that it works for many types.
We will see len() applied to different types in future lessons.

# Yes! The lengths are the same
print(len(hba1_uniprot), "amino acids")
print(len(hba1_ncbi),    "amino acids")
142 amino acids
142 amino acids

Let’s learn from another example: PTEN, Phosphatase and tensin homolog (human)

Protein vs. mRNA sequence:

  • Protein sequence (403 nt):

  • mRNA (far longer):

    • UCSC::gene::browser (gene: introns + exons. Span > 100Kbp)

    • NCBI NM_000314.8: GenBank annotation (mRNA. Exons, without introns, after splicing + UTRs)

    • NCBI NM_000314.8: mRNA fasta sequence (8505 bp. The UTRs conform most of this sequence)

What happens if we need to assign a long sequence to a variable?#

We should not directly assign (on fire) the mRNA sequence to a variable, neither its protein sequence
ie. pten_ncbi = “MTAIIKEIVS…
Imagine how much work is just joining all the lines in one long line!

Bioinformatics solutions:#

- Retrieve the sequence from a fasta file
- Retrieve the sequence from a database (e.g. SQL: local or remote)

Another solution#

  • Use triple quotes

With this solution, you will be able to assign the long sequence to a var directly within the code.
Still, do not use it for very long sequences, like the mRNA sequence with so many lines

pten_uniprot = """MTAIIKEIVSRNKRRYQEDGFDLDLTYIYPNIIAMGFPAERLEGVYRNNIDDVVRFLDSK
HKNHYKIYNLCAERHYDTAKFNCRVAQYPFEDHNPPQLELIKPFCEDLDQWLSEDDNHVA
AIHCKAGKGRTGVMICAYLLHRGKFLKAQEALDFYGEVRTRDKKGVTIPSQRRYVYYYSY
LLKNHLDYRPVALLFHKMMFETIPMFSGGTCNPQFVVCQLKVKIYSSNSGPTRREDKFMY
FEFPQPLPVCGDIKVEFFHKQNKMLKKDKMFHFWVNTFFIPGPEETSEKVENGSLCDQEI
DSICSIERADNDKEYLVLTLTKNDLDKANKDKANRYFSPNFKVKLYFTKTVEEPSNPEAS
SSTSVTPDVSDNEPDHYRYSDTTDSDPENEPFDEDQHTQITKV"""

print(len(pten_uniprot)) # we know that this protein is 403 aa
                         # what happens?
409
# check that it maintains the returns ("\n"):
# the next sequence is like the previous one, but without "\n"
pten_uniprot_1L = """MTAIIKEIVSRNKRRYQEDGFDLDLTYIYPNIIAMGFPAERLEGVYRNNIDDVVRFLDSKHKNHYKIYNLCAERHYDTAKFNCRVAQYPFEDHNPPQLELIKPFCEDLDQWLSEDDNHVAAIHCKAGKGRTGVMICAYLLHRGKFLKAQEALDFYGEVRTRDKKGVTIPSQRRYVYYYSYLLKNHLDYRPVALLFHKMMFETIPMFSGGTCNPQFVVCQLKVKIYSSNSGPTRREDKFMYFEFPQPLPVCGDIKVEFFHKQNKMLKKDKMFHFWVNTFFIPGPEETSEKVENGSLCDQEIDSICSIERADNDKEYLVLTLTKNDLDKANKDKANRYFSPNFKVKLYFTKTVEEPSNPEASSSTSVTPDVSDNEPDHYRYSDTTDSDPENEPFDEDQHTQITKV"""

print(len(pten_uniprot_1L))
403

How to change the value of a variable?#

lys = "AAA"
print(lys)
AAA
lys = "AAG" # change the value of Lysine
print(lys)  # see how "AAA" is not anymore assigned to the var
AAG

More on str (strings)#

Concatenation of str#

# PTEN CDS is: atg aca gcc atc atc ...aaagagatcgttagcagaaacaaaaggagatatcaagagg...
pten_cds = "atg" + "aca" + "gcc" + "atc" + "atc" + "..."

print(pten_cds)
atgacagccatcatc...

Concatenation of variables that contain strings#

These variables can be concatenated too

pten_cds = "atg" + "aca" + "gcc" + "atc" + "atc" + "..."
meth = "atg"
thr = "aca"
ala = "gcc"
ile = "atc"
new_pten_cds = meth + thr + ala + ile + ile + "..."
print(new_pten_cds)
atgacagccatcatc...
print(pten_cds == new_pten_cds)
True

Be aware of concatenating different types¶#

Note: +: is an arithmetic operator, when all variables are of numeric type

# addition
five = 5
two = 2
addition = five + two
print(addition)
print(type(addition)) # new! 
7
<class 'int'>

Important Note:
+: we can not concatenate different types

# the next code is wrong
five = 5
two = “two”
addition = five + two # error!
print(addition)

TypeError: unsupported operand type(s) for +: ‘int’ and ‘str’

Literals: treat the string as a raw string#

Example:

r"\n"

Then, it is not a return of line, but literally: backslash followed by an n:

r"\n" is the same as "\\n"...that is, the backslash is in the string
text_without_r = "one\ntwo" # \n is 'return of line' 
print(text_without_r)
one
two
text_with_r = r"one\ntwo" # \n is not 'return of line'
print(text_with_r)
one\ntwo
# It can be confusing. Be careful! If you run (in the interpreter)  
r"\n" # it returns '\\n', that is the string with \ escaped  
'\\n'

Literal for paths#

They are here very useful. Observe the next paths in the different OS

  • Windows: r”c:\windows\Desktop\file.txt”

  • Mac: “/Users/cdarwin/file.txt”

  • Linux: “/home/cdarwin/file.txt”

# Big difference because of the r
windows_file = "c:\windows\Desktop\nitrogen.txt" # This is not what we want
print(windows_file)

windows_file = r"c:\windows\Desktop\nitrogen.txt" # This is what we want
print(windows_file)
c:\windows\Desktop
itrogen.txt
c:\windows\Desktop\nitrogen.txt

The length of a variable#

The value of the variable can be, for instance, a string contained in the variable. Remember that:

len()#

is a built-in function that returns a number (number of items).
Here it will be the number of characters of a string

meth = "atg"
thr = "aca"
ala = "gcc"
ile = "atc"
print("Methionine is", meth)
Methionine is atg
print("The length of the content of the var meth is: ", len(meth))
The length of the content of the var meth is:  3
peptide = meth + thr + ala + ile + ile
print(peptide)
atgacagccatcatc
print("The length of the peptide is", len(peptide), "(bp)")
The length of the peptide is 15 (bp)

A number cannot be concatenated with a string!#

It will raise an error!

“The length of the previous peptide was “ + len(peptide)

#TypeError: can only concatenate str (not “int”) to str

A solution is casting#

That is, changing types. For example, from number to str

str() is a function

  • Return a string, in the next case, taking a number as argument

"The length of my peptide was " + str(len(peptide))
'The length of my peptide was 15'
# Note that we previously used the next, with no problem:
print("The length of my peptide was", len(peptide)) # No need of casting
The length of my peptide was 15

We call it string, Python calls it str#

In Python data types are actually classes and variables are instances (objects) of these classes.

For instance, for casting: str(5).
We can use type() to see the class of a variable. See the next example:

number = 5          # int 
print(number)
print(type(number)) # object of class int

number = str(5)     # from int to string
print(number)       # looks the same, but it has a diff. type
print(type(number)) # object of class str
5
<class 'int'>
5
<class 'str'>
# help really helps!...but, it is more advanced
# help(len) 
# help(str) ...this could be complicated. It is a class

Changing case#

str.lower()#

It is a class method, not a function. In this case the class is str This method returns a str: the modified string

# lower()
# is a method not a function
# it belongs to the type str
dna = "GATTACA"
print(dna)         # print is a function
print(dna.lower()) # variable + . + method + ()
print(dna)         # and NOT IN-PLACE
GATTACA
gattaca
GATTACA
# You can even do the next
print("GATTACA".lower())
gattaca

Important!
You need to be very careful, because:

dna = "GATTACA"
print(dna.lower()) # this works!
gattaca

Do not mix up methods with functions!
For instance, the next will provide an error:

print(lower(dna)) # There is not such lower function # str.lower() is a method from str, not built-in function

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [117], in <cell line: 2>()
      1 # ...but, the next will provide an error
----> 2 print(lower(dna))

NameError: name 'lower' is not defined
# Save the result of str.lower() to a variable
#
dna = "aTg"
dna_modified = dna.lower() # we need a var if we want to use again the result of the method
print(dna)  # dna was not modified 
print(dna_modified)
aTg
atg

Check more str methods at python.org
For instance, for str.lower()

str.upper()#

The very same here, upper is a method, not function.
As with lower, it does not change the value of the string (variable)

# upper()
# is a method not a function
# it belongs to a type, in this case (string)
dna = "gattaca"
print(dna)         # print is a function
print(dna.upper()) # upper is a method, with no argument => ()
                   # variable + . + method + ()
print(dna)
gattaca
GATTACA
gattaca

str.replace()#

replace is a method of the class str, not a function

hba1 = "MVLSPADKTNVK...M" 
print(hba1)
print(hba1.replace("M", "V")) # it replaces all the occurrences
MVLSPADKTNVK...M
VVLSPADKTNVK...V
# note that the variable does not change
print(hba1)
MVLSPADKTNVK...M
# replace a substring
hba1 = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSA...MVLSPADKT"
print(hba1)
print(hba1.replace("MVLSPADKT", "IAIA")) # it replaces all the occurrences
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSA...MVLSPADKT
IAIANVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSA...IAIA
print(hba1) # again the original var was not modified
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSA...MVLSPADKT

Extracting a substring#

Summary:
Variable[from_inclusive:to_exclusive]

hba1 = "MLVSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR"
print(hba1)
print(len(hba1))
MLVSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
142
# substring
# summary: variable[from_inclusive:to_exclusive]
print(hba1[0:20]) # array coordinates. Not biological coordinates!
                  # first considered, last not considered
MLVSPADKTNVKAAWGKVGA
print(hba1[1:3]) # not biological coordinates
LV
hba1 = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFP..."
length_hba1 = len(hba1) # The whole sequence. Different methods
print(hba1[0:length_hba1])  
print(hba1[0:])  
print(hba1[:length_hba1]) 
print(hba1[:]) 
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFP...
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFP...
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFP...
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFP...
print(hba1[0:length_hba1-5]) 
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTY
# Learn a trick:
# display the sequences like an alignment
my_sequence = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALER"
print(my_sequence[0:])
print("|" * len(my_sequence)) # new!: repeat the string * n times
print(my_sequence[:])
MVLSPADKTNVKAAWGKVGAHAGEYGAEALER
||||||||||||||||||||||||||||||||
MVLSPADKTNVKAAWGKVGAHAGEYGAEALER

Jump every n characters#

Summary: variable[from_inclusive:to_exclusive:step]

hba1 = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR"
print(hba1[0::2]) # jump every 2
MLPDTVAWKGHGYAAEMLFTKYPFLHSQKHKVDLNVHDMNLASLAKRDVFLSCLTAHPETAHSDFAVTLSY
print(hba1[0::3]) # jump every 3
MSDNAGGAYEEFFTYHLGQGKALAHDNSSHKVVKSLTAPFAADLVVSR
print(hba1[0::7]) # jump every 7
MKWAASYSKAVPSLFLHPDSY
Another jump example#

Summary: variable[from_inclusive:to_exclusive:step]

meth_with_spaces = "a t g "
print(meth_with_spaces[0::2])
atg

Negative means starting from the end#

-1: the last character
This is very useful when dealing with biological sequences

# for instance, retrieving the last part of the sequence
hba1 = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR"
print(hba1[-1:]) # The last character
print(hba1[-2:]) # last 2 characters
print(hba1[-3:]) # last 3 characters
R
YR
KYR
Python reads here like a sentence in a book: from left to right#
peptide = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLS"
print(peptide[-2:-1]) # The amino acid before the last
print(peptide[0:-2])  # all but the 2 last amino acids 
L
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMF
# but be careful: order matters! 
# Start can not be greater than end.
print(hba1[-1:-2]) # empty

Summary of substring#
# variable[from_inclusive:to_exclusive]
print(hba1[5:10]) # from 5 (inclusive) to 10 (exclusive). Note: the indexes starts at 0
ADKTN

Reverse a str#

Very useful in biological sequences! Imagine if you have the sequence of one of the DNA strands

my_seq = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK"
print(my_seq[::-1])
KGHGKVQASGHSLDFHPFYTKTTPFSLFMRELAEAGYEGAHAGVKGWAAKVNTKDAPSLVM

str.count()#

It is again a method

hba1 = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR"
print(hba1)
print(len(hba1))
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
142
num_a = hba1.count("A") # count() is a method
print("Number of alanines: " + str(num_a))
Number of alanines: 21
# count substrings
hba1 = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR"
print(hba1.count("LS")) 
6
# .. but only 1 "ADK" 
print(hba1.count("ADK")) # count() is a method
1

count, an example#

Simple test showing that it counts correctly

hba1 = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR"
print(len(hba1))
142

Amino acids, letters

# the 20 amino acids (aa)
print(hba1.count("A") + hba1.count("R") + hba1.count("N") + hba1.count("D") + hba1.count("C") + hba1.count("Q") + hba1.count("E") + hba1.count("G") + hba1.count("H") + hba1.count("I") + hba1.count("L") + hba1.count("K") + hba1.count("M") + hba1.count("F") + hba1.count("P") + hba1.count("S") + hba1.count("T") + hba1.count("W") + hba1.count("Y") + hba1.count("V"))
142

str.find()#

It is a method: returns the index of the first occurrence of the substring in array coordinates

hba1 = "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR"
print(hba1)
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
print(hba1.find("A"))  # returns the index of first occurrence of the substring
5

Returns the index of first occurrence of the substring

print(hba1.find("SKY"))  # returns the index of first occurrence of the substring
138
print(hba1.find("X"))  # returns -1: it is not in hba1, not the last!
-1

Note: for more occurrences or complex patterns, we will need regular expressions
Regular expressions are more advance material

Summary#

  • Comments

    # commentary
    
  • Statement, function, argument

    meth = "ATG" # or 
    print(meth)
    
  • Error messages

      >>> primt
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
      NameError: name 'primt' is not defined
      >>> 
    
  • Variables: assign a value to a variable

    meth = "ATG"
    
  • Types
    Numbers vs strings

  • Functions vs. methods

    • print() vs meth.lower()

  • String concatenation

        dna = "ATG" + "ACG"
    
    • be careful: do not concatenate string with number.

  • Quotes

    • double quotes

    • single quotes

  • Special characters (casting)
    -\t
    -\n

  • String: changing case (str.lower() and str.upper())

    meth = "ATG"
    meth_in_lowercase = meth.lower()
    
  • str.count() (and str.find()) substrings

    hba1.count("A")
    hba1.find("A")
    
  • Replacing subtrings

    hba1.replace("T", "U")  # obtain mRNA from cDNA
    
  • Extract subtrings

    • Variable[start, end, period]

    hba1[0::2]
    

Exercises#