C4. Lists of elements: iterating over them#
A list#
A list is able to store a sequence of elements. For instance, the canonical protein coding genes of a given species ordered by their loci within the different chromosomes
list() is another built-in data type in Python like int()
Why are Lists useful?#
Till now we have been using variables to store information; one piece of information (a string, a number). Examples:
The name of a programming language (e.g. “Python”)
A given day of the month (e.g. 11)
But what happens when the data has multiple pieces? Or when having a sequence of elements
But, we could assign the information to individual variables. Couldn’t we?#
# One variable: one piece of information (str)
my_globin = "HBA1"
# One variable: more globins in a single str
globins = "HBA1, HBA2, HBB, HBD, HBE1, HBG1, HBG2, HBM, HBQ1, HBZ, MB" # a string.
Now, there would be a problem when there is a need to access an specific globin or the one in an specific position within the string: we could even have thousands of ordered protein names
Another example#
See the result of a multiple sequence alignment provided by Clustal X
# we can again assign a different variable per line of alignment
seq_alignment_1 = "AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSS"
seq_alignment_2 = "AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVS"
seq_alignment_3 = "DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSS"
seq_alignment_4 = "AEGDDP---AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTS"
seq_alignment_5 = "AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFAS"
seq_alignment_6 = "AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTS"
seq_alignment_7 = "FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVS"
…Imagine with more sequences aligning. The resulting code will a pain
A solution consists in combining lists and loops#
We will see now what is a list, leaving loops for later
What is a list?#
A list is offered by Python to manage this type of data. Its is a data structure designed to store, organize, retrieve and process certain type of data (information). In practical terms, a list is a new type in Python (like int, str, …) and it is used a lot in any programming language
# what is a list in Python?
# quite simple: squared-brackets and in between its elements separated by commas
["HBA1", "HBA2", "HBB", "HBD", "HBE1", "HBG1", "HBG2", "HBM", "HBQ1", "HBZ", "MB"] # list of str
['HBA1',
'HBA2',
'HBB',
'HBD',
'HBE1',
'HBG1',
'HBG2',
'HBM',
'HBQ1',
'HBZ',
'MB']
A lists of what?#
The last example is a list of strings; but there could be lists of any other type
# for instance a list of integers
[1, 2, 3, 4, 5, 6] # list of int
[1, 2, 3, 4, 5, 6]
# or a list of floats
[1., 2., 3., 4., 5., 6.] # list of float
[1.0, 2.0, 3.0, 4.0, 5.0, 6.0]
# or a even a list of different types. But, note that this is not common
[1, 2., "protein"] # list of different types
[1, 2.0, 'protein']
list is an iterable type#
This means that we can iterate over its elements. Just for your information, in Python there are more iterable types like tuples or strings.
Assigning a list to a variable#
As we did in the previous chapters with an int, float, str or a file object
globins = ["HBA1", "HBA2", "HBB", "HBD", "HBE1", "HBG1", "HBG2", "HBM", "HBQ1", "HBZ", "MB"]
print(globins)
['HBA1', 'HBA2', 'HBB', 'HBD', 'HBE1', 'HBG1', 'HBG2', 'HBM', 'HBQ1', 'HBZ', 'MB']
len()#
Remember that len() provides the number of items in a provided container
# len()
print(len(globins)) # numbers of elements
11
Indexes and values#
list.index()#
It is a method
print(globins.index("HBA1"))
print(globins.index("HBA2"))
print(globins.index("MB"))
0
1
10
If the variable is not in the list Python raises an Error message:
print(globins.index("NOT_IN_THE_LIST"))
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_11369/4034324269.py in <module>
----> 1 print(globins.index("NOT_IN_THE_LIST"))
ValueError: 'NOT_IN_THE_LIST' is not in list
Accessing to its elements#
print(globins[0])
HBA1
print(globins[1])
HBA2
print(globins[10])
MB
Check the number of elements of the list with len()#
print(len(globins)) # numbers of elements
11
# the length of the first element is and NOT the length of the list
print(len(globins[0]))
4
# like in str, this will be the last element
print(globins[-1])
MB
# another way to do the same:
# Lists use array coordinates. The first element is 0
print(globins[len(globins)-1])
MB
# is really the same?
globins[-1] == globins[len(globins)-1]
True
help(list)#
For obtaining more information on lists or whatever you want. It is not for beginners but use it as soon as possible
Extracting a sublist#
If you remember our previous lesson, this is very similar. It is like extracting a substring
globins = ["HBA1", "HBA2", "HBB", "HBD", "HBE1", "HBG1", "HBG2", "HBM", "HBQ1", "HBZ", "MB"]
print(globins)
# sublist
print(globins[0:3]) # array coordinates
# first considered, last not
# That is: first inclusive, last exclusive
['HBA1', 'HBA2', 'HBB', 'HBD', 'HBE1', 'HBG1', 'HBG2', 'HBM', 'HBQ1', 'HBZ', 'MB']
['HBA1', 'HBA2', 'HBB']
print(globins[1:3]) # again array coordinates not biological coordinates
['HBA2', 'HBB']
Get the whole list#
# print the whole list
print(globins[0:])
print(globins[:]) # another way, same list displayed
['HBA1', 'HBA2', 'HBB', 'HBD', 'HBE1', 'HBG1', 'HBG2', 'HBM', 'HBQ1', 'HBZ', 'MB']
['HBA1', 'HBA2', 'HBB', 'HBD', 'HBE1', 'HBG1', 'HBG2', 'HBM', 'HBQ1', 'HBZ', 'MB']
Get the whole list, but jump every…#
# print the list, but jump every 1 i.e. no jump
print(globins[0::1])
# print the list, but jump every 2
print(globins[0::2])
# print the list but jump every 3
print(globins[0::3])
['HBA1', 'HBA2', 'HBB', 'HBD', 'HBE1', 'HBG1', 'HBG2', 'HBM', 'HBQ1', 'HBZ', 'MB']
['HBA1', 'HBB', 'HBE1', 'HBG2', 'HBQ1', 'MB']
['HBA1', 'HBD', 'HBG2', 'HBZ']
Get the last part of the list#
# last part of the sequence
print(globins[-1:]) # last element
print(globins[-2:]) # last 2 elements
['MB']
['HBZ', 'MB']
Reversing strings (str)#
Revisiting what we already learned, because lists are quite similar
dna = "atgccg"
print(dna[::-1])
print(dna)
gccgta
atgccg
Reversing a list (intermediate level).#
It does not change the content of the list
# too advanced, but does not harm
globins = ["HBA1", "HBA2", "HBB", "HBD", "HBE1", "HBG1", "HBG2", "HBM", "HBQ1", "HBZ", "MB"]
print(globins[::-1]) # a nice trick, isn't it?
print(globins)
['MB', 'HBZ', 'HBQ1', 'HBM', 'HBG2', 'HBG1', 'HBE1', 'HBD', 'HBB', 'HBA2', 'HBA1']
['HBA1', 'HBA2', 'HBB', 'HBD', 'HBE1', 'HBG1', 'HBG2', 'HBM', 'HBQ1', 'HBZ', 'MB']
List methods#
list.reverse()#
It is a method from the class list
In this case, reverses a list in-place; that means that it modifies the list. But, it returns None. So, we have to be careful, we can not assign the output of list.reverse() expecting to be the reversed list
in-place (in uppercase too: IN PLACE)
globins = ["HBA1", "HBA2", "HBB", "HBD", "HBE1", "HBG1", "HBG2", "HBM", "HBQ1", "HBZ", "MB"]
print(globins)
# method: reverse (beginner level)
globins.reverse() # in this case, reverse *IN PLACE*
print(globins)
['HBA1', 'HBA2', 'HBB', 'HBD', 'HBE1', 'HBG1', 'HBG2', 'HBM', 'HBQ1', 'HBZ', 'MB']
['MB', 'HBZ', 'HBQ1', 'HBM', 'HBG2', 'HBG1', 'HBE1', 'HBD', 'HBB', 'HBA2', 'HBA1']
A tricky issue#
Following the very same logic: list.reverse() returns None, not the list reversed
# you would expect that globins.reverse().reverse() == globins, but
# the double reverse() does not work,
# because globins.reverse() modifies the list (globins), but returns None
#
globins = ["HBA1", "HBA2", "HBB", "HBD", "HBE1", "HBG1", "HBG2", "HBM", "HBQ1", "HBZ", "MB"]
#print(globins.reverse().reverse()) # this will raise an error! Do you see why?
print(globins.reverse()) # do you see it now?
print(globins)
None
['MB', 'HBZ', 'HBQ1', 'HBM', 'HBG2', 'HBG1', 'HBE1', 'HBD', 'HBB', 'HBA2', 'HBA1']
list.sort()#
It is a method from the class list
Sorts the list in ascending order and returns None.
The sort is in-place (i.e. the list itself is modified) and stable (i.e. the order of two equal elements is maintained).
A reverse flag can be set to sort in descending order.
globins = ["HBE1", "HBG1", "HBA2", "HBZ", "MB", "HBBD", "HBB", "HBA1", "HBM", "HBQ1"]
print(globins)
# method: sort (beginner level)
globins.sort() # ascending and in-place
print(globins)
['HBE1', 'HBG1', 'HBA2', 'HBZ', 'MB', 'HBBD', 'HBB', 'HBA1', 'HBM', 'HBQ1']
['HBA1', 'HBA2', 'HBB', 'HBBD', 'HBE1', 'HBG1', 'HBM', 'HBQ1', 'HBZ', 'MB']
Descending:#
globins = ["HBE1", "HBG1", "HBA2", "HBZ", "MB", "HBBD", "HBB", "HBA1", "HBM", "HBQ1"]
print(globins)
globins.sort(reverse=True) # descending
print(globins)
['HBE1', 'HBG1', 'HBA2', 'HBZ', 'MB', 'HBBD', 'HBB', 'HBA1', 'HBM', 'HBQ1']
['MB', 'HBZ', 'HBQ1', 'HBM', 'HBG1', 'HBE1', 'HBBD', 'HBB', 'HBA2', 'HBA1']
Numerical sort:#
In the same way, we can sort a list of int or floats instead of a list of str
numbers = [111, 11, 1, 3, 9, 2, 8, 7, 77]
print(numbers)
# method: sort (beginner level)
numbers.sort() # ascending and IN-PLACE
print(numbers) # numerically not lexicographically
[111, 11, 1, 3, 9, 2, 8, 7, 77]
[1, 2, 3, 7, 8, 9, 11, 77, 111]
numbers = ["111", "11", "1", "3", "9", "2", "8", "7", "77"] # the elements are str not int
print(numbers)
# method: sort (beginner level)
numbers.sort() # ascending and IN-PLACE
print(numbers) # then it sorts lexicographically. Not numerically!
['111', '11', '1', '3', '9', '2', '8', '7', '77']
['1', '11', '111', '2', '3', '7', '77', '8', '9']
list.count()#
This method returns the number of occurrences of a value
globins = ["HBA1", "HBA1", "HBA2", "HBB", "HBD", "HBE1", "HBG1", "HBG2", "HBM", "HBQ1", "HBZ", "MB", "HBA1"]
print(globins)
print(str(globins.count("HBA1")))
print(str(globins.count("HBA2")))
print(str(globins.count("HBO"))) # no TV channel here.
# note: do not confuse 0 with O, different but look quite similar
['HBA1', 'HBA1', 'HBA2', 'HBB', 'HBD', 'HBE1', 'HBG1', 'HBG2', 'HBM', 'HBQ1', 'HBZ', 'MB', 'HBA1']
3
1
0
list.append()#
This method is frequently used, it appends an object to the end of the list (in place)
globins = ["HBA1", "HBA2"]
print(globins, len(globins))
# add the next globin at the end of the list
globins.append("HBB") # *IN PLACE*
print(globins, len(globins))
['HBA1', 'HBA2'] 2
['HBA1', 'HBA2', 'HBB'] 3
# add the next globin at the end of the list
globins.append("MB") # *IN PLACE*
print(globins, len(globins))
['HBA1', 'HBA2', 'HBB', 'MB'] 4
list.pop()#
Removes and returns the item at index (default last, index=-1).
Raises IndexError if list is empty or index is out of range.
also in place!
globins = ['HBA1', 'HBA2', 'HBB', 'MB']
print(globins, len(globins))
# remove "MB".
# By default removes the last element of the list
last_globin = globins.pop() # *IN PLACE*
print(last_globin)
print(globins, len(globins)) # see that we have now one globin less
['HBA1', 'HBA2', 'HBB', 'MB'] 4
MB
['HBA1', 'HBA2', 'HBB'] 3
# Init globins
globins = ['HBA1', 'HBA2', 'HBB', 'MB']
print(globins, len(globins))
# remove "HBB". It is necesary to indicate it index: 2, or -2 in this case
#
removed_globin = globins.pop(2) # *IN PLACE*
print(removed_globin)
print(globins, len(globins)) # see that we have now one globin less
['HBA1', 'HBA2', 'HBB', 'MB'] 4
HBB
['HBA1', 'HBA2', 'MB'] 3
list.extend()#
Extends the list by appending elements from an iterable (in place)
globins = ["HBA1", "HBA2"]
print(globins, len(globins))
more_globins = ["HBB", "HBD", "HBE1", "HBG1", "HBG2", "HBM", "HBQ1", "HBZ", "MB"]
print(more_globins, len(more_globins))
# add the list more_globins at the end of the list globins
globins.extend(more_globins) # *IN PLACE*
print(globins, len(globins))
['HBA1', 'HBA2'] 2
['HBB', 'HBD', 'HBE1', 'HBG1', 'HBG2', 'HBM', 'HBQ1', 'HBZ', 'MB'] 9
['HBA1', 'HBA2', 'HBB', 'HBD', 'HBE1', 'HBG1', 'HBG2', 'HBM', 'HBQ1', 'HBZ', 'MB'] 11
Another way to do the same is concatenating: +
globins = ["HBA1", "HBA2"]
print(globins, len(globins))
more_globins = ["HBB", "HBD", "HBE1", "HBG1", "HBG2", "HBM", "HBQ1", "HBZ", "MB"]
print(more_globins, len(more_globins))
globins = globins + more_globins # concatenate the lists
print(globins, len(globins))
['HBA1', 'HBA2'] 2
['HBB', 'HBD', 'HBE1', 'HBG1', 'HBG2', 'HBM', 'HBQ1', 'HBZ', 'MB'] 9
['HBA1', 'HBA2', 'HBB', 'HBD', 'HBE1', 'HBG1', 'HBG2', 'HBM', 'HBQ1', 'HBZ', 'MB'] 11
An old trick: +=
# that is: a += 1
# instead of a = a + 1
# see the next example
globins = ["HBA1", "HBA2"]
print(globins, len(globins))
more_globins = ["HBB", "HBD", "HBE1", "HBG1", "HBG2", "HBM", "HBQ1", "HBZ", "MB"]
print(more_globins, len(more_globins))
globins += more_globins # instead of: globins = globins + more_globins
print(globins, len(globins))
['HBA1', 'HBA2'] 2
['HBB', 'HBD', 'HBE1', 'HBG1', 'HBG2', 'HBM', 'HBQ1', 'HBZ', 'MB'] 9
['HBA1', 'HBA2', 'HBB', 'HBD', 'HBE1', 'HBG1', 'HBG2', 'HBM', 'HBQ1', 'HBZ', 'MB'] 11
Loops#
Loops are a way to
Control the flow#
That is, how do we decide upon the order of execution of the statements within our code?
Example:
Print the elements of a list; with our current knowledge we need to write the next code (very redundant):
# Init globins
globins = ["HBA1", "HBA2", "HBB", "HBD", "HBE1", "HBG1", "HBG2", "HBM", "HBQ1", "HBZ", "MB"]
# print information about the elements of the list globins
print(globins[0], "is a globin")
print(globins[1], "is a globin")
print(globins[2], "is a globin")
print(globins[3], "is a globin")
print(globins[4], "is a globin")
print(globins[5], "is a globin")
print(globins[6], "is a globin")
print(globins[7], "is a globin")
print(globins[8], "is a globin")
print(globins[9], "is a globin")
print(globins[10], "is a globin")
HBA1 is a globin
HBA2 is a globin
HBB is a globin
HBD is a globin
HBE1 is a globin
HBG1 is a globin
HBG2 is a globin
HBM is a globin
HBQ1 is a globin
HBZ is a globin
MB is a globin
This is painful! Imagine a list with ~20,000 human gene names!.
Solution: loops
for: iterating over a list#
globins = ["HBA1", "HBA2", "HBB", "HBD", "HBE1", "HBG1", "HBG2", "HBM", "HBQ1", "HBZ", "MB"] # Init globins
for globin in globins: # even with lists containing more than 20000 elements!
print(globin, "is a globin") # the body of the loop!
HBA1 is a globin
HBA2 is a globin
HBB is a globin
HBD is a globin
HBE1 is a globin
HBG1 is a globin
HBG2 is a globin
HBM is a globin
HBQ1 is a globin
HBZ is a globin
MB is a globin
several iterations
The body of the loop#
The body is composed by all the statements that belong to the loop (for, see above). In Python these statements are indented with a tabular. This makes Python quite readable in comparison with other programming languages.
A note from experience:
in order to avoid headaches do not mix-up tabulars and spaces within your code and set up your programming editor (a tabular = 4 whitespaces).
An example with a larger body of the loop:
globins = ["HBA1", "HBA2","HBZ", "MB"] # init some globins into a list
for globin in globins: # even with lists containing more than 20000 elements!
print(globin, "is a globin")
print("\t" + globin + " starts with " + globin[0])
print("\t" + globin + " ends with " + globin[-1])
print("\t" + "The gene name has " + str(len(globin)) + " letters")
HBA1 is a globin
HBA1 starts with H
HBA1 ends with 1
The gene name has 4 letters
HBA2 is a globin
HBA2 starts with H
HBA2 ends with 2
The gene name has 4 letters
HBZ is a globin
HBZ starts with H
HBZ ends with Z
The gene name has 3 letters
MB is a globin
MB starts with M
MB ends with B
The gene name has 2 letters
The structure of a for#
Structure:
for element in list**:**
for character in string**:**
Example:
for globin in globins:
print(globin + "is a globin")
print("\t" + globin + " starts with " + globin[0])
print("\t" + globin + " ends with " + globin[-1])
print("\t" + "Its name has " + str(len(globin)) + " letters")
You need to pay attention to:
for, variable for each element (globin), in, iterable variable (globins), the colon (:)
The indentation (tabular=”\t”) of the block of code. In this case, the block of code is the body of the loop
# for example, iterating over an str instead of a list
amino_acids = "ACDEFGHIKLMNPQRSTVWY" # an str with 20 amino acids (letters)
for aa in amino_acids: # amino_acids is a str
print(aa, "is one of the amino acids")
print("There are", len(amino_acids), "amino_acids")
A is one of the amino acids
C is one of the amino acids
D is one of the amino acids
E is one of the amino acids
F is one of the amino acids
G is one of the amino acids
H is one of the amino acids
I is one of the amino acids
K is one of the amino acids
L is one of the amino acids
M is one of the amino acids
N is one of the amino acids
P is one of the amino acids
Q is one of the amino acids
R is one of the amino acids
S is one of the amino acids
T is one of the amino acids
V is one of the amino acids
W is one of the amino acids
Y is one of the amino acids
There are 20 amino_acids
# a for example, iterating over a list instead of a str
amino_acids = list("ACDEFGHIKLMNPQRSTVWY") # now we have converted the str to a list
for aa in amino_acids: # amino_acids is now a list
print(aa, "is one of the amino acids")
print("There are", len(amino_acids), "amino_acids")
A is one of the amino acids
C is one of the amino acids
D is one of the amino acids
E is one of the amino acids
F is one of the amino acids
G is one of the amino acids
H is one of the amino acids
I is one of the amino acids
K is one of the amino acids
L is one of the amino acids
M is one of the amino acids
N is one of the amino acids
P is one of the amino acids
Q is one of the amino acids
R is one of the amino acids
S is one of the amino acids
T is one of the amino acids
V is one of the amino acids
W is one of the amino acids
Y is one of the amino acids
There are 20 amino_acids
Indentation error#
A single whitespace in the body of the loop will provide an error. See the next snippet:
globins = ["HBA1", "HBA2","HBZ", "MB"] # Init globins
for globin in globins:
print(globin + "is a globin")
print("\t" + globin + " starts with " + globin[0])
print("\t" + globin + " ends with " + globin[-1]) # This will raise an error!
print("\t" + "Its name has " + str(len(globin)) + " letters")
\#IndentationError: unexpected indent
Note of advice
(again)
When indenting avoid mixing tabulars and whitespaces
Configurate the editor: translates “\t” in 4 whitespaces (google or talk to your system administrator)
More on strings#
for: iterate over characters of a str#
Remind
suborder="Haplorhini"
for letter in suborder: # iterates all over the string
print(letter.upper()*8 + " "*4 + letter.upper()*4 + " "*4 + letter.upper()*2 + " "*4 + letter.upper()) # trick: *4
HHHHHHHH HHHH HH H
AAAAAAAA AAAA AA A
PPPPPPPP PPPP PP P
LLLLLLLL LLLL LL L
OOOOOOOO OOOO OO O
RRRRRRRR RRRR RR R
HHHHHHHH HHHH HH H
IIIIIIII IIII II I
NNNNNNNN NNNN NN N
IIIIIIII IIII II I
Take-home message:
We can loop with for over any iterator: list, str (tuples, ranges,…)
str.split()#
It is an str method that returns a list of the “words” within a string, using sep as the word-delimiter. The default value of sep is any kind of whitespace, see below:
help(str.split) # learn better by now with examples
Help on method_descriptor:
split(self, /, sep=None, maxsplit=-1)
Return a list of the substrings in the string, using sep as the separator string.
sep
The separator used to split the string.
When set to None (the default value), will split on any whitespace
character (including \\n \\r \\t \\f and spaces) and will discard
empty strings from the result.
maxsplit
Maximum number of splits (starting from the left).
-1 (the default value) means no limit.
Note, str.split() is mainly useful for data that has been intentionally
delimited. With natural text that includes punctuation, consider using
the regular expression module.
Split with a default delimiter#
Whitespace as delimiter
# Taxonomy from wiki
classification = "Kingdom Phylum Class Order Suborder Infraorder Family Genus Species"
taxonomy = classification.split()
print(taxonomy)
['Kingdom', 'Phylum', 'Class', 'Order', 'Suborder', 'Infraorder', 'Family', 'Genus', 'Species']
# P.abelii taxon from wiki
classification = "Animalia Chordata Mammalia Primates Haplorhini Simiiformes Hominidae Pongo P.abelii"
abelii_taxon = classification.split()
print(abelii_taxon)
['Animalia', 'Chordata', 'Mammalia', 'Primates', 'Haplorhini', 'Simiiformes', 'Hominidae', 'Pongo', 'P.abelii']
Split with a comma as delimiter#
# Pongo abelii is a Sumatran orangutan
classification = "Kingdom,Phylum,Class,Order,Suborder,Infraorder,Family,Genus,Species"
taxonomy = classification.split(',')
print(taxonomy)
['Kingdom', 'Phylum', 'Class', 'Order', 'Suborder', 'Infraorder', 'Family', 'Genus', 'Species']
Pay attention while coding#
The next example shows that you have to be careful with whitespaces that sometimes are difficult to be seen
# Pongo abelii is a Sumatran orangutan
classification = "Kingdom , Phylum , Class , Order, Suborder , Infraorder, Family , Genus, Species"
taxonomy = classification.split(',')
print(taxonomy)
['Kingdom ', ' Phylum ', ' Class ', ' Order', ' Suborder ', ' Infraorder', ' Family ', ' Genus', ' Species']
Try the next solution
At the same time we learn a new built-in function: enumerate that is used to provide the count and the value of the iteration
# Pongo abelii is a Sumatran orangutan
classification = "Kingdom , Phylum , Class, Order , Suborder, Infraorder , Family , Genus , Species"
taxa = classification.split(',')
print("before:\n" + str(taxa)) # + on the same types => we need to cast, str()
# eliminate the annoying whitespace
for index, taxon in enumerate(taxa): # enumerate provides (index, list[index]); that is index and value
taxa[index] = taxon.strip() # we modify all the elements of the list
print("after:\n" + str(taxa))
before:
['Kingdom ', ' Phylum ', ' Class', ' Order ', ' Suborder', ' Infraorder ', ' Family ', ' Genus ', ' Species']
after:
['Kingdom', 'Phylum', 'Class', 'Order', 'Suborder', 'Infraorder', 'Family', 'Genus', 'Species']
From the last example we need to think about the next
It is quite dangerous to modify a list while iterating over its elements (e.g. the previous for example). It can even end up in an eternal loop
How do we copy a list?
# the next does not copy the list
list1 = [1, 2, 3]
list2 = list1 # this is not copying the list!
# see the next statement: id provides the address of memory
print("ids @", id(list1), id(list2)) # it points to the same memory address! => we probe that it was not a copy
list1.append(4) # append one element
list2[0] = 11 # change one element
print("content:", list1, list2) # it is changed in both. Obviously, it was not a copy BUT THE SAME LIST
print("ids after modification @", id(list1), id(list2))
ids @ 140701483403968 140701483403968
content: [11, 2, 3, 4] [11, 2, 3, 4]
ids after modification @ 140701483403968 140701483403968
str.copy()#
It is a method that returns a copy of the list
cp_of_list = original_list.copy() # cp of the original list
See the next example:
# Now we will really copy the list
list1 = [1, 2, 3]
list2 = list1.copy() # this is copying the list!
# id provides the address of memory
print("ids @", id(list1), id(list2)) # diff. memory addresses! => a copy
list1.append(4) # append another element
list2[0] = 10 # change one element
print("content:", list1, list2) # it was a copy
print("ids after modification @", id(list1), id(list2))
ids @ 140701483463296 140701483477440
content: [1, 2, 3, 4] [10, 2, 3]
ids after modification @ 140701483463296 140701483477440
More on for#
range()#
It is another iterable type as list, tuples, str. We use it very frequently in loops
| range(start, stop[, step]) -> range object
|
| Produces a sequence of integers from start (inclusive)
| to stop (exclusive) by step.
|
| range(i, j) produces i, i+1, i+2, ..., j-1.
|
| range(stop) -> range object
| start defaults to 0, and stop is omitted! range(4) produces 0, 1, 2, 3.
| These are exactly the valid indices for a list of 4 elements.
| When step is given, it specifies the increment (or decrement).
Let’s see it with examples!
range(start, stop)#
# using a list for iterating
numbers = [1, 2, 3, 4]
print(type(numbers))
for num in numbers:
print(num)
<class 'list'>
1
2
3
4
# using a range without step
numbers = range(1, 5) # 1 inclusive and 5 not inclusive
print(type(numbers)) # range returns a range object
for num in numbers:
print(num)
<class 'range'>
1
2
3
4
range(stop)#
# range from 0 to stop
for num in range(5): # from 0 till 5, but 5 not inclusive
print(num)
0
1
2
3
4
range(start, stop, step)#
# range with step
# odd numbers till 5
for num in range(1, 6, 2): # odd number till 6 (step=2)
print(num)
1
3
5
# range is very useful: obtaining the indexes of a list
# use the iterations to indent the clade as you like
taxon = ['Animalia', 'Chordata', 'Mammalia', 'Primates', 'Haplorhini', 'Simiiformes', 'Hominidae', 'Pongo', 'P.abelii']
indent = ""
for index in range(0, len(taxon)): # in each iteration adds a "\t"
print(indent + taxon[index])
indent = indent + "\t" # or: indent += "\t"
Animalia
Chordata
Mammalia
Primates
Haplorhini
Simiiformes
Hominidae
Pongo
P.abelii
list comprehensions#
This is a special syntax that Python has; it is very pythonic: it creates a list based in the elements of another iterable, for instance another list or a range. See the next example
# powers of 10
powers_of_10 = [10**i for i in range(7)] # observe this syntax and try with your own examples
print(powers_of_10)
[1, 10, 100, 1000, 10000, 100000, 1000000]
Loops in files: read lines using for#
Now, after opening a file, you can read it line by line because the file object is an iterable type
# for instance, the result of a multiple sequence alignment
file_msa = "./files/msa.fa"
file_object = open(file_msa, "r")
for line in file_object: # every line is an str
print(line) # be careful! It grasps also the \n
file_object.close()
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSS
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVS
DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSS
AEGDDP---AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTS
AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFAS
AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTS
FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVS
Be careful with the end of line: “\n”
Solution: use str.strip() for removing each “\n” (end of line)
# solution proposed
file_msa = "./files/msa.fa"
file_object = open(file_msa, "r")
for seq in file_object:
print(seq.rstrip()) # Now, it removes each end of line
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSS
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVS
DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSS
AEGDDP---AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTS
AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFAS
AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTS
FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVS
See the file “msa.fa”
Summary#
List:
Definition: index, value
Method: index()
print(globins.index("HBA1"))
Accessing to elements
print(globins[0])
len()
print(len(globins))
Extracting a sublist
# sublist print(globins[0:3]) # array coordinates. First inclusive, last exclusive
list.reverse()
globins.reverse() # IN PLACE print(globins)
list.sort()
globins.sort() # ascending and IN PLACE print(globins)
list.count()
print(str(globins.count("HBA1")))
list.append()
globins.append("HBB") # IN PLACE
list.pop()
last_globin=globins.pop() # IN PLACE print(last_globin)
list.extent()
# more_globins is a list of globins globins.extend(more_globins) # IN PLACE
Also concatenating strings (“+”) and an old trick (“+=”)
Copying list
list.copy()
Loops: for
iteration, for val in list:, for letter in string:
indentation: block of code, the body of the loop
Method: str.split()
classification = "Kingdom Phylum Class Order Suborder Infraorder Family Genus Species" taxa = classification.split()
range(start, stop, step)
for num in range(1,6,2): # odd number till 6 (step=2) print(num)
list comprehensions
even_nums = [2*n for n in range(1, 6)] # obtains a list of even numbers print(even_nums) # [2, 4, 6, 8, 10]
Reading files line by line using for