C4. Lists of elements: iterating over them#

A list#

A list is able to store a sequence of elements. For instance, the canonical protein coding genes of a given species ordered by their loci within the different chromosomes
list() is another built-in data type in Python like int()

Why are Lists useful?#

Till now we have been using variables to store information; one piece of information (a string, a number). Examples:

  • The name of a programming language (e.g. “Python”)

  • A given day of the month (e.g. 11)

But what happens when the data has multiple pieces? Or when having a sequence of elements

But, we could assign the information to individual variables. Couldn’t we?#

# One variable: one piece of information (str)
my_globin = "HBA1"

# One variable: more globins in a single str
globins = "HBA1, HBA2, HBB, HBD, HBE1, HBG1, HBG2, HBM, HBQ1, HBZ, MB" # a string.

Now, there would be a problem when there is a need to access an specific globin or the one in an specific position within the string: we could even have thousands of ordered protein names

Another example#

See the result of a multiple sequence alignment provided by Clustal X

# we can again assign a different variable per line of alignment
seq_alignment_1 = "AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSS"
seq_alignment_2 = "AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVS"
seq_alignment_3 = "DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSS"
seq_alignment_4 = "AEGDDP---AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTS"
seq_alignment_5 = "AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFAS"
seq_alignment_6 = "AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTS"
seq_alignment_7 = "FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVS"

…Imagine with more sequences aligning. The resulting code will a pain

A solution consists in combining lists and loops#

We will see now what is a list, leaving loops for later

What is a list?#

A list is offered by Python to manage this type of data. Its is a data structure designed to store, organize, retrieve and process certain type of data (information). In practical terms, a list is a new type in Python (like int, str, …) and it is used a lot in any programming language

# what is a list in Python?
# quite simple: squared-brackets and in between its elements separated by commas
["HBA1", "HBA2", "HBB", "HBD", "HBE1", "HBG1", "HBG2", "HBM", "HBQ1", "HBZ", "MB"] # list of str
['HBA1',
 'HBA2',
 'HBB',
 'HBD',
 'HBE1',
 'HBG1',
 'HBG2',
 'HBM',
 'HBQ1',
 'HBZ',
 'MB']

A lists of what?#

The last example is a list of strings; but there could be lists of any other type

# for instance a list of integers
[1, 2, 3, 4, 5, 6] # list of int
[1, 2, 3, 4, 5, 6]
# or a list of floats
[1., 2., 3., 4., 5., 6.] # list of float
[1.0, 2.0, 3.0, 4.0, 5.0, 6.0]
# or a even a list of different types. But, note that this is not common
[1, 2., "protein"] # list of different types
[1, 2.0, 'protein']

list is an iterable type#

This means that we can iterate over its elements. Just for your information, in Python there are more iterable types like tuples or strings.

Assigning a list to a variable#

As we did in the previous chapters with an int, float, str or a file object

globins = ["HBA1", "HBA2", "HBB", "HBD", "HBE1", "HBG1", "HBG2", "HBM", "HBQ1", "HBZ", "MB"]
print(globins)
['HBA1', 'HBA2', 'HBB', 'HBD', 'HBE1', 'HBG1', 'HBG2', 'HBM', 'HBQ1', 'HBZ', 'MB']

len()#

Remember that len() provides the number of items in a provided container

# len()
print(len(globins)) # numbers of elements
11

Indexes and values#

list.index()#

It is a method

print(globins.index("HBA1"))
print(globins.index("HBA2"))
print(globins.index("MB"))
0
1
10

If the variable is not in the list Python raises an Error message:

print(globins.index("NOT_IN_THE_LIST"))

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_11369/4034324269.py in <module>
----> 1 print(globins.index("NOT_IN_THE_LIST"))

ValueError: 'NOT_IN_THE_LIST' is not in list

Accessing to its elements#

print(globins[0])
HBA1
print(globins[1])
HBA2
print(globins[10])
MB

Check the number of elements of the list with len()#

print(len(globins)) # numbers of elements
11
# the length of the first element is and NOT the length of the list
print(len(globins[0]))
4
# like in str, this will be the last element
print(globins[-1])
MB
# another way to do the same:
# Lists use array coordinates. The first element is 0 
print(globins[len(globins)-1])
MB
# is really the same?
globins[-1] == globins[len(globins)-1]
True

help(list)#

For obtaining more information on lists or whatever you want. It is not for beginners but use it as soon as possible

Extracting a sublist#

If you remember our previous lesson, this is very similar. It is like extracting a substring

globins = ["HBA1", "HBA2", "HBB", "HBD", "HBE1", "HBG1", "HBG2", "HBM", "HBQ1", "HBZ", "MB"]
print(globins)
# sublist
print(globins[0:3]) # array coordinates
                    # first considered, last not
                    # That is: first inclusive, last exclusive
['HBA1', 'HBA2', 'HBB', 'HBD', 'HBE1', 'HBG1', 'HBG2', 'HBM', 'HBQ1', 'HBZ', 'MB']
['HBA1', 'HBA2', 'HBB']
print(globins[1:3]) # again array coordinates not biological coordinates
['HBA2', 'HBB']

Get the whole list#

# print the whole list
print(globins[0:])
print(globins[:]) # another way, same list displayed
['HBA1', 'HBA2', 'HBB', 'HBD', 'HBE1', 'HBG1', 'HBG2', 'HBM', 'HBQ1', 'HBZ', 'MB']
['HBA1', 'HBA2', 'HBB', 'HBD', 'HBE1', 'HBG1', 'HBG2', 'HBM', 'HBQ1', 'HBZ', 'MB']

Get the whole list, but jump every…#

# print the list, but jump every 1 i.e. no jump
print(globins[0::1])

# print the list, but jump every 2
print(globins[0::2])

# print the list but jump every 3
print(globins[0::3]) 
['HBA1', 'HBA2', 'HBB', 'HBD', 'HBE1', 'HBG1', 'HBG2', 'HBM', 'HBQ1', 'HBZ', 'MB']
['HBA1', 'HBB', 'HBE1', 'HBG2', 'HBQ1', 'MB']
['HBA1', 'HBD', 'HBG2', 'HBZ']

Get the last part of the list#

# last part of the sequence
print(globins[-1:]) # last element
print(globins[-2:]) # last 2 elements
['MB']
['HBZ', 'MB']

Reversing strings (str)#

Revisiting what we already learned, because lists are quite similar

dna = "atgccg"
print(dna[::-1])
print(dna)
gccgta
atgccg

Reversing a list (intermediate level).#

It does not change the content of the list

# too advanced, but does not harm
globins = ["HBA1", "HBA2", "HBB", "HBD", "HBE1", "HBG1", "HBG2", "HBM", "HBQ1", "HBZ", "MB"]
print(globins[::-1]) # a nice trick, isn't it?
print(globins)
['MB', 'HBZ', 'HBQ1', 'HBM', 'HBG2', 'HBG1', 'HBE1', 'HBD', 'HBB', 'HBA2', 'HBA1']
['HBA1', 'HBA2', 'HBB', 'HBD', 'HBE1', 'HBG1', 'HBG2', 'HBM', 'HBQ1', 'HBZ', 'MB']

List methods#

list.reverse()#

It is a method from the class list

In this case, reverses a list in-place; that means that it modifies the list. But, it returns None. So, we have to be careful, we can not assign the output of list.reverse() expecting to be the reversed list

in-place (in uppercase too: IN PLACE)

globins = ["HBA1", "HBA2", "HBB", "HBD", "HBE1", "HBG1", "HBG2", "HBM", "HBQ1", "HBZ", "MB"]
print(globins)
# method: reverse (beginner level)
globins.reverse() # in this case, reverse *IN PLACE*
print(globins)
['HBA1', 'HBA2', 'HBB', 'HBD', 'HBE1', 'HBG1', 'HBG2', 'HBM', 'HBQ1', 'HBZ', 'MB']
['MB', 'HBZ', 'HBQ1', 'HBM', 'HBG2', 'HBG1', 'HBE1', 'HBD', 'HBB', 'HBA2', 'HBA1']
A tricky issue#

Following the very same logic: list.reverse() returns None, not the list reversed

# you would expect that globins.reverse().reverse() == globins, but
# the double reverse() does not work, 
# because globins.reverse() modifies the list (globins), but returns None
#
globins = ["HBA1", "HBA2", "HBB", "HBD", "HBE1", "HBG1", "HBG2", "HBM", "HBQ1", "HBZ", "MB"]
#print(globins.reverse().reverse()) # this will raise an error! Do you see why?
print(globins.reverse())            # do you see it now?
print(globins)
None
['MB', 'HBZ', 'HBQ1', 'HBM', 'HBG2', 'HBG1', 'HBE1', 'HBD', 'HBB', 'HBA2', 'HBA1']

list.sort()#

It is a method from the class list

  • Sorts the list in ascending order and returns None.

  • The sort is in-place (i.e. the list itself is modified) and stable (i.e. the order of two equal elements is maintained).

  • A reverse flag can be set to sort in descending order.

globins = ["HBE1", "HBG1", "HBA2", "HBZ", "MB", "HBBD", "HBB", "HBA1", "HBM", "HBQ1"]
print(globins)
# method: sort (beginner level)
globins.sort() # ascending and in-place
print(globins)
['HBE1', 'HBG1', 'HBA2', 'HBZ', 'MB', 'HBBD', 'HBB', 'HBA1', 'HBM', 'HBQ1']
['HBA1', 'HBA2', 'HBB', 'HBBD', 'HBE1', 'HBG1', 'HBM', 'HBQ1', 'HBZ', 'MB']
Descending:#
globins = ["HBE1", "HBG1", "HBA2", "HBZ", "MB", "HBBD", "HBB", "HBA1", "HBM", "HBQ1"]
print(globins)

globins.sort(reverse=True) # descending
print(globins)
['HBE1', 'HBG1', 'HBA2', 'HBZ', 'MB', 'HBBD', 'HBB', 'HBA1', 'HBM', 'HBQ1']
['MB', 'HBZ', 'HBQ1', 'HBM', 'HBG1', 'HBE1', 'HBBD', 'HBB', 'HBA2', 'HBA1']
Numerical sort:#

In the same way, we can sort a list of int or floats instead of a list of str

numbers = [111, 11, 1, 3, 9, 2, 8, 7, 77]
print(numbers)

# method: sort (beginner level)
numbers.sort() # ascending and IN-PLACE
print(numbers) # numerically not lexicographically
[111, 11, 1, 3, 9, 2, 8, 7, 77]
[1, 2, 3, 7, 8, 9, 11, 77, 111]
numbers = ["111", "11", "1", "3", "9", "2", "8", "7", "77"] # the elements are str not int
print(numbers) 
# method: sort (beginner level)
numbers.sort() # ascending and IN-PLACE
print(numbers) # then it sorts lexicographically. Not numerically!
['111', '11', '1', '3', '9', '2', '8', '7', '77']
['1', '11', '111', '2', '3', '7', '77', '8', '9']

list.count()#

This method returns the number of occurrences of a value

globins = ["HBA1", "HBA1", "HBA2", "HBB", "HBD", "HBE1", "HBG1", "HBG2", "HBM", "HBQ1", "HBZ", "MB", "HBA1"]
print(globins)
print(str(globins.count("HBA1"))) 
print(str(globins.count("HBA2"))) 
print(str(globins.count("HBO"))) # no TV channel here.
                                 # note: do not confuse 0 with O, different but look quite similar
['HBA1', 'HBA1', 'HBA2', 'HBB', 'HBD', 'HBE1', 'HBG1', 'HBG2', 'HBM', 'HBQ1', 'HBZ', 'MB', 'HBA1']
3
1
0

list.append()#

This method is frequently used, it appends an object to the end of the list (in place)

globins = ["HBA1", "HBA2"]
print(globins, len(globins))

# add the next globin at the end of the list
globins.append("HBB") # *IN PLACE*
print(globins, len(globins))
['HBA1', 'HBA2'] 2
['HBA1', 'HBA2', 'HBB'] 3
# add the next globin at the end of the list
globins.append("MB") # *IN PLACE*
print(globins, len(globins))
['HBA1', 'HBA2', 'HBB', 'MB'] 4

list.pop()#

Removes and returns the item at index (default last, index=-1).
Raises IndexError if list is empty or index is out of range.
also in place!

globins = ['HBA1', 'HBA2', 'HBB', 'MB']
print(globins, len(globins))

# remove "MB".
# By default removes the last element of the list
last_globin = globins.pop() # *IN PLACE*
print(last_globin)
print(globins, len(globins)) # see that we have now one globin less
['HBA1', 'HBA2', 'HBB', 'MB'] 4
MB
['HBA1', 'HBA2', 'HBB'] 3
# Init globins
globins = ['HBA1', 'HBA2', 'HBB', 'MB']
print(globins, len(globins))
# remove "HBB". It is necesary to indicate it index: 2, or -2 in this case
# 
removed_globin = globins.pop(2) # *IN PLACE*
print(removed_globin)
print(globins, len(globins)) # see that we have now one globin less
['HBA1', 'HBA2', 'HBB', 'MB'] 4
HBB
['HBA1', 'HBA2', 'MB'] 3

list.extend()#

Extends the list by appending elements from an iterable (in place)

globins = ["HBA1", "HBA2"]
print(globins, len(globins))
more_globins = ["HBB", "HBD", "HBE1", "HBG1", "HBG2", "HBM", "HBQ1", "HBZ", "MB"]
print(more_globins, len(more_globins))

# add the list more_globins at the end of the list globins
globins.extend(more_globins) # *IN PLACE*
print(globins, len(globins))
['HBA1', 'HBA2'] 2
['HBB', 'HBD', 'HBE1', 'HBG1', 'HBG2', 'HBM', 'HBQ1', 'HBZ', 'MB'] 9
['HBA1', 'HBA2', 'HBB', 'HBD', 'HBE1', 'HBG1', 'HBG2', 'HBM', 'HBQ1', 'HBZ', 'MB'] 11

Another way to do the same is concatenating: +

globins = ["HBA1", "HBA2"]
print(globins, len(globins))
more_globins = ["HBB", "HBD", "HBE1", "HBG1", "HBG2", "HBM", "HBQ1", "HBZ", "MB"]
print(more_globins, len(more_globins))
globins = globins + more_globins # concatenate the lists
print(globins, len(globins))
['HBA1', 'HBA2'] 2
['HBB', 'HBD', 'HBE1', 'HBG1', 'HBG2', 'HBM', 'HBQ1', 'HBZ', 'MB'] 9
['HBA1', 'HBA2', 'HBB', 'HBD', 'HBE1', 'HBG1', 'HBG2', 'HBM', 'HBQ1', 'HBZ', 'MB'] 11

An old trick: +=

# that is: a += 1
# instead of a = a + 1
# see the next example
globins = ["HBA1", "HBA2"]
print(globins, len(globins))
more_globins = ["HBB", "HBD", "HBE1", "HBG1", "HBG2", "HBM", "HBQ1", "HBZ", "MB"]
print(more_globins, len(more_globins))
globins += more_globins # instead of: globins = globins + more_globins
print(globins, len(globins))
['HBA1', 'HBA2'] 2
['HBB', 'HBD', 'HBE1', 'HBG1', 'HBG2', 'HBM', 'HBQ1', 'HBZ', 'MB'] 9
['HBA1', 'HBA2', 'HBB', 'HBD', 'HBE1', 'HBG1', 'HBG2', 'HBM', 'HBQ1', 'HBZ', 'MB'] 11

Loops#

Loops are a way to

Control the flow#

That is, how do we decide upon the order of execution of the statements within our code?

Example:
Print the elements of a list; with our current knowledge we need to write the next code (very redundant):

# Init globins
globins = ["HBA1", "HBA2", "HBB", "HBD", "HBE1", "HBG1", "HBG2", "HBM", "HBQ1", "HBZ", "MB"]
# print information about the elements of the list globins
print(globins[0], "is a globin")
print(globins[1], "is a globin")
print(globins[2], "is a globin")
print(globins[3], "is a globin")
print(globins[4], "is a globin")
print(globins[5], "is a globin")
print(globins[6], "is a globin")
print(globins[7], "is a globin")
print(globins[8], "is a globin")
print(globins[9], "is a globin")
print(globins[10], "is a globin")
HBA1 is a globin
HBA2 is a globin
HBB is a globin
HBD is a globin
HBE1 is a globin
HBG1 is a globin
HBG2 is a globin
HBM is a globin
HBQ1 is a globin
HBZ is a globin
MB is a globin

This is painful! Imagine a list with ~20,000 human gene names!.
Solution: loops

for: iterating over a list#

globins = ["HBA1", "HBA2", "HBB", "HBD", "HBE1", "HBG1", "HBG2", "HBM", "HBQ1", "HBZ", "MB"] # Init globins
for globin in globins: # even with lists containing more than 20000 elements!
    print(globin, "is a globin") # the body of the loop!
HBA1 is a globin
HBA2 is a globin
HBB is a globin
HBD is a globin
HBE1 is a globin
HBG1 is a globin
HBG2 is a globin
HBM is a globin
HBQ1 is a globin
HBZ is a globin
MB is a globin

several iterations

The body of the loop#

The body is composed by all the statements that belong to the loop (for, see above). In Python these statements are indented with a tabular. This makes Python quite readable in comparison with other programming languages.
A note from experience:
in order to avoid headaches do not mix-up tabulars and spaces within your code and set up your programming editor (a tabular = 4 whitespaces).

An example with a larger body of the loop:

globins = ["HBA1", "HBA2","HBZ", "MB"] # init some globins into a list
for globin in globins: # even with lists containing more than 20000 elements!
    print(globin, "is a globin")
    print("\t" + globin + " starts with " + globin[0])
    print("\t" + globin + " ends with " + globin[-1])
    print("\t" + "The gene name has " + str(len(globin)) + " letters")
HBA1 is a globin
	HBA1 starts with H
	HBA1 ends with 1
	The gene name has 4 letters
HBA2 is a globin
	HBA2 starts with H
	HBA2 ends with 2
	The gene name has 4 letters
HBZ is a globin
	HBZ starts with H
	HBZ ends with Z
	The gene name has 3 letters
MB is a globin
	MB starts with M
	MB ends with B
	The gene name has 2 letters

The structure of a for#

Structure:

  • for element in list**:**

  • for character in string**:**

Example:

for globin in globins:
    print(globin + "is a globin")
    print("\t" + globin + " starts with " + globin[0])
    print("\t" + globin + " ends with " + globin[-1])
    print("\t" + "Its name has " + str(len(globin)) + " letters")

You need to pay attention to:

  • for, variable for each element (globin), in, iterable variable (globins), the colon (:)

  • The indentation (tabular=”\t”) of the block of code. In this case, the block of code is the body of the loop

# for example, iterating over an str instead of a list
amino_acids = "ACDEFGHIKLMNPQRSTVWY" # an str with 20 amino acids (letters)
for aa in amino_acids: # amino_acids is a str
    print(aa, "is one of the amino acids")
print("There are", len(amino_acids), "amino_acids")
A is one of the amino acids
C is one of the amino acids
D is one of the amino acids
E is one of the amino acids
F is one of the amino acids
G is one of the amino acids
H is one of the amino acids
I is one of the amino acids
K is one of the amino acids
L is one of the amino acids
M is one of the amino acids
N is one of the amino acids
P is one of the amino acids
Q is one of the amino acids
R is one of the amino acids
S is one of the amino acids
T is one of the amino acids
V is one of the amino acids
W is one of the amino acids
Y is one of the amino acids
There are 20 amino_acids
# a for example, iterating over a list instead of a str
amino_acids = list("ACDEFGHIKLMNPQRSTVWY") # now we have converted the str to a list
for aa in amino_acids: # amino_acids is now a list
    print(aa, "is one of the amino acids")
print("There are", len(amino_acids), "amino_acids")
A is one of the amino acids
C is one of the amino acids
D is one of the amino acids
E is one of the amino acids
F is one of the amino acids
G is one of the amino acids
H is one of the amino acids
I is one of the amino acids
K is one of the amino acids
L is one of the amino acids
M is one of the amino acids
N is one of the amino acids
P is one of the amino acids
Q is one of the amino acids
R is one of the amino acids
S is one of the amino acids
T is one of the amino acids
V is one of the amino acids
W is one of the amino acids
Y is one of the amino acids
There are 20 amino_acids

Indentation error#

A single whitespace in the body of the loop will provide an error. See the next snippet:

globins = ["HBA1", "HBA2","HBZ", "MB"] # Init globins  
for globin in globins:   
    print(globin + "is a globin")    
    print("\t" + globin + " starts with " + globin[0])   
     print("\t" + globin + " ends with " + globin[-1]) # This will raise an error!    
    print("\t" + "Its name has " + str(len(globin)) + " letters")   
    
\#IndentationError: unexpected indent   

Note of advice
(again)

  • When indenting avoid mixing tabulars and whitespaces

  • Configurate the editor: translates “\t” in 4 whitespaces (google or talk to your system administrator)

More on strings#

for: iterate over characters of a str#

Remind

suborder="Haplorhini"
for letter in suborder: # iterates all over the string
    print(letter.upper()*8 + " "*4 + letter.upper()*4  + " "*4 + letter.upper()*2  + " "*4 + letter.upper()) # trick: *4 
HHHHHHHH    HHHH    HH    H
AAAAAAAA    AAAA    AA    A
PPPPPPPP    PPPP    PP    P
LLLLLLLL    LLLL    LL    L
OOOOOOOO    OOOO    OO    O
RRRRRRRR    RRRR    RR    R
HHHHHHHH    HHHH    HH    H
IIIIIIII    IIII    II    I
NNNNNNNN    NNNN    NN    N
IIIIIIII    IIII    II    I

Take-home message:

We can loop with for over any iterator: list, str (tuples, ranges,…)

str.split()#

It is an str method that returns a list of the “words” within a string, using sep as the word-delimiter. The default value of sep is any kind of whitespace, see below:

help(str.split) # learn better by now with examples
Help on method_descriptor:

split(self, /, sep=None, maxsplit=-1)
    Return a list of the substrings in the string, using sep as the separator string.
    
      sep
        The separator used to split the string.
    
        When set to None (the default value), will split on any whitespace
        character (including \\n \\r \\t \\f and spaces) and will discard
        empty strings from the result.
      maxsplit
        Maximum number of splits (starting from the left).
        -1 (the default value) means no limit.
    
    Note, str.split() is mainly useful for data that has been intentionally
    delimited.  With natural text that includes punctuation, consider using
    the regular expression module.
Split with a default delimiter#

Whitespace as delimiter

# Taxonomy from wiki
classification = "Kingdom Phylum Class Order Suborder Infraorder Family Genus Species"
taxonomy = classification.split()
print(taxonomy)
['Kingdom', 'Phylum', 'Class', 'Order', 'Suborder', 'Infraorder', 'Family', 'Genus', 'Species']
# P.abelii taxon from wiki
classification = "Animalia Chordata Mammalia Primates Haplorhini Simiiformes Hominidae Pongo P.abelii"
abelii_taxon = classification.split()
print(abelii_taxon)
['Animalia', 'Chordata', 'Mammalia', 'Primates', 'Haplorhini', 'Simiiformes', 'Hominidae', 'Pongo', 'P.abelii']
Split with a comma as delimiter#
# Pongo abelii is a Sumatran orangutan
classification = "Kingdom,Phylum,Class,Order,Suborder,Infraorder,Family,Genus,Species"
taxonomy = classification.split(',')
print(taxonomy)
['Kingdom', 'Phylum', 'Class', 'Order', 'Suborder', 'Infraorder', 'Family', 'Genus', 'Species']
Pay attention while coding#

The next example shows that you have to be careful with whitespaces that sometimes are difficult to be seen

# Pongo abelii is a Sumatran orangutan
classification = "Kingdom  , Phylum , Class , Order, Suborder ,  Infraorder, Family , Genus, Species"
taxonomy = classification.split(',')
print(taxonomy)
['Kingdom  ', ' Phylum ', ' Class ', ' Order', ' Suborder ', '  Infraorder', ' Family ', ' Genus', ' Species']

Try the next solution
At the same time we learn a new built-in function: enumerate that is used to provide the count and the value of the iteration

# Pongo abelii is a Sumatran orangutan
classification = "Kingdom  , Phylum  , Class, Order , Suborder, Infraorder , Family , Genus   , Species"
taxa = classification.split(',')
print("before:\n" + str(taxa)) # + on the same types => we need to cast, str()

# eliminate the annoying whitespace
for index, taxon in enumerate(taxa): # enumerate provides (index, list[index]); that is index and value
    taxa[index] = taxon.strip() # we modify all the elements of the list
print("after:\n" + str(taxa))
before:
['Kingdom  ', ' Phylum  ', ' Class', ' Order ', ' Suborder', ' Infraorder ', ' Family ', ' Genus   ', ' Species']
after:
['Kingdom', 'Phylum', 'Class', 'Order', 'Suborder', 'Infraorder', 'Family', 'Genus', 'Species']

From the last example we need to think about the next
It is quite dangerous to modify a list while iterating over its elements (e.g. the previous for example). It can even end up in an eternal loop

How do we copy a list?

# the next does not copy the list
list1 = [1, 2, 3]
list2 = list1 # this is not copying the list!

# see the next statement: id provides the address of memory
print("ids @", id(list1), id(list2)) # it points to the same memory address! => we probe that it was not a copy
list1.append(4) # append one element 
list2[0] = 11   # change one element
print("content:", list1, list2) # it is changed in both. Obviously, it was not a copy BUT THE SAME LIST
print("ids after modification @", id(list1), id(list2))
ids @ 140312491469632 140312491469632
content: [11, 2, 3, 4] [11, 2, 3, 4]
ids after modification @ 140312491469632 140312491469632

str.copy()#

It is a method that returns a copy of the list

cp_of_list = original_list.copy() # cp of the original list

See the next example:

# Now we will really copy the list
list1 = [1, 2, 3]
list2 = list1.copy() # this is copying the list!

# id provides the address of memory
print("ids @", id(list1), id(list2)) # diff. memory addresses! => a copy
list1.append(4) # append another element
list2[0] = 10 # change one element
print("content:", list1, list2) # it was a copy
print("ids after modification @", id(list1), id(list2))
ids @ 140312491325440 140312491530048
content: [1, 2, 3, 4] [10, 2, 3]
ids after modification @ 140312491325440 140312491530048

More on for#

range()#

It is another iterable type as list, tuples, str. We use it very frequently in loops

 |  range(start, stop[, step]) -> range object
 |  
 |  Produces a sequence of integers from start (inclusive)
 |  to stop (exclusive) by step.
 |
 |  range(i, j) produces i, i+1, i+2, ..., j-1.
 |
 |  range(stop) -> range object
 |  start defaults to 0, and stop is omitted!  range(4) produces 0, 1, 2, 3.
 |  These are exactly the valid indices for a list of 4 elements.
 |  When step is given, it specifies the increment (or decrement).

Let’s see it with examples!

range(start, stop)#

# using a list for iterating
numbers = [1, 2, 3, 4]
print(type(numbers))
for num in numbers:
    print(num)
<class 'list'>
1
2
3
4
# using a range without step
numbers = range(1, 5) # 1 inclusive and 5 not inclusive
print(type(numbers)) # range returns a range object
for num in numbers:
    print(num)
<class 'range'>
1
2
3
4

range(stop)#

# range from 0 to stop 
for num in range(5): # from 0 till 5, but 5 not inclusive
    print(num)
0
1
2
3
4

range(start, stop, step)#

# range with step
# odd numbers till 5
for num in range(1, 6, 2): # odd number till 6 (step=2)
    print(num)
1
3
5
# range is very useful: obtaining the indexes of a list
# use the iterations to indent the clade as you like
taxon = ['Animalia', 'Chordata', 'Mammalia', 'Primates', 'Haplorhini', 'Simiiformes', 'Hominidae', 'Pongo', 'P.abelii']
indent = ""
for index in range(0, len(taxon)): # in each iteration adds a "\t"
    print(indent + taxon[index])
    indent = indent + "\t"  # or: indent += "\t"
Animalia
	Chordata
		Mammalia
			Primates
				Haplorhini
					Simiiformes
						Hominidae
							Pongo
								P.abelii

list comprehensions#

This is a special syntax that Python has; it is very pythonic: it creates a list based in the elements of another iterable, for instance another list or a range. See the next example

# powers of 10
powers_of_10 = [10**i for i in range(7)] # observe this syntax and try with your own examples
print(powers_of_10)
[1, 10, 100, 1000, 10000, 100000, 1000000]

Loops in files: read lines using for#

Now, after opening a file, you can read it line by line because the file object is an iterable type

# for instance, the result of a multiple sequence alignment
file_msa = "./files/msa.fa"
file_object = open(file_msa, "r")
for line in file_object: # every line is an str
    print(line) # be careful! It grasps also the \n
file_object.close()
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSS

AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVS

DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSS

AEGDDP---AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTS

AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFAS

AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTS

FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVS

Be careful with the end of line: “\n”
Solution: use str.strip() for removing each “\n” (end of line)

# solution proposed
file_msa = "./files/msa.fa"
file_object = open(file_msa, "r")
for seq in file_object:
    print(seq.rstrip()) # Now, it removes each end of line
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSS
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVS
DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSS
AEGDDP---AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTS
AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFAS
AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTS
FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVS

Summary#

  • List:

    • Definition: index, value

    • Method: index()

    print(globins.index("HBA1"))
    
    • Accessing to elements

    print(globins[0])
    
    • len()

    print(len(globins))
    
    • Extracting a sublist

    # sublist
    print(globins[0:3]) # array coordinates. First inclusive, last exclusive
    
    • list.reverse()

    globins.reverse() # IN PLACE
    print(globins)
    
    • list.sort()

    globins.sort() # ascending and IN PLACE
    print(globins)
    
    • list.count()

    print(str(globins.count("HBA1"))) 
    
    • list.append()

    globins.append("HBB") # IN PLACE
    
    • list.pop()

    last_globin=globins.pop() # IN PLACE
    print(last_globin)
    
    • list.extent()

    # more_globins is a list of globins
    globins.extend(more_globins) # IN PLACE
    

    Also concatenating strings (“+”) and an old trick (“+=”)

    • Copying list

    list.copy()
    
  • Loops: for

    • iteration, for val in list:, for letter in string:

    • indentation: block of code, the body of the loop

    • Method: str.split()

    classification = "Kingdom Phylum Class Order Suborder Infraorder Family Genus Species"
    taxa = classification.split()
    
    • range(start, stop, step)

    for num in range(1,6,2): # odd number till 6 (step=2)
        print(num)
    
    • list comprehensions

    even_nums = [2*n for n in range(1, 6)] # obtains a list of even numbers
    print(even_nums) # [2, 4, 6, 8, 10]
    
  • Reading files line by line using for


Exercises#