C8. Dictionaries: a very powerful data structure#

What are dictionaries?#

Dictionaries are data structures, but which kind of data structure? Think in a language dictionary. There, you have words (aka keys) and their corresponding meanings (aka values). In the school, kids learn how to access quickly to a word, after that there is no need to look up before thousands of other words to find the meaning of the one they are interested in. Python uses dictionaries in the same way, it stores keys and each one with its corresponding value. There could be many different keys (ie. millions) and the access to a key is ultra fast because is based in something that in computer science is called a hashmap. Note that all the keys must be different

Storing biological data in pairs#

In Biology is very common to have the information in pairs. For instance:

  • Species and their corresponding taxa

  • Gene names and their corresponding gene id in a repository

  • Genes and their corresponding genomic loci

  • Genes and their sequences

  • Proteins and their 3-D structure PDB id

  • Restriction enzymes and their motifs

  • and so on…

Let’s see an example#

Imagine we want to store the number of times the different nucleotides (nt) are present within a DNA sequence:

# single nt: There are 4 possibilities: 4
dna = "GAGGTTACCGCCTACGATTGGGAATTA" # a short DNA seq, it can be a long one (i.e 100000 nt)
count_a = dna.count("A") # we studied str.count() as an str method
count_t = dna.count("T")
count_g = dna.count("G")
count_c = dna.count("C")

Now, the number of times the different dinucleotides (2-nt, that is motifs of length two) are in a sequence:

# 2-nts: There are 16 possibilities: 4*4 = 4**2 = 16
dna = "GAGGTTACCGCCTACGATTGGGAATTA"
count_aa = dna.count("AA")
count_at = dna.count("AT") # order matters: "AT" is different than "TA"
count_ag = dna.count("AG")
count_ac = dna.count("AC")
# and so on...

There are more different trinucleotides (3-nts) combinations:

# 3-nts: There are 64 possibilities: 4*4*4 = 4**3 = 64
dna = "GAGGTTACCGCCTACGATTGGGAATTA"
count_aaa = dna.count("AAA")
count_aat = dna.count("AAT")
count_aag = dna.count("AAG")
count_aac = dna.count("AAC")
count_aaa = dna.count("ATA")
count_aat = dna.count("ATT")
count_aag = dna.count("ATG")
count_aac = dna.count("ATC")
# and so on...
#
# in this case, the DNA sequence is very short and the count of most of the trinucleotides is cero

Imagine for more nt, k-nts, in bioinformatics this is call k-mers.
For k-nts, There are $4^{k}$ possibilities, for any value of k = 1, 2, 3, 4, 5, …

Then, for motifs of 5-nts we will need to store $4^{5} = 1024$ variables. Therefore using variables for storing the counts is not practical at all!

With our current knowledge we can think in a solution to store the information: using two lists, one for keys and one for values

# The 3-nt can be in any position.
dna = "GAGGTTACCGCCTACGATTGGGAATTA"
tri_nts = []
count_tri_nts = [] 
nts = ['A','T','G','C']
for nt1 in nts:
    for nt2 in nts:
        for nt3 in nts:
            tri_nts.append(nt1+nt2+nt3) # list of keys
            count_tri_nts.append(dna.count(nt1+nt2+nt3)) # list of values
print("tri_nts:\n", tri_nts)
print("count_tri_nts:\n", count_tri_nts)
# Each index in the two lists is devoted to a particular 3-nt, think in codon usage
tri_nts:
 ['AAA', 'AAT', 'AAG', 'AAC', 'ATA', 'ATT', 'ATG', 'ATC', 'AGA', 'AGT', 'AGG', 'AGC', 'ACA', 'ACT', 'ACG', 'ACC', 'TAA', 'TAT', 'TAG', 'TAC', 'TTA', 'TTT', 'TTG', 'TTC', 'TGA', 'TGT', 'TGG', 'TGC', 'TCA', 'TCT', 'TCG', 'TCC', 'GAA', 'GAT', 'GAG', 'GAC', 'GTA', 'GTT', 'GTG', 'GTC', 'GGA', 'GGT', 'GGG', 'GGC', 'GCA', 'GCT', 'GCG', 'GCC', 'CAA', 'CAT', 'CAG', 'CAC', 'CTA', 'CTT', 'CTG', 'CTC', 'CGA', 'CGT', 'CGG', 'CGC', 'CCA', 'CCT', 'CCG', 'CCC']
count_tri_nts:
 [0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 2, 2, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0]
# and now print the trinucleotide's counts (when counts > 0)
for index, tri_nt in enumerate(tri_nts): # enumerate(list) -> returns an enumerate object: index and value 
    if count_tri_nts[index] > 0:
        print(f"There are {count_tri_nts[index]:d} {tri_nt:s} in the DNA sequence") # New format for strings
There are 1 AAT in the DNA sequence
There are 2 ATT in the DNA sequence
There are 1 AGG in the DNA sequence
There are 1 ACG in the DNA sequence
There are 1 ACC in the DNA sequence
There are 2 TAC in the DNA sequence
There are 2 TTA in the DNA sequence
There are 1 TTG in the DNA sequence
There are 1 TGG in the DNA sequence
There are 1 GAA in the DNA sequence
There are 1 GAT in the DNA sequence
There are 1 GAG in the DNA sequence
There are 1 GTT in the DNA sequence
There are 1 GGA in the DNA sequence
There are 1 GGT in the DNA sequence
There are 1 GGG in the DNA sequence
There are 1 GCC in the DNA sequence
There are 1 CTA in the DNA sequence
There are 1 CGA in the DNA sequence
There are 1 CGC in the DNA sequence
There are 1 CCT in the DNA sequence
There are 1 CCG in the DNA sequence

But there is a handicap: it is relatively complicated to find the count for a given trinucleotide. Pseudocode:

  1. Find the position of the trinucleotide (in one array)

  2. Obtain the count for that trinucleotide (using the previously obtained position in the other array)

# For instance for 'TTA'
i = tri_nts.index('TTA') # 1.- find the position of 
print(f"There are {count_tri_nts[i]} {tri_nts[i]} in the DNA sequence") # 2.- access to the counts of TTA
There are 2 TTA in the DNA sequence

Note: we are storing a lot of trinucleotides with count == cero

Dictionaries (dict) as a solution#

Storing data as key-value pairs. For instance, the key ‘TTA’ will have an associated value 2

We have previously learned that lists store collections of objects in an ordered sequence. On the other hand, dictionaries (dict) in Python are not like lists, or even like language-dictionaries, in the sense that in Python-dictionaries the collections of objects are stored in an unordered way. Nevertheless, if we are interested on this, we can always retrieve and sort them. The keys usually are human readable strings, although there is no need of this; but they need to be all different and Python access very quickly to any key and its associated value.

More advanced, just read the next do not try understand it by now: the key needs to be an unmutable type: str, int, float, bool. On the other hand, the value can be a mutable or unmutable type.

Initialization of a dict#

taxonomy_id = {'pan_paniscus': 9597, 'homo_sapiens': 9696,  'gorilla_gorilla': 9593} # items separated by commas
print(taxonomy_id)
{'pan_paniscus': 9597, 'homo_sapiens': 9696, 'gorilla_gorilla': 9593}

Dissection of a dictionary#

Each item (key:value) is separated by comma (ie. ‘pan_paniscus’: 9597):

  • key (unmutable): str, int, float, bool. Usually a human readable string

  • value (mutable or unmutable)

Note that:

  • in the dictionary initialization, there is a colon between the key and the value

  • the keys need to be unique

  • each key stores only one value

More advanced: the value could be a more complicated data structure. For instance, a list (or object)

Initialization of a dict in several lines#

It is more human readable

# Example of dictionary
taxonomy_id = {
    'pan_paniscus': 9597,
    'homo_sapiens': 9696, 
    'gorilla_gorilla': 9593
}
print(taxonomy_id)
{'pan_paniscus': 9597, 'homo_sapiens': 9696, 'gorilla_gorilla': 9593}

Another way to initializate a dict#

taxonomy_id = dict({'pan_paniscus': 9597, 'homo_sapiens': 9696, 'gorilla_gorilla': 9593})
print(taxonomy_id)
{'pan_paniscus': 9597, 'homo_sapiens': 9696, 'gorilla_gorilla': 9593}

Alternative ways to initializate a dict#

List of tuples, list of lists, tuples of tuples …

# dict(), using a list of tuples 
taxonomy_id = dict([('pan_paniscus', 9597), ('homo_sapiens', 9696), ('gorilla_gorilla', 9593)])
print(taxonomy_id)
{'pan_paniscus': 9597, 'homo_sapiens': 9696, 'gorilla_gorilla': 9593}
# dict(), using a list of lists 
taxonomy_id = dict([['pan_paniscus', 9597], ['homo_sapiens', 9696], ['gorilla_gorilla', 9593]])
print(taxonomy_id)
{'pan_paniscus': 9597, 'homo_sapiens': 9696, 'gorilla_gorilla': 9593}
# dict() using a tuple of tuples 
taxonomy_id = dict((('pan_paniscus', 9597), ('homo_sapiens', 9696), ('gorilla_gorilla', 9593)))
print(taxonomy_id)
{'pan_paniscus': 9597, 'homo_sapiens': 9696, 'gorilla_gorilla': 9593}

Or even a tuple of lists

Accessing to the values through the keys#

taxonomy_id['homo_sapiens']
9696

Usually, we create an empty dict and add items (key-value)#

Because we do not have the items when initializating the dict.

Initialization of an empty dict#

enzymes = dict()
print(enzymes)
{}
# or 
enzymes = {}
print(enzymes)
{}

Add items (key-value) to the dict#

For instance restriction enzymes cutting sites. The IUPAC nucleid acid notation can be helpful here.

# We can store even regular expression patterns
# We add then items within our code:
enzymes = dict()
enzymes['RshI']   = r"CGATCG"    
enzymes['Lmu60I'] = r"CCT[ATGC]AGG" # r"CCTNAGG" \ N = A or T or G or C
enzymes['MabI']   = r"ACC(A|T)GGT"  # r"ACCWGGT" \ W = A or T 
print(enzymes)
{'RshI': 'CGATCG', 'Lmu60I': 'CCT[ATGC]AGG', 'MabI': 'ACC(A|T)GGT'}
# Of course, if we have all the items, we can initializate the dict in a classical way
enzymes = {
    'RshI': r"CGATCG",    
    'Lmu60I': r"CCT[ATGC]AGG",
    'MabI': r"ACC(A|T)GGT"
}
print(enzymes)
{'RshI': 'CGATCG', 'Lmu60I': 'CCT[ATGC]AGG', 'MabI': 'ACC(A|T)GGT'}

Delete an item using the key: dict.pop()#

pop() is a method of dict. It returns the associated value

enzymes = {'RshI': r"CGATCG", 'Lmu60I': r"CCT[ATGC]AGG", 'MabI': r"ACC(A|T)GGT"}
poped_enzyme = enzymes.pop('MabI') # erase MabI and returns the value
print(poped_enzyme) 
print(enzymes)
ACC(A|T)GGT
{'RshI': 'CGATCG', 'Lmu60I': 'CCT[ATGC]AGG'}

If we try to access a key that is not in the dict, it will raise an error#

enzymes = {'RshI': r"CGATCG", 'Lmu60I': r"CCT[ATGC]AGG", 'MabI': r"ACC(A|T)GGT"}
print(enzymes)
print('MabI', enzymes.pop('MabI'))
enzymes.pop('MabI') # raises an error. It was not any more in the dict

># KeyError: 'MabI'

Using a dict in our previous example#

To add/update items (key-value) we have to do the next:

dna = "GAGGTTACCGCCTACGATTGGGAATTA"
nts = ['A','T','G','C']
counts = dict()
for nt1 in nts:
    for nt2 in nts:
        for nt3 in nts:
            counts[nt1+nt2+nt3] = dna.count(nt1+nt2+nt3) # add all the pairs
print("counts:\n", counts)
counts:
 {'AAA': 0, 'AAT': 1, 'AAG': 0, 'AAC': 0, 'ATA': 0, 'ATT': 2, 'ATG': 0, 'ATC': 0, 'AGA': 0, 'AGT': 0, 'AGG': 1, 'AGC': 0, 'ACA': 0, 'ACT': 0, 'ACG': 1, 'ACC': 1, 'TAA': 0, 'TAT': 0, 'TAG': 0, 'TAC': 2, 'TTA': 2, 'TTT': 0, 'TTG': 1, 'TTC': 0, 'TGA': 0, 'TGT': 0, 'TGG': 1, 'TGC': 0, 'TCA': 0, 'TCT': 0, 'TCG': 0, 'TCC': 0, 'GAA': 1, 'GAT': 1, 'GAG': 1, 'GAC': 0, 'GTA': 0, 'GTT': 1, 'GTG': 0, 'GTC': 0, 'GGA': 1, 'GGT': 1, 'GGG': 1, 'GGC': 0, 'GCA': 0, 'GCT': 0, 'GCG': 0, 'GCC': 1, 'CAA': 0, 'CAT': 0, 'CAG': 0, 'CAC': 0, 'CTA': 1, 'CTT': 0, 'CTG': 0, 'CTC': 0, 'CGA': 1, 'CGT': 0, 'CGG': 0, 'CGC': 1, 'CCA': 0, 'CCT': 1, 'CCG': 1, 'CCC': 0}

Access to an item#

It is straightforward and fast

print(counts['ATT'])
2

dict.update()#

It is a dict method to add (or update) pairs. Then, the next code is equivalent to the previous one

dna = "GAGGTTACCGCCTACGATTGGGAATTA"
nts = ['A','T','G','C']
counts = dict()
for nt1 in nts:
    for nt2 in nts:
        for nt3 in nts:
            counts.update({nt1+nt2+nt3: dna.count(nt1+nt2+nt3)}) # equal to: counts["str"]=dna.count("str")
print("counts:\n", counts)
# The idea: counts.update({key1: val1, key2: val2, ...})
counts:
 {'AAA': 0, 'AAT': 1, 'AAG': 0, 'AAC': 0, 'ATA': 0, 'ATT': 2, 'ATG': 0, 'ATC': 0, 'AGA': 0, 'AGT': 0, 'AGG': 1, 'AGC': 0, 'ACA': 0, 'ACT': 0, 'ACG': 1, 'ACC': 1, 'TAA': 0, 'TAT': 0, 'TAG': 0, 'TAC': 2, 'TTA': 2, 'TTT': 0, 'TTG': 1, 'TTC': 0, 'TGA': 0, 'TGT': 0, 'TGG': 1, 'TGC': 0, 'TCA': 0, 'TCT': 0, 'TCG': 0, 'TCC': 0, 'GAA': 1, 'GAT': 1, 'GAG': 1, 'GAC': 0, 'GTA': 0, 'GTT': 1, 'GTG': 0, 'GTC': 0, 'GGA': 1, 'GGT': 1, 'GGG': 1, 'GGC': 0, 'GCA': 0, 'GCT': 0, 'GCG': 0, 'GCC': 1, 'CAA': 0, 'CAT': 0, 'CAG': 0, 'CAC': 0, 'CTA': 1, 'CTT': 0, 'CTG': 0, 'CTC': 0, 'CGA': 1, 'CGT': 0, 'CGG': 0, 'CGC': 1, 'CCA': 0, 'CCT': 1, 'CCG': 1, 'CCC': 0}

Again, if we ask for a key that is not in the dict, it will raise an error#

# We can add pairs when the value is > 0
dna = "GAGGTTACCGCCTACGATTGGGAATTA"
nts = ['A','T','G','C']
counts = dict()
for nt1 in nts:
    for nt2 in nts:
        for nt3 in nts:
            numberOf = dna.count(nt1+nt2+nt3)
            if numberOf > 0:  # only if they are present
                counts[nt1+nt2+nt3] = numberOf 
print("counts:\n", counts)
counts:
 {'AAT': 1, 'ATT': 2, 'AGG': 1, 'ACG': 1, 'ACC': 1, 'TAC': 2, 'TTA': 2, 'TTG': 1, 'TGG': 1, 'GAA': 1, 'GAT': 1, 'GAG': 1, 'GTT': 1, 'GGA': 1, 'GGT': 1, 'GGG': 1, 'GCC': 1, 'CTA': 1, 'CGA': 1, 'CGC': 1, 'CCT': 1, 'CCG': 1}

And now, be careful,

print(counts[‘TTT’])

#KeyError: ‘TTT’

Because of the same reason, the next will raise an error too:

if(counts['TTT']):
    print(counts['TTT'])

Check that the key is in the dict#

This a solution to avoid the previous error

if('TTT' in counts):
    print(counts['TTT'])
print("Check passed! Now, it raises no error")
Check passed! Now, it raises no error

dict.get()#

It is a method of dict that returns the value for key if key is in the dictionary, else it returns None or whatever we want to set up by default. It is an alternative solution to avoid that error while trying to access a none existing key

print('ATT has', counts.get('ATT'), "instances in the dna sequence") # get: None by default
print('TTT has', counts.get('TTT'), "instances in the dna sequence")
ATT has 2 instances in the dna sequence
TTT has None instances in the dna sequence
# dict.get():set up default to 0
print('ATT has', counts.get('ATT', 0), "instances in the dna sequence") # get: 0 by default
print('TTT has', counts.get('TTT', 0), "instances in the dna sequence") 
ATT has 2 instances in the dna sequence
TTT has 0 instances in the dna sequence

Iterating all over dictionaries#

dict.keys()#

Returns a set-like object providing a view on the keys of a dictionary

counts.keys()
dict_keys(['AAT', 'ATT', 'AGG', 'ACG', 'ACC', 'TAC', 'TTA', 'TTG', 'TGG', 'GAA', 'GAT', 'GAG', 'GTT', 'GGA', 'GGT', 'GGG', 'GCC', 'CTA', 'CGA', 'CGC', 'CCT', 'CCG'])

Note: The keys are unordered

dict.items()#

Returns a set-like object providing a view on the items of a dictionary

counts.items()
dict_items([('AAT', 1), ('ATT', 2), ('AGG', 1), ('ACG', 1), ('ACC', 1), ('TAC', 2), ('TTA', 2), ('TTG', 1), ('TGG', 1), ('GAA', 1), ('GAT', 1), ('GAG', 1), ('GTT', 1), ('GGA', 1), ('GGT', 1), ('GGG', 1), ('GCC', 1), ('CTA', 1), ('CGA', 1), ('CGC', 1), ('CCT', 1), ('CCG', 1)])

Iterating all over the keys#

for trinucleotide in counts.keys():
    if counts[trinucleotide] >= 2:
        print(f"{trinucleotide} has {counts[trinucleotide]:d} counts")
ATT has 2 counts
TAC has 2 counts
TTA has 2 counts

Iterating all over the items (keys-values)#

for key, val in counts.items():
    if val >= 1: # val is the same than counts[key]
        print(f"{key} has {val:d} counts")
AAT has 1 counts
ATT has 2 counts
AGG has 1 counts
ACG has 1 counts
ACC has 1 counts
TAC has 2 counts
TTA has 2 counts
TTG has 1 counts
TGG has 1 counts
GAA has 1 counts
GAT has 1 counts
GAG has 1 counts
GTT has 1 counts
GGA has 1 counts
GGT has 1 counts
GGG has 1 counts
GCC has 1 counts
CTA has 1 counts
CGA has 1 counts
CGC has 1 counts
CCT has 1 counts
CCG has 1 counts

Summary#

  • Introducing dictionaries:

    • What is a dict?

    • Why is it useful?

    • Items within a dictionary

  • Dictionary initialization

  • Accessing an item of the dict

  • Add an item

  • Delete an item (using a key):

    • dict.pop()

  • Get a val from key (with default val):

    • dict.get()

  • Iterating over a dictionary:

    • dict.keys()

    • dict.items()


Exercises#

More online references on dictionaries:#