C5. A higher level of abstraction: functions#

Introduction to functions#

Let’s start with an example. If I had a program that generates a random sequence of DNA of length n, I would not code it again every time I need it. What I need is something at a higher level of abstraction, something I can easily call/use in my programs, a function. For instance:

get_a_random_sequence_of_DNA(n) # n is an integer, the length of the sequence  

A DNA composition example#

This will be the code with no function

# Goal: Obtain the DNA composition (percent), the bp (A, T, C, G) composition
dna = "ATGGTCGTAGCTTACGCCCCGGCACTCCTTCGTGTCCACCAGTTGGATTGCAAGGGTAGAATCACCCTATGCTGATACAT" # randomly generated

# main code
bp_percent = []
for bp in ['A', 'T', 'C', 'G']:
    bp_percent.append(100*dna.count(bp)/len(dna))   
    
print(bp_percent) # Percent of ['A', 'T', 'C', 'G']
[21.25, 26.25, 28.75, 23.75]

How can you calculate the DNA composition of ~20000 human genes?#

Lets implement a function!#

That allows us to reuse the code

def dna_bp_composition(dna):
    """Goal: Find the bp [A, T, C, G] composition of an string of DNA (dna).
    
    Arguments:
    dna -- (str) the string of DNA
    
    Returns:
    bp_percent -- (list of float) with the dna composition -percent- for ['A', 'T', 'C', 'G']
    """
    bp_percent = []
    for bp in ['A', 'T', 'C', 'G']:
        bp_percent.append(100*dna.count(bp)/len(dna))
    return bp_percent
# main code    
dna = "ATGGTCGTAGCTTACGCCCCGGCACTCCTTCGTGTCCACCAGTTGGATTGCAAGGGTAGAATCACCCTATGCTGATACAT" # randomly generated
print ("Percent of ['A', 'T', 'C', 'G']:", dna_bp_composition(dna)) # Percent of ['A', 'T', 'C', 'G']
Percent of ['A', 'T', 'C', 'G']: [21.25, 26.25, 28.75, 23.75]

Then, for a list of genes#

For instance, for 7 gene sequences

# main code 
genes = ["ATGGTCGTAG","CTTACGCCCC","GGCACTCCTT","CGTGTCCACC","AGTTGGATTG","CAAGGGTAGA","ATCAGCCCTA"] # randomly generated
for gene in genes:
    print ("Percent of ['A', 'T', 'C', 'G'] in", gene, ":", dna_bp_composition(gene))  
Percent of ['A', 'T', 'C', 'G'] in ATGGTCGTAG : [20.0, 30.0, 10.0, 40.0]
Percent of ['A', 'T', 'C', 'G'] in CTTACGCCCC : [10.0, 20.0, 60.0, 10.0]
Percent of ['A', 'T', 'C', 'G'] in GGCACTCCTT : [10.0, 30.0, 40.0, 20.0]
Percent of ['A', 'T', 'C', 'G'] in CGTGTCCACC : [10.0, 20.0, 50.0, 20.0]
Percent of ['A', 'T', 'C', 'G'] in AGTTGGATTG : [20.0, 40.0, 0.0, 40.0]
Percent of ['A', 'T', 'C', 'G'] in CAAGGGTAGA : [40.0, 10.0, 10.0, 40.0]
Percent of ['A', 'T', 'C', 'G'] in ATCAGCCCTA : [30.0, 20.0, 40.0, 10.0]

Note: Instead of using this small list of short DNA seqs, we can run a similar code over the ~20,0000 human protein coding genes

Are functions useful?#

Here I provide some reasons:

  • Another level of abstraction

  • Clean code: order, readability

  • Flexibility: your code can have more functionality with little changes

  • Reusability: do not reinvent the wheel and use already implemented code

  • Testing: write specific functions for testing purposes

  • Maintain and extent your code

Advice:

  • Avoid writing “spaghetti code”, functions help

Parts of the definition of a function#

Note that a function is defined and latter it is called. Now we proceed with the definition

def dna_bp_composition(dna):
    """Goal: Find the bp [A, T, C, G] composition of an string of DNA (dna).
    
    Arguments:
    dna -- (str) the string of DNA
    
    Returns:
    bp_percent -- (list of num) with the dna composition -percent- for ['A', 'T', 'C', 'G']
    """
    bp_percent = []
    for bp in ['A', 'T', 'C', 'G']:
        bp_percent.append(100*dna.count(bp)/len(dna))
    return bp_percent
  • def: defining the function

  • A Function name: dna_bp_composition. Select a nice name!

  • Function arguments: dna. Only one here, a function can have many arguments

    • dna is of the type str

  • A colon “:

  • Commentaries

  • Block of indented lines. This is the way Python works

    • It is the body of the function

  • return: bp_percent (a list)

    • In this case a list of floats

Calling a function#

When we call the function within the main code, two things can happen:

  1. The result of the function vanishes

    # main code
    dna_bp_composition("ATGGTCGTAG")  
    
  2. The result is stored in a variable for a posterior use

    # main code
    dna_comp = dna_p_composition("ATGGTCGTAG")  
    

Divide and conquer principle:#

Divide et Vinces
    Way before Julius Caesar

Divide a complex task in simpler tasks. Functions follows conceptually this idea

Input, output#

Can a function have no input parameter?#

Yes, a function can have no input parameter

def print_right_triangle():
    """
        Plot a square triangle.
        This function has no input parameter
    """
    my_line = ""
    for i in range(1,10):
        my_line += "@"  # another way of doing my_line = "@"*i
        print(my_line)
    return 1 # 1 to indicate that has ended properly

# main
print_right_triangle()
@
@@
@@@
@@@@
@@@@@
@@@@@@
@@@@@@@
@@@@@@@@
@@@@@@@@@
1

Can a function have no output/return?#

Yes, a function can have no return. It is called a procedure:

Procedure

(bbc.co.uk)

Another classical procedure is a cooking recipe

Can a function have multiple input parameters?#

Yes, a function can have multiple input parameters. They are separated by commas in the definition and also in the call of the function

def dna_bp_composition(dna, bps):  # input parameters separated by commas 
    bp_percent_local = []  
    for bp in bps:
        dna_num_bp = dna.count(bp)
        bp_percent_local.append(100*dna.count(bp)/len(dna))
    return bp_percent_local

# main code
dna = "AGTACACCCGTTATGGATCG"
bps = ['A', 'T', 'C', 'G']   # this is a *global variable* now
print(dna_bp_composition(dna, bps)) # several input parameters separated by commas 
[25.0, 25.0, 25.0, 25.0]

Scope of the variables#

Variables can only be accessed in the area where they are defined, this is called scope

Local variable#

That is a variable with local scope

def dna_bp_composition(dna):
    bps = ['A', 'T', 'C', 'G'] # this is a *local variable*
    bp_percent_local = []  
    for bp in bps:
        bp_percent_local.append(100*dna.count(bp)/len(dna))
    return bp_percent_local

Then, a local variable can be accessed in its local scope

def dna_bp_composition(dna):
    bps_local = ['A', 'T', 'C', 'G'] # this is a *local variable*
    bp_percent_local = []  
    for bp in bps_local:
        bp_percent_local.append(100*dna.count(bp)/len(dna))
    print("Works! Local scope (function):", bps_local)  # the local variable can be locally accessed
    return bp_percent_local

# main code 
my_dna = "AAGGTCATAT" 
dna_bp_composition(my_dna)
Works! Local scope (function): ['A', 'T', 'C', 'G']
[40.0, 30.0, 10.0, 20.0]

But, a local variable can not be accessed outside its local scope.

# main
print("Fails! Out of scope:", bps_local)  # the local variable can not be accessed outside its scope!

>NameError: name 'bps_local' is not defined

Global variable#

A global variable can be accessed from the whole programm (global scope)

def dna_bp_composition():
    bp_percent_local = []  
    for bp in ['A', 'T', 'C', 'G']:
        bp_percent_local.append(100*main_dna.count(bp)/len(dna))
    print("Works! Global scope:", main_dna)  # the variable can be accessed from the function
    return bp_percent_local

# main code 
main_dna = "AAGGTCATAT" 
print(dna_bp_composition())
Works! Global scope: AAGGTCATAT
[20.0, 15.0, 5.0, 10.0]

But be careful, because breaks the divide and conquer concept
In the example, the function have dependences other than the function parameters, you have to be careful with the global variable main_dna

It would be better to pass the DNA seq as input parameter to the function

Fixing detected errors in functions#

Some errors can now be fixed in an easier way due to the “divide and conquer principle”:
They are localized within the function. For instance:

def dna_bp_composition(dna):
    bp_percent_local = []  
    for bp in ['A', 'T', 'C', 'G']:
        bp_percent_local.append(100*dna.count(bp)/len(dna))
    return bp_percent_local

# main
dna_comp = dna_bp_composition("tttttttttTTTttttttttt") # [0.0, 14.285714285714286, 0.0, 0.0]
                                                       # should be 100% of "thymine". But "T" != "t"
print(dna_comp)                                        
[0.0, 14.285714285714286, 0.0, 0.0]

Solution: The function’s code can be modified. This it is simpler than looking up within the whole code

def dna_bp_composition(dna):
    dna = dna.upper()   # this fix the ERROR; 
                        # note that in place does not happen
    bp_percent_local = []  
    for bp in ['A', 'T', 'C', 'G']:
        bp_percent_local.append(100*dna.count(bp)/len(dna))
    return bp_percent_local

dna_comp = dna_bp_composition("tttttttttTTTttttttttt") # now it is 100% of "thymine"
print(dna_comp) 
[0.0, 100.0, 0.0, 0.0]
...Note that the code is always prone to errors and fixing them can be a difficult task.
For instance, what happens when different IUPAC nt symbols (ie. 'N') are used?

Improving functions#

A big advantage is that the code can be improved just modifying the function

# the next function returns too many decimals (quite difficult to read!)
def dna_bp_composition(dna):
    dna = dna.upper()  # note that in place does not happen
    bp_percent_local = []  
    for bp in ['A', 'T', 'C', 'G']:
        bp_percent_local.append(100*dna.count(bp)/len(dna))
    return bp_percent_local

print(['A', 'T', 'C', 'G'], "composition: ", dna_bp_composition("CCCTTTCCCCCCCCCCCCCCC")) # too many decimals!
['A', 'T', 'C', 'G'] composition:  [0.0, 14.285714285714286, 85.71428571428571, 0.0]

Solution: use the built-in function round()

round(number, ndigits=None)
    Round a number to a given precision in decimal digits.
    
    The return value is an integer if ndigits is omitted or None.  
    Otherwise, the return value has the same type as the number.
def dna_bp_composition(dna):
    dna = dna.upper() # note that in place does not happen
    bp_percent_local = []  
    for bp in ['A', 'T', 'C', 'G']:
        percent = 100*dna.count(bp)/len(dna)
        bp_percent_local.append(round(percent, 1)) # we now round to 1 decimal
    return bp_percent_local

print(['A', 'T', 'C', 'G'], "composition: ", dna_bp_composition("CCCTTTCCCCCCCCCCCCCCC"))
['A', 'T', 'C', 'G'] composition:  [0.0, 14.3, 85.7, 0.0]

Default arguments and positional arguments#

# see it with the next example:
def dna_bp_composition(dna="ATCG", number_of_decimals=3): # setting default parameters
    dna = dna.upper()  # note that in place does not happen
    bp_percent_local = []  
    for bp in ['A', 'T', 'C', 'G']:
        bp_percent_local.append(round(100*dna.count(bp)/len(dna), number_of_decimals)) # rounded
    return bp_percent_local

# main code (calling the function)
print(dna_bp_composition("ACCTGCGGTTTCCAGCT", 2)) # round to 2 decimals
print(dna_bp_composition("ACCTGCGGTTTCCAGCT"))    # we skip the 2nd argument (round to 3 decimals, default)
print(dna_bp_composition())                       # calling the function even with no parameter: All by default
print(dna_bp_composition(number_of_decimals=4, dna="ACCTGCGGTTTCCAGCT")) # in this case the order is not needed
                                                                         # because we provide the keywords
[11.76, 29.41, 35.29, 23.53]
[11.765, 29.412, 35.294, 23.529]
[25.0, 25.0, 25.0, 25.0]
[11.7647, 29.4118, 35.2941, 23.5294]

But, the next will raise an error

# main code
print(dna_bp_composition(7))    # ERROR, is 7 the value of dna?
                                # The code believes that 7 is the sequence and raises an ERROR.

The order of the parameters is very important
As we saw before, the next will work well

# main code: calling the function 
print(dna_bp_composition("GAACGCAAACCTGGGTG"))  # round to 3 decimals (set up by default)
[29.412, 11.765, 23.529, 35.294]

Testing functions: assert#

Ends the program if the assert runs false.
This kind of advancing material, it will be more clear when knowing what are conditionals

def dna_bp_composition(dna="ATCG", number_of_decimals=3): # setting default parameters
    dna = dna.upper()  # note that in place does not happen
    bp_percent_local = []  
    for bp in ['A', 'T', 'C', 'G']:
        bp_percent_local.append(round(100*dna.count(bp)/len(dna), number_of_decimals)) # rounded
    return bp_percent_local

# main code (calling the function)
print(dna_bp_composition("ATCG", 1))
assert dna_bp_composition("ATCG", 1) == [25.0, 25.0, 25.0, 25.0] # rounded to one decimal
[25.0, 25.0, 25.0, 25.0]

It has passed this test. A battery of asserts can be tested!

It can also fail the test:

# fails an raises an AssertionError 
assert dna_bp_composition("ATCG", 1) == [25.0, 25.0, 25.0, 26.0]  # if fails raises an "AssertionError"

#Assertion Error

New concepts here:

  • == is a new operator, the comparative operator

  • Assert is equivalent to

    if not condition: 
       raise AssertionError()
  • The functionality of assert will be:

    • pass or raise an error

The functionality of assert is: pass or raise an error

assert dna_bp_composition("ATCG", 1) == [25.0, 25.0, 25.0, 25.0]  # no "AssertionError"

The previous works but, advice:

It is a very bad idea to compare 2 real numbers: integers is ok, but not real numbers.  
It can easily get unexpected results due to the number precision

Another example: we can detect errors when testing agains ‘N’ IUPAC nts. For instance:

# main code (calling the function)
print(dna_bp_composition("ATCGNNNNNNNNNNN", 1))  # This will be [6.7, 6.7, 6.7, 6.7]
assert dna_bp_composition("ATCGNNNNNNNNNNN", 1) == [25.0, 25.0, 25.0, 25.0]

#AssertionError

Then, we easily fix the error: N’s are removed in the next example

def dna_bp_composition(dna="ATCG", number_of_decimals=3): # setting default parameters
    dna = dna.upper().replace('N', '') # 2 consecutive methods (left to right)
    # or 
    # dna = dna.upper()          # in place: false
    # dna = dna.replace('N', '') # in place: false
    #
    bp_percent_local = []  
    for bp in ['A', 'T', 'C', 'G']:
        bp_percent_local.append(round(100*dna.count(bp)/len(dna), number_of_decimals)) # rounded
    return bp_percent_local

# main code (calling the function)
print(dna_bp_composition("ATCGNNNNNNNNNNN", 1))  # round to one decimal
assert dna_bp_composition("ATCGNNNNNNNNNNN", 1) == [25.0, 25.0, 25.0, 25.0] # pass the assert
[25.0, 25.0, 25.0, 25.0]

Summary#

  • Functions:

    • An example

    • Why are functions useful?

    • Defining and calling a function

    • Parts of a function

    • Divide and conquer:

      • All is the sum of its parts

      • From one complex task to several simple tasks

    • Functions with

      • No input parameters

      • Multiple input parameters

      • No return: “procedures”

    • Variable scope:

      • Local variables

      • Global variables

    • Errors:

      • Raising errors

      • Fixing errors

    • Input parameters of a function:

      • Position

      • Default and keywords

    • Testing functions

      • Assert

More material now?#

Yes, it is very important!
Function commentaries: the code needs to be well documented. There are many ways of commenting. For instance:

# FUNCTIONS
###########

def dna_bp_composition(dna):
    """
    Goal: Find the bp [A, T, C, G] composition of an string of DNA (dna)
    
    Arguments:
    dna -- (str) the string of DNA
    
    Returns:
    bp_percent -- (list of floats) with the dna composition -percent- for ['A', 'T', 'C', 'G']
    """
    
    Body of the function



def is_a_leap_year(year):
    """
    Is the year a leap year?

    @type  year: int
    @param year: a year

    @rtype:   bool
    @return:  is a leap year?
    """
    
    Body of the function

Exercises#