C5. A higher level of abstraction: functions#
Introduction to functions#
Let’s start with an example. If I had a program that generates a random sequence of DNA of length n, I would not code it again every time I need it. What I need is something at a higher level of abstraction, something I can easily call/use in my programs, a function. For instance:
get_a_random_sequence_of_DNA(n) # n is an integer, the length of the sequence
A DNA composition example#
This will be the code with no function
# Goal: Obtain the DNA composition (percent), the bp (A, T, C, G) composition
dna = "ATGGTCGTAGCTTACGCCCCGGCACTCCTTCGTGTCCACCAGTTGGATTGCAAGGGTAGAATCACCCTATGCTGATACAT" # randomly generated
# main code
bp_percent = []
for bp in ['A', 'T', 'C', 'G']:
bp_percent.append(100*dna.count(bp)/len(dna))
print(bp_percent) # Percent of ['A', 'T', 'C', 'G']
[21.25, 26.25, 28.75, 23.75]
How can you calculate the DNA composition of ~20000 human genes?#
Lets implement a function!#
That allows us to reuse the code
def dna_bp_composition(dna):
"""Goal: Find the bp [A, T, C, G] composition of an string of DNA (dna).
Arguments:
dna -- (str) the string of DNA
Returns:
bp_percent -- (list of float) with the dna composition -percent- for ['A', 'T', 'C', 'G']
"""
bp_percent = []
for bp in ['A', 'T', 'C', 'G']:
bp_percent.append(100*dna.count(bp)/len(dna))
return bp_percent
# main code
dna = "ATGGTCGTAGCTTACGCCCCGGCACTCCTTCGTGTCCACCAGTTGGATTGCAAGGGTAGAATCACCCTATGCTGATACAT" # randomly generated
print ("Percent of ['A', 'T', 'C', 'G']:", dna_bp_composition(dna)) # Percent of ['A', 'T', 'C', 'G']
Percent of ['A', 'T', 'C', 'G']: [21.25, 26.25, 28.75, 23.75]
Then, for a list of genes#
For instance, for 7 gene sequences
# main code
genes = ["ATGGTCGTAG","CTTACGCCCC","GGCACTCCTT","CGTGTCCACC","AGTTGGATTG","CAAGGGTAGA","ATCAGCCCTA"] # randomly generated
for gene in genes:
print ("Percent of ['A', 'T', 'C', 'G'] in", gene, ":", dna_bp_composition(gene))
Percent of ['A', 'T', 'C', 'G'] in ATGGTCGTAG : [20.0, 30.0, 10.0, 40.0]
Percent of ['A', 'T', 'C', 'G'] in CTTACGCCCC : [10.0, 20.0, 60.0, 10.0]
Percent of ['A', 'T', 'C', 'G'] in GGCACTCCTT : [10.0, 30.0, 40.0, 20.0]
Percent of ['A', 'T', 'C', 'G'] in CGTGTCCACC : [10.0, 20.0, 50.0, 20.0]
Percent of ['A', 'T', 'C', 'G'] in AGTTGGATTG : [20.0, 40.0, 0.0, 40.0]
Percent of ['A', 'T', 'C', 'G'] in CAAGGGTAGA : [40.0, 10.0, 10.0, 40.0]
Percent of ['A', 'T', 'C', 'G'] in ATCAGCCCTA : [30.0, 20.0, 40.0, 10.0]
Note: Instead of using this small list of short DNA seqs, we can run a similar code over the ~20,0000 human protein coding genes
Are functions useful?#
Here I provide some reasons:
Another level of abstraction
Clean code: order, readability
Flexibility: your code can have more functionality with little changes
Reusability: do not reinvent the wheel and use already implemented code
Testing: write specific functions for testing purposes
Maintain and extent your code
Advice:
Avoid writing “spaghetti code”, functions help
Parts of the definition of a function#
Note that a function is defined and latter it is called. Now we proceed with the definition
def dna_bp_composition(dna):
"""Goal: Find the bp [A, T, C, G] composition of an string of DNA (dna).
Arguments:
dna -- (str) the string of DNA
Returns:
bp_percent -- (list of num) with the dna composition -percent- for ['A', 'T', 'C', 'G']
"""
bp_percent = []
for bp in ['A', 'T', 'C', 'G']:
bp_percent.append(100*dna.count(bp)/len(dna))
return bp_percent
def: defining the function
A Function name: dna_bp_composition. Select a nice name!
Function arguments: dna. Only one here, a function can have many arguments
dna is of the type str
A colon “:”
Commentaries
Block of indented lines. This is the way Python works
It is the body of the function
return: bp_percent (a list)
In this case a list of floats
Calling a function#
When we call the function within the main code, two things can happen:
The result of the function vanishes
# main code dna_bp_composition("ATGGTCGTAG")
The result is stored in a variable for a posterior use
# main code dna_comp = dna_p_composition("ATGGTCGTAG")
Divide and conquer principle:#
Divide et Vinces
Way before Julius Caesar
Divide a complex task in simpler tasks. Functions follows conceptually this idea
Input, output#
Can a function have no input parameter?#
Yes, a function can have no input parameter
def print_right_triangle():
"""
Plot a square triangle.
This function has no input parameter
"""
my_line = ""
for i in range(1,10):
my_line += "@" # another way of doing my_line = "@"*i
print(my_line)
return 1 # 1 to indicate that has ended properly
# main
print_right_triangle()
@
@@
@@@
@@@@
@@@@@
@@@@@@
@@@@@@@
@@@@@@@@
@@@@@@@@@
1
Can a function have no output/return?#
Yes, a function can have no return. It is called a procedure:
(bbc.co.uk)
Another classical procedure is a cooking recipe
Can a function have multiple input parameters?#
Yes, a function can have multiple input parameters. They are separated by commas in the definition and also in the call of the function
def dna_bp_composition(dna, bps): # input parameters separated by commas
bp_percent_local = []
for bp in bps:
dna_num_bp = dna.count(bp)
bp_percent_local.append(100*dna.count(bp)/len(dna))
return bp_percent_local
# main code
dna = "AGTACACCCGTTATGGATCG"
bps = ['A', 'T', 'C', 'G'] # this is a *global variable* now
print(dna_bp_composition(dna, bps)) # several input parameters separated by commas
[25.0, 25.0, 25.0, 25.0]
Scope of the variables#
Variables can only be accessed in the area where they are defined, this is called scope
Local variable#
That is a variable with local scope
def dna_bp_composition(dna):
bps = ['A', 'T', 'C', 'G'] # this is a *local variable*
bp_percent_local = []
for bp in bps:
bp_percent_local.append(100*dna.count(bp)/len(dna))
return bp_percent_local
Then, a local variable can be accessed in its local scope
def dna_bp_composition(dna):
bps_local = ['A', 'T', 'C', 'G'] # this is a *local variable*
bp_percent_local = []
for bp in bps_local:
bp_percent_local.append(100*dna.count(bp)/len(dna))
print("Works! Local scope (function):", bps_local) # the local variable can be locally accessed
return bp_percent_local
# main code
my_dna = "AAGGTCATAT"
dna_bp_composition(my_dna)
Works! Local scope (function): ['A', 'T', 'C', 'G']
[40.0, 30.0, 10.0, 20.0]
But, a local variable can not be accessed outside its local scope.
# main
print("Fails! Out of scope:", bps_local) # the local variable can not be accessed outside its scope!
>NameError: name 'bps_local' is not defined
Global variable#
A global variable can be accessed from the whole programm (global scope)
def dna_bp_composition():
bp_percent_local = []
for bp in ['A', 'T', 'C', 'G']:
bp_percent_local.append(100*main_dna.count(bp)/len(main_dna))
print("Works! Global scope:", main_dna) # the variable can be accessed from the function
return bp_percent_local
# main code
main_dna = "AAGGTCATAT"
print(dna_bp_composition())
Works! Global scope: AAGGTCATAT
[40.0, 30.0, 10.0, 20.0]
But be careful, because breaks the divide and conquer concept
In the example, the function have dependences other than the function parameters, you have to be careful with the global variable main_dna
It would be better to pass the DNA seq as input parameter to the function
Fixing detected errors in functions#
Some errors can now be fixed in an easier way due to the “divide and conquer principle”:
They are localized within the function. For instance:
def dna_bp_composition(dna):
bp_percent_local = []
for bp in ['A', 'T', 'C', 'G']:
bp_percent_local.append(100*dna.count(bp)/len(dna))
return bp_percent_local
# main
dna_comp = dna_bp_composition("tttttttttTTTttttttttt") # [0.0, 14.285714285714286, 0.0, 0.0]
# should be 100% of "thymine". But "T" != "t"
print(dna_comp)
[0.0, 14.285714285714286, 0.0, 0.0]
Solution: The function’s code can be modified. This it is simpler than looking up within the whole code
def dna_bp_composition(dna):
dna = dna.upper() # this fix the ERROR;
# note that in place does not happen
bp_percent_local = []
for bp in ['A', 'T', 'C', 'G']:
bp_percent_local.append(100*dna.count(bp)/len(dna))
return bp_percent_local
dna_comp = dna_bp_composition("tttttttttTTTttttttttt") # now it is 100% of "thymine"
print(dna_comp)
[0.0, 100.0, 0.0, 0.0]
...Note that the code is always prone to errors and fixing them can be a difficult task.
For instance, what happens when different IUPAC nt symbols (ie. 'N') are used?
Improving functions#
A big advantage is that the code can be improved just modifying the function
# the next function returns too many decimals (quite difficult to read!)
def dna_bp_composition(dna):
dna = dna.upper() # note that in place does not happen
bp_percent_local = []
for bp in ['A', 'T', 'C', 'G']:
bp_percent_local.append(100*dna.count(bp)/len(dna))
return bp_percent_local
print(['A', 'T', 'C', 'G'], "composition: ", dna_bp_composition("CCCTTTCCCCCCCCCCCCCCC")) # too many decimals!
['A', 'T', 'C', 'G'] composition: [0.0, 14.285714285714286, 85.71428571428571, 0.0]
Solution: use the built-in function round()
round(number, ndigits=None)
Round a number to a given precision in decimal digits.
The return value is an integer if ndigits is omitted or None.
Otherwise, the return value has the same type as the number.
def dna_bp_composition(dna):
dna = dna.upper() # note that in place does not happen
bp_percent_local = []
for bp in ['A', 'T', 'C', 'G']:
percent = 100*dna.count(bp)/len(dna)
bp_percent_local.append(round(percent, 1)) # we now round to 1 decimal
return bp_percent_local
print(['A', 'T', 'C', 'G'], "composition: ", dna_bp_composition("CCCTTTCCCCCCCCCCCCCCC"))
['A', 'T', 'C', 'G'] composition: [0.0, 14.3, 85.7, 0.0]
Default arguments and positional arguments#
# see it with the next example:
def dna_bp_composition(dna="ATCG", number_of_decimals=3): # setting default parameters
dna = dna.upper() # note that in place does not happen
bp_percent_local = []
for bp in ['A', 'T', 'C', 'G']:
bp_percent_local.append(round(100*dna.count(bp)/len(dna), number_of_decimals)) # rounded
return bp_percent_local
# main code (calling the function)
print(dna_bp_composition("ACCTGCGGTTTCCAGCT", 2)) # round to 2 decimals
print(dna_bp_composition("ACCTGCGGTTTCCAGCT")) # we skip the 2nd argument (round to 3 decimals, default)
print(dna_bp_composition()) # calling the function even with no parameter: All by default
print(dna_bp_composition(number_of_decimals=4, dna="ACCTGCGGTTTCCAGCT")) # in this case the order is not needed
# because we provide the keywords
[11.76, 29.41, 35.29, 23.53]
[11.765, 29.412, 35.294, 23.529]
[25.0, 25.0, 25.0, 25.0]
[11.7647, 29.4118, 35.2941, 23.5294]
But, the next will raise an error
# main code
print(dna_bp_composition(7)) # ERROR, is 7 the value of dna?
# The code believes that 7 is the sequence and raises an ERROR.
The order of the parameters is very important
As we saw before, the next will work well
# main code: calling the function
print(dna_bp_composition("GAACGCAAACCTGGGTG")) # round to 3 decimals (set up by default)
[29.412, 11.765, 23.529, 35.294]
Testing functions: assert#
Ends the program if the assert runs false.
This kind of advancing material, it will be more clear when knowing what are conditionals
def dna_bp_composition(dna="ATCG", number_of_decimals=3): # setting default parameters
dna = dna.upper() # note that in place does not happen
bp_percent_local = []
for bp in ['A', 'T', 'C', 'G']:
bp_percent_local.append(round(100*dna.count(bp)/len(dna), number_of_decimals)) # rounded
return bp_percent_local
# main code (calling the function)
print(dna_bp_composition("ATCG", 1))
assert dna_bp_composition("ATCG", 1) == [25.0, 25.0, 25.0, 25.0] # rounded to one decimal
[25.0, 25.0, 25.0, 25.0]
It has passed this test. A battery of asserts can be tested!
It can also fail the test:
# fails an raises an AssertionError
assert dna_bp_composition("ATCG", 1) == [25.0, 25.0, 25.0, 26.0] # if fails raises an "AssertionError"
#Assertion Error
New concepts here:
== is a new operator, the comparative operator
Assert is equivalent to
if not condition:
raise AssertionError()
The functionality of assert will be:
pass or raise an error
The functionality of assert is: pass or raise an error
assert dna_bp_composition("ATCG", 1) == [25.0, 25.0, 25.0, 25.0] # no "AssertionError"
The previous works but, advice:
It is a very bad idea to compare 2 real numbers: integers is ok, but not real numbers.
It can easily get unexpected results due to the number precision
Another example: we can detect errors when testing agains ‘N’ IUPAC nts. For instance:
# main code (calling the function)
print(dna_bp_composition("ATCGNNNNNNNNNNN", 1)) # This will be [6.7, 6.7, 6.7, 6.7]
assert dna_bp_composition("ATCGNNNNNNNNNNN", 1) == [25.0, 25.0, 25.0, 25.0]
#AssertionError
Then, we easily fix the error: N’s are removed in the next example
def dna_bp_composition(dna="ATCG", number_of_decimals=3): # setting default parameters
dna = dna.upper().replace('N', '') # 2 consecutive methods (left to right)
# or
# dna = dna.upper() # in place: false
# dna = dna.replace('N', '') # in place: false
#
bp_percent_local = []
for bp in ['A', 'T', 'C', 'G']:
bp_percent_local.append(round(100*dna.count(bp)/len(dna), number_of_decimals)) # rounded
return bp_percent_local
# main code (calling the function)
print(dna_bp_composition("ATCGNNNNNNNNNNN", 1)) # round to one decimal
assert dna_bp_composition("ATCGNNNNNNNNNNN", 1) == [25.0, 25.0, 25.0, 25.0] # pass the assert
[25.0, 25.0, 25.0, 25.0]
Summary#
Functions:
An example
Why are functions useful?
Defining and calling a function
Parts of a function
Divide and conquer:
All is the sum of its parts
From one complex task to several simple tasks
Functions with
No input parameters
Multiple input parameters
No return: “procedures”
Variable scope:
Local variables
Global variables
Errors:
Raising errors
Fixing errors
Input parameters of a function:
Position
Default and keywords
Testing functions
Assert
More material now?#
Yes, it is very important!
Function commentaries: the code needs to be well documented. There are many ways of commenting. For instance:
# FUNCTIONS
###########
def dna_bp_composition(dna):
"""
Goal: Find the bp [A, T, C, G] composition of an string of DNA (dna)
Arguments:
dna -- (str) the string of DNA
Returns:
bp_percent -- (list of floats) with the dna composition -percent- for ['A', 'T', 'C', 'G']
"""
Body of the function
def is_a_leap_year(year):
"""
Is the year a leap year?
@type year: int
@param year: a year
@rtype: bool
@return: is a leap year?
"""
Body of the function