C3. Managing information: files#
Usefulness#
Files are able to store a big amount of data. In many cases it will the right place to write the output of our programs. Note that if some input files contain plenty of information, like gene annotation files, using our program we can retrieve only the data we are interested in
Which types of files are in the area of interest of a biologists?#
For instance:
Your own code will need/generate input/output files
Other programs output files (e.g. BLAST)
Sequences (fasta and multifasta files)
Annotations in general (structural, genes, phylogeny…)
High Throughput Sequencing reads
Internet (html,…)
Formats: txt, json, xml, etc
Media: photos, videos, audios
compressed files: *.gz, *bz2
…and many more types of files as you can imagine
Opening a file#
open()#
open() is a function
It opens a file in your working directory.
Returns a file object
file_object = open("my_first_file.txt", "r") # opening for reading
# the next is opened for reading
file_object = open("my_first_file.txt") # reading by default if not indicated
- In this case it will raise an error, because the file does not exists
>>> file_object = open("my_first_file.txt")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
FileNotFoundError: [Errno 2] No such file or directory: 'my_first_file.txt'
Try the next little exercise#
Create a file in a dir of your computer
Assign the complete path, with the file included, to a variable
Open the file, using the variable as argument
This should be working by now. Then:
Modify the name of the file when assigning to a variable
Save your program and run it again. It should provide an Error. Doesn’t it?
Note that in order to create a file in a computer you need to have the right to do it
In your home laptop you have administrator rights, but if for instance, you are running Jupyter in a remote server, you need to know your rights in that remote system. Obviously, you will have certain right limitations there
Which one is my current working directory?#
You need to import the “os module” and the next input file: my_first_file.txt
import os
print(os.getcwd()) # other methods can change the wd
# it is not necessary to know them by now
/home/emuro/Desktop/goingOn/teaching/p4b__jupyter_book_WiSe24/p4b__jb_built/p4b__web-book/c3_reading_and_writing_files
#### another solution is
!pwd # but from Jupyter, not within a Python program
/home/emuro/Desktop/goingOn/teaching/p4b__jupyter_book_WiSe24/p4b__jb_built/p4b__web-book/c3_reading_and_writing_files
my_first_file_object = open("./files/my_first_file.txt")
# !ls -l files
File objects#
A file object is not the plain text contained in “my_first_file.txt”. It is the object that open() returns
# show the file object
print(my_first_file_object)
<_io.TextIOWrapper name='./files/my_first_file.txt' mode='r' encoding='UTF-8'>
We use methods to deal with file objects
Reading a file object#
The read() method#
file_name = "./files/my_first_file.txt" # This file exists
file_object = open(file_name)
file_content = file_object.read()
print(file_content)
`-:-. ,-;"`-:-. ,-;"`-:-. ,-;"`-:-. ,-;"
`=`,'=/ `=`,'=/ `=`,'=/ `=`,'=/
y==/ y==/ y==/ y==/
,=,-<=`. ,=,-<=`. ,=,-<=`. ,=,-<=`.
,-'-' `-=_,-'-' `-=_,-'-' `-=_,-'-' `-=_
___
.---'- \
.-----------/ \
/ ( ^ | __
& ( \ O / / .'
'._/( '-' (. (_.' /
\ \ ./
| | | |/ '._.'
) @).____\| @ |
. / / ( |
\|, '_:::\ . .. '_:::\ ..\)....ATCGGTGTATGGC...wolly mammoth paleogenomics?
The name of the variables matters#
In the last example, there were 3 variables.
You can call them what you want, but it can be a mess!:
# Use better names for the vars than the next ones:
var1 = "./files/my_first_file.txt"
var2 = open(var1)
var3 = var2.read()
print(var3) # these names are not informative at all
`-:-. ,-;"`-:-. ,-;"`-:-. ,-;"`-:-. ,-;"
`=`,'=/ `=`,'=/ `=`,'=/ `=`,'=/
y==/ y==/ y==/ y==/
,=,-<=`. ,=,-<=`. ,=,-<=`. ,=,-<=`.
,-'-' `-=_,-'-' `-=_,-'-' `-=_,-'-' `-=_
___
.---'- \
.-----------/ \
/ ( ^ | __
& ( \ O / / .'
'._/( '-' (. (_.' /
\ \ ./
| | | |/ '._.'
) @).____\| @ |
. / / ( |
\|, '_:::\ . .. '_:::\ ..\)....ATCGGTGTATGGC...wolly mammoth paleogenomics?
Closing a file object#
close()#
It is a file object method with no arguments
Do not forget to close any opened file object!#
file_h = open("your_file.txt", "r") #
...
file_h.close() # the file needs to be closed when you are done with the file object
Try the next little exercise:#
Create a ascii-file (*.txt) in your computer. The extension (.txt) indicates that it is plain-text
Using your favourite plain text editor add some text within the file. Do not forget to save and close
Then, using Python: open, read and display its content
And, of course, close the file object
A bit more on strings (str)#
How to remove whitespaces and/or “\n” at the end of a str?#
For instance:
"dunaliella_salina_kozac_seq__Microalga.txt" # file name, the extension indicates that is plain text
The file “dunaliella_salina_kozac_seq__Microalga.txt” contains a line with a final return “\n”. Then, the sequence: “gccaagATGgcg” has len 12 not 13, but it is counting that return too.
file_obj = open("./files/dunaliella_salina_kozac_seq__Microalga.txt") # file name not stored in a var
file_content = file_obj.read()
print("(" + file_content +")", len(file_content)) # gccaagATGgcg has len 12...but here we see 13 (cause of "\n")
(gccaagATGgcg
) 13
Solution: use specific str methods#
str.rstrip()#
It is an string method and it removes any white spaces at the end of the string, and/or any return of line
As it was explained in the previous chapter, str.rstrip() does not modify the string, it returns a cp of the string without whitespaces. This is because str is an immutable type
message = " ATG "
print("." + message + ".") # the whitespaces are there
print("[" + message.rstrip() + "]") # the right whitespaces are not shown
print("(" + message + ")") # the all whitespaces (left and right) are still there
. ATG .
[ ATG]
( ATG )
help(str.rstrip) # help is a built-in function
Help on method_descriptor:
rstrip(self, chars=None, /)
Return a copy of the string with trailing whitespace removed.
If chars is given and not None, remove characters in chars instead.
print("ccgaaaaaaaaaaaaaaaaaa".rstrip("a")) # in ran this will remove the polyA
print("atga.aazt".rstrip("zat."))
ccg
atg
file_obj = open("./files/dunaliella_salina_kozac_seq__Microalga.txt") # file name not stored in a var
file_content = file_obj.read()
file_content.rstrip() # the method does not change the string
print(file_content, len(file_content)) # gccaagATGgcg has len 12, and not 13
# it contains an invisible return of line
file_content = file_content.rstrip() # This is the way to modify file_content
print(file_content, len(file_content)) # Now is right!
gccaagATGgcg
13
gccaagATGgcg 12
# Some examples in books can be confusing
# For instance:
text = "atgatg \n"
print("(" + text.rstrip("\n") + ")") # Seen in text-books
# it does not remove whitespaces
# This is confusing!
(atgatg )
# now it removes ending " " + "\n"
print("("+ text.rstrip() + ")")
(atgatg)
print( "("+ text.rstrip(" \n") + ")") # both, whitespace and "\n"...works but...
print( "("+ text.rstrip() + ")") # better!
print( "("+ text.rstrip("\n") + ")") # dangerous!
(atgatg)
(atgatg)
(atgatg )
str.lstrip()#
This is a str method. It removes any white spaces at the beginning of the string (and/or return of line)
Note that the string can contain a return of line at the beginning of the string
message = " ATG "
print("." + message + ".") # the whitespaces are there
print("[" + message.lstrip() + "]") # the left whitespaces are not shown
print("(" + message + ")") # the all whitespaces (left and right) are still there
. ATG .
[ATG ]
( ATG )
Try the next example, or something from your own#
based on a discussion between Schrödinger and Einstein on quantum superposition: The Schrödinger cat
:): (smiling and sad at the very same time)
schroedinger_cat = ":):" # both smiling and sad at the same time!
schroedinger_smile = schroedinger_cat.rstrip(":") # smiling; from a microscopic world to a macroscopic one!
schroedinger_sad = schroedinger_cat.lstrip(":") # sad
print("Schroedinger_cat", schroedinger_cat)
print("smile", schroedinger_smile)
print("sad", schroedinger_sad)
Schroedinger_cat :):
smile :)
sad ):
str.strip()#
Again a str method. It removes any white spaces at the beginning and at the end (also “\n”) of the string
schroedinger_cat = ":):"
print("Uppss... I see no eyes", schroedinger_cat.strip(":"))
Uppss... I see no eyes )
Concatenating methods in Python#
Methods can be combined in Python within the same statement!
It is less readable, but very standard in python. It is called pythonic style
It always reads from left to right
It is important to validate that what a method returns should be a proper input for the next one
file_obj = open("./files/dunaliella_salina_kozac_seq__Microalga.txt") # file name not stored in a var
aux_content = file_obj.read().rstrip() # the method does not change the string
print(aux_content, len(aux_content)) # gccaagATGgcg has len 12, and not 13
gccaagATGgcg 12
# We can even make a more complicated combination
# using the same rule (read from left to rigth)
file_content = open("./files/dunaliella_salina_kozac_seq__Microalga.txt").read().rstrip() # read from left to right
print(file_content, len(file_content)) # gccaagATGgcg has len 12, and not 13
gccaagATGgcg 12
Writing in a file object#
write()#
It is a method on a file object, but the file object has to be previously opened in writing mode
The next are the most important modes to open a file:
========= ===============================================================
Character Meaning
--------- ---------------------------------------------------------------
'r' open for reading (default)
'w' open for writing, truncating the file first
'a' open for writing, appending to the end of the file if it exists
'b' binary mode (for binary code: 1001001)
....
========= ===============================================================
file_h = open("methionine.txt", "w") # in this case, the file does not need to be previously created.
file_h.write("ATG") # write "ATG" in the file
# write "acgatg"
file_h.write("acg" + "atg")
met = "atg"
# write "atc"
file_h.write(met.replace("g","c"))
met = "atg"
# "3"
file_h.write(len(met))
met = "atg"
# write ATG
file_h.write(met.upper())
Closing a file object opened for writing#
Exactly the same than for reading, you have to close it after using it. It is quite important
close()#
It is a file object method with no arguments
file_h = open("methionine.txt", "w")
...
file_h.close() # the file needs to be closed at the end, when you are done with it
Try the next little exercise#
Open your own file (writing mode), write something in and close it.
Using your OS, check that it looks fine
Then:
Open the very same file (now in reading mode), read and display its content
Do not forget to close the file object
Using “with” to close the file automatically#
with open("methionine.txt", "w") as my_file:
....block of code
my_file.write("whatever")
# the file does not need to be closed because "with" was used
# Python will close it automatically. The very same for files opened for reading
A note of advice#
Perhaps you are wondering: what about opening a file for writing and reading at the same time? When opening a file there is something like a pointer indicating where are you now within the file. So, in order not to mess with that pointer, and while you get more experience, I suggest you to open any file just for reading or writing
Summary#
open()
file_object = open("my_first_file.txt")
* For reading
file_name = "./files/my_first_file.txt"
file_object = open(file_name, "r") # also by default with no "r"
file_content = file_object.read()
print(file_content)
file_object.close() # close it after using
* For writing
file_object = open("my_first_file.txt", "w") # append "a"
close()
file_object.close() # if you open a file you have to close it
read()
file_content = file_object.read()
print(file_content)
write()
file_object.write("ATG") # write "ATG" in the file
str.rstrip()
file_content = open("my_first_file.txt").read().rstrip()
* str.lstrip(), str.strip()