C3. Managing information: files#

Usefulness#

Files are able to store a big amount of data. In many cases it will the right place to write the output of our programs. Note that if some input files contain plenty of information, like gene annotation files, using our program we can retrieve only the data we are interested in

Which types of files are in the area of interest of a biologists?#

For instance:

  • Your own code will need/generate input/output files

  • Other programs output files (e.g. BLAST)

  • Sequences (fasta and multifasta files)

  • Annotations in general (structural, genes, phylogeny…)

  • High Throughput Sequencing reads

  • Internet (html,…)

  • Formats: txt, json, xml, etc

  • Media: photos, videos, audios

  • compressed files: *.gz, *bz2

  • …and many more types of files as you can imagine

Opening a file#

open()#

  • open() is a function

    • It opens a file in your working directory.

    • Returns a file object

file_object = open("my_first_file.txt", "r") # opening for reading 
# the next is opened for reading
file_object = open("my_first_file.txt") # reading by default if not indicated
- In this case it will raise an error, because the file does not exists 
    >>> file_object = open("my_first_file.txt")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    FileNotFoundError: [Errno 2] No such file or directory: 'my_first_file.txt'

Try the next little exercise#

  • Create a file in a dir of your computer

  • Assign the complete path, with the file included, to a variable

  • Open the file, using the variable as argument

This should be working by now. Then:

  • Modify the name of the file when assigning to a variable

  • Save your program and run it again. It should provide an Error. Doesn’t it?

Note that in order to create a file in a computer you need to have the right to do it
In your home laptop you have administrator rights, but if for instance, you are running Jupyter in a remote server, you need to know your rights in that remote system. Obviously, you will have certain right limitations there

Which one is my current working directory?#

You need to import the “os module”

import os
print(os.getcwd()) # other methods can change the wd
                   # it is not necessary to know them by now
/home/emuro/Desktop/goingOn/teaching/p4b__jupyter_book_SoSe23/p4b__jb_built/p4b-web_book/c3_reading_and_writing_files
#### another solution is 
!pwd # but from Jupyter, not within a Python program
/home/emuro/Desktop/goingOn/teaching/p4b__jupyter_book_SoSe23/p4b__jb_built/p4b-web_book/c3_reading_and_writing_files
my_first_file_object = open("./files/my_first_file.txt")
# !ls -l files

File objects#

A file object is not the plain text contained in “my_first_file.txt”. It is the object that open() returns

# show the file object
print(my_first_file_object)
<_io.TextIOWrapper name='./files/my_first_file.txt' mode='r' encoding='UTF-8'>

We use methods to deal with file objects

Reading a file object#

The read() method#

file_name = "./files/my_first_file.txt" # This file exists
file_object = open(file_name)
file_content = file_object.read()
print(file_content)
`-:-.   ,-;"`-:-.   ,-;"`-:-.   ,-;"`-:-.   ,-;"
   `=`,'=/     `=`,'=/     `=`,'=/     `=`,'=/
     y==/        y==/        y==/        y==/
   ,=,-<=`.    ,=,-<=`.    ,=,-<=`.    ,=,-<=`.
,-'-'   `-=_,-'-'   `-=_,-'-'   `-=_,-'-'   `-=_

                         ___
                   .---'-    \
      .-----------/           \
     /           (         ^  |   __
&   (             \        O  /  / .'
'._/(              '-'  (.   (_.' /
     \                    \     ./
      |    |       |    |/ '._.'
       )   @).____\|  @ |
   .  /    /       (    | 
  \|, '_:::\  . ..  '_:::\ ..\)....ATCGGTGTATGGC...mammut paleogenomics?

The name of the variables matters#

In the last example, there were 3 variables.
You can call them what you want, but it can be a mess!:

# Use better names for the vars than the next ones:
var1 = "./files/my_first_file.txt"
var2 = open(var1)
var3 = var2.read()
print(var3) # these names are not informative at all
`-:-.   ,-;"`-:-.   ,-;"`-:-.   ,-;"`-:-.   ,-;"
   `=`,'=/     `=`,'=/     `=`,'=/     `=`,'=/
     y==/        y==/        y==/        y==/
   ,=,-<=`.    ,=,-<=`.    ,=,-<=`.    ,=,-<=`.
,-'-'   `-=_,-'-'   `-=_,-'-'   `-=_,-'-'   `-=_

                         ___
                   .---'-    \
      .-----------/           \
     /           (         ^  |   __
&   (             \        O  /  / .'
'._/(              '-'  (.   (_.' /
     \                    \     ./
      |    |       |    |/ '._.'
       )   @).____\|  @ |
   .  /    /       (    | 
  \|, '_:::\  . ..  '_:::\ ..\)....ATCGGTGTATGGC...mammut paleogenomics?

Closing a file object#

close()#

It is a file object method with no arguments

Do not forget to close any opened file object!#

file_h = open("your_file.txt", "r")  # 
...
file_h.close()  # the file needs to be closed when you are done with the file object

Try the next little exercise:#

  • Create a ascii-file (*.txt) in your computer. The extension (.txt) indicates that it is plain-text

  • Using your favourite plain text editor add some text within the file. Do not forget to save and close

  • Then, using Python: open, read and display its content

  • And, of course, close the file object

A bit more on strings (str)#

How to remove whitespaces and/or “\n” at the end of a str?#

For instance:

"dunaliella_salina_kozac_seq__Microalga.txt"  # file name, the extension indicates that is plain text

The file “./files/dunaliella_salina_kozac_seq__Microalga.txt” contains a line with a final return “\n”. Then, the sequence: “gccaagATGgcg” has len 12 not 13, but it is counting that return too.

file_obj = open("./files/dunaliella_salina_kozac_seq__Microalga.txt")  # file name not stored in a var
file_content = file_obj.read()
print("(" + file_content +")", len(file_content)) # gccaagATGgcg has len 12...but here we see 13 (cause of "\n")
(gccaagATGgcg
) 13

Solution: use specific str methods#

str.rstrip()#

  • It is an string method and it removes any white spaces at the end of the string, and/or any return of line

  • As it was explained in the previous chapter, str.rstrip() does not modify the string, it returns a cp of the string without whitespaces. This is because str is an immutable type

message =  "   ATG   " 
print("." + message + ".")          # the whitespaces are there
print("[" + message.rstrip() + "]") # the right whitespaces are not shown
print("(" + message + ")")          # the all whitespaces (left and right) are still there
.   ATG   .
[   ATG]
(   ATG   )
help(str.rstrip) # help is a built-in function
Help on method_descriptor:

rstrip(self, chars=None, /)
    Return a copy of the string with trailing whitespace removed.
    
    If chars is given and not None, remove characters in chars instead.
print("ccgaaaaaaaaaaaaaaaaaa".rstrip("a"))  # in ran this will remove the polyA 
print("atga.aazt".rstrip("zat."))
ccg
atg
file_obj = open("./files/dunaliella_salina_kozac_seq__Microalga.txt")  # file name not stored in a var
file_content = file_obj.read()

file_content.rstrip()  # the method does not change the string
print(file_content, len(file_content)) # gccaagATGgcg has len 12, and not 13
                                       # it contains an invisible return of line

file_content = file_content.rstrip() # This is the way to modify file_content
print(file_content, len(file_content)) # Now is right!
gccaagATGgcg
 13
gccaagATGgcg 12
# Some examples in books can be confusing
# For instance:
text = "atgatg    \n"
print("(" + text.rstrip("\n") + ")") # Seen in text-books
                                     # it does not remove whitespaces
                                     # This is confusing!
(atgatg    )
# now it removes ending " " + "\n"
print("("+ text.rstrip() + ")")
(atgatg)
print( "("+ text.rstrip(" \n") + ")") # both, whitespace and "\n"...works but...
print( "("+ text.rstrip()      + ")") # better!
print( "("+ text.rstrip("\n")  + ")") # dangerous!
(atgatg)
(atgatg)
(atgatg    )

str.lstrip()#

  • This is a str method. It removes any white spaces at the beginning of the string (and/or return of line)

  • Note that the string can contain a return of line at the beginning of the string

message =  "   ATG   " 
print("." + message + ".")          # the whitespaces are there
print("[" + message.lstrip() + "]") # the left whitespaces are not shown
print("(" + message + ")")          # the all whitespaces (left and right) are still there
.   ATG   .
[ATG   ]
(   ATG   )
Try the next example, or something from your own#

based on a discussion between Schrödinger and Einstein on quantum superposition: The Schrödinger cat

:): (smiling and sad at the very same time) 
schroedinger_cat = ":):" # both smiling and sad at the same time!
schroedinger_smile = schroedinger_cat.rstrip(":") # smiling; from a microscopic world to a macroscopic one!
schroedinger_sad = schroedinger_cat.lstrip(":") # sad
print("Schroedinger_cat", schroedinger_cat) 
print("smile", schroedinger_smile)
print("sad", schroedinger_sad)       
Schroedinger_cat :):
smile :)
sad ):

str.strip()#

  • Again a str method. It removes any white spaces at the beginning and at the end (also “\n”) of the string

schroedinger_cat = ":):"
print("Uppss... I see no eyes", schroedinger_cat.strip(":")) 
Uppss... I see no eyes )

Concatenating methods in Python#

Methods can be combined in Python within the same statement!

  • It is less readable, but very standard in python. It is called pythonic style

  • It always reads from left to right

  • It is important to validate that what a method returns should be a proper input for the next one

file_obj = open("./files/dunaliella_salina_kozac_seq__Microalga.txt")  # file name not stored in a var
aux_content = file_obj.read().rstrip()  # the method does not change the string
print(aux_content, len(aux_content)) # gccaagATGgcg has len 12, and not 13
gccaagATGgcg 12
# We can even make a more complicated combination
# using the same rule (read from left to rigth)
file_content = open("./files/dunaliella_salina_kozac_seq__Microalga.txt").read().rstrip() # read from left to right
print(file_content, len(file_content)) # gccaagATGgcg has len 12, and not 13
gccaagATGgcg 12

Writing in a file object#

write()#

It is a method on a file object, but the file object has to be previously opened in writing mode

The next are the most important modes to open a file:

    ========= ===============================================================
    Character Meaning
    --------- ---------------------------------------------------------------
    'r'       open for reading (default)
    'w'       open for writing, truncating the file first
    'a'       open for writing, appending to the end of the file if it exists
    'b'       binary mode (for binary code: 1001001)
     ....
    ========= ===============================================================
file_h = open("methionine.txt", "w") # in this case, the file does not need to be previously created.
file_h.write("ATG")  # write "ATG" in the file
# write "acgatg"
file_h.write("acg" + "atg")
met = "atg"
# write "atc"
file_h.write(met.replace("g","c"))
met = "atg"
# "3"
file_h.write(len(met))
met = "atg"
# write ATG
file_h.write(met.upper())

Closing a file object opened for writing#

Exactly the same than for reading, you have to close it after using it. It is quite important

close()#

It is a file object method with no arguments

file_h = open("methionine.txt", "w")  
...
file_h.close()  # the file needs to be closed at the end, when you are done with it 

Try the next little exercise#

  • Open your own file (writing mode), write something in and close it.

  • Using your OS, check that it looks fine

Then:

  • Open the very same file (now in reading mode), read and display its content

  • Do not forget to close the file object

Using “with” to close the file automatically#

with open("methionine.txt", "w") as my_file:  
    ....block of code
    my_file.write("whatever")
    
# the file does not need to be closed because "with" was used
# Python will close it automatically. The very same for files opened for reading

A note of advice#

Perhaps you are wondering: what about opening a file for writing and reading at the same time? When opening a file there is something like a pointer indicating where are you now within the file. So, in order not to mess with that pointer, and while you get more experience, I suggest you to open any file just for reading or writing

Summary#

  • open()

file_object = open("my_first_file.txt")
* For reading 
        file_name = "./files/my_first_file.txt"
        file_object = open(file_name, "r")  # also by default with no "r"
        file_content = file_object.read()
        print(file_content)
        file_object.close()  # close it after using
* For writing 
        file_object = open("my_first_file.txt", "w")  # append "a"
  • close()

    file_object.close()  # if you open a file you have to close it
  • read()

    file_content = file_object.read()
    print(file_content)
  • write()

    file_object.write("ATG")  # write "ATG" in the file
  • str.rstrip()

file_content = open("my_first_file.txt").read().rstrip()
* str.lstrip(), str.strip()

Exercises#