C9 Exercises#

c9e1: Saccharomyces_cerevisiae ENSEMBL peptide analysis#

In this exercise, the input file is “c9e1_input__15FirstSeqs_Saccharomyces_cerevisiae.R64-1-1.pep.all.fa”, that contains the first 15 sequences from the file Saccharomyces_cerevisiae.R64-1-1.pep.all.fa.
Saccharomyces_cerevisiae.R64-1-1.pep.all.fa can be downloaded from ENSEMBL v106 repositories, see below.

The input file has a classical multifasta structure:

>id1 description (Note that there is no space between > and id)
seq1
>id2 description
seq2
...and so on

For example, the first two sequences are:

>YPL071C pep chromosome:R64-1-1:XVI:420048:420518:-1 gene:YPL071C transcript:YPL071C_mRNA gene_biotype:protein_coding transcript_biotype:protein_coding description:Putative protein of unknown function; green fluorescent protein (GFP)-fusion protein localizes to both the cytoplasm and the nucleus [Source:SGD;Acc:S000005992]
MSSRFARSNGNPNHIRKRNHSPDPIGIDNYKRKRLIIDLENLSLNDKGPKNGHADDNNLI
HNNIVFTDAIDDKVLKEIIKCSTSKRGDNDLFYDKIWERLREKRLQIIKWVDYKEIAYLS
WWKWFHNQMTSKYTYDGEADTDVEMMAVDTDVDMDA
>YLL050C pep chromosome:R64-1-1:XII:39804:40414:-1 gene:YLL050C transcript:YLL050C_mRNA gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:COF1 description:Cofilin, involved in pH-dependent actin filament depolarization; binds both actin monomers and filaments and severs filaments; involved in the selective sorting, export of the secretory cargo from the late golgi; genetically interacts with pmr1; thought to be regulated by phosphorylation at SER4; ubiquitous and essential in eukaryotes [Source:SGD;Acc:S000003973]
MSRSGVAVADESLTAFNDLKLGKKYKFILFGLNDAKTEIVVKETSTDPSYDAFLEKLPEN
DCLYAIYDFEYEINGNEGKRSKIVFFTWSPDTAPVRSKMVYASSKDALRRALNGVSTDVQ
GTDFSEVSYDSVLERVSRGAGSH

Then,

a. When executing your program from the terminal (“python3 your_program_name.py first_argument second_argument third_argument…”), pass the input-file-name as a command line argument to your program (here we have a unique argument). The name of your program will be c9e1_Saccharomyces_cerevisiae_ENSEMBL_peptide_analysis.py and your program has to check that you are not passing more or less command line arguments. Then, you call your program as:

python3 c9e1_Saccharomyces_cerevisiae_ENSEMBL_peptide_analysis.py c9e1_input__15FirstSeqs_cerevisiae.fa

b. Steps:

  • Retrieve the fasta file sequences

  • Create a dictionary (key:id, val:sequence) with the fasta sequences

  • Show the dictionary

c. Steps:

  • Create a new dictionary with the length of each sequence (key:id, val:length)

  • Show the dictionary

d. Create a directory (using Python) to store the outputs of this problem. It should be located in the dir where your program is (show also your current working directory). After creating the dir, you should have the next directory structure:

├── c9e1_Saccharomyces_cerevisiae_ENSEMBL_peptide_analysis.py            # This is your main program
├── c9e1_input__15FirstSeqs_Saccharomyces_cerevisiae.R64-1-1.pep.all.fa  # Your input file
└── c9e1_output_files                                                    # The dir, ready to store the outputs

e. For each sequence:

  1. Calculate the last digit from its length. For instance, for ‘YPL071C’, that was a sequence of length 156, the last digit is 6.

  2. Create a subdirectory within c9e1_output_files. The name of the subdirectory depends on the previous last digit (ie. last_digit_6/). Check first if the subdirectory already exists, in that case just do nothing.

  3. Create a file for the sequence (using its id: id.txt). The file needs to be in the subdirectory (last_digit_6/ in the case of YPL071C.txt). The file will contain the whole length of the sequence.

f. Steps:

  • Delete the information from the dictionaries: sequences, lengths, etc. To be in the safe side: do not erase dirs or files

my_dict.clear() # This method deletes all the items of my_dict
  • list the subdirectories and files within c9e1_output_files (using python) and show how many sequences end up in each digit, show also the sequences for each digit.

Sample#

Input:

c9e1_input__15FirstSeqs_Saccharomyces_cerevisiae.R64-1-1.pep.all.fa

Output b:

{'YLL050C': 'MSRSGV...', 'YMR172W': 'MSGMG...', ...} # Note that there is no key order in a dictionary

Output c:

{YLL050C': 143,'YMR172W': 719, ...}

Output d:

├── c9e1_Saccharomyces_cerevisiae_ENSEMBL_peptide_analysis.py            # This is your main program
├── c9e1_input__15FirstSeqs_Saccharomyces_cerevisiae.R64-1-1.pep.all.fa  # Your input file
└── c9e1_output_files                                                    # The dir, ready to store the outputs

Output e:

c9e1_output_files
├── last_digit_0
│   ├── YBR225W.txt
│   ├── YMR027W.txt
...and so on

For instance:
YMR027W.txt will contain:
470

Output f:

last_digit_0	3	['YOR185C.txt', 'YBR225W.txt', 'YMR027W.txt']
last_digit_2	3	['YOL048C.txt', 'YML089C.txt', 'YNR017W.txt']
last_digit_3	1	['YLL050C.txt']
...and so on

Extra task#

Download by yourself the original file from ENSEMBL 106 and run against all the proteome:

1.-Create a directory/folder to store the ENSEMBL input files in the dir where your code/program (c9e1_Saccharomyces_cerevisiae_ENSEMBL_peptide_analysis.py) is. For instance for Linux/MacOS:

mkdir c9e1_input_files

2.-Download the files (README and Saccharomyces_cerevisiae.R64-1-1.pep.all.fa.gz) within c9e1_input_files/:
http://ftp.ensembl.org/pub/release-106/fasta/saccharomyces_cerevisiae/pep/

By now, you should have the next directory structure:

c9e1_Saccharomyces_cerevisiae_ENSEMBL_peptide_analysis.py # This is your code-solution
c9e1_input_files
├── README
└── Saccharomyces_cerevisiae.R64-1-1.pep.all.fa.gz

3.-Read the README file and Inspect the Saccharomyces_cerevisiae.R64-1-1.pep.all.fa.gz compressed file you have just download. You can decompress it and read it, but you can also read it without decompressing the file, from your OS’s commands. From a Linux-like OS it will be:

 gzcat Saccharomyces_cerevisiae.R64-1-1.pep.all.fa.gz | more

 This shows the content of the gzip compressed file without decompressing it

Now, it is clear that the file is a compressed (gzip) multifasta containing all the polypeptides from Saccharomyces_cerevisiae genes. But, how many? Again the solution is simple from the terminal:

 zgrep -e "^>" Saccharomyces_cerevisiae.R64-1-1.pep.all.fa.gz | wc -l

 This counts the number of lines that contains ">", the number of fasta sequences

There are 6600 peptides in the multifasta.#

Note#

Be careful parsing fasta files. The next sequence descriptor could lead to error, because it contains a “>” in them middle of the description. That is why you can consider to include “\n>” as a pattern to split sequences.

>YBL055C pep chromosome:R64-1-1:II:115573:116829:-1 gene:YBL055C transcript:YBL055C_mRNA gene_biotype:protein_coding transcript_biotype:protein_coding description:3'-->5' exonuclease and endonuclease with a possible role in apoptosis; has similarity to mammalian and C. elegans apoptotic nucleases [Source:SGD;Acc:S000000151]