ProjectProteins in UniRef50 come from the clustering of the UniProtKB database, in which the sequences in the clusters have at least 50% sequence identity to the longest sequence in it. This database is taken as a proxy to start the simplification of the UniProtKB database, as it already is a simplified version of it. The cluster representatives have then less than 50% identity between themselves. UniRef50 was splitted in 201 bins of 100,000 sequences, and an all-versus-all strategy was performed in each bin separately. The 201 intermediate result datasets were joined to form an intermediate database (Intermediate1) (a 37.55% compression compared to the initial 20,083,468 proteins). A second iteration of the same procedure was performed, using as the initial dataset an ordered-by-length version of the Intermediate1 database. It could be further reduced to an Intermediate2 database (31.30% compressed compared to the Intermediate1 dataset). As a last iteration, a randomly shuffled version of the Intermediate2 database was used as the initial dataset. In this case, the compression achieved was low (4.54%), and a dataset called preUEP (pre- Unique Earth's Proteome) with 8,225,772 sequences was generated. The simplified all-versus-all strategy using bins was no longer able to reduce the redundancy of the intermediate datasets. To calculate the UEP dataset, one would have to perform comparisons of all the proteins in the preUEP database between themselves, which we are still no capable of doing due to computational and time limitations. |
Download datasets
If you have questions or suggestions, please contact Dr. Pablo Mier. |