Towards completion of the Earth's proteome: an update a decade later

Project

Proteins in UniRef50 come from the clustering of the UniProtKB database, in which the sequences in the clusters have at least 50% sequence identity to the longest sequence in it. This database is taken as a proxy to start the simplification of the UniProtKB database, as it already is a simplified version of it. The cluster representatives have then less than 50% identity between themselves. UniRef50 was splitted in 201 bins of 100,000 sequences, and an all-versus-all strategy was performed in each bin separately. The 201 intermediate result datasets were joined to form an intermediate database (Intermediate1) (a 37.55% compression compared to the initial 20,083,468 proteins). A second iteration of the same procedure was performed, using as the initial dataset an ordered-by-length version of the Intermediate1 database. It could be further reduced to an Intermediate2 database (31.30% compressed compared to the Intermediate1 dataset). As a last iteration, a randomly shuffled version of the Intermediate2 database was used as the initial dataset. In this case, the compression achieved was low (4.54%), and a dataset called preUEP (pre- Unique Earth's Proteome) with 8,225,772 sequences was generated. The simplified all-versus-all strategy using bins was no longer able to reduce the redundancy of the intermediate datasets. To calculate the UEP dataset, one would have to perform comparisons of all the proteins in the preUEP database between themselves, which we are still no capable of doing due to computational and time limitations.

	CABRA	FastaHerder2	MAGA	ProteinPathTracker	polyQ context
	RACCOON	sQanner	Traitpedia	UEP	dAPE

Towards completion of the Earth's proteome: what has happened in the last ten years?

Pablo Mier and Miguel A. Andrade-Navarro

Faculty of Biology, Johannes Gutenberg University Mainz, Gresemundweg 2, 55128 Mainz, Germany

Institute of Molecular Biology, Ackermannweg 4, 55128 Mainz, Germany

Project

Download datasets