Groups similar protein sequences of comparable lengths



FastaHerder2

This page contains basic information about FastaHerder2, and its four usage modes.

FastaHerder2 is a web resource based on the clustering of similar sequences (in length and in %identity). It allows the user to...

  ...cluster a set of sequences (mode 1).
  ...co-cluster a sequence to a previously-clustered database (mode 2). Examples: ABC2_SCHPO, CYSJ_ECOLI, DCR1_SCHPO.
  ...find sequence and the clusters it belongs to using an AC or an ID (mode 3). Examples: P53_HUMAN, Q9BZZ5, ADA_BOVIN.
  ...search clusters using a combination of selected annotations (mode 4). Examples: example 1, example 2.

The pre-clustering step reduces the complexity of the protein database, easing the interpretation of the results of a sequence similarity search.



MODE 1: CLUSTER

Given a set of protein sequences, FastaHerder2 mode 1 clusters them based first on the protein length and identity, and then the ones that were not successfully clustered go through a second clustering method based purely on sequence identity.


To use FastaHerder2 mode 1: CLUSTER, you have to upload a file of protein sequences in fasta format, or paste the sequences in the available text area. More than one sequence must be submitted to start the execution of FastaHerder2. The input file can be up to 2 Mb; if it's bigger, you will get an error. If you want to cluster bigger files, please contact us and we will provide you with a solution.

Once we have your sequences, FastaHerder2 will run upon clicking on the buttom GO.

The threshold tolerance parameter controls the stringency of the clustering. FastaHerder2 clusters near-full length homologs allowing for lower sequence identity thresholds. Longer sequences could be clustered together with larger differences in length. This will have the effect of increassing the compression. The parameter's value depends on the query's length:

Query's lengthMaximum difference
in length
Length ≥ 200aa      32 aa
100aa ≥ Length < 200aa20 aa
60aa ≥ Length < 100aa10 aa
40aa ≥ Length < 60aa7 aa
20aa ≥ Length < 40aa5 aa
Length < 20aa2 aa

FastaHerder2 will cluster the input sequences to produce three different output files, that will appear in the results' page:

.cluster File containing clustered IDs, one line per cluster. Clustered sequences are annotated as
 '(1:)', identifying them as clustered with a high confidence, based on both length and identity.
.recluster File containing clustered (tagged with '(1:)') and reclustered (tagged with '(2:)') sequences,
 one line per cluster. The first are high-quality clusters, taking into account the threshold
 tolerance parameter, while the latter are medium-quality clusters, clustered by its identity
 (at least a 53% identity).
.leader Fasta file containing only the cluster's leaders.

Results also provide information about the number of initial sequences, clusters and reclusters, and %compression of the initial set of sequences.


MODE 2: CO-CLUSTER

Given one protein in fasta sequence, FastaHerder2 mode 2 looks for the most suitable cluster for the query to cluster within from a pre-clustered dataset.


To use FastaHerder2 mode 2: CO-CLUSTER, you have to upload a file with one sequence in fasta format, or paste it in the available text area. Once we have your sequences, FastaHerder2 will run upon clicking on the buttom GO.

FastaHerder2 will co-cluster the input sequence to a previously-clustered database. It finds the most appropriate cluster for the input sequence in each of the available clustered databases. As of July 2015, we have available a clustered version of SwissProt (release 2015_05) along with 50 complete reference proteomes.

First of all, to summarize the results it presents a heatmap featuring all positional annotations concerning domains of the leaders from all of the clusters found. It therefore provides information about the domain architecture of a protein in different taxonomic groups. Next to the leader is written the organism it belongs, colored depending on its taxonomic group:

Bacteria

Archaea

Fungi

Early Eukarya

Viridiplantae

Metazoa no Chordata

Metazoa Chordata


Then, the results' page features the information drawn from each cluster the input sequence can belong to. If displayed (clicking on the cluster), the shown information is:

Database Clustered database or proteome to which the found cluster belongs to.
%Identity Identity between the query and the cluster's leader.
Bit score Score obtained from the alignment between the query and the cluster's leader.
Aln overview Overview of the alignments between the query and the cluster's leader. An "*" is shown if
 both sequences have identities in at least two out of three amino acids (2/3), four out of
 five (4/5),or eight out of ten (8/10). If not, an "_" is shown.
Joint taxonomy (JT) Minimum common taxonomy from the complete taxonomies of all the proteins within the cluster.
Pfam information Recurrence of domain emsembles obtained from the proteins in the cluster.
PDB entries Set of PDB entries obtained from the proteins in the cluster.
PolyX regions Amino acid repeats (AAR) obtained from the proteins in the cluster. An AAR is considered if
 the protein sequence has at least eight occurrences of a given amino acid out of ten.


MODE 3: FIND SEQUENCE IN CLUSTERS

Given a protein identifier, FastaHerder2 mode 3 retrieves the clusters containing such protein, if present in a pre-clustered dataset.


To use FastaHerder2 mode 3: FIND SEQUENCE IN CLUSTERS, you have to write any UniProt AC or ID. FastaHerder2 will run upon clicking on the buttom GO. It will search the clustered databases (see mode 2) using the query identifier.

The results' page features the information drawn from each cluster the input sequence belongs to. If displayed (clicking on the database), the shown information is:

Cluster Clustered database or proteome to which the query sequence belongs to.
Joint taxonomy (JT) Minimum common taxonomy from the complete taxonomies of all the proteins within the cluster.
Pfam information Recurrence of domain emsembles obtained from the proteins in the cluster.
PDB entries Set of PDB entries obtained from the proteins in the cluster.
PolyX regions Amino acid repeats (AAR) obtained from the proteins in the cluster. An AAR is considered if
 the protein sequence has at least eight occurrences of a given amino acid out of ten.


MODE 4: SEARCH CLUSTERS

Given a set of annotations, FastaHerder2 mode 4 retrieves the clusters in the pre-clustered dataset that match the restrictions.


To use FastaHerder2 mode 4: SEARCH CLUSTERS, you must select at least one annotation from the available to restrict the search. The default selection is DM "doesn't mind", but it can be YES (at least one sequence from the cluster must have that annotation) or NO (the cluster must not have any sequence with that annotation).

Clusters are built from SwissProt release 2015_05 (see mode 2). The restricted search allows the user to locate the complete set of clusters that match the selected restrictions. The available annotations to select are:

Length Minimum and/or maximum length from the sequences in a cluster. The minimum can't be greater than
 the maximum.
Number of proteins Minimum and/or maximum number of proteins in a cluster. The minimum can't be greater than the
 maximum.
PDB annotation The user can select whether the cluster must or mustn't have proteins with PDB entries.
PolyX regions Presence of polyX regions in the proteins from the cluster. They are computed taking recurrent
 subsets of ten amino acids in each protein, validating the polyX region if there are at least eigth
 aminoacidic occurrences (Schaefer et al., 2012).
Pfam domains Presence and/or absence of a set of Pfam domains in the proteins from one cluster. To input more
 than one Pfam domain names, link them together using "+". Example: "smn+tudor". Searches are case
 insensitive. It can be also selected whether the cluster must or mustn't have proteins with Pfam
 domains, in general.
PMID Link to PubMed articles drawn from the information from each protein in the cluster. The user can
 select whether the cluster must or mustn't have proteins with PMID information. Only PMIDs cited
 less than 13 times in SwissProt are taken into account, to discard literature reporting genomes,
 cDNA libraries, large scale reports, etc.
Organism Clusters with sequences from a specific organism or taxonomic group. To select an organism, the
 user must use its taxonomic id (e.g. 9606 for Homo sapiens). If you don't know the taxonomic id
 associated to a specific organism, please use our tool to find taxonomic id. To search using a
 complete taxonomic group, its name must be used (e.g. Mammalia).

In the first section of the results, search settings are shown. It features the selected restrictions. The information drawn from each cluster that matches the restrictions is shown. If displayed (clicking on the cluster's leader), the shown information is the same as in mode 3. The user can also display the whole cluster. If there are more than 200 results, they are not displayed. The user should then restrict more the search, to obtain fewer results.




ABOUT US| CONTACT| HELP!