FastaHerder2

Groups similar protein sequences of comparable lengths

	FastaHerder2
	This page contains basic information about FastaHerder2, and its four usage modes. FastaHerder2 is a web resource based on the clustering of similar sequences (in length and in %identity). It allows the user to... ...cluster a set of sequences (mode 1). ...co-cluster a sequence to a previously-clustered database (mode 2). Examples: ABC2_SCHPO, CYSJ_ECOLI, DCR1_SCHPO. ...find sequence and the clusters it belongs to using an AC or an ID (mode 3). Examples: P53_HUMAN, Q9BZZ5, ADA_BOVIN. ...search clusters using a combination of selected annotations (mode 4). Examples: example 1, example 2. The pre-clustering step reduces the complexity of the protein database, easing the interpretation of the results of a sequence similarity search.

MODE 1: CLUSTER

Given a set of protein sequences, FastaHerder2 mode 1 clusters them based first on the protein length and identity, and then the ones that were not successfully clustered go through a second clustering method based purely on sequence identity.

To use FastaHerder2 mode 1: CLUSTER, you have to upload a file of protein sequences in fasta format, or paste the sequences in the available text area. More than one sequence must be submitted to start the execution of FastaHerder2. The input file can be up to 2 Mb; if it's bigger, you will get an error. If you want to cluster bigger files, please contact us and we will provide you with a solution.

Once we have your sequences, FastaHerder2 will run upon clicking on the buttom GO.

The threshold tolerance parameter controls the stringency of the clustering. FastaHerder2 clusters near-full length homologs allowing for lower sequence identity thresholds. Longer sequences could be clustered together with larger differences in length. This will have the effect of increassing the compression. The parameter's value depends on the query's length:

Query's length		Maximum difference in length
Length ≥ 200aa		32 aa
100aa ≥ Length < 200aa		20 aa
60aa ≥ Length < 100aa		10 aa
40aa ≥ Length < 60aa		7 aa
20aa ≥ Length < 40aa		5 aa
Length < 20aa		2 aa

FastaHerder2 will cluster the input sequences to produce three different output files, that will appear in the results' page:

.cluster		File containing clustered IDs, one line per cluster. Clustered sequences are annotated as '(1:)', identifying them as clustered with a high confidence, based on both length and identity.
.recluster		File containing clustered (tagged with '(1:)') and reclustered (tagged with '(2:)') sequences, one line per cluster. The first are high-quality clusters, taking into account the threshold tolerance parameter, while the latter are medium-quality clusters, clustered by its identity (at least a 53% identity).
.leader		Fasta file containing only the cluster's leaders.

Results also provide information about the number of initial sequences, clusters and reclusters, and %compression of the initial set of sequences.

MODE 2: CO-CLUSTER

Given one protein in fasta sequence, FastaHerder2 mode 2 looks for the most suitable cluster for the query to cluster within from a pre-clustered dataset.

To use FastaHerder2 mode 2: CO-CLUSTER, you have to upload a file with one sequence in fasta format, or paste it in the available text area. Once we have your sequences, FastaHerder2 will run upon clicking on the buttom GO.

FastaHerder2 will co-cluster the input sequence to a previously-clustered database. It finds the most appropriate cluster for the input sequence in each of the available clustered databases. As of July 2015, we have available a clustered version of SwissProt (release 2015_05) along with 50 complete reference proteomes.

BACTERIA (5)
+ Escherichia coli s.K12 (Proteobacteria)
+ Bacillus subtilis s.168 (Firmicutes)
+ Helicobacter pylori s.ATCC (Proteobacteria)
+ Deinococcus radiodurans s.ATCC (Deinococcus-Thermus)
+ Synechocystis sp s.PCC 6803/Kazusa (Cyanobacteria)

ARCHAEA (4)
+ Halobacterium salinarum strain ATCC (Euryarchaeota)
+ Korarchaeum cryptofilum strain OPF8 (Korarchaeota)
+ Nanoarchaeum equitans (Nanoarchaeota)
+ Sulfolobus solfataricus strain ATCC (Crenarchaeota)

EARLY EUKARYA (8)
+ Trypanosoma cruzi (Euglenozoa)
+ Monosiga brevicollis (Choanoflagellate)
+ Dictyostelium discoideum (Amoebozoa)
+ Giardia lamblia (Diplomonadida)
+ Paramecium tetraurelia (Alveolata, Ciliophora)
+ Plasmodium falciparum isolate 3D7 (Alveolata, Apicomplexa)
+ Reticulomyxa filosa (Rhizaria)
+ Thalassiosira oceanica (Stramenopiles)

FUNGI (6)
+ Encephalitozoon cuniculi s.GB-M1 (Microsporidia)
+ Schizosaccharomyces pombe (Ascomycota)
+ Saccharomyces cerevisiae s.ATCC (Ascomycota)
+ Kluyveromyces lactis s.ATCC (Ascomycota)
+ Neurospora crassa s.ATCC (Ascomycota)
+ Rhizopus delemar (Rhizopus)

METAZOA NO CHORDATA (10)
+ Trichoplax adhaerens (Placozoa)
+ Strongylocentrotus purpuratus (Echinodermata)
+ Nematostella vectensis (Cnidaria)
+ Schistosoma mansoni (Platyhelminthes)
+ Crassostrea gigas (Mollusca)
+ Caenorhabditis elegans (Nematoda)
+ Drosophila melanogaster (Arthropoda, Insecta)
+ Apis mellifera (Arthropoda, Insecta)
+ Anopheles gambiae (Arthropoda, Insecta)
+ Daphnia pulex (Arthropoda, Crustacea)

PLANTAE (5)
+ Volvox carteri (Chlorophyta)
+ Chlamydomonas reinhardtii (Chlorophyta)
+ Selaginella moellendorffii (Streptophyta)
+ Oryza sativa subsp. japonica (Angiosperms)
+ Arabidopsis thaliana (Angiosperms)

METAZOA CHORDATA (12)
+ Ciona intestinalis (Tunicata)
+ Branchiostoma floridae (Cephalochordata)
+ Anolis carolinensis (Iguania)
+ Takifugu rubripes (Teleostei)
+ Danio rerio (Teleostei)
+ Xenopus tropicalis (Amphibia)
+ Taeniopygia guttata (Aves)
+ Gallus gallus (Aves)
+ Ornithorhynchus anatinus (Mammalia)
+ Bos taurus (Mammalia)
+ Mus musculus (Mammalia)
+ Homo sapiens (Mammalia)

First of all, to summarize the results it presents a heatmap featuring all positional annotations concerning domains of the leaders from all of the clusters found. It therefore provides information about the domain architecture of a protein in different taxonomic groups. Next to the leader is written the organism it belongs, colored depending on its taxonomic group:

	Bacteria
	Archaea
	Fungi
	Early Eukarya
	Viridiplantae
	Metazoa no Chordata
	Metazoa Chordata

Then, the results' page features the information drawn from each cluster the input sequence can belong to. If displayed (clicking on the cluster), the shown information is:

Database		Clustered database or proteome to which the found cluster belongs to.
%Identity		Identity between the query and the cluster's leader.
Bit score		Score obtained from the alignment between the query and the cluster's leader.
Aln overview		Overview of the alignments between the query and the cluster's leader. An "*" is shown if both sequences have identities in at least two out of three amino acids (2/3), four out of five (4/5),or eight out of ten (8/10). If not, an "_" is shown.
Joint taxonomy (JT)		Minimum common taxonomy from the complete taxonomies of all the proteins within the cluster.
Pfam information		Recurrence of domain emsembles obtained from the proteins in the cluster.
PDB entries		Set of PDB entries obtained from the proteins in the cluster.
PolyX regions		Amino acid repeats (AAR) obtained from the proteins in the cluster. An AAR is considered if the protein sequence has at least eight occurrences of a given amino acid out of ten.

MODE 3: FIND SEQUENCE IN CLUSTERS

Given a protein identifier, FastaHerder2 mode 3 retrieves the clusters containing such protein, if present in a pre-clustered dataset.

To use FastaHerder2 mode 3: FIND SEQUENCE IN CLUSTERS, you have to write any UniProt AC or ID. FastaHerder2 will run upon clicking on the buttom GO. It will search the clustered databases (see mode 2) using the query identifier.

The results' page features the information drawn from each cluster the input sequence belongs to. If displayed (clicking on the database), the shown information is:

Cluster		Clustered database or proteome to which the query sequence belongs to.
Joint taxonomy (JT)		Minimum common taxonomy from the complete taxonomies of all the proteins within the cluster.
Pfam information		Recurrence of domain emsembles obtained from the proteins in the cluster.
PDB entries		Set of PDB entries obtained from the proteins in the cluster.
PolyX regions		Amino acid repeats (AAR) obtained from the proteins in the cluster. An AAR is considered if the protein sequence has at least eight occurrences of a given amino acid out of ten.

MODE 4: SEARCH CLUSTERS

Given a set of annotations, FastaHerder2 mode 4 retrieves the clusters in the pre-clustered dataset that match the restrictions.

To use FastaHerder2 mode 4: SEARCH CLUSTERS, you must select at least one annotation from the available to restrict the search. The default selection is DM "doesn't mind", but it can be YES (at least one sequence from the cluster must have that annotation) or NO (the cluster must not have any sequence with that annotation).

Clusters are built from SwissProt release 2015_05 (see mode 2). The restricted search allows the user to locate the complete set of clusters that match the selected restrictions. The available annotations to select are:

Length		Minimum and/or maximum length from the sequences in a cluster. The minimum can't be greater than the maximum.
Number of proteins		Minimum and/or maximum number of proteins in a cluster. The minimum can't be greater than the maximum.
PDB annotation		The user can select whether the cluster must or mustn't have proteins with PDB entries.
PolyX regions		Presence of polyX regions in the proteins from the cluster. They are computed taking recurrent subsets of ten amino acids in each protein, validating the polyX region if there are at least eigth aminoacidic occurrences (Schaefer et al., 2012).
Pfam domains		Presence and/or absence of a set of Pfam domains in the proteins from one cluster. To input more than one Pfam domain names, link them together using "+". Example: "smn+tudor". Searches are case insensitive. It can be also selected whether the cluster must or mustn't have proteins with Pfam domains, in general.
PMID		Link to PubMed articles drawn from the information from each protein in the cluster. The user can select whether the cluster must or mustn't have proteins with PMID information. Only PMIDs cited less than 13 times in SwissProt are taken into account, to discard literature reporting genomes, cDNA libraries, large scale reports, etc.
Organism		Clusters with sequences from a specific organism or taxonomic group. To select an organism, the user must use its taxonomic id (e.g. 9606 for Homo sapiens). If you don't know the taxonomic id associated to a specific organism, please use our tool to find taxonomic id. To search using a complete taxonomic group, its name must be used (e.g. Mammalia).

In the first section of the results, search settings are shown. It features the selected restrictions. The information drawn from each cluster that matches the restrictions is shown. If displayed (clicking on the cluster's leader), the shown information is the same as in mode 3. The user can also display the whole cluster. If there are more than 200 results, they are not displayed. The user should then restrict more the search, to obtain fewer results.

ABOUT US| CONTACT| HELP!