Homorepeats

PolyX2 is a web tool to search for homorepeats in a given protein dataset. It processes ~500 proteins per second. File upload is limited to 50 Mb (~2 minutes running time). For bigger datasets we recommend to use the standalone version of the script (see below).

Mier P and Andrade-Navarro MA. PolyX2: fast detection of homorepeats in large protein datasets. Genes 13(2022), 758. PMID:35627143.

EXECUTION

PRECOMPUTED

Here you have some precomputed datasets and their homorepeats (default parameters).

Dataset		Results
Drosophila melanogaster proteome
Homo sapiens proteome
Isoforms (UniProt v2020_06)
SwissProt (UniProt v2020_06)

DOWNLOAD

You can alternatively download the source code and run it locally. All possible polyX will be searched for by default.

HELP

Input

A fasta file with one or more protein sequences.

Thresholds

[X] > Minimum number of identical residues in the polyX. Must be greater than half of the parameter "window length". Default = 8.

[Y] > Window length: minimum length of the polyX. Default = 10.

These thresholds result in parameter k, which is the maximum number of guest amino acids allowed in a window (k = [Y] - [X]). This parameter must be smaller than half of the window size. Otherwise, the execution will be halted and an error message will be triggered.

Depending on the selected thresholds, the script will locate different polyX. However, as they are by definition the minimum amount of identical residues and minimum window length, long pure polyX will be found irrespective of the threshold.

The selection of the thresholds is at the discretion of the user. For example, choosing parameters [X = 6] and [Y = 10] can lead to detect 'SESRSDVSSS' as a polyS region, which does not seem as a real polyS. We recommend using one of the following settings:

[X = 8] and [Y = 10] (default), to look for long polyX.

[X = 4] and [Y = 6], to look for short polyX. Long polyX will also be found.

Output

A file with the polyX regions found with the selected thresholds in the input file. Example for the protein HD_HUMAN (example 1), with default parameters:

Start End Aa +Aa Aa/len ID polyX

18 38 Q - 21/21 sp|P42858|HD_HUMAN QQQQQQQQQQQQQQQQQQQQQ

39 52 P LQ 12/14 sp|P42858|HD_HUMAN PPPPPPPPPPPQLP

63 78 P GQ 13/16 sp|P42858|HD_HUMAN PQPQPPPPPPPPPPGP

2633 2643 E WD 9/11 sp|P42858|HD_HUMAN EEEWDEEEEEE

Columns in the output file:

Start: starting coordinate of the polyX.

End: finishing coordinate of the polyX.

Aa: most prevalent amino acid in the polyX.

+Aa: other amino acids, apart from the most prevalent, in the polyX.

Aa/len: number of residues of the most prevalent amino acid versus polyX length.

ID: protein ID.

polyX: sequence of the polyX.

In the results page, there is also an overview table with the number of homorepeats found per amino acid.

	Upload a file with one or more protein sequence/s, in *fasta format* or paste the sequence/s here: [example1: HD_HUMAN] [example2: SARS-CoV-2 complete proteome] Minimum number of identical residues in a local window of amino acids. Will search homorepeats from amino acids: A C D E F G H I K L M N P Q R S T V W Y
	Your results will be available in: