pSTR: protein Short Tandem Repeats

Protein Short Tandem Repeats (pSTR) are units directly adjacent to a conserved equal unit. When they are detected in a comparison between two protein sequences, they are checked for the variation of their unit number. In the following example, units colored in blue are pSTR, and in red they have unit number variation. They are fully conserved N- or C-terminally to themselves.

AAAACDCDEFGH--IIAKLKLTPPPP

AAAACD--EFGHIIIIAKL--TPW--

******--****--*****--**---

The pSTR web tool looks for pSTR in a given multifasta file. It does pairwise comparisons between all provided sequences and detects pSTR, which are mapped to a query sequence (the first one of the provided sequences). We only consider pSTR of length more than one amino acid.

How to cite us? Mier P and Andrade-Navarro MA. (2023) Functional characterization of odorant receptor genes in solitary and social insects using machine learning. Biomolecules, 13:1116. PMID:37509152.

Help

Theoretical explanation. If "ACD" is identified as a pSTR with unit number variation, it means that at least one sequence in the input file has "ACDACD", and it aligned with a different sequence with region "ACD" (see example above).

Raw output. pSTR with unit number variation are provided in the form [ACD], plus the pair of proteins from which they were identified. The protein with the pSTR is also aligned with the query sequence, and the pSTR mapped to it; we also provide the position in the query sequence to which the pSTR is mapped ( [ACD(102)] ), to allow a global comparison between all the pSTR found in the search.

Alignments. All alignments in the pSTR web tool are done with MUSCLE v3.8.1551 with default parameters.

Tip. Input sequences will be compared pairwise to detect pSTR with unit number variation. A heatmap with the results from the pairwise comparisons will be shown as output, with the same order as in the input file. Therefore, we advise you to order the sequences in the input file as you wish them to be in the output.

Input limitation. The web tool allows a maximum of 100 protein sequences as an input. For larger datasets, please use the standalone version of the tool.

Execution time depends on the number of sequences in the input file, and on their length. As a guideline:

30 sequences

100 sequences

Who are we? Dr. Pablo Mier and Prof Dr. Miguel A. Andrade-Navarro, from the CBDM group (JGU Mainz, Germany).