REP2 Server

Kamel M, Kastano K, Mier P, Andrade-Navarro MA. REP2: a web server to detect common tandem repeats in protein sequences. Journal of Molecular Biology (2021). 433(11):166895.
https://doi.org/10.1016/j.jmb.2021.166895.

About REP2


REP2 is a method for the search of a few types of tandem repeats in protein sequences. It uses an iterative search procedure with a profile manually optimized for each repeat type considered. The methodology to compile the profile and how this is used is described here shortly. For further details see (Ref. 1).

The method is focused on the identification of protein repeats between 20 to 40 amino acids long. These repeats are usually very divergent and their recognition is difficult even if having a good profile of the repeat. We generated manually profiles of selected known instances of several types of tandem repeats. For each of them, we observed that the scores of optimal and sub-optimal hits follow Extreme Value Distributions (EVD). This allowed the estimation of E-values from these scores by fitting distributions of successive hits obtained in a randomized version of the sequence database.

For each repeat type, a threshold in E-value and a minimum number of repeats necessary to consider that there are tandem repeats are manually curated using the analysis of known examples.

To recognize (non-overlapping) repeats in a query sequence, the profile is compared to the sequence: if a hit is detected with a score that corresponds to an E-value below the predetermined E-value cut-off, the repeat sequence is masked and the profile is compared again to find if there is another repeat; the threshold in the score is less restrictive in each successive iteration because the precomputed EVDs for successive hits are different.

If a repeat fails the test, the repeat is put on hold and the procedure continues anyway with the successive searches to find if there is a further hit that passes the test. If there is a further hit that passes the test, all repeats on hold are validated.

It is possible to run the profile without thresholds (obtain all hits option). This is done by running the iterative comparison until there are no further matches and reporting all matches found irrespective of their score. In this case only the scores are reported since converting them to E-values is not possible because the model requires that there was a previous hit above the cut-off.

The tool can be used via API access.

For example:

https://cbdm-01.zdv.uni-mainz.de/~munoz/cgi-bin/rep/search3.pl?sequence=MSELEQLRQ EAEQLRNQIQDARKACNDATLVQITSNMDSVGRIQMRTRRTLRGHLAKIYAMHWGYDSRLLVSASQDGKLIIWDSYTTNK MHAIPLRSSWVMTCAYAPSGNYVACGGLDNICSIYNLKTREGNVRVSRELPGHTGYLSCCRFLDDSQIVTSSGDTTCALW DIETAQQTTTFTGHSGDVMSLSLSPDMRTFVSGACDASSKLWDIRDGMCRQSFTGHVSDINAVSFFPNGYAFATGSDDAT CRLFDLRADQELLLYSHDNIICGITSVAFSKSGRLLLAGYDDFNCNVWDTLKGDRAGVLAGHDNRVSCLGVTDDGMAVAT GSWDSFLRIWN&format=csv&repeat_type=ALL

Reference 1: Andrade MA, Ponting CP, Gibson TJ, Bork P. Homology-based method for identification of protein repeats using statistical significance estimates. J. Mol. Biol. (2000), 298, 521-537.

Reference 2: Kamel M, Kastano K, Mier P, Andrade-Navarro MA. REP2: a web server to detect common tandem repeats in protein sequences. J. Mol. Biol. (2021). 433(11):166895.