How to cite:
Gebhardt M.L., Mer, A.S., Andrade-Navarro M. - mBISON: Finding miRNA target over-representation in gene lists from ChIP-sequencing data. BMC Res Notes. 2015 Apr 16;8:157 (PMID: 25889572)
Gebhardt M.L., Reuter S., Mrowka R., Andrade-Navarro M. - Similarity in targets with REST points to neural and glioblastoma related miRNAs. Nucleic Acids Research. 2014 May 1; 42(9):5436-46. (PMID: 24728992)
For questions and comments regarding the web-tool please contact Marie Gebhardt at
"first namelast name" at "gmx.de"
The tool calculates over-representation of miRNA targets from a gene list or genomic positions (e.g. from ChIP-seq experiments). Enrichment is assessed by simulation. We count the number of target genes of a certain miRNA family in your genes of interest in a set of miRNA binding site predictions from TargetScan 6.2 (http://www.targetscan.org/) and compare the count x times to a random background (x = 1,000, 10,000 or 100,000, according to your choice) from the same set. You can either use the default background of all 11,161 genes in TargetScan or supply a custom gene list for this purpose. A p-value is calculated depending on how often the result from your query set (or a better one) could be generated by random sampling. This p-value is corrected by means of Benjamini and Hochberg multiple-testing correction to False Discovery Rates (FDRs).
There can be gene lists that have on average a high count in miRNA binding sites (e.g. because of long 3'UTRs in comparison to all other genes). This can cause a bias in the calculation (too many miRNA families will be significant). Therefore we correct a possible bias of average miRNA count by multiplying each background count with a correcting factor. This factor is the fraction of number of all miRNA-gene associations in the test set and the number of miRNA-gene associations in the respective background set. For further explanation please refer to Gebhardt et al., 2014.
1. Choose 1,000 randomizations when you only want to have a first look at your data. Very significant results will pass the test, but many will not. It is recommended to use a relaxed FDR cutoff when this option is used. For trustworthy results you should choose 10,000 or even 100,000 randomizations. The calculation will take longer, but will be much more accurate.
2. You can choose to run the analysis on genes from human or mouse.
Gene list input format:
Gene lists can be entered as simple newline-seperated list. In other words, please enter one identifier per line. Double identifiers will be removed by the tool and identifiers, which cannot be found in the TargetScan 6.2 predictions will be skipped. You can choose different identifier types, which will be transfered to Entrez-IDs for the analysis. Supplying Entrez-IDs is recommended. Lists with between 20 and 4000 unique gene identifiers, which are part of the TargetScan predictions, will be processed. From bigger list no significant results can be expected and the processing takes very long.
Genomic positions input format:
Genomic positions have to be entered as txt- or bed-files in the following tab-seperated format:
In any case, please don't use a header. Bed-files with more than three columns will be accepted and trunctated to the first three columns.
You can choose the method for assigning the genomic positions to genes. Either you search the closest TSS (5, 10 or 20 kb) of RefSeq genes (hg19) and miRNAs (miRBASE, release 20 (ref.)) to your binding sites or you use the custom ranking method. For this method peak locations from ChIP-seq data are assigned to RefSeq genes of the reference genome hg19 according to prioritized criteria: if the peaks are situated
(i) in their known or predicted promoter region (according to a database on human and murine promoters: MPromDb (ref.);
(ii) up to 1,000 bp upstream of their (TSS);
(iii-a) for a single exon gene, up to 1,000 bp downstream of the TSS;
(iii-b) for a multi-exon gene, anywhere between the TSS to coding start plus first intron of a gene;
(iv) up to 5,000 bp upstream of their TSS; and
(v) up to 5,000 bp downstream of their transcription end site. This method stems from the observation that the highest frequency of binding sites for certain TFs can be found mainly in the first intron or in the core promoter region (ref.). In the strict version of this method (v) will not be used for the mapping.
Up to three replicates can be uploaded. A miRNA or gene is considered to be potentially regulated by the factor if it is identified in at least two replicates.
miRNAs will be considered as bound by your factor of interest if a binding site was found in distance of up to 10 kb around a transcription start site (TSS) according to miRBase, release 20 (ref.).
How to choose cutoffs:
1. You can choose a FDR-cutoff from 0.005 to 0.2. 0.2 would mean that statistically 20 % of your results are false positives. Depending on the network you analyze and the questions you want to answer it can be informative to look for tendencies of the presence of a certain miRNA in a gene list and therefore to use a relaxed cutoff or to look at the p-values instead. If your miRNA of interest is not significantly over-represented but has a good p-value this still gives you information about your network.
2. The second cutoff is the minimal number of genes, that has to be targeted by a miRNA family in the TargetScan dataset, to be capable of giving significant results. It is given with the help of a percentage of your number of input genes. Default is 5%. If your input contains 1000 genes, that show up in TargetScan 6.2, this means, we want at least 50 genes to be targeted by a miRNA family before we say the result can be significant. Some miRNAs however have a very limited number of predicted targets. A miRNA that has a very significant FDR but doesn't pass the second cutoff, can still be an important result.
Why input your factor name?
When you enter the Entrez-ID of the factor, you analyzed in the experiment, the tool will be able to output the miRNA families that are predicted to bind your factor of interest. This will point to possible loops between the miRNA families and your factor.
Understand the output:
Two seperate output-files can be accessed from the results-page. The first contains all results of over-represented miRNAs, parameters and additional information. The second contains a list of miRNA-gene associations which are more likely to be true than the other predictions for the analyzed network.
All output on the webpage will be contained also in the txt-output. The txt-file additionally contains a list of all found gene identifiers (as Entrez-Ids). If you entered the name of your analyzed factor there can be information on over-represented miRNAs predicted to regulate the factor of interest. When genomic positions instead of gene lists are provided furthermore binding close to miRNAs can be detected and will be displayed in the output.
Display output in Internet Explorer:
Most browsers will display the text output of the program properly, but the Internet Explorer will not. If you cannot use another browser like Firefox or Google Chrome, please download the file (right-click, "Save link as...") and display it in a smart editor.
Interpret the correcting factor:
The correcting factor can tell you if the 3'UTRs of the genes you entered have more or less miRNA-gene pairs predicted than the average of genes. If e.g. your genelist includes many genes with a long 3'UTR there tend to be more predicted miRNA-gene interactions than on average. In this case the average correcting factor will be bigger than 1, otherwise it will be smaller. It cannot be negative, since it is per definition the ratio between the number of predicted miRNA-gene interactions in the test set devided by the number of predicted miRNA-gene interactions in the respective random background gene set.
Where does the example come from?
The example file is a bed-formated file from the Gene Expression Omnibus database (record GSE53927). It contains beta-catenin binding regions in SW480 colorectal cancer cells.