Promoter finder tools




















Mann et al. This tool was reported to have much higher prediction accuracy. Despite these efforts, all these tools tend to produce many false positives or show poor sensitivity, particularly when they are applied to long sequences or whole genomes.

Such tests are insufficient to adequately evaluate their prediction accuracy. Therefore, novel, more accurate and efficient tools are required for the computational recognition of different classes of promoters in a broader taxonomical scope. In this study, we present a novel method for predicting TSSs in E. We characterized promoters of E. The prediction models are implemented in a tool, bTSSfinder, which is available as a standalone program as well as a web application.

The main goal of this study was to develop a tool that predicts promoters for the different sigma classes in Cyanobacteria and E. Success of any promoter prediction tool depends mainly on: i the features used to distinguish promoters from non-promoters, ii the size and diversity of the positive and negative datasets used for learning and iii the quality of both the positive and the negative datasets.

Unfortunately, most studied characteristic features are not consistent within the same promoter class. Therefore, different studies have applied various combinations of features to improve the recognition of promoters. Furthermore, most of the reported tools were trained and tested on relatively small datasets, due to the lack of genome-wide TSSs maps with experimental validation.

As for the quality of the datasets, here we face two problems: i the accuracy of experimental data on real TSSs varies significantly depending on the experimental method applied; ii the choice of the negative set non-promoters could have significant ramifications on the predictive power of the model, as it is a great challenge to define DNA regions that never serve as promoters. To address these challenges: i we analyzed as many promoter sequences with experimentally validated TSS as available; ii from the whole pool of initially extracted negative samples see below , we use different subsets of randomly chosen negative samples in both training and testing procedures and iii we checked different features that may allow a DNA region to serve as a potential promoter.

We compare bTSSfinder with other available methods. Boundaries of promoter regions remain unclearly defined. RegulonDB, version v8. For the non-marine cyanobacterium Nostoc sp. As for the freshwater cyanobacterium Synechocystis sp. For the freshwater cyanobacterium S. Our preliminary assessment of the experimental sets revealed that some TSSs are located close to each other a few nucleotides apart. To remove redundancy, intra-species pairwise comparison of all TSS positions was performed.

After redundancy removal, the final TSS count was as follows: i E. Promoter sequences that do not satisfy the upstream length requirement for either set are excluded. The protocol for the generation of negative sets is described in the Supplementary material. The final counts for the negative sets were and 32 for E.

To build our PWMs for E. We propose that this new motif is another descriptive element for the recognition of promoters see Supplementary material for PWMs and coverage. As for the cyanobacterial species, there are no published PWMs to use as initial matrices. EM was also used to classify the cyanobacterial promoters into different classes. Following the same approach, we used PWMs for E. It should be noted that each round of classification is applied only to the sequences unassigned in the previous step.

Oligomer frequencies triplets, tetramers, pentamers and hexamers were used to calculate scores as described in Rani and Bapi We also used four physico-chemical properties of DNA: free energy, base stacking, melting temperature and entropy, as additional features to describe true and false promoter regions see Supplementary material.

We evaluated the predictive power of these features as well as the aforementioned promoter elements using Mahalanobis distance D 2 , which we calculated based on the approach described by Afifi and Azen Based on these distances we selected the final set of features for use in the predictive model, as described in Section 3.

We estimated the performance of the predictive models using: sensitivity Sn , specificity Sp , Precision the positive predictive value, P1 , Accuracy a measure of statistical bias, Ac , Negative predictive value P2 , the F 1 -score the harmonic mean of Precision and Accuracy, F1 and the Mathew correlation coefficient MCC.

These statistical measures are briefly described in the Supplementary material. The algorithm of bTSSfinder is depicted in Figure 1. More details about the algorithm are given in the Results section Section 3. The algorithm is implemented in the bTSSfinder tool. Flow-chart of the algorithm implemented in the bTSSfinder program. T is the threshold for the prediction of the box specific for every sigma class. The largest collection of experimentally validated promoters of E.

Unfortunately, no such classification exists for cyanobacterial promoters. So far, the and boxes have been identified or predicted in a handful of promoters. Our preliminary comparison of E. Combining E. We identified over 30 prospective features that may exert specificity for the different promoter classes. To cull the feature space to those with the highest predictive power, we calculated Mahalanobis distances for each feature and reduced the number to 19—21 features depending on the promoter class Supplementary Table S3.

To the best our knowledge, this is the first time a wide feature base was used for this type of problem. Physico-chemical properties of the promoter sequences: four features were chosen in the feature selection process: free energy, base stacking, entropy and melting temperature. Using a combination of features for each promoter class as outlined in Supplementary Table S3 , we built 10 NN classifiers, one for each promoter class in E.

Then, we implemented these models into the bTSSfinder program. For each window, position is classified as TSS or non-TSS using the appropriate NN classifier based on a threshold that was predetermined during the training. Predictions that pass the qualifying threshold are labeled as putative TSSs. Depending on user preference, bTSSfinder can report for a chosen phylum: i all predicted TSSs for all promoter classes, ii a user-selected promoter class and iii or the highest scoring TSS.

We tested bTSSfinder on positive and negative sets for every promoter class in E. We observed good performance for all promoter classes in E. The F 1 -score for the remaining cyanobacterial promoter classes ranged from 0. Table 1 Testing results for five sigma classes of E. Test experiments for every sigma class were repeated 10 times for randomly selected negative sets and the means were taken.

Testing results for five sigma classes of E. For fairness, we assessed all tools on a single testing dataset. All other promoter prediction tools that we checked were no longer available. We also tried to test BacPP, which is the first tool that attempted to predict the complete range of sigma promoters in E.

The authors reported high prediction accuracy for BacPP, but these results were obtained from a small training and testing sets. Given that we would have to make an educated guess as to where the TSS locations are as well as the shear number of promoters it predicts, we excluded this tool from our comparison.

Our comparison clearly indicates that bTSSfinder has significantly higher prediction accuracy. Table 2 Comparison of available promoter prediction programs tested on E.

Prediction is true, if distance between annotated and predicted TSSs is 50 bp or less. Comparison of available promoter prediction programs tested on E. Using short sequences to predict TSSs is not sufficient in evaluating the accuracy and efficiency especially the real false positive rate of a prediction tool. It should also be tested on longer sequences. In fact, an ideal test should be genome-wide.

Nonetheless, genome-wide TSS maps are scarce which renders the task of assessing such predictions unfeasible.

Our results highlight the scale of the problem that researchers encounter when they analyze long sequences. We also investigate if models optimized for E. Results of these cross-phylum experiment are presented in Table 4. However, the opposite scenario had a significant impact on sensitivity Table 4.

Cross assessment of the models for the other sigma factors failed to reproduce the sensitivity achieved for their intended species. This perhaps can be explained by: i significant structural differences between promoters in E. In fact, we tested bTSSfinder and the other three tools on ten other bacterial species belonging to five different phyla: three Firmicutes, four Proteobacteria, one Spirochetes, Chlamydias and one CFB group.

For details of this comparison consult the supplementary material Supplementary Table S4. Table 4 Result of cross-phylum application of bTSSfinder on the positive dataset. Bold refers to sensitivity of the models applied to their intended species. Sensitivity values obtained with the species-specific bTSSfinder parameters are given in bold. Result of cross-phylum application of bTSSfinder on the positive dataset.

We observed that some experimentally verified promoters did not pass the prediction thresholds. This may warrant an alternative approach to search for transcription start regions TSRs rather than points. The scoring landscape of experimentally validated TSSs in E. The promoter prediction problem in prokaryotes is an old problem that has yet to achieve an adequate solution. Available tools tend to produce many false positives or have poor sensitivity, especially when applied to long sequences or whole genomes.

These limitations are probably due to the following challenges: Some in-vitro -strong promoters that are predicted computationally with high score are in fact not used in vivo at all, perhaps due to unknown repression mechanisms Hertz and Stormo, ; Huerta and Collado-Vides, Some predicted TSSs may be evaluated as false positives due to the lack of experimentally-verified, comprehensive and precise TSS maps.

Scarcity of experimental data also means that training models using features extracted from the limited available data would naturally restrict their predictive power.

All methods, as far as we know, depend on promoter architecture and other physico-chemical properties in their model building. The choice of a negative dataset can be detrimental for the trained model since one cannot be certain about the total absence of TSSs in the negative dataset.

Nonetheless, promoter prediction, especially at the whole genome level, remains unresolved and this warrants further investigations in this field. The authors thank Mohamad Jaber for some of the helpful discussions and feedback.

Afifi A. Google Scholar. Google Preview. Altschul S. Nucleic Acids Res. Barnett M. Burden S. Bioinformatics , 21 , — Campagne S. Cardon L. Dartigalongue C. Djordjevic M. Estrem S. Feklistov A. Gordon J. Bioinformatics , 22 , — Gordon L. Bioinformatics , 19 , — Gruber T. Hertz G. Methods Enzymol. Huerta A. Imamura S. Gene Regul. Jihoon Y. In all the above-mentioned works the negative set was extracted from non-promoter regions of the genome.

This results in high classification accuracy in due to huge disparity between the positive and negative samples in terms of sequence structure. Additionally, the classification task becomes effortless to achieve, for instance, the CNN models will just rely on the presence or absence of some motifs at their specific positions to make the decision on the sequence type. It is more than the approximated maximal number of genes in the total human genome.

As an illustration of this issue, we notice that when testing these models on non-promoter sequences that have TATA-box they misclassify most of these sequences. Therefore, in order to generate a robust classifier, the negative set should be selected carefully as it determines the features that will be used by the classifier in order to discriminate the classes.

The importance of this idea has been demonstrated in previous works such as Wei et al. In this work, we mainly address this issue and propose an approach that integrates some of the positive class functional motifs in the negative class to reduce the model's dependency on these motifs. The datasets, which are used for training and testing the proposed promoter predictor, are collected from human and mouse.

They contain two distinctive classes of the promoters namely TATA promoters i. For each of these datasets, a negative set non-promoter sequences with the same size of the positive one is constructed based on the proposed approach as described in the following section. The details on the numbers of promoter sequences for each organism are given in Table 1.

As a quality control, we used 5-fold cross-validation to assess the proposed model. In this case, 3-folds are used for training, 1-fold is used for validation, and the remaining fold is used for testing. Thus, the proposed model is trained 5 times and the overall performance of the 5-fold is calculated.

In order to train a model that can accurately perform promoter and non-promoter sequences classification, we need to choose the negative set non-promoter sequences carefully. This point is crucial in making a model capable of generalizing well, and therefore able to maintain its precision when evaluated on more challenging datasets. Previous works, such as Qian et al.

Obviously, this approach is not completely reasonable because if there is no intersection between positive and negative sets. Thus, the model will easily find basic features to separate the two classes. For instance, TATA motif can be found in all positive sequences at a specific position normally 28 bp upstream of the TSS, between —30 and —25 pb in our dataset.

Therefore, creating negative set randomly that does not contain this motif will produce high performance in this dataset. However, the model fails at classifying negative sequences that have TATA motif as promoters. In brief, the major flaw in this approach is that when training a deep learning model it only learns to discriminate the positive and negative classes based the presence or absence of some simple features at specific positions, which makes these models impracticable.

In this work, we aim to solve this issue by establishing an alternative method to derive the negative set from the positive one. Our method is based on the fact that whenever the features are common between the negative and the positive class the model tends, when making the decision, to ignore or reduce its dependency on these features i. Instead, the model is forced to search for deeper and less obvious features. Deep learning models generally suffer from slow convergence while training on this type of data.

However, this method improves the robustness of the model and ensures generalization. We reconstruct the negative set as follows.

Each positive sequence generates one negative sequence. The positive sequence is divided into 20 subsequences. Then, 12 subsequences are picked randomly and substituted randomly.

The remaining 8 subsequences are conserved. This process is illustrated in Figure 1. Applying this process to the positive set results in new non-promoter sequences with conserved parts from promoter sequences the unchanged subsequences, 8 subsequences out of This ratio is found to be optimal for having robust promoter predictor as explained in section 3.

The sequence logos of the positive and negative sets for both human and mouse TATA promoter data are shown in Figures 2 , 3 , respectively. Therefore, the training is more challenging but the resulted model generalizes well. Figure 1. Illustration of the negative set construction method. Green represents the randomly conserved subsequences while red represents the randomly chosen and substituted ones.

Figure 2. The plots show the conservation of the functional motifs between the two sets. Figure 3. We propose a deep learning model that combines convolution layers with recurrent layers as shown in Figure 4. The input is one-hot encoded and represented as a one-dimensional vector with four channels. In order to select the best performing model, we have used grid search method for choosing the best hyper-parameters.

The tuned hyper-parameters are the number of convolution layers, kernel size, number of filters in each layer, the size of the max pooling layer, dropout probability, and the units of Bi-LSTM layer. The proposed model starts with multiple convolution layers that are aligned in parallel and help in learning the important motifs of the input sequences with different window size.

We use three convolution layers for non-TATA promoter with window sizes of 27, 14, and 7, and two convolution layers for TATA promoters with window sizes of 27, All convolution layers are followed by ReLU activation function Glorot et al. Then, the outputs of these layers are concatenated together and fed into a bidirectional long short-term memory BiLSTM Schuster and Paliwal, layer with 32 nodes in order to capture the dependencies between the learnt motifs from the convolution layers.

Then we add two fully connected layers for classification. The first one has nodes and followed by ReLU and dropout with a probability of 0. This is achieved through the LSTM structure which is composed of a memory cell and three gates called input, output, and forget gates.

These gates are responsible for regulating the information in the memory cell. In addition, utilizing the LSTM module increases the network depth while the number of the required parameters remains low. Having a deeper network enables extracting more complex features and this is the main objective of our models as the negative set contains hard samples.

The Keras framework is used for constructing and training the proposed models Chollet F. Adam optimizer Kingma and Ba, is used for updating the parameters with a learning rate of 0. The batch size is set to 32 and the number of epochs is set to Early stopping is applied based on validation loss. In this work, we use the widely adopted evaluation metrics for evaluating the performance of the proposed models.

These metrics are precision, recall, and Matthew correlation coefficient MCC , and they are defined as follows:. Where TP is true positive and represents correctly identified promoter sequences, TN is true negative and represents correctly rejected promoter sequences, FP is false positive and represents incorrectly identified promoter sequences, and FN is false negative and represents incorrectly rejected promoter sequences. When analyzing the previously published works for promoter sequences identification we noticed that the performance of those works greatly depends on the way of preparing the negative dataset.

They performed very well on the datasets that they have prepared, however, they have a high false positive ratio when evaluated on a more challenging dataset that includes non-prompter sequences having common motifs with promoter sequences.

For instance, in case of the TATA promoter dataset, the randomly generated sequences will not have TATA motif at the position and —25 bp which in turn makes the task of classification easier. In other words, their classifier depended on the presence of TATA motif to identify the promoter sequence and as a result, it was easy to achieve high performance on the datasets they have prepared.

However, their models failed dramatically when dealing with negative sequences that contained TATA motif hard examples. The precision dropped as the false positive rate increased.

Simply, they classified these sequences as positive promoter sequences. A similar analysis is valid for the other promoter motifs. Therefore, the main purpose of our work is not only achieving high performance on a specific dataset but also enhancing the model ability on generalizing well by training on a challenging dataset. To more illustrate this point, we train and test our model on the human and mouse TATA promoter datasets with different methods of negative sets preparation.

The first experiment is performed using randomly sampled negative sequences from non-coding regions of the genome i. These high results are expected, but the question is whether this model can maintain the same performance when evaluated on a dataset that has hard examples. The answer, based on analyzing the prior models, is no. The second experiment is performed using our proposed method for preparing the dataset as explained in section 2.

This ensures that our model learns more complex features rather than learning only the presence or absence of TATA-box. Figure 5. Over the past years, plenty of promoter region prediction tools have been proposed Hutchinson, ; Scherf et al. However, some of these tools are not publically available for testing and some of them require more information besides the raw genomic sequences.

In this study, we compare the performance of our proposed models with the current state-of-the-art work, CNNProm, which was proposed by Umarov and Solovyev as shown in Table 2.

On the other hand, our models are able to deal with these cases more successfully and false positive rate is lower compared with CNNProm. For further analyses, we study the effect of alternating nucleotides at each position on the output score. We focus on the region —40 and 10 bp as it hosts the most important part of the promoter sequence. Blue color represents a drop in the output score due to mutation while the red color represents the increment of the score due to mutation.

We notice that altering the nucleotides to C or G in the region —30 and —25 bp reduces the output score significantly. This region is TATA-box which is a very important functional motif in the promoter sequence. Thus, our model is successfully able to find the importance of this region. In the rest of the positions, C and G nucleotides are more preferable than A and T, especially in case of the mouse.

This can be explained by the fact that the promoter region has more C and G nucleotides than A and T Shi and Zhou, Figure 6. Figure 7. Accurate prediction of promoter sequences is essential for understanding the underlying mechanism of the gene regulation process.

In this work, we were particularly interested in constructing a hard negative set that drives the models toward exploring the sequence for deep and relevant features instead of only distinguishing the promoter and non-promoter sequences based on the existence of some functional motifs. The main benefits of using DeePromoter is that it significantly reduces the number of false positive predictions while achieving high accuracy on challenging datasets.

DeePromoter outperformed the previous method not only in the performance but also in overcoming the issue of high false positive predictions. It is projected that this framework might be helpful in drug-related applications and academia. MO and ZL prepared the dataset, conceived the algorithm, and carried out the experiment and analysis.

All authors discussed the results and contributed to the final manuscript. The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. Alipanahi, B. Predicting the sequence specificities of dna-and rna-binding proteins by deep learning. Angermueller, C.

Deepcpg: accurate prediction of single-cell dna methylation states using deep learning. Genome Biol. Baker, T. Benjamin-Cummings Publishing Company. Google Scholar. Behjati, S. What is next generation sequencing? Childhood Educ. Bharanikumar, R. PeerJ 6:e Chollet, F. Astrophysics Source Code Library. Dahl, J. A rapid micro chromatin immunoprecipitation assay chip. Davuluri, R. Computational identification of promoters and first exons in the human genome.

Down, T. Computational detection and location of transcription start sites in mammalian genomic dna. Genome Res. Dreos, R. Epd and epdnew, high-quality promoter resources in the next-generation sequencing era. Nucleic Acids Res. Glorot, X. Hutchinson, G. The prediction of vertebrate promoter regions using differential hexamer frequency analysis. Bioinformatics 12, — PubMed Abstract Google Scholar. Ioshikhes, I. Large-scale human promoter mapping using cpg islands.

Juven-Gershon, T. The rna polymerase ii core promoter—the gateway to transcription.



0コメント

  • 1000 / 1000