Abstract
Motivation: Computational RNA structure prediction is a mature importantproblem that has received a new wave of attention with thediscovery of regulatory non-coding RNAs and the advent of highthroughputtranscriptome sequencing. Despite nearly two scoreyears of research on RNA secondary structure and RNA–RNA interactionprediction, the accuracy of the state-of-the-art algorithms arestill far from satisfactory. So far, researchers have proposed increasinglycomplex energy models and improved parameter estimationmethods, experimental and/or computational, in anticipation ofendowing their methods with enough power to solve the problem.The output has disappointingly been only modest improvements, notmatching the expectations. Even recent massively featured machinelearning approaches were not able to break the barrier. Why is that?Approach: The first step toward high-accuracy structure prediction isto pick an energy model that is inherently capable of predicting eachand every one of known structures to date. In this article, we introducethe notion of learnability of the parameters of an energy model as ameasure of such an inherent capability. We say that the parameters ofan energy model are learnable iff there exists at least one set of suchparameters that renders every known RNA structure to date the minimumfree energy structure. We derive a necessary condition for thelearnability and give a dynamic programming algorithm to assess it.Our algorithm computes the convex hull of the feature vectors of allfeasible structures in the ensemble of a given input sequence.Interestingly, that convex hull coincides with the Newton polytope ofthe partition function as a polynomial in energy parameters. To thebest of our knowledge, this is the first approach toward computing theRNA Newton polytope and a systematic assessment of the inherentcapabilities of an energy model. The worst case complexity of ouralgorithm is exponential in the number of features. However, dimensionalityreduction techniques can provide approximate solutions toavoid the curse of dimensionality.Results: We demonstrated the application of our theory to a simpleenergy model consisting of a weighted count of A-U, C-G and G-Ubase pairs. Our results show that this simple energy model satisfiesthe necessary condition for more than half of the input unpseudoknottedsequence–structure pairs (55%) chosen from the RNASTRAND v2.0 database and severely violates the condition for~13%, which provide a set of hard cases that require further investigation.From 1350 RNA strands, the observed 3D feature vector for749 strands is on the surface of the computed polytope. For 289 RNAstrands, the observed feature vector is not on the boundary of thepolytope but its distance from the boundary is not more than one.A distance of one essentially means one base pair difference betweenthe observed structure and the closest point on the boundary of thepolytope, which need not be the feature vector of a structure. For171 sequences, this distance is larger than two, and for only 11sequences, this distance is larger than five.
[Less...]