| Title: | Are grammatical representations useful for learning from biological sequence data? - a case study |
| Authors: | C.H. Bryant, S.H. Muggleton, A. Srinivasan, A. Whittaker, S. Topp, and C. Rawlings |
| Series: | Linköping Electronic Articles
in Computer and Information Science ISSN 1401-9841 |
| Issue: | Vol. 6 (2001), No. 013 |
| URL: | http://www.ep.liu.se/ea/cis/2001/013/ |
| Abstract: | This paper investigates whether Chomsky-like grammar representations
are useful for learning cost-effective, comprehensible predictors of members
of biological sequence families. The Inductive Logic Programming (ILP) Bayesian
approach to learning from positive examples is used to generate a grammar
for recognising a class of proteins known as human neuropeptide precursors
(NPPs). Collectively, five of the co-authors of this paper, have extensive
expertise on NPPs and general bioinformatics methods. Their motivation for
generating a NPP grammar was that none of the existing bioinformatics methods
could provide sufficient cost-savings during the search for new NPPs. Prior
to this project experienced specialists at SmithKline Beecham had tried
for many months to hand-code such a grammar but without success. Our best
predictor makes the search for novel NPPs more than 100 times more efficient
than randomly selecting proteins for synthesis and testing them for biological
activity. As far as these authors are aware, this is both the first biological
grammar learnt using ILP and the first real-world scientific application
of the ILP Bayesian approach to learning from positive examples.
A group of features is derived from this grammar. Other groups of features
of NPPs are derived using other learning strategies. Amalgams of these
groups are formed. A recognition model is generated for each amalgam using
C4.5 and C4.5rules and its performance is measured using both predictive
accuracy and a new cost function, Relative Advantage ( |
|---|
| Original publication 2001-08-30 |
Postscript Checksum |
|---|