An oft repeated phrase in both natural and machine learning is that we learn at the boundaries of what we already know. A corollary statement is that we need to know a lot to learn something truly novel. Our research deals with incorporating prior knowledge in search algorithms to find interesting and novel rules in medicine and science.
Starting from the premise that causal influence is the "natural" way to interpret rules, we introduce some useful syntactic and semantic constraints on rule discovery. We show how increasingly stronger forms of background knowledge might be used to increase the chance that discovered rules are understandable, interesting and novel. We show this for both individual rules and sets of rules in the domains of medicine, biochemistry and geophysics.
Differential equations are one of the most commonly used formalisms used to represent knowledge on and reason about dynamic systems. We consider the task of discovering ordinary and partial differential equations from behavior traces. The basic ideas behind a number of algorithms for these tasks will be outlined. A representative sample of (re)discovered models will be presented.
Any artificial intelligence tool that takes an input-output approach to modeling nonlinear dynamic physical systems must represent, apply, and reason with many heterogeneous kinds of knowledge about the real world. The program PRET automates this process by leveraging several common engineering formalisms and techniques. This talk will address two of these. The first, input-output modeling, applies a test input to a system, analyzes the resulting data, and learns something useful from the cause-effect pair. This approach requires encoding knowledge not only of intelligent data analysis but also of autonomous actuator control. The second formalism, incorporating domain knowledge about a target system, is also a critical step as it reduces the search space of possible models. However, encoding and applying this knowledge is difficult as its form and scope ranges from general mathematics that applies in all situations to very specific knowledge that is only useful in limited circumstances. The challenges in automating this process are to smoothly incorporate these varying levels and types of knowledge and to apply the appropriate reasoning techniques at the right place and time. At all stages and levels of the modeling process, PRET's formalisms reflect the language and conceptual structures used in engineering practice and are therefore naturally communicable to practicing engineers.
In this work, we focus on the role of completeness in theory development in particle physics. In the early 1930s, physicists were puzzled by certain observations about elementary particles, which they could not explain. They had an incomplete theory of particle physics at hand, and this led them to introduce new hypotheses into the theory so that they could explain the observations. This completeness-driven process, together with other processes driven by consistency and symmetry, resulted in the development of the theory of particle physics. Based on our discovery model BR-4, here we dwell on the two episodes in this process: The discoveries of the elementary particle neutrino, and the quantum property called baryon number. Our exposition will be in terms of a general problem statement that the physicists were faced with prior to the discoveries, the knowledge that was available to them at the time, in the form of some general principles and scientific data about the elementary particles themselves and their reactions, and how the theory was developed after the discoveries, in terms of the knowledge gained, all expressed within forms readily understandable by the scientists.
Problems of empirical discovery entail the automatic creation of a mathematical structure that represents real-world processes and the observed data that they produce. Graphs (directed or undirected) are often useful structures for representing such real-world procsses and data. Typically, the problem of empirical discovery entails discovering both the size of the graphical structure, the topological arrangement of the elements of the graph, and numerical values associated with these elements. Recent work has demonstrated that genetic programming is capable of automatically creating both the overall size, topology, and numerical parameter values of complex networks whose behavior is modeled by continuous-time differential equations (both linear and non-linear) and whose behavior matches prespecified output values. The talk will present examples from the fields of automatically synthesizing both the topology and numerical parameter values for analog electrical circuits, controllers, and networks of chemical reactions from observed time-domain data or other requirements. In these examples, we start with observed time-domain concentrations of input substances and automatically creates both the topology of the network of chemical reactions and the rates of each reaction within the network such that the concentration of the final product of the automatically created network matches the observed time-domain data. This talk describes the automatic creation of a metabolic pathway involving four chemical reactions that takes in glycerol and fatty acid as input, uses ATP as a cofactor, and produces diacyl-glycerol as its final product. In addition, this talk also describes the automatic creation of a metabolic pathway involving three chemical reactions for the synthesis and degradation of ketone bodies. Both automatically created metabolic pathways contain at least one instance of three noteworthy topological features, namely an internal feedback loop, a bifurcation point where one substance is distributed to two different reactions, and an accumulation point where one substance is accumulated from two sources.
Most research in bioinformatics relies on knowledge-lean methods like clustering to analyze the growing amounts of biological data, but the results seldom make direct contact with existing theories of biological processes. In this talk, I describe an approach to computational discovery that represents partial models of known organisms, utilizes these models to detect anomalies and suggest revisions, and combines data from DNA microarray studies with background knowledge to evaluate alternative hypotheses. I report some initial results in constructing metabolic models for a simple organism in this manner and I outline plans to adapt the approach to inferring processes of gene regulation. I also discuss the need for both concrete and abstract terms in such models in order to make them communicable to biologists.
Thus talk describes joint work with Jennifer Cross, Andrew Pohorille, Jeff Shrager, and Nancy Zhang.
The pharmaceutical industry is increasingly overwhelmed by large-volume-data. This is generated both internally as a side-effect of screening tests and combinatorial chemistry, as well as externally from sources such as the human genome project. The industry is also becoming predominantly knowledge-driven. For instance, knowledge is required within computational chemistry for pharmacophore identification, as well as for determining biological function using sequence analysis. From a computer science point of view, the knowledge requirements within the industry give higher emphasis to ``knowing that'' (declarative or descriptive knowledge) rather than ``knowing how'' (procedural or prescriptive knowledge). Within the Philosophy and Computer Science literature mathematical logic has always been the preferred representation for declarative knowledge and thus knowledge discovery techniques are required which generate logical formulae from data. Inductive Logic Programming (ILP) is such a technique. This talk will review the results of the last few years' academic pilot studies involving the application of ILP to the prediction of protein secondary structure, mutagenicity, and structure activity. While predictive accuracy is the central performance measure of data analytical techniques which generate procedural knowledge (neural nets, decision trees, etc.), the performance of an ILP system is determined both by accuracy and degree of stereo-chemical insight provided. ILP hypotheses can be easily stated in English and exemplified diagrammatically. This allows cross-checking with the relevant biological and chemical literature. In several of the comparative trials presented ILP systems provided significant chemical and biological insights where other data analysis techniques do not.
CASA is a complex computational model of the Earth's ecosystem that makes predictions about quantitative variables like net primary production from variables like temperature, precipitation, and vegetation type. This talk reports progress on the task of improving CASA's equations given data about these variables. After explaining the general problem, we outline an approach to equation revision that involves transforming the equations into a neural network, revising weights in that network, and transforming the network back into equations. We also report the results of initial experiments that used this method to find improved values for the intrinsic property associated with vegetation type, which plays a central role in the model.
This talk describes how we used regression rules to improve upon a result previously published in the Earth science literature. In such a scientific application of machine learning, it is crucially important for the learned models to be understandable and communicable. We recount how we selected a learning algorithm to maximize communicability, and then describe two visualization techniques that we developed to aid in understanding the model by exploiting the spatial nature of the data. We also report how evaluating the learned models across time let us discover an error in the data.
I will describe PRET, a program that automates system identification, the process of finding a mathematical model of a dynamical black-box system. PRET performs both structural identification and parameter estimation by integrating qualitative reasoning, numerical simulation, geometric reasoning, constraint reasoning, backward chaining, reasoning with abstraction levels, declarative meta-level control, and truth maintenance. PRET builds models in the form of ordinary differential equations, which is one of the common engineering formalisms to describe dynamical systems. Furthermore, PRET uses a communicable vocabulary in order to represent properties of models, dynamical systems, and their behavior. PRET's concepts range from qualitative properties such as ``the system's behavior is a damped oscillation'' to less-abstract properties such as numeric time series.
The talk will focus on using background knowledge to define the space of possible structures of differential equations in equation discovery. The use of context-free grammars will be considered first. We then describe a recent development that uses domain knowledge in population dynamics modeling: the domain knowledge consists of domain specific webs of interactions between populations and general population dynamics modeling knowledge. This knowledge is used to generate context-dependent grammars, which are used to guide equation discovery.
Experience in both psychology and AI has shown that heuristic search in problem spaces is the best current framework for making general statements about scientific discovery by people and machines. I will lay out the recipe my collaborators and I have followed in building a number of AI programs - both data-driven and expertise-driven - that discover knowledge of a form that humans find conventional, or at least highly understandable.
This talk discusses the conditions of law equations to be communicable knowledge. First, we divide the conditions into two types, i.e., generic conditions of law equations and domain dependent conditions for communicable law equations. We enumerate all necessary conditions as far as we have found, and mathematically formalize some of the important conditions. The characteristics of the law equations become clear to some extent through this discussion. Finally, a model of communicable knowledge discovery is proposed based on the above discussion. The task of discovery is known not to be the matter of only learning and data mining but also model composition, belief revision, consistency checking, model diagnosis, knowledge representation and reasoning of background knowledge and empirical knowledge and computer-human collaboration.
Spatial Aggregation is an architecture for abstracting global structures from a distributed point-wise data representation. Examples include weather data analysis, tracking of diffusion-reaction pattern evolution in the study of biological and chemical systems, as well as distributed control optimization in engineering. I will describe the general idea of Spatial Aggregation using these three domain examples. Finally, I will describe a programming language called SAL that embodys these ideas; SAL/C++ is a freely available software package supporting rapid prototyping of applications in the style of Spatial Aggregation.