This book reviews some of the underlying technologies and also some recent applications in a number of fields. In a world increasingly overloaded with data of varying quality, not least via the Internet, computerised tools are becoming useful to "mine" useful data from the mass available.
Inspec keywords: neural nets; electricity supply industry; scientific information systems; medical computing; data mining; geophysics computing
Other keywords: data mining; technical issues; medical diagnosis; neural networks; electricity supply industry; organic chemistry; knowledge discovery; weather forecasting
Subjects: Public utility administration; Neural computing techniques; Knowledge engineering techniques; Biology and medical computing; Database management systems (DBMS); Geophysics computing; Business applications of IT; Public utilities
In concept learning, the features used to describe examples can have a drastic effect on the learning algorithm's ability to acquire the target concept. In many poorly understood domains, the representation can be described as being low level. Examples are described in terms of a large number of small measurements. No single measurement is strongly correlated to the target concept; however, all information for classification is believed to be present. Patterns are harder to identify because they are conditional. This is in contrast to problems where a small number of attributes are highly predictive of the concept. A. Clark and C. Thornton (1997) call these type-2 and type-1 problems, respectively. Many current approaches perform very poorly on type-2 problems because the biases which they employ are poorly tuned to the underlying concept. We discuss reasons why attempting to estimate concept difficulty before any learning takes place is desirable. Some current approaches to learning type-2 problems are described, together with several measures used to estimate particular sources of difficulty, their advantages and disadvantages and an estimate based on the Δj measure (K. Nazar and M.A. Bramer, 1997) which addresses many of the shortcomings of previous approaches.
The handling of anomalous or outlying observations in a dataset is one of the most important tasks in data preprocessing. It is important for three reasons. First, outlying observations can have a considerable influence on the results of an analysis. Secondly, although outliers are often measurement or recording errors, some of them can represent phenomena of interest, something significant from the viewpoint of the application domain. Thirdly, for many applications, exceptions identified can often lead to the discovery of unexpected knowledge. We propose an algorithm for outlier analysis. In this approach, outliers detected by statistical methods and the context in which these outliers occur are examined by acquiring and applying relevant domain knowledge. In particular, we try to establish domain-specific hypotheses which may be used to explain the data points.
An approach to rule discovery has been presented in which the strategy of targeting a restricted class of rules is combined with a technique for their efficient discovery. Attribute values in the dataset are distributed among the outcome classes in the dataset in such a way that the attribute values associated with an outcome class are more likely, on a heuristic basis, to appear as conditions of a discovered rule with this outcome class on the RHS. The discovered rules are exact rules with additional properties depending on the heuristic used to distribute attribute values among outcome classes in the dataset.
This chapter describes two major themes: partial values and database background knowledge. First, we show that the partial-value data model is a useful extension to the relational-database model. As well as increasing the expressivity of the data model, partial values allow the data miner to deal with concept hierarchies rigorously. We show how an iterative procedure allows us to calculate aggregate proportions for a database table. The use of this iterative procedure is well founded in statistical theory, being a maximum-likelihood estimator. It is possible that the Newton-Raphson procedure described by J.M. Jamshidian and R.I. Jennrich (1997) will provide a means of speeding up the solution of the maximum-likelihood equations. Secondly, we demonstrate how it is possible to use background knowledge about the database. We have indicated how to reengineer the database using logic programming and integrity constraints. The aggregation algorithms were extended to the multiattribute case and we have shown how they are computed in the case where integrity constraints limit the allowed combinations of attribute values.
Data mining is a field which potentially offers nonexplicitly stored knowledge for a particular application domain. In most application areas that have been studied for data mining, the time at which something happened is also known and recorded (e.g. the date and time when a point-of-sale transaction took place, or a patient's temperature was taken). Most existing approaches, however, take a static view of an application domain so that the discovered knowledge is considered to be valid indefinitely on the time line. If data mining is to be used as a vehicle for better decision making, the existing approaches will in most cases lead into not very significant or interesting results. Consider, for example, a possible association between butter and bread (i.e. people who buy butter also buy bread) among the transactions of a supermarket. If someone looks at all transactions that are available, say for the past ten years, that association might be-with a certain possibility-true. If, however, the highest concentration of people who bought butter and bread can be found up to five years ago, then the discovery of the association is not significant for the present and the future of the supermarket framework, an SQL-like mining language is also proposed. With this language, any temporal data-mining task can easily be expressed.
Data mining and online analysis processing (OLAP) are two complementary techniques for analysis of large amounts of data in data-warehouse environments to deal with decision-support queries. In this chapter we have examined the gap between these two techniques, and proposed a feedback sandwich model to combine OLAP and data mining. An integrated architecture has also been proposed. Our model and architecture differ from other proposals in that they take care of the overall process of OLAP and data mining, offer flexibility for both loosely-coupled and tightly-coupled combinations of OLAP and data mining (because no particular structure of the extended OLAP/data-mining engine is specified) and allow the feedback of discovered knowledge to enhance future OLAP/data mining. We hope that the opinions presented in this chapter will stimulate more research efforts on the integration of OLAP and data mining.
This chapter describes a number of empirical studies in the use of the data mining approach to the analysis of health information. The examples serve to highlight the factors perceived as influencing the success or otherwise of the data mining approach in each case and to illustrate the difficulties which may be encountered during the data mining process and how they may be overcome.
At some time in their lives 60-80 percent of the population will experience one episode of low-back pain (LBP), of whom 90 percent will get better within six to eight weeks without need for treatment or investigation. The remaining ten percent incur 70-90 percent of the medical costs arising from low-back pain and represent a challenge to health practitioners in providing an accurate diagnosis and successful management. The aim of this chapter is to discover the key inputs which a low-back pain multilayer perceptron network uses to classify selected training-case examples using a knowledge discovery method and to show how a rule can then be directly induced from each training example. Preliminary results are presented of the top-ranked key inputs which the LBP MLP uses to classify all training cases for each diagnostic class. It is shown how the validation of the top-ranked key inputs by the domain experts can lead to the validation of the LBP MLP network during both training and testing.
Meteorological societies and universities worldwide frequently collect vast amounts of data from satellites and weather stations. Given a collection of datasets, we were asked to examine a sample of such data and look for patterns which may exist between certain geographical locations over time. The overall aim of the work is to generate a set of rules which can be used to predict certain grid squares a number of months in advance; these predictions can be used by meteorologists to make mid-long-term forecasts.
Geophysical data is the most important material that meteorologists use to model the behaviour of the Earth's atmospheres and oceans. Although most research dedicated to explanation and prediction has been based on the application of a specific statistical or artificial intelligence technique, only a few endeavours have tackled the holistic nature of the subject. One possible way of approaching this target is the application of knowledge discovery techniques. The authors designed the meteorology and data mining environment (MADAME), which resulted in a promising platform for further research being carried out in this area. The work was motivated by a project in which the authors were involved, and which had the objective of establishing the feasibility of forecasting high-intensity rainfall over different areas of the territory of Hong Kong using data mining techniques with a view of improving the existing landslide warning system.
The main aim of this work was to conduct a feasibility trial to determine whether data mining had the potential to provide an enabling technology for the pharmaceutical industry, to provide researchers with the capability of determining the common key characteristics of compounds that determine their functionality, irrespective of compound size. A secondary aim of this work was to investigate whether the powerful lazy evaluation of the functional programming language, Gofer, could be applied to this task, to which it would appear to be ideally suited.
Data mining is the process of analysing data in order to extract useful information. Many techniques exist that can be used in the analysis of data, one of which is artificial neural networks. We explain what a neural network is and how one was used in analysing electricity consumption data from a utility in the UK.