n classifying the data according to the model. Some well-known classification algorithms include Bayesian Classification (based on Bayes Theorem), decision trees, neural networks and backpropagation (based on neural networks), k-nearest neighbor classifers (based on learning by analogy), and genetic algorithms. (Benoit, 2002; Dunham, 2003). Trees are a popular top-down approach to classification that divides the data into leaf and node divisions until the entire set has been analyzed. Neural networks are nonlinear predictive tools that learn from a prepared data set and are then applied to new, larger sets. Genetic algorithms are like neural networks but incorporate natural selection and mutation. Nearest neighbor utilizes a training set of data to measure the similarity of a group and then use the resultant information to analyze the test data. (Benoit, 2002, pp. 279-280)
.2 Regression
Regression analysis is used to make predictions based on existing data by applying formulas. Using linear or logistic regression techniques from statistics, a function is learned from the existing data. The new data is then mapped to the function in order to make predictions. (Dunham, 2003, p. 6) Regression trees, decision trees with averaged values ​​at the leaves, are a common regression technique. (Witten & Frank, 2005, p. 76)
6.3 Clustering
Clustering involves identifying a finite set of categories (clusters) to describe the data. The clusters can be mutually exclusive, hierarchical or overlapping. (Fayyad, et al., 1996, p. 44). Each member of a cluster should be very similar to other members in its cluster and dissimilar to other clusters. Techniques for creating clusters include partitioning (often using the k-means algorithm) and hierarchical methods (which group objects into a tree of clusters), as well as grid, model, and density-based methods. (Han & Kamber, 2001, p. 346-348) analysis is a form of cluster analysis that focuses on the items that donРѓft fit neatly into other clusters (Han & Kamber, 2001). Sometimes these objects represent errors in the data, and other times they represent the most interesting pattern of all. Freitas (1999) focuses on outliers in his discussion of attribute surprisingness and suggests that another criterion for interestingness measures should be surprisingness. br/>
.4 Summarization
Summarization maps data into subsets and then applies a compact description for that subset. Also called characterization or generalization, it derives summary data from the data or extracts actual portions of the data which Рѓgsuccinctly characterize the contentsРѓh (Dunham, 2003, p. 8). Modeling (Association Rule Mining) or Association Rule Mining involves searching for interesting relationships between items in a data set. Market basket analysis is a good example of this model. An example of an association rule is Рѓgcustomers who buy computers tend to also buy financial softwareРѓh (Han & Kamber, 2001, pp. 226-117). Since association rules are not always interesting or useful, constraints are applied which specify the type of knowledge to be mined such as specific dates of interest, thresholds on statistical measures (rule interestingness, support, confidence), or other rules applied by end users ( Han & Kamber, 2001, pp. 262).
.5 Change and Deviation Detection
Also called sequential analysis and sequence discovery (Dunham, 203, p. 9), change and deviation detection focuses on discovering the most significant changes in data. This involves establishing normative values ​​and then evaluating new data against the baseline (Fayyad, et al., 1996, p. 45). Relationships based on time are discovered in the data.above methods form the basis for most data mining activities. Many variations on the basic approaches described above can be found in the literature including algorithms specifically modified to apply to spatial data, temporal data mining, multi-dimensional databases, text databases and the Web (Dunham, 2003; Han & Kamber, 2001) .
7. Related Disciplines: Information Retrieval and Text Mining
Two disciplines closely related to data mining are information retrieval and text mining. The relationship between information retrieval and data mining techniques has been complementary. Text mining, however, represents a new discipline arising from the combination of information retrieval and data mining. br/>
.1 Information Retrieval (IR)
Many of the techniques used in data mining come from Information Retrieval (IR), but data mining goes beyond information retrieval. IR is concerned with the process of searching and retrieving i...