ry process or be generalized to refer to the larger process of knowledge discovery.
5. Steps in Knowledge Discovery
5.1 Step 1: Task Discovery
goals of the data mining operation must be well understood before the process begins: The analyst must know what the problem to be solved is and what the questions that need answers are. Typically, a subject specialist works with the data analyst to refine the problem to be solved as part of the task discovery step (Benoit, 2002). br/>
.2 Step 2: Data Discovery
In this stage, the analyst and the end user determine what data they need to analyze in order to answer their questions, and then they explore the available data to see if what they need is available (Benoit, 2002).
.3 Step 3: Data Selection and Cleaning
Once data has been selected, it will need to be cleaned up: missing values ​​must be handled in a consistent way such as eliminating incomplete records, manually filling them in, entering a constant for each missing value, or estimating a value. Other data records may be complete but wrong (noisy). These noisy elements must be handled in a consistent way (Benoit, 2002; Fayyad, et al., 1996). br/>
.4 Step 4: Data Transformation
Next, the data will be transformed into a form appropriate for mining. Per Weiss, Indurkhya, Zhang & Damerau (2005), Рѓgdata mining methods expect a highly structured format for data, necessitating extensive data preparation. Either we have to transform the original data, or the data are supplied in a highly structured formatРѓh (p. 1). Process of data transformation might include smoothing (eg using bin means to replace data errors), aggregation (eg viewing monthly data rather than daily), generalization (eg defining people as young, middle-aged, or old instead of by their exact age), normalization (scaling the data inside a fixed range), and attribute construction (adding new attributes to the data set, Han & Kamber, 2001, p. 114).
.5 Step 5: Data Reduction
The data will probably need to be reduced in order to make the analysis process manageable and cost-efficient. Data reduction techniques include data cube aggregation, dimension reduction (irrelevant or redundant attributes are removed), data compression (data is encoded to reduce the size, numerosity reduction (models or samples are used instead of the actual data), and discretization and concept hierarchy generation (attributes are replaced by some kind of higher level construct, Han & Kamber, 2001, pp. 116-117).
5.6 Step 6: Discovering Patterns (aka Data Mining)
this stage, the data is iteratively run through the data mining algorithms (see Data Mining Methods below) in an effort to find interesting and useful patterns or relationships. Often, classification and clustering algorithms are used first so that association rules can be applied (Benoit, 2002, p. 278). Rules yield patterns that are more interesting than others. This РѓginterestingnessРѓh is one of the measures used to determine the effectiveness of the particular algorithm (Fayyad, et al., 1996; Freitas, 1999; Han & Kamber, 2001)., Et al. (1996) states that interestingness is Рѓgusually taken as an overall measure of pattern value, combining validity, novelty, usefulness, and simplicityРѓh (p. 41). A pattern can be considered knowledge if it exceeds an interestingness threshold. That threshold is defined by the user, is domain specific, and Рѓgis determined by whatever functions and thresholds the user choosesРѓh (p. 41). br/>
.7 Step 7: Result Interpretation and Visualization
It is important that the output from the data mining step can be Рѓgreadily absorbed and accepted by the people who will use the resultsРѓh (Benoit, p. 272). Tools from computer graphics and graphics design are used to present and visualize the mined output. br/>
5.8 Step 8: Putting the Knowledge to Use
, the end user must make use of the output. In addition to solving the original problem, the new knowledge can also be incorporated into new models, and the entire knowledge or data mining cycle can begin again. br/>
6. Data Mining Methods
Common data mining methods include classification, regression, clustering, summarization, dependency modeling, and change and deviation detection. (Fayyad, et al., 1996, pp. 44-45)
.1 Classification
Classification is composed of two steps: supervised learning of a training set of data to create a model, and the...