nformation that exists in text-based collections (Dunham, 2003, p. 26). Data mining, on the other hand, is not concerned with retrieving data that exists in the repository. Instead, data mining is concerned with patterns that can be found that will tell us something new. something that isnРѓft explicitly in the data (Han & Kamber, 2001). techniques are applied to text-based collections (Baeza-Yates & Ribeiro-Neto, 1999). Data mining techniques can be applied to text documents as well as databases (KDD), Web based content and metadata, and complex data such as GIS data and temporal data.terms of evaluating effectiveness, IR and data mining system markedly differ. Per Dunham (2003, p. 26), the effectiveness of an IR system is based on precision and recall and can be represented by the following formulas: = Relevant and Retrieved = Relevant and Retrievedeffectiveness of any knowledge discovery system is whether or not any useful or interesting information (knowledge) has been discovered. Usefulness and interestingness measures are much more subjective than IR measures (precision and recall). br/>
7.2 IR Contributions to Data Mining
Many of the techniques developed in IR have been incorporated into data mining methods including Vector Space Models, Term Discrimination Values, Inverse Document Frequency, Term Frequency-Inverse Document Frequency, and Latent Semantic Indexing.Space Models, or vector space information retrieval systems, represent documents as vectors in a vector space (Howland & Park, 2003, p. 3; Kobayashi & Aono, 2003, p. 105). Term Discrimination Value posits that a good discriminating term is one that, when added to the vector space, increases the distances between documents (vectors). Terms that appear in 1% -10% of documents tend to be good discriminators (Senellart & Blondel, 2003, p. 28). Inverse Document Frequency (IDF) is used to measure similarity. IDF is used in data mining methods including clustering and classification (Dunham, 2003, pp. 26-27). Term Frequency-Inverse Document Frequency (TF-IDF) is an IR algorithm based on the idea that terms that appear often in a document and do not appear in many documents are more important and should be weighted accordingly (Senellart & Blondel, 2003, p. 28). Latent Semantic Indexing (LSI) is a dimensional reduction process based on Singular Value Decomposition (SVD). It can be used to reduce noise in the database and help overcome synonymy and polysemy problems (Kobayashi & Aono, 2003, p. 107). br/>
7.3 Data Mining Contributions to IR
Although IR cannot utilize all the tools developed for data mining because IR is generally limited to unstructured documents, it has nonetheless benefited from advances in data mining. Han and Kamber (2001) describe Document Classification Analysis which involves developing models which are then applied to other documents to automatically classify documents. The process includes creating keywords and terms using standard information retrieval techniques such as TF-IDF and then applying association techniques from data mining disciplines to build concept hierarchies and classes of documents which can be used to automatically classify subsequent documents (p. 434). Data mining idea of ​​creating a model instead of directly searching the original data can be applied to IR. Kobayashi & Aono (2003) describe using Principle Component Analysis (PCA) and Covariance Matrix Analysis (COV) to map an IR problem to a Ѓgsubspace spanned by a subset of the principal componentsЃh (p. 108). br/>
8. Text Mining
Text mining (TM) is related to information retrieval insofar as it is limited to text. Yet it is related to data mining in that it goes beyond search and retrieval. Witten and Frank (2005) explain that the information to be extracted in text mining is not hidden; however, it is unknown because in its text form it is not amenable to automatic processing. Some of the methods used in text mining are essentially the same methods used in data mining. However, one of the first steps in text mining is to convert text documents to numerical representations which then allows for the use of standard data mining methods (Weiss, Indurkhya, Zhang & Damerau, 2005). Weiss, et al. (2005), Рѓgone of the main themes supporting text mining is the transformation of text into numerical data, so although the initial presentation is different, at some intermediate stage, the data move into a classical data-mining encoding. The unstructured data becomes structuredРѓh (pp. 3-4)., Et al (2005) use the spreadsheet analogy as the classical data mining model for structured data. Each cell contains a numerical value that is one of two types: ordered numerical or categorical. Income and cost are examples of ordered numerical attributes. Categorical attributes are c...