2.4.4.4 Decision Trees (DT)
They are like those used in decision analysis where each non-terminal node represents a test or decision on the data item considered. Depending on the outcome of the test, one chooses a certain branch. To classify a particular data item, one would start at the root node and follow the assertions down until a terminal node (or leaf) is reached; at that point, a decision is made. DT can also be interpreted as a special form of a rule set, characterized by their hierarchical organization of rules. A disadvantage of DT is that trees use up data very rapidly in the training2 process. They should never be used with small data sets. They are also highly sensitive to noise in the data, and they try to fit the data exactly, which is referred to as “over ï¬tting. Over ï¬tting means that the model depends too strongly on the details of the particular dataset used to create it. When a model suffers from overï¬tting, it is unlikely to be externally valid (i.e., it won't hold up when applied to a new data set) (Peacock et al, 1998).
2.4.4.5 Association Rules (AR)
They are statements about relationships between the attributes of a known group of entities and one or more aspects of those entities that enable predictions to be made about aspects of other entities who are not in the group, but who possess the same attributes. More generally, AR State a statistical correlation between the occurrences of certain attributes in a data item, or between certain data items in a data set. The general form of an AR is X1…Xn => Y[C,S] which means that the attributes X1,… ,Xn predict Y with a confidence C and a significance S (Peacock et al, 1998).
2.4.4.6 Rough Set Theory
This is a formal approximation of a crisp set (i.e., conventional set) in terms of a pair of sets which give the lower and the upper approximation of the original set. In the standard version of rough set theory (Pawlak 1991), the lower- and upper-approximation sets are crisp sets, but in other variations, the approximating sets may be fuzzy sets.
While these so-called ï¬rst-generation algorithms are widely used, they have signiï¬cant limitations. They typically assume the data contains only numeric and textual symbols and do not contain images. They assume the data was carefully collected into a single database with a speciï¬c data mining task in mind. Furthermore, these algorithms tend to be fully automatic and therefore fail to allow guidance from knowledgeable users at key stages in the search for data regularities (Jackson, 2002).
2.4.5 Data Mining and Statistics
The disciplines of statistics and data mining both aim to discover structure in data. So much do their aims overlap, that some people regard data mining as a subset of statistics. But that is not a realistic assessment as data mining also makes use of ideas, tools, and methods from
other areas – particularly database technology and machine learning, and is not heavily concerned with some areas in which statisticians are interested [Hand 1999]. Statistical procedures do, however, play a major role in data mining, particularly in the processes of developing and assessing models. Most of the learning algorithms use statistical tests when constructing rules or trees and also for correcting models that are overï¬tted. Statistical tests are also used to validate machine learning models and to evaluate machine learning algorithms (Jackson, 2002); in this section some of the commonly used statistical analysis
techniques are described briefly.
2.4.5.1 Cluster Analysis
This seeks to organize information about variables so that relatively homogeneous groups, or "clusters," can be formed. The clusters formed with this family of methods should be highly internally homogenous (members are similar to one another) and highly externally heterogeneous (members are not like members of other clusters) (Jackson, 2002).
2.4.5.2 Correlation Analysis
This measures the relationship between two variables. The resulting correlation coefï¬cient shows if changes in one variable will result in changes in the other. When comparing the correlation between two variables, the goal is to see if a change in the independent variable will result in a change in the dependent variable. This information helps in understanding an independent variable's predictive abilities. Correlation ï¬ndings, just as regression ï¬ndings, can be useful in analysing causal relationships, but they do not by themselves establish causal patterns. Discriminant Analysis is used to predict membership in two or more mutually exclusive groups from a set of predictors, when there is no natural ordering on the groups. Discriminant analysis can be seen as the inverse of a one-way multivariate analysis of variance (MANOVA) in that the levels of the independent variable (or factor) for MANOVA become the categories of the dependent variable for discriminant analysis, and the dependent variables of the MANOVA become the predictors for discriminant analysis (Jackson, 2002).
2.4.5.3 Factor Analysis
This is useful for understanding the underlying reasons for the correlations among a group of variables. The main applications of factor analytic techniques are to reduce the number of variables and to detect structure in the relationships among variables; that is to classify variables. Therefore, factor analysis can be applied as a data reduction or structure detection method. In an exploratory factor analysis, the goal is to explore or search for a factor structure. Conï¬rmatory factor analysis, on the other hand, assumes the factor structure is known a priori and the objective is to empirically verify or conï¬rm that the assumed factor structure is correct (Jackson, 2002).