2.6 Mixed Clustering
Clustering algorithms are designed for either
categorical data or numeric data. However, in the real world, a
majority of datasets are described by a combination of continuous and
categorical features. A general method is to transform one data type to
another. In most cases, nominal attributes are encoded by simple
matching or binarymapping, and then clustering is performed on the
new-computed numeric proximity. Binary encoding transforms each
categorical attribute to a set of binary attributes and then encodes a
categorical value to this set of binary values.
Simple matching
generates distance measurement in such a way that yields a difference of
zero when comparing two identical categorical values, and a difference
of one while comparing two distinct values. However, the coding methods
have the disadvantages of losing information derivable from the ordering
of different values, losing the structure of categorical value with
different levels of similarity, requiring more space and time when the
domain of the categorical attribute is large, ignoring the context of a
pair of values, e.g., the co- occurrence with other attributes, and
giving different weight to the attributes according to the number of
different values they may take. Moreover, if quantitative and binary
attributes are included in the same index, these procedures will
generally give the latter excessive weight.
An alternative approach
is to discrete numeric values and then apply symbolic clustering
algorithms. The discretization process often loses the important
information especially the relative difference of two values for numeric
features. In addition, it causes boundary problem when two close values
near the discretization boundary may be assigned to two different
ranges. Another difficult problem is to estimate the optimal intervals
during discretization.
Huang (1998) extended k-modes to mixed
datasets and developed k-prototype algorithm. The distances of two types
of features are separately calculated. The numerical distances are
measured by Euclidean distances, while the categorical distances are
measured by simple matching. The centers of categorical attributes are
defined as the modes in the cluster. (Ahmad and Dey, 2007) proposed a
fuzzy prototype k-means algorithm. Similar to k-prototype,the cost
function is made up of two components. The difference is that the
categorical distances are measured by the co-occurrence of two
attributes and the categorical cluster centers are the lists of values
in every attribute with their frequencies in the cluster.