Class Separation Metric

Summary

We have theoretically designed a new class separation measurement.

It is versatile as it can be applied on datasets composed of any number of classes and any number of dimensions
It is particularly suitable for overlapping classes of data
It is computationally advantageous for low dimensionality (typically 2D, 3D or 4D)
It is correlated with the accuracy of density-based classifiers
It can be applied on unknown multivariate density distributions as it makes minimal assumptions on the probability distributions of the data (e.g. features do not have to follow a normal distribution)

Description

Interpretation

The class separation is a distance. It evaluates the separation of K classes in a D dimensional space. It returns a separation value ranging in the [0,1] interval. A value of 1 indicates that the classes are fully separated (i.e. the K classes do not overlap at all). A value close to 0 indicates low separation of classes (i.e. high overlap).

Some of the typical machine learning applications are classification or cluster validation. In those cases, a maximum separation is preferred as it will lead respectively to higher classification accuracies and better cluster definition. Figure 1 shows the Class Separation Metric (CSM) concept in 1D. Three probability distribution functions (PDFs) have a partial overlap. The portions of the PDFs that do not overlap define the CSM value.

Visualization of CSM in the simple 1D case.<br/>(Left) 1D probability density
functions of three color-coded overlapping classes.<br/>(Right) The integral contributions
to a class separation distance measurement — Visualization of the CSM in the simple 1D case.
(Left) 1D probability density functions of three color-coded overlapping classes.
(Right) The integral contributions to a class separation distance measurement.

Automated feature selection and dimensionality reduction for subcellular segmentation

We used the class separation metric to automate the feature selection process by searching for a subset of features that maximizes class separation. It was used to segment epifluorescence microscopy images of cells at a sub-cellular level, into 4 meaningful regions (see Figure 2). Out of the initial 15 features, 2 features were identified as the best subset of features, visualized in Figure 3 after normalization. Each color correspond to one of the 4 classes (cell regions).

The left image is an example of the cell images to partition, the center image is
the result of the partitioning into 4 regions, and the right image overlaps the raw
image with the 4 partitions. — The left image is an example of the cell images to partition, the center image is the result of the partitioning into 4 regions, and the right image overlaps the raw image with the 4 partitions.

Feature Selection using the Class Separaion Metric. Each color correspond
to one of the 4 classes (cell regions). — Feature Selection using the Class Separation Metric (CSM). Each color correspond to one of the 4 classes (cell regions).

How is the CSM computed?

In the more general case with multiple classes and multiple dimensions, here is our definition of the CSM:

Class Separation Metric formula — The CSM formula.

where

K is the number of classes and k is an index of a class
M is the number of all hyper-space cells enclosing N points in D-dimensions, i is an index of a hyper-space cell
Vi is the volume of a cell in the hyper-space partition

Lead Organizational Unit:

ITL

Staff:

ITL-Software and Systems Division

Information Systems Group

Publications:

Julien Amelot, Peter Bajcsy and Mary Brady, “Class Separation Measurements For Multi-dimensional Points from Highly Overlapping Classes”, Journal of Machine Learning Research, Under review, March 2014

Contact:

Peter Bajcsy
peter.bajcsy@nist.gov
Phone: 301.975.2958