National Institute of Standards and Technology

Feature Selection

Summary

Image feature selection is one part of machine learning from images. Finding solutions to image feature selection is typically driven by accuracy of feature-based prediction/segmentation; and the ability to gain insights from the prediction selected features; as well as the corresponding prediction models. We explore a methodology for selecting transformed features to predict image regions in subcellular actin images.

Relationships among cell structures during cell spreading on a stiff extracellular matrix (ECM) are complex and therefore gaining in-sights about them is non-trivial. Our work is motivated by seeking insights about cell spreading by relating manually established subcellular actin regions with corresponding extracted image features in the most parsimonious linear prediction model.

Figure 1 illustrates how an subcellular actin image is segmented into four biologically meaningful regions manually and image features are extracted per region. The goal is to select image features that could be used for automated segmentation with the highest segmentation accuracy.


feature selection
Subcellular microscope image of fluorescent actin (left) that is manually segmented into four regions. Automation of the segmentation step starts with a question about image features (construction of new features or reuse of existing features, number of features, correlation of features, weighting of features). Feature selection involves finding solutions to this question.

Description

This work addresses the problem of building a linear model with power-transformed image features. Our goal is to discover relationships among cell structures related to cellular contractility (f-actin assembly) and a subset of image intensity, texture and geometrical image features of fluorescently labelled cell structures. The 15 image features extracted from images are representatives of the three classes of image features (intensity, texture, geometry). The results of our work are used for gaining insights on which transformed image features can predict cell regions by their linear combination, and for automated cell image segmentation to semantically (biologically) meaningful regions (cell structures).

The challenge of building a linear model with power-transformed image features is in the size of a search space of all possible pairs consisting of subsets of features and their power transformations. This search size is prohibitive for today’s computational resources. Furthermore, the search space configurations have to be evaluated on hundreds of thousands of training data points. The computational requirements increase even more when the obtained model robustness to multiple methods for selecting subsets of features and to random data subsets is explored.

Major Accomplishments

In this work, we leveraged 180 manually segmented cells, and per-formed ranking computations on around 6.5 million data records (30 datasets x 13 transformations x 166,500 data points with 15 features) per method. The number of classification computations per method was executed on about 7.5 million data points (30 datasets x 166,500 data points x 15 evaluations of the sorted pairs of features and lambdas). These numbers illustrate the extent of data sizes and computations behind the results presented for this particular experimental design (cell conditions).

The most consistent conclusion is in the establishment of a classification relationship between the four subcellular regions and the fluorescent intensity mean, intensity mode and geometrical distance between the cell border and its center. The inclusion of intensity mean and mode suggests that the spatially varying presence of actin in subcellular regions is well represented by these two intensity features. The biological meaning of selecting the distance feature is in confirming the radial spatial distribution of actin (and of the region labels). It was also observed that the distance feature is inversely related to predicting the subcellular region label. This is a consequence of the feature ranking methods maximizing the separation of the four region labels given the distance feature values. The biological insight lies in understanding that the four concentric regions with unequal widths can be transformed by placing hexagons according to the inverse value of the distance feature into better separated four regions with respect to a feature ranking criterion.

Date created: April 10, 2014 | Last updated: