National Institute of Standards and Technology

Images Computations on Computer Clusters

Summary

Executing terabyte-sized (TB) image computations on a single computer is at best very time consuming and at worst not possible if images do not fit into RAM. Thus, using a cluster for such computations is unavoidable. However, the existing frameworks for big data computations (eg. Hadoop, Spark) are primarily designed for text processing and the suitability of image processing computations for cluster computing has not been documented. We explore the problem of efficient execution of image processing computations on computer cluster platforms. Figure 1 illustrates the time speed-up for Deep Zoom pyramid building computation as one of the many image processing computations. This computation can take advantage of distributed computing resources (horizontal axis shows the number of computational nodes in a cluster) using Hadoop. Each bar shows the time contributions of phases in Hadoop-based execution. The left and right graphs document the the execution time dependency on Hadoop configuration, such as the number of Map tasks per computational node.

Hadoop speedup example
Example of the speedup achieved by using Hadoop applied to image pyramid computation.

Description

We present a characterization of four basic TB image computations on a Hadoop cluster in terms of their relative efficiency according to the modified Amdahl's law. The work is motivated by the lack of standard benchmarks and stress tests for big image processing operations on a Hadoop computer cluster platform.

Our benchmark design and evaluations were performed on one of the three microscopy image sets, each consisting of over one-half TB. All image processing benchmarks executed on the NIST Raritan cluster with Hadoop were compared against baseline measurements, such as the Terasort/Teragen designed for Hadoop testing previously, image processing executions on a multiprocessor desktop and on NIST Raritan cluster using Java Remote Method Invocation (RMI) with multiple configurations. By applying our methodology to assessing efficiencies of computations on computer cluster configurations, we could rank computation configurations and aid scientists in measuring the benefits of running image processing on a Hadoop cluster.

Major Accomplishments

Peter Bajcsy, Antoine Vandecreme, Julien Amelot, Phuong Nguyen, Joe Chalfoun, Mary Brady, "Terabyte-sized Image Computations on Hadoop Cluster Platforms", IEEE International Conference on Big Data, October 6-9, 2013, Santa Clara, CA, USA, 2013 (download pdf)

Lead Organizational Unit:

ITL

Staff:

ITL-Software and Systems Division
Information Systems Group

Publications:

Peter Bajcsy, Antoine Vandecreme, Julien Amelot, Phuong Nguyen, Joe Chalfoun, Mary Brady, "Terabyte-sized Image Computations on Hadoop Cluster Platforms", IEEE International Conference on Big Data, October 6-9, 2013, Santa Clara, CA, USA, 2013 (download pdf)

Contact:

Peter Bajcsy
peter.bajcsy@nist.gov
Phone: 301.975.2958

Date created: April 10, 2014 | Last updated: