Logical Data Partitions for Spatial Image Computations

Summary

Our objective is to lower the barrier of executing spatial image computations in a computer cluster/cloud environment. We research two related problems encountered during an execution of spatial computations over terabyte-sized (TB) images using Apache Hadoop running on distributed computing resources. The two problems encountered were:

detection of spatial computations and their parameter estimation from a library of image processing functions, and
partitioning of image data for spatial image computations on Hadoop cluster/cloud computing platforms in order to minimize network data transfer.

Figure 1 illustrates the overall comparison of spatial computations executed on a desktop (top) and in a cluster/cloud computing environment (bottom). In order to transition from desktop to cluster execution, one has to determine whether additional information is available about the computation type or input data. If no inputs are available then further determination is needed about the type of computation to take advantage of data distribution schemas. We focus on the design of methods for determining the spatial type of computation, estimation of spatial computation parameters, and utilizing the type of computation for efficient data distribution across distributed computational resources.

logical data partition — Execution options for scientists to run spatial image computations on desktop/laptop or cluster/cloud computing platforms. HDFS denotes the Hadoop Distribution File System.

Description

The first problem is solved by designing an iterative estimation methodology. The second problem is formulated as an optimization over three partitioning schemas (physical, logical without overlap and logical with overlap), and evaluated over several system configuration parameters. Our experimental results for the two problems demonstrate 100% accuracy in detecting spatial computations in the Java Advanced Imaging and ImageJ libraries. Also, it resulted in a speed-up of 5.36 between the default Hadoop physical partitioning and developed logical image partitioning with overlap; and 3.14 times faster execution of logical partitioning with overlap than without overlap. The novelty of our work is in designing an extension to Apache Hadoop to run a class of spatial image processing operations efficiently on a distributed computing resource.

Major Accomplishments

The paper: Peter Bajcsy, Phuong Nguyen, Antoine Vandecreme, and Mary Brady, “Spatial Computations over Terabyte-Sized Images on Hadoop Platforms,“ 2014 IEEE International Conference on Big Data (IEEE BigData 2014), October 27-30, 2014, Washington DC, USA (submitted). (download pdf)