Problem Definition
Input: A collection of New York City Yellow Cab taxi trip records spanning January 2009 to June 2015. The source data may be clipped to an envelope encompassing the five New York City boroughs in order to remove some of the noisy error data (e.g., latitude 40.5N – 40.9N, longitude 73.7W – 74.25W).
Output: A list of the fifty most significant hot spot cells in time and space as identified using the Getis-Ord statistic.
where is the attribute value for cell , is the spatial weight between cell and , is equal to the total number of cells, and:
The statistic is a z-score, so no further calculations are required.
The neighborhood for each cell in the space-time cube is established by the neighbors in a grid based on subdividing latitude and longitude uniformly. This spatial neighborhood is created for the preceding, current, and following time periods (i.e., each cell has 26 neighbors). For simplicity of computation, the weight of each neighbor cell is presumed to be equal.
Further details can be found in the original paper by Ord and Getis:
Ord, J.K. and A. Getis. 1995. "Local Spatial Autocorrelation Statistics: Distributional Issues and an Application" in Geographical Analysis 27(4).
Constraints: Time and space should be aggregated into cube cells as specified on command line . Together, this will form a space-time cube.
Description: Using spatial statistics to identify statistically significant clusters or outliers in spatial data is a well understood analytic approach employed by GIS professionals. Spatial statistics is required when making decisions – e.g., if we are 95% sure that the levels measured in this area exceed a regulatory threshold, then we must respond by doing such and such. When identifying statistically significant clusters (often termed Hot Spot Analysis), a very commonly used statistic is the Getis-Ord statistic. It provide a z-score and p-values that allow users to determine where features with either high or low values are clustered spatially. It is important to note that this statistic can be applied to both the spatial and spatio-temporal domains; in this competition, we are focused on the spatio-temporal Getis-Ord statistic.
Compounding the problem of identifying hot spots, we now live in a world where observational data is being collected are an ever increasing rate. For example, large numbers of businesses and organizations are actively tracking their mobile assets. This data is often considerably valuable to the organizations, both from a real-time analytic perspective (e.g., geofencing), as well as in a after the fact batch analysis. In order to handle these large collections of observational data (e.g., numbering in the billions), distributed processing techniques are required. Over the past few years, interest in the Apache Spark framework has exploded, both in government, industry, and academia. Spark can readily be installed on clusters of commodity hardware, and offers a sophisticated software platform for creating scalable distributed algorithms using functional programming techniques and languages.
Assumptions: The programs will be tested using a cluster of twenty five commodity-level PCs (3GHz, quad core), each equipped with 24 GB of RAM and 3 TB of disk storage. The program will run on top of the Apache Spark open source framework. The program will be implemented using Java, Python, or Scala using functional programming techniques. Scala is the most natural and preferred language for this challenge.
Evaluation Rules: Evaluation will be done using two metrics: (a) correctness of the result (do the identified spatio-temporal hot spots match those returned by the reference implementations), and (b) computation time when running against the specified input dataset across the cluster of twenty five PCs.
Representing a hotspot as a single spatio-temporal cell in the space-time cube, we evaluate correctness using the Jaccard similarity coefficient of the reference result and the candidates result for the subset of the top fifty. This will in effect ignore small perturbations in the ordering of the top results. Additionally, we evaluate the speed relative to the fastest submission and calculate a final score consisting of two thirds correctness, and one third performance. In case of unclear score situations (e.g., very precise, but slow or very fast, but imprecise), we perform a manual review process inside the Cup committee to identify the final ranking.
where is the reference result set, is the submission result set, is the Jaccard similarity coefficient, is the execution time of the fastest submission, and is the execution time of the submission.