Implementing the Pseudo Priority R-Tree (PR-Tree)

A Toy Implementation for Calculating Nearest Neighbour on Points in the X-Y Plane

Julius Davies, MSc student, University of Victoria, BC
Final project report for UVic's CSC 520 graduate course, "Analysis of Algorithms."
April 18th, 2011

Abstract

In this report we develop a 2-dimensional implementation of Arge et Al.'s Priority R-Tree to index a collection of points in the X-Y plane. We query the index with a randomly chosen new point to discover its nearest neighbour. We consider this a toy implementation since the PR-Tree (and R-Tree's in general) are designed to index axis-aligned rectangles (with potential overlap) in higher-dimension. However, we believe the implementation presented here provides a good starting point for programmers interested in understanding and exploring this intriguing data structure.

1. Introduction and Background

Typical algorithms presented in undergraduate computer science curriculums often assume all data fits in memory. Of course this assumption fails in many real world systems. Very large data sets exist that are orders of magnitude larger than a single computer's memory, and the majority of data in these cases must reside on slower storage media, such as solid-state-flash, hard disk, optical disk, tape drive, or network drives. The performance disparities between main-memory and secondary media, and in some cases, even physical characteristics (e.g. platter rotation and geometry in a hard-disk; sequential nature of tape) motivate the need for specialized algorithms and data structures that best suit each medium.



Figure 1: The research provenance of Arge et Al's PR-Tree. Two results from the 1970's (B-Tree, kd-tree) converge in 2008 to provide the PR-Tree, a worst-case disk-efficient spatial index for d-dimensional rectangles:
O((n/B)1-1/d + k/B)

B=block-size. k=number of answers.

In this report we focus exclusively on hard-disk for secondary storage. The B-Tree provides 1-dimensional indices for many production file systems and database engines, including PostgreSQL, MySQL, Oracle, Microsoft SQL Server. The key insight of the B-Tree is that since disk access is prohibitively expensive compared to CPU operations, efficient disk utilization is of critical importance, while efficient CPU utilization is of little concern. Seemingly slow or wasteful algorithms in memory to sort keys, find medians, split or merge nodes have next no performance impact as long as the number of disk reads and writes is minimized. Unfortunately the B-Tree cannot find nearest neighbour in a spatial context, but this insight guides us as we implement our 2-dimensional PR-Tree.

Why is a B-Tree (or any 1-dimensional index) unsuitable for nearest-neighbour? A simple example suffices. Suppose our index, in lexicographic order, contains the following points in the x-y plane:

(1,10), (1,11), (1,98), (2,99), (3,3)
And suppose (1,1) is our query point. What is the nearest neighbour to (1,1)? Unfortunately, a lexicographic index, being 1-dimensional, considers entries that share an x-coordinate to be closer to each other. The index does not effectively capture any notion of Euclidean distance, and thus only a full scan can locate the closest entry, (3,3), in this example.

Just as the B-Tree stores proximate values within the same block on disk, the R-Tree stores proximate shapes together. The R-Tree encodes spatial "nearness" via a single concept: the axis-aligned bounding rectangle.

  • Axis-aligned: by forcing all bounding rectangles to be axis-aligned (e.g. the sides of the rectangles are parallel to the x-axis or the y-axis), distance calculations become trivial.
  • Bounding rectangle: rectangles are suitable for capturing "closeness" of shapes since shapes inside an arbitrary rectangle have a closeness bound (the diagonal of the rectangle).

The R-Tree makes no guarantees about the efficacy of the bounding. The only guarantee is that each node in the tree will contain no more than B rectangles, where B is the block-size, (number of rectangles stored in each leaf on disk). This overflow constraint is again similar to the B-Tree, where each node can contain no more than B values, but an important difference between the two data structures also emerges. With a B-Tree, the remedy for an overfull node is evident: split the node into two nodes through the median value. Whereas with the R-Tree, there is no obvious remedy. How should one best partition B+1 smaller rectangles into two larger bounding rectangles such that the partition minimizes future disk accesses?

The PR-Tree, by considering each rectangle as a 4-dimensional point (xmin,ymin,xmax,ymax) in a kd-tree, is able to achieve a splitting optimum that guarantees good asymptotic performance (similar to the kd-tree) for subsequent accesses. Unfortunately this kd-tree technique only works during the bulk-loading phase, and is unsuitable for maintaining a dynamic index. Thus there is no support for adhoc additions and deletions after the initial construction of the PR-Tree.

2. Problem Definition



Figure 2: Given n original points in the x-y plane, and an additional query point q = (x,y), which of the n original points is closest to the query point?
(n = 64 in Figure 2).

While the R-Tree in general, and the PR-Tree specifically, can deal with more complex scenarios, for our toy implementation we considered only the nearest neighbor problem on points in the x-y plane. As shown in Figure 2, as input we have a data set of n points, and a query q at location (x,y). Our goal is to locate the single point among the n points that lies closest to the query point. A human can easily solve this problem in 2-dimensions with a quick glance toward the query point and its immediate surrounding region. For the computer the problem requires a spatial index, unless we are satisfied with a sequential scan.

One benefit of choosing a toy problem where the human is naturally proficient at it --- even the author's two-year-old son can solve it --- is the ease with which a programmer or student can evaluate their implementation. Should a proposed implementation identify an invalid nearest-neighbour in the x-y plane, the programmer will likely notice the error. This gives the programmer increased confidence should they attempt to implement a related solution including and beyond 4-dimensions, where most humans have no natural spatial intuition. Our own evaluation relies on the human's innate ability to solve the toy problem.

2.1 Nearest Rectangle

We employ an heuristic for finding the closest point: the closest rectangle to our query q = (x,y) probably contains the closest point. Since an R-Tree can store only axis-aligned rectangles, calculating the distance between the query and all rectangles at the current level of the tree is straight-forward, as shown in Figures 3 and 4. Initially we consider the axis-aligned possibilities {(a,y), (c,y), (x,b), (x,d)} since these will be closer if they lie on the rectangle. If none of these points lay on the rectangle, as shown in Figure 3, we look at the corners. The code listing in Figure 5 shows the approach implemented in Java.



Figure 3: The distance is measured from a corner. We report the elementary pythagorean result, leaving it squared: (x-a)*(x-a) + (y-b)*(y-b).
         

Figure 4: The distance is measured along an axis-aligned path. In this example the distance reported is (a-x)2. We square the answer to maintain compatibility with the corner distances. Since we never take square-roots, we can retain integer precision in our algorithm, although some data sets may be better served by floating point.
public class Rectangle {
 int a, b, c, d;

 public boolean containsPoint(long x, long y) { 
            return a <= x && x <= b && c <= y && y <= d; 
   } 

   public final long squaredDistanceTo(long x, long y) { 
            if (containsPoint(x,y)) { return 0; } 
            long aSquared = (x-a)*(x-a); 
            long bSquared = (y-b)*(y-b); 
            long cSquared = (x-c)*(x-c); 
            long dSquared = (y-d)*(y-d); 
 
            // Try perpendicular edges: (x,d), (x,b), (a,y), (c,y). 
            long min = Long.MAX_VALUE; 
            if (containsPoint(x,d)) { min = Math.min(min, dSquared); } 
            if (containsPoint(x,b)) { min = Math.min(min, bSquared); } 
            if (containsPoint(a,y)) { min = Math.min(min, aSquared); } 
            if (containsPoint(c,y)) { min = Math.min(min, cSquared); } 
            if (min != Long.MAX_VALUE) { return min; } // short circuit 
 
            // Try rectangle corners. 
            min = Math.min(min, aSquared+bSquared); 
            min = Math.min(min, aSquared+dSquared); 
            min = Math.min(min, cSquared+bSquared); 
            min = Math.min(min, cSquared+dSquared); 
            return min; 
        }
}
Figure 5: A Java implementation to calculate the squared distance between a query point q = (x,y) and a rectangle.

2.2 R-Tree Best-First Search

Our search algorithm, for traversing our PR-Tree, explores neighbouring rectangles from closest to farthest. This is a standard "best-first" greedy search based on the heuristic that the closest rectangle probably contains the closest point. For each rectangle explored, its contained points are loaded from disk, and a sequential scan is performed on these to determine the nearest point to the query point inside the rectangle. Throughout the traversal a 'best candidate' is maintained. We use the distance between the query point and the 'best known' candidate to prune our search tree at each level: should any unexplored rectangles lie outside this distance, we omit them from our search.

Recall that in an R-Tree each leaf resides on disk and contains at most B points. Since B is relatively small compared to our data set, the sequential scans applied inside each leaf node have negligible cost compared to the cost of a disk access. More efficient algorithms to scan the points inside the leaves are unnecessary; in this way our R-Tree search is similar to a standard B-Tree search (B-Tree leaves usually store sorted lists of values, but this is unnecessary). Figures 6.1 through 6.4, below, illustrate a best-first search on the final level of an example R-Tree.



Figure 6.1: The closest rectangle to the query q = (x,y) is the rectangle that happens to contain it. A sequential scan of the 8 points inside the rectangle finds the nearest, and a prune-radius is created from this distance. The two rectangles that lie outside the prune radius are eliminated from consideration.
         

Figure 6.2: The 2nd closest rectangle lies directly to the right of the query point. A sequential scan of the 8 points inside this rectangle locates a new 'best candidate.'


Figure 6.3: The prune radius is updated based on the latest 'best candidate.' This in turn eliminates two more rectangles from consideration. Now it is time to look at the points in the 3rd closest rectangle. Unfortunately, this rectangle fails to provide a new 'best candidate.'
         

Figure 6.4: The 4th closest rectangle is considered. Since all other rectangles have either been pruned or explored, the best candidate found by now is the answer.

3. Implementation

3.1 Bulk-Load

To build the pseudo PR-Tree in the first place, the initial algorithm presented in [6] is straight-forward. As in [6], we employ a top-down approach. The lowest levels of the tree may contain sub-trees with fewer than 4 leaf-nodes; the search procedure should guard against this possibility.

Algorithm to build Pseudo PR-Tree on n points in the x-y plane:

  1. Extract B left-extreme points. Store them in a leaf-node.
  2. Extract B bottom-extreme points. Store them in a leaf-node.
  3. Extract B right-extreme points. Store them in a leaf-node.
  4. Extract B top-extreme points. Store them in a leaf-node.
  5. Partition the remaining n - 4B nodes into two halves based on our current tree depth, using the same scheme as found in the kd-tree data structure:
    1. If ((depth % 4) == 0) partition by the x-coordinate (left-extreme).
    2. If ((depth % 4) == 1) partition by the y-coordinate (bottom-extreme).
    3. If ((depth % 4) == 2) partition by the x-coordinate in reverse (right-extreme).
    4. If ((depth % 4) == 3) partition by the y-coordinate in reverse (top-extreme).
  6. Recursivly apply this algorithm to create two sub-trees on the partitioned halves.
  7. Stop when no points remain to be indexed (e.g. stop when n - 4B ≤ 0).

3.2 Query

Our toy implementation, a pseudo PR-Tree to index points on the x-y plane, differs slightly from the general R-Tree presented earlier in this report. In a proper R-Tree all leaves reside on the final level, whereas in the pseudo PR-Tree, each level contains 2d leaf nodes (d=2 in our case), and two sub-trees. Since the sub-trees are tightly enclosed by bounding rectangles, the best-first greedy search outlined before is still applicable, with one complication: only the leaf nodes perform sequential scans; the sub-trees employ recursive scans. At each level of the tree, a query may fall into one of two cases:

  • Case 1: The query lands in one of the six rectangles for the current level (see Figure 7). We start are best-first search by scanning the rectangle we landed in, either recursively or sequentially, depending on the rectangle's underlying type (sub-tree or leaf-node).
  • Case 2: The query does not land in any of the six rectangles (see Figure 8). We start are best-first search by scanning the closest rectangle.


Figure 9. A query happens to land inside a current-level rectangle. We consider the points inside that rectangle first, since it is the closest one (distance=0).
         

Figure 10. A query lands outside all current-level rectangles. We first explore the points in the nearest rectangle, which is "Left Sub PR-Tree" in this case.

A Java implementation of our PR-Tree search algorithm is given, in Figure 9.

public class PRTree implements Rectangle { 

    /**
     * Returns a Comparator object that orders rectangles
     * based on their distance from the supplied (x,y) point.
     *
     * @param x coordinate to measure nearness from.
     * @param y coordinate to measure nearness from.
     * @return A Comparator that returns 1, 0, or -1 based on how two supplied
     *         rectangles compare to each other.
     */ 
    private static Comparator<Rectangle> closenessComparator( 
            final int x, final int y 
    ) { 
        return new Comparator<Rectangle>() { 
            public int compare(Rectangle r1, Rectangle r2) { 
                if (r1 == r2) { return 0; } 
 
                // null is always far away (bigger) 
                if (r1 == null) { return 1; } 
                if (r2 == null) { return -1; } 
 
                long a = r1.squaredDistanceTo(x,y); 
                long b = r2.squaredDistanceTo(x,y); 
                return a > b ? 1 : a == b ? 0 : -1; 
            } 
        }; 
    } 


     /*
     Here is the PR-Tree implementation:
     Two sub-trees and four priority leaves. 
     */ 
 
    private Leaf[] priorityLeaves = new Leaf[4]; 
    private PRTree2D prTree1; 
    private PRTree2D prTree2; 
 
    public Rectangle queryNearest(int x, int y) { 
        Comparator<Rectangle> c = closenessComparator(x, y);
        return queryNearest(c, x, y); 
    } 
 
    public Rectangle queryNearest(Comparator c, int x, int y) { 

        Rectangle[] currentLevel = {
                priorityLeaves[0],
                priorityLeaves[1],
                priorityLeaves[2],
                priorityLeaves[3],
                prTree1,
                prTree2
        };
 
        // Apply our heuristic:  the closest rectangle probably 
        // contains the closest point. 
        Arrays.sort(currentLevel, c);
 
        Rectangle bestCandidate = null;
        long bestSquaredDistance = Long.MAX_VALUE; 
        for (Rectangle r : currentLevel) { 
 
            // Notice the pruning based on bestSquaredDistance. 
            if (r != null && r.squaredDistanceTo(x,y) < bestSquaredDistance) { 
 
                // This is a recursive call when (r instanceof PRTree),
                // otherwise it's a sequential scan of the leaf points.  
                Rectangle candidate = r.queryNearest(c, x, y);
 
                // Should we update our bestCandidate? 
                long d = candidate.squaredDistanceTo(x,y); 
                if (d < bestSquaredDistance) { 
                    bestCandidate = candidate;
                    bestSquaredDistance = d;
                }
            }
        }
        return bestCandidate; 
    }
 
Figure 9: A Java implementation of a best-first greedy search to find the closest point to a query point q = (x,y) in a PR-Tree.

4. Evaluation

4.1 Correctness



Figure 10.1.
q = center. 50 random points. B = 4.
 
         

Figure 10.2.
q = center. 500 random points. B = 4.
 


Figure 10.3.
q = random. 50 random points. B = 4. 

 
         

Figure 10.4.
q = random. 500 random points. B = 4.

 

4.2 Performance

Performance numbers come from running a pure-memory implementation (no disk accesses) on an Intel Core i3 2.26ghz laptop with 8GB of ram. The geographic space was limited to a 300,000 by 300,000 2-d array of possible locations for each run. The times reported represent the best observed of 5 runs. We ran the Oracle Java 6 profiler to locate any performance bottlenecks in our implementation, and the majority of execution time was spent partitioning the data, especially the partitioning into halves. We used a partitioning algorithm based on heap-sort (averaging over 8 comparisons per item in our largest run), whereas an algorithm based on Floyd-Rivest's SELECT would likely speed up our implementation significantly (literature suggests it usually requires less than 2-comparisons per item).

# of pointstime-to-buildtime-to-queryquery-recursionslinear-scan
5,00010ms0msbetween 7 to 90ms
50,000170ms0msbetween 11 to 121ms
500,0003950ms0msbetween 14 to 1510ms
5,000,00077500ms0msbetween 16 to 18350ms

6. References

  1. R. Bayer and E. M. McCreight, "Organization and maintenance of large ordered indexes," in SIGFIDET Workshop. ACM, 1970, pp. 107-141.
  2. J. L. Bentley, "Multidimensional binary search trees used for associative searching," Commun. ACM, vol. 18, no. 9, pp. 509-517, 1975.
  3. A. Guttman, "R-trees: A dynamic index structure for spatial searching," in SIGMOD Conference, B. Yormark, Ed. ACM Press, 1984, pp. 47-57.
  4. N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger, "The r*-tree: An efficient and robust access method for points and rectangles," in SIGMOD Conference, H. Garcia-Molina and H. V. Jagadish, Eds. ACM Press, 1990, pp. 322-331.
  5. I. Kamel and C. Faloutsos, "Hilbert r-tree: An improved r-tree using fractals," in VLDB, J. B. Bocca, M. Jarke, and C. Zaniolo, Eds. Morgan Kaufmann, 1994, pp. 500-509.
  6. L. Arge, M. de Berg, H. J. Haverkort, and K. Yi, "The priority r-tree: A practically efficient and worst-case optimal r-tree," ACM Transactions on Algorithms, vol. 4, no. 1, 2008.