Implementing the Pseudo Priority RTree (PRTree)A Toy Implementation for Calculating Nearest Neighbour on Points in the XY PlaneFinal project report for UVic's CSC 520 graduate course, "Analysis of Algorithms." April 18th, 2011 AbstractIn this report we develop a 2dimensional implementation of Arge et Al.'s Priority RTree to index a collection of points in the XY plane. We query the index with a randomly chosen new point to discover its nearest neighbour. We consider this a toy implementation since the PRTree (and RTree's in general) are designed to index axisaligned rectangles (with potential overlap) in higherdimension. However, we believe the implementation presented here provides a good starting point for programmers interested in understanding and exploring this intriguing data structure. 1. Introduction and BackgroundTypical algorithms presented in undergraduate computer science curriculums often assume all data fits in memory. Of course this assumption fails in many real world systems. Very large data sets exist that are orders of magnitude larger than a single computer's memory, and the majority of data in these cases must reside on slower storage media, such as solidstateflash, hard disk, optical disk, tape drive, or network drives. The performance disparities between mainmemory and secondary media, and in some cases, even physical characteristics (e.g. platter rotation and geometry in a harddisk; sequential nature of tape) motivate the need for specialized algorithms and data structures that best suit each medium.
Just as the BTree stores proximate values within the same block on disk, the RTree stores proximate shapes together. The RTree encodes spatial "nearness" via a single concept: the axisaligned bounding rectangle.
The RTree makes no guarantees about the efficacy of the bounding. The only guarantee is that each node in the tree will contain no more than B rectangles, where B is the blocksize, (number of rectangles stored in each leaf on disk). This overflow constraint is again similar to the BTree, where each node can contain no more than B values, but an important difference between the two data structures also emerges. With a BTree, the remedy for an overfull node is evident: split the node into two nodes through the median value. Whereas with the RTree, there is no obvious remedy. How should one best partition B+1 smaller rectangles into two larger bounding rectangles such that the partition minimizes future disk accesses? The PRTree, by considering each rectangle as a 4dimensional point (x_{min},y_{min},x_{max},y_{max}) in a kdtree, is able to achieve a splitting optimum that guarantees good asymptotic performance (similar to the kdtree) for subsequent accesses. Unfortunately this kdtree technique only works during the bulkloading phase, and is unsuitable for maintaining a dynamic index. Thus there is no support for adhoc additions and deletions after the initial construction of the PRTree. 2. Problem Definition
2.1 Nearest RectangleWe employ an heuristic for finding the closest point: the closest rectangle to our query q = (x,y) probably contains the closest point. Since an RTree can store only axisaligned rectangles, calculating the distance between the query and all rectangles at the current level of the tree is straightforward, as shown in Figures 3 and 4. Initially we consider the axisaligned possibilities {(a,y), (c,y), (x,b), (x,d)} since these will be closer if they lie on the rectangle. If none of these points lay on the rectangle, as shown in Figure 3, we look at the corners. The code listing in Figure 5 shows the approach implemented in Java.
public class Rectangle { int a, b, c, d; public boolean containsPoint(long x, long y) { return a <= x && x <= b && c <= y && y <= d; } public final long squaredDistanceTo(long x, long y) { if (containsPoint(x,y)) { return 0; } long aSquared = (xa)*(xa); long bSquared = (yb)*(yb); long cSquared = (xc)*(xc); long dSquared = (yd)*(yd); // Try perpendicular edges: (x,d), (x,b), (a,y), (c,y). long min = Long.MAX_VALUE; if (containsPoint(x,d)) { min = Math.min(min, dSquared); } if (containsPoint(x,b)) { min = Math.min(min, bSquared); } if (containsPoint(a,y)) { min = Math.min(min, aSquared); } if (containsPoint(c,y)) { min = Math.min(min, cSquared); } if (min != Long.MAX_VALUE) { return min; } // short circuit // Try rectangle corners. min = Math.min(min, aSquared+bSquared); min = Math.min(min, aSquared+dSquared); min = Math.min(min, cSquared+bSquared); min = Math.min(min, cSquared+dSquared); return min; } }
Figure 5: A Java implementation to calculate the squared distance between a query point q = (x,y)
and a rectangle.
2.2 RTree BestFirst SearchOur search algorithm, for traversing our PRTree, explores neighbouring rectangles from closest to farthest. This is a standard "bestfirst" greedy search based on the heuristic that the closest rectangle probably contains the closest point. For each rectangle explored, its contained points are loaded from disk, and a sequential scan is performed on these to determine the nearest point to the query point inside the rectangle. Throughout the traversal a 'best candidate' is maintained. We use the distance between the query point and the 'best known' candidate to prune our search tree at each level: should any unexplored rectangles lie outside this distance, we omit them from our search.Recall that in an RTree each leaf resides on disk and contains at most B points. Since B is relatively small compared to our data set, the sequential scans applied inside each leaf node have negligible cost compared to the cost of a disk access. More efficient algorithms to scan the points inside the leaves are unnecessary; in this way our RTree search is similar to a standard BTree search (BTree leaves usually store sorted lists of values, but this is unnecessary). Figures 6.1 through 6.4, below, illustrate a bestfirst search on the final level of an example RTree.
3. Implementation3.1 BulkLoadTo build the pseudo PRTree in the first place, the initial algorithm presented in [6] is straightforward. As in [6], we employ a topdown approach. The lowest levels of the tree may contain subtrees with fewer than 4 leafnodes; the search procedure should guard against this possibility. Algorithm to build Pseudo PRTree on n points in the xy plane:
3.2 QueryOur toy implementation, a pseudo PRTree to index points on the xy plane, differs slightly from the general RTree presented earlier in this report. In a proper RTree all leaves reside on the final level, whereas in the pseudo PRTree, each level contains 2^{d} leaf nodes (d=2 in our case), and two subtrees. Since the subtrees are tightly enclosed by bounding rectangles, the bestfirst greedy search outlined before is still applicable, with one complication: only the leaf nodes perform sequential scans; the subtrees employ recursive scans. At each level of the tree, a query may fall into one of two cases:
A Java implementation of our PRTree search algorithm is given, in Figure 9. public class PRTree implements Rectangle { /** * Returns a Comparator object that orders rectangles * based on their distance from the supplied (x,y) point. * * @param x coordinate to measure nearness from. * @param y coordinate to measure nearness from. * @return A Comparator that returns 1, 0, or 1 based on how two supplied * rectangles compare to each other. */ private static Comparator<Rectangle> closenessComparator( final int x, final int y ) { return new Comparator<Rectangle>() { public int compare(Rectangle r1, Rectangle r2) { if (r1 == r2) { return 0; } // null is always far away (bigger) if (r1 == null) { return 1; } if (r2 == null) { return 1; } long a = r1.squaredDistanceTo(x,y); long b = r2.squaredDistanceTo(x,y); return a > b ? 1 : a == b ? 0 : 1; } }; } /* Here is the PRTree implementation: Two subtrees and four priority leaves. */ private Leaf[] priorityLeaves = new Leaf[4]; private PRTree2D prTree1; private PRTree2D prTree2; public Rectangle queryNearest(int x, int y) { Comparator<Rectangle> c = closenessComparator(x, y); return queryNearest(c, x, y); } public Rectangle queryNearest(Comparator c, int x, int y) { Rectangle[] currentLevel = { priorityLeaves[0], priorityLeaves[1], priorityLeaves[2], priorityLeaves[3], prTree1, prTree2 }; // Apply our heuristic: the closest rectangle probably // contains the closest point. Arrays.sort(currentLevel, c); Rectangle bestCandidate = null; long bestSquaredDistance = Long.MAX_VALUE; for (Rectangle r : currentLevel) { // Notice the pruning based on bestSquaredDistance. if (r != null && r.squaredDistanceTo(x,y) < bestSquaredDistance) { // This is a recursive call when (r instanceof PRTree), // otherwise it's a sequential scan of the leaf points. Rectangle candidate = r.queryNearest(c, x, y); // Should we update our bestCandidate? long d = candidate.squaredDistanceTo(x,y); if (d < bestSquaredDistance) { bestCandidate = candidate; bestSquaredDistance = d; } } } return bestCandidate; }
Figure 9: A Java implementation of a bestfirst greedy search to find the closest point to a query point q = (x,y)
in a PRTree.
4. Evaluation4.1 Correctness
4.2 PerformancePerformance numbers come from running a purememory implementation (no disk accesses) on an Intel Core i3 2.26ghz laptop with 8GB of ram. The geographic space was limited to a 300,000 by 300,000 2d array of possible locations for each run. The times reported represent the best observed of 5 runs. We ran the Oracle Java 6 profiler to locate any performance bottlenecks in our implementation, and the majority of execution time was spent partitioning the data, especially the partitioning into halves. We used a partitioning algorithm based on heapsort (averaging over 8 comparisons per item in our largest run), whereas an algorithm based on FloydRivest's SELECT would likely speed up our implementation significantly (literature suggests it usually requires less than 2comparisons per item).
