[Calculation method] Orthogonal area query—KD-Tree concept

1. Description

The kd tree is a binary tree data structure that can be used for efficient kNN calculations. The kd tree algorithm is rather complicated. This article will first introduce the idea of recording and indexing space in the form of a binary tree, so that readers can understand the kd tree more easily.

2. Orthogonal region search

2.1 Definition

For the tensor data table in the k-dimensional space, if it is necessary to find out the search method for the internal data of the hypercube area. It is called orthogonal region search because in k-dimensional tensors, the space of attribute dimensions is independent of each other.
The query of many information can be converted into an orthogonal area search, for example, among a bunch of employees, how many are in [a,b], the salary is in [l,r], and the family size is [n, m] 】. There are many ways to do this, such as tree-in-tree. One way of thinking is to map the employee’s age x, salary y, and family member z to a point (x, y, z) on a three-dimensional plane, so that an orthogonal area search can be performed, that is, to find a rectangle in the number of points.

For higher dimensional queries, we need a data structure that can be used in any number of dimensions. * Note: If tree nested query is not enough to form a peer-to-peer model of each dimension, therefore, iterative query of binary tree is not advisable.

2.2 Introduction of KD tree

First explain the name, K is the dimension, D is Dimension, that is, dimension. “Tree” indicates that he is a tree structure. Basically, a node in a KD tree stores:

  • K-dimensional space domain, (e.g. a cuboid in three dimensions),
  • Coordinates of a K-dimensional point
  • two sons bid

In the balanced tree, we know that the min and max of the subtree weights rooted at each node can be maintained.
In the same way, the K-dimensional space domain is very similar to this, maintaining the coordinate range of the subtree points.

const int K=3;
struct KD_Tree
{
    int d[K],son[2];
    int x[2],y[2],z[2] ;//Range[K][2];
} tr[N];

In the above code, P is the point coordinate of the original image stored by the node, son is the son, and the second line stores the K-dimensional space domain.

2.3 Construct Kd tree

Basic idea:

  • The KD tree is a balanced binary tree, in which each non-leaf node can imagine a hyperplane used to divide its stored space domain, where the hyperplane is perpendicular to the coordinate axis.
  • The tree should be as balanced as possible, and the points in the two spaces divided by the hyperplane should be as many as possible.
  • In order to have scalability, the coordinate axis vertical to the hyperplane of each layer of the tree must be taken in turn. That is, the first layer is vertical to the x-axis, the second layer is vertical to the y-axis, and the third layer is vertical to the z-axis…

To be vertical to an axis means to operate with the coordinates of this axis as a keyword.
For example, to be vertical to the x-axis this time, we take the median of the x-coordinates of the current point set, and then use it as the split point, which is the parent node, which is the point stored by the new node in the KD tree; the two sides of the cut The points of belong to the point sets of the left and right subtrees respectively.

2.4 Two-dimensional example to illustrate the principle

1) There are two-dimensional points as shown in the figure below:

2) Establish a 2D balanced tree x-axis node

Find the bisection line l1 on the x-axis

3) Establish a 2d balanced tree y-axis node

Depth-first algorithm:

  • Find the bisecting line l2 on the y-axis to the left of the l1 line in x

  • Find the bisection line l4 of x in the area included by l1 and l2

Complete the graph:

3. Three-dimensional example study

3.1 If the following example

This is an example: three indicators of blood type, platelet count, and blood pressure. Just select alternately according to x, y, z to construct the binary tree.

3.2 Build sample code (python)

The construction code is given below

class KDTree(object):
    
    """
    A super short KD-Tree for points...
    so concise that you can copypasta into your homework
    without arousing suspicion.
    This implementation only supports Euclidean distance.
    The points can be any array-like type, e.g.:
        lists, tuples, numpy arrays.
    Usage:
    1. Make the KD-Tree:
        `kd_tree = KDTree(points, dim)`
    2. You can then use `get_knn` for k nearest neighbors or
       `get_nearest` for the nearest neighbor
    points are be a list of points: [[0, 1, 2], [12.3, 4.5, 2.3], ...]
    """
    def __init__(self, points, dim, dist_sq_func=None):
        """Makes the KD-Tree for fast lookup.
        Parameters
        ----------
        points : list<point>
            A list of points.
        dim: int
            The dimension of the points.
        dist_sq_func : function(point, point), optional
            A function that returns the squared Euclidean distance
            between the two points.
            If omitted, it uses the default implementation.
        """

        if dist_sq_func is None:
            dist_sq_func = lambda a, b: sum((x - b[i]) ** 2
                for i, x in enumerate(a))
                
        def make(points, i=0):
            if len(points) > 1:
                points.sort(key=lambda x: x[i])
                i = (i + 1) % dim
                m = len(points) >> 1
                return [make(points[:m], i), make(points[m + 1:], i),
                    points[m]]
            if len(points) == 1:
                return [None, None, points[0]]
        
        def add_point(node, point, i=0):
            if node is not None:
                dx = node[2][i] - point[i]
                for j, c in ((0, dx >= 0), (1, dx < 0)):
                    if c and node[j] is None:
                        node[j] = [None, None, point]
                    elif c:
                        add_point(node[j], point, (i + 1) % dim)

        import heapq
        def get_knn(node, point, k, return_dist_sq, heap, i=0, tiebreaker=1):
            if node is not None:
                dist_sq = dist_sq_func(point, node[2])
                dx = node[2][i] - point[i]
                if len(heap) < k:
                    heapq.heappush(heap, (-dist_sq, tiebreaker, node[2]))
                elif dist_sq < -heap[0][0]:
                    heapq.heappushpop(heap, (-dist_sq, tiebreaker, node[2]))
                i = (i + 1) % dim
                # Goes into the left branch, then the right branch if needed
                for b in (dx < 0, dx >= 0)[:1 + (dx * dx < -heap[0][0])]:
                    get_knn(node[b], point, k, return_dist_sq,
                        heap, i, (tiebreaker << 1) | b)
            if tiebreaker == 1:
                return [(-h[0], h[2]) if return_dist_sq else h[2]
                    for h in sorted(heap)][::-1]

        def walk(node):
            if node is not None:
                for j in 0, 1:
                    for x in walk(node[j]):
                        yield x
                yield node[2]

        self._add_point = add_point
        self._get_knn = get_knn
        self._root = make(points)
        self._walk = walk

    def __iter__(self):
        return self._walk(self._root)
        
    def add_point(self, point):
        """Adds a point to the kd-tree.
        
        Parameters
        ----------
        point : array-like
            The point.
        """
        if self._root is None:
            self._root = [None, None, point]
        else:
            self._add_point(self._root, point)

    def get_knn(self, point, k, return_dist_sq=True):
        """Returns k nearest neighbors.
        Parameters
        ----------
        point : array-like
            The point.
        k: int
            The number of nearest neighbors.
        return_dist_sq : boolean
            Whether to return the squared Euclidean distances.
        returns
        -------
        list<array-like>
            The nearest neighbors.
            If `return_dist_sq` is true, the return will be:
                [(dist_sq, point), ...]
            else:
                [point, ...]
        """
        return self._get_knn(self._root, point, k, return_dist_sq, [])

    def get_nearest(self, point, return_dist_sq=True):
        """Returns the nearest neighbor.
        Parameters
        ----------
        point : array-like
            The point.
        return_dist_sq : boolean
            Whether to return the squared Euclidean distance.
        returns
        -------
        array-like
            The nearest neighbor.
            If the tree is empty, returns `None`.
            If `return_dist_sq` is true, the return will be:
                (dist_sq, point)
            else:
                point
        """
        l = self._get_knn(self._root, point, 1, return_dist_sq, [])
        return l[0] if len(l) else None

The test code is given below

import unittest
import random
import cProfile
from kd_tree import *

class KDTreeUnitTest(unittest. TestCase):

    def test_all(self):

        dim = 3

        def dist_sq_func(a, b):
            return sum((x - b[i]) ** 2 for i, x in enumerate(a))

        def get_knn_naive(points, point, k, return_dist_sq=True):
            neighbors = []
            for i, pp in enumerate(points):
                dist_sq = dist_sq_func(point, pp)
                neighbors. append((dist_sq, pp))
            neighbors = sorted(neighbors)[:k]
            return neighbors if return_dist_sq else [n[1] for n in neighbors]

        def get_nearest_naive(points, point, return_dist_sq=True):
            nearest = min(points, key=lambda p:dist_sq_func(p, point))
            if return_dist_sq:
                return (dist_sq_func(nearest, point), nearest)
            return nearest

        def rand_point(dim):
            return [random. uniform(-1, 1) for d in range(dim)]

        points = [rand_point(dim) for x in range(10000)]
        additional_points = [rand_point(dim) for x in range(100)]
        query_points = [rand_point(dim) for x in range(100)]

        kd_tree_results = []
        naive_results = []
        
        global test_and_bench_kd_tree
        global test_and_bench_naive

        def test_and_bench_kd_tree():
            global kd_tree
            kd_tree = KDTree(points, dim)
            for point in additional_points:
                kd_tree.add_point(point)
            kd_tree_results.append(tuple(kd_tree.get_knn([0] * dim, 8)))
            for t in query_points:
                kd_tree_results.append(tuple(kd_tree.get_knn(t, 8)))
            for t in query_points:
                kd_tree_results.append(tuple(kd_tree.get_nearest(t)))

        def test_and_bench_naive():
            all_points = points + additional_points
            naive_results.append(tuple(get_knn_naive(all_points, [0] * dim, 8)))
            for t in query_points:
                naive_results.append(tuple(get_knn_naive(all_points, t, 8)))
            for t in query_points:
                naive_results.append(tuple(get_nearest_naive(all_points, t)))

        print("Running KDTree...")
        cProfile.run("test_and_bench_kd_tree()")
        
        print("Running naive version...")
        cProfile.run("test_and_bench_naive()")

        print("Query results same as naive version?: {}"
            .format(kd_tree_results == naive_results))
        
        self.assertEqual(kd_tree_results, naive_results,
            "Query results mismatch")
        
        self.assertEqual(len(list(kd_tree)), len(points) + len(additional_points),
            "Number of points from iterator mismatch")

if __name__ == '__main__':
    unittest. main()

Reference article:

GitHub – Vectorized/Python-KD-Tree: A simple and fast KD-tree for points in Python for kNN or nearest points. (damm short at just ~60 lines) No libraries needed.

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge algorithm skill treeHome pageOverview 44875 people are studying systematically