# Predicting image data with the SRS-12 algorithm. Applying random projection-based c-approximate nearest-neighbor queries to image data

Bachelor Thesis 2015 41 Pages

## Excerpt

## Contents

1 Introduction

1.1 Motivation

1.2 Problem Statement

1.3 Objectives

1.4 Structure of this Thesis

2 A brief history and overview of Approximate Nearest-Neighbor Queries

2.1 Improving Index Structures

2.2 Softening Requirements for Accuracy

2.3 Further Related Work

3 An Introduction to SRS-12

3.1 Why SRS-12 is worth a consideration

3.2 SRS-12 in a Nutshell

3.3 Variants of SRS-12

3.4 Indexing

3.5 Normal Stopping Condition of the SRS-12 Algorithm

4 Implementation & Experimental Evaluation

4.1 Implementation Details for the Computation of Parameter Values .

4.2 Datasets

4.3 Veri cation on SIFT1M Dataset

4.4 Block Matching

4.4.1 A Brief Introduction to Block Matching

4.4.2 Methodology

4.4.3 Parameter Settings

4.4.4 Computation Time

4.4.5 Conclusions

4.4.6 Summary - Recommended Parameter Settings

4.5 Computation Time of SRS-12 vs Exact Nearest-Neighbor Search

4.5.1 Possible Optimizations

5 Future Work

5.1 Image Prediction

5.2 Reducing Data through Rapid Object Detection and Background Removal . . .

5.3 Optical Flow

5.4 Increasing the Accuracy of the SRS-12 Algorithm

6 Discussion

Bibliography

## 1 Introduction

### 1.1 Motivation

Finding similar points, or points with a given set of attributes, is of fundamental necessity in the wide variety of sciences and economic areas. To name but a few examples it is applied in astronomy where stars are examined for similarities, life sciences where cells and RNA strings are matched and even in cars to recognize objects that are within its trajectory. Especially in recent years, with the amounts of processable data growing at an unprecedented pace in many economic and scienti c areas, the need for e cient methods and algorithms to process this data is greater than ever. In many cases the processing consists of retrieving points similar to an input data point or clustering data in order to search for patterns. In both cases some sort of similarity search is necessary. After decades of research in this eld there are plenty of methods available, however the classical straightforward nearest-neighbor search is still of great importance because it outperforms many other methods once point dimensions are su ciently high.

At its core, the problem as well as its solution propose not much of a challenge in terms of intricacy. In the pre-processing, the given attributes are being transformed into vector form and also a distance metric is established to enable a comparison of distances between vectors. The actual nearest-neighbor search is rather simple: The vector entries of a query point are being compared with the respective vector entries of all data points according to the distance metric. The data point with the closest distance to the query point is the nearest neighbor as per de nition.

While the naive approach for nding the nearest neighbor to a given query point, a simple linear scan of all points in the dataset, is almost trivial, its complexity is O(d · n) with d being the dimension of the point and n being the number of points in the dataset. Once we go one step further and expand the problem to nding all pairs of nearest-neighbor points in a given dataset, the complexity rises to O(d · n^{2} ). On a sidenote, it is worthwhile to notice that while there is a signi cant di erence between the complexities of nding a given query points neighbor and nding two similar points, the growth in complexity for nding a set of k similar points is linear in k if some straightforward extensions and optimizations are applied. This is why most of the papers published in this eld focus on nding the nearest or approximate nearest-neighbor and leave the extension to the k nearest neighbors up to subsequent work. We proceed in this fashion as well.

With the complexity from above in mind, it is usually a problem when point dimensions are high. The reason is twofold: On the one hand, for large datasets that have sizes of several gigabytes or even terabytes and therefore millions or even billions of points, brute force scans are not longer a feasible method for nding similar points in most applications because large numbers of points need to be queried. On the other hand when indexing structures are used, their size usually increases rapidly with increasing dimension as we will show in the following. The concept of an index structure is to create some sort of data structure that allows for faster retrieval of points in the original dataset. Over the past decades, a variety of indexing ap- proaches were presented. While structures like the R-tree presented by [Guttman, 1984] work well for low-dimensional points, due to the overlap in partitioned space the required space for higher-dimensional points grows rapidly. As shown in figure 2 in [Berchtold et al., 2001], the overlap in space exceeds 90% already in dimension 5 on real data. While [Berchtold et al., 2001] were able to achieve a speed-up of over factor 20 for 14-dimensional points that are partitioned with an R-tree, as shown infigure 11 the slope is attening signi cantly between dimension 10 and dimension 14 and it is close to constant for dimensions greater than 20. So for higher- dimensional points a speedup by factor 20 is still achieved, however in many use cases the dimension is in the hundreds, e ectively canceling out the speedup-e ect since the dimension increases further and further, but the speedup stays nearly constant for all dimensions greater than 20.

Nowadays most current indexing methods that are suitable for c-approximate nearest-neighbor queries have index sizes that are at best linear in n. For example in [Sun et al., 2014] the com- parison index of the C2LSH, a state-of-the-art method for locality-sensitive hashing presented by [Gan et al., 2012], still by far exceeds the capacities of the computer being used for the experiments. The SIFT1B dataset used for testing has a size of 92GB, yet its C2LSH-index would have had a size of over 800GB (stated as a conservative estimate by [Sun et al., 2014]).

So at this point the current potential for faster search with better indexing structures is mostly exhausted and we introduce a second approach: trading accuracy for speed. The basic idea is that instead of nding the exact nearest-neighbor of a given point, a neighbor in the proximity of said point is found. The most widely adopted method for this is locality-sensitive hashing, or LSH. It was introduced by [Indyk and Motwani, 1998] and spread quickly due to its good performance and theoretical guarantees. Conceptually, several hashing functions are chosen in a way such that points in proximity to each other will be hashed into the same bucket. Upon querying, the query point is hashed and subsequently only points hashed in the same bucket as the query point need to be checked for their distance. The problem is that with increasing point dimension, a phenomenon often referred to as the "curse of dimensionality" a icts the index sizes. Intuitively speaking, if the bucket size is held constant, the number of buckets, and therefore the index size, increases exponentially with the dimension d. This can be counteracted by simply increasing the bucket size. However, obviously this imposes a loss in precision. Also for high dimensions, even for fairly inaccurate approximation ratios, index sizes quickly become infeasible large as shown in [Sun et al., 2014], table 5. Even the most recent im- provements in LSH-based methods such as [Gan et al., 2012] still require very large index sizes.

If a reduction of memory requirement down to what is typically available on a regular com- modity PC could be achieved while maintaining good query speed and theoretical guarantees, it would not only imply much lower hardware costs due to reduced hardware requirements and therefore make the application much more available to a large audience of researchers as well as businesses, but also enable approximate nearest-neighbor queries on datasets that were too large to be processed until now. Subsequently this would represent another step towards the rise of a wide range of applications in real-world scenarios such as real-time image prediction.

### 1.2 Problem Statement

We will follow [Sun et al., 2014] in their formal de nition of the nearest-neighbor problem and expand it by some formalities: For a given query point q ∈ R^{d} and a dataset D ∈ R^{d}, we define o ∈ D as the nearest-neighbor of q if it satisfies:

illustration not visible in this excerpt

with dist(p1, p2) being the Euclidean distance between p_{1} ∈ R^{d} and p_{2} ∈ R^{d}. A point oc ∈ D is a c-approximate nearest-neighbor of q if it satis es:

illustration not visible in this excerpt

with o ∈ D being the nearest neighbor of q.

The SRS-12 algorithm nds a c-approximate nearest neighbor as described above with proba- bility pτ . The problem is that it is unknown how well the algorithm performs on image data in terms of the frequency of returning the actual nearest-neighbor. Similarly, it is also not known with what ratio it approximates the exact nearest-neighbor in case it returns approxi- mate nearest-neighbors.

### 1.3 Objectives

The work presented in this thesis is aimed towards verifying the claims of [Sun et al., 2014] and applying them to a real-world dataset in order to gain insights into how well the SRS-12 algorithm works for queries on image data. We strive to lay the ground for future research that is aimed towards performing image prediction using the SRS-12 algorithm. In particular the objectives were:

1. Implementing the SRS-12 algorithm presented by [Sun et al., 2014].

2. Verifying that the implementation is correct by reproducing the results of [Sun et al., 2014].

3. Finding out whether processing image data with SRS-12 yields meaningful results.

4. In case SRS-12 returns useful results for image data: determine the quality of the output.

In order to reach these goals we implement the SRS-12 algorithm in C++, verify the implementation on the SIFT1M dataset and subsequently perform block matching in order to establish a baseline for the quality of image predictions made by SRS-12.

Speci ally, our achievements in this thesis are:

- Implementing the SRS-12 algorithm presented by [Sun et al., 2014] as well as an exact nearest-neighbor search that supports comparison of image indexes.

- Running multiple veri cation experiments to establish a guarantee that the implementa- tion works.

- Performing an extensive experimental evaluation to determine the best parameters for SRS-12.

- Comparing the computation time of SRS-12 with a simple exhaustive nearest-neighbor search both on the experimental as well as on the theoretical side.

- Offering promising approaches for future research.

### 1.4 Structure of this Thesis

In this thesis we rst give an introduction to the work presented by [Sun et al., 2014]. More speci cally, in section 3 we present the SRS-12 algorithm, its variants and the underlying theoretical basis in terms of its functioning and index. Finally, we introduce the core of the algorithm - its stopping condition. Section 4 introduces the details of the implementation that are not self-explanatory given the contents of [Sun et al., 2014] and describe the experimental setup for the veri cation as well as quality measurements of the implementation of the SRS- 12 algorithm. Also we give a detailed overview and interpretation of the results we obtained in the experiments and provide speci c recommendations on how to achieve a good trade-o between accuracy and speed when searching for approximate nearest-neighbors in image data with SRS-12. Further we suggest possible enhancements for the block matching procedure using the SRS-12 algorithm. Section 5 presents an outlook on future research prospects that at at improving results when applying the SRS-12 algorithm to image predictions. Eventually section 6 concludes with a discussion of the main findings.

## 2 A brief history and overview of Approximate Nearest-Neighbor Queries

In terms of work related to nearest-neighbor queries it is hard to give a complete overview over all the research that has been conducted on the topic due to its profoundness and the wide variety of approaches presented over the years. We therefore limit ourselves to related work in the eld of approximate nearest-neighbor searches and give but a few examples and recommendations for further approaches. This way we provide what we consider an adequately broad overview over the eld of nearest-neighbor searches while providing su cient depth of knowledge about approximate nearest-neighbor searches in order for this thesis to be put in context.

### 2.1 Improving Index Structures

The topic of e ciently performing similarity searches has been an active eld of research for decades. A major breakthrough in terms of increasing query e ciency for nearest-neighbor search was presented in 1975 when the k -d -tree was introduced by [Bentley, 1975]. The au- thors observed an empirical average running time of O(log n) for a nearest-neighbor query. This was proven to only be true for very low-dimensional data later on since the k -d -tree merely divides space and overlaps of the space partitions occur oftentimes. An improvement was introduced by [Guttman, 1984]. They demonstrate that their R-tree data structure o ers numerous advantages over a simple k -d -tree, but as demonstrated in section 1.1, even in ve dimensions the overlap is still over 90%. There were further improvements such as the R*-tree introduced in 1990 by [Beckmann et al., 1990], the SS-tree proposed by [White and Jain, 1996] in 1996 and the SR-tree published in 1997 by [Katayama and Satoh, 1997]. The R*-tree was mainly an enhanced version of the R-tree. The SS-tree is noteworthy, since it applies spherical bounding boxes rather than rectangular ones, which increases performance in nearest-neighbor searches. And nally in 1997 the SR-tree combined spherical and rectangular bounding boxes which led to further volume e ciency. More recent work included the introduction of the afore- mentioned X-tree by [Berchtold et al., 2001]. It provides a substantial reduction of overlapping space in high dimensions, however as discussed in section 1.1 this only holds true for dimensions up to roughly 14.

An common aspect of the approaches above is that they provide performance improvements for low-dimensional point queries, but their performance declines rapidly as the dimension is rising into the tens and twenties. The reason is that they all nd different trade-offs between query time and preprocessing time. Most of the approaches o er a query time that is exponential in d, hence generally the complexity is O(n^{f(n)},), with f (n) being at least linear. In [Meiser, 1993], a query time of O(d^{5} log n) is obtained, but only at a cost of O(n^{d+δ} ) for preprocessing. To put this in perspective: for a dataset such as the SIFT1M with one million points and dimension d = 128, preprocessing cost is of the order of 10^{134} and thus far from feasible.

### 2.2 Softening Requirements for Accuracy

In 1998 the aforementioned LSH, short for locality-sensitive hashing, was presented by [Indyk and Motwani, 1998] as a novel approach for approximate nearest-neighbor search that inherently sacri ced some precision on account of speed. The idea of trading accuracy for speed was not new at the time, but the methodology was. A thorough overview and analysis of contemporary methods was presented by [Weber et al., 1998]. The seminal pa- per by [Kushilevitz et al., 2000] employed random projections as hashing functions and ex- ploited their distance-preserving nature - if two points are close to each other, there is a high probability they will collide upon hashing. Various other LSH-based approaches follows, but for the sake of brevity we only present three more recent publications. The rst two, [Bahmani et al., 2012] and [Gan et al., 2012] are noteworthy steps towards decreasing the in- dexing size as well as computation time. The former builds upon an entropy-based approach presented by [Panigrahy, 2006], the latter uses collision counting as a probabilistic approach to determining good candidates for c-ANN search.

[Panigrahy, 2006] exploits the fact that the entropy of the hash value of a given point conveys information about the number of randomly chosen points in its neighborhood. This results in a reduced number of required hash tables. In [Bahmani et al., 2012] this entropy-based approach is picked up and combined with a MapReduce approach resulting in lower network cost due to the smaller amount of hash tables used.

In [Gan et al., 2012] a so-called LSB-forest is built by using multiple locality sensitive B-trees, or LSB trees. The LSB-forest enables fast querying, however this comes at the cost under which most of the previous work su ers: an increased index size that is superlinear in n. One of the most recent enhancements was presented by [Andoni et al., 2014]. The authors im- prove the LSH-scheme by augmenting the LSH functions with a center point around which the hashing is performed. This approach leads to a query time polynomial in n and superlinear in d. However, this is achieved by building multiple index sizes with di erent distance thresholds, thereby e ectively resulting in very large index sizes.

### 2.3 Further Related Work

Due to the profoundness of the nearest-neighbor search problem and its various applications there has been a variety of di erent approaches over the decades. In order to give an idea about how broad the spectrum of potential approaches and applications of the nearest-neighbor search is, we are going to give a brief insight into clustering approaches and additionally present two pa- pers we deem interesting. However, we omit the details due to the little relevance the presented approaches have for our work. Nevertheless, the relevance of alternative, seemingly unreward- ing methods should not be underestimated. For instance, the very idea that the works in this thesis build upon, namely using random projections to reduce dimensionality by exploiting the similarity-preserving nature of projections with a know distribution, was already proposed by [Indyk and Motwani, 1998]. However, this approach was not pursued much further at the time.

One of the main approaches besides locality-sensitive hashing is vector quantization through clustering. [Jain et al., 1999] provide a good summary over clustering approaches. While the seminal paper has been published fteen years ago and there has been more recent work in the eld of clustering related to approximate nearest-neighbor queries, due to its comprehen- siveness it has been cited in a remarkable amount of subsequent related work. To give but one example for a typical approach for complexity reduction through clustering,figure 12 and

figure 13 in [Jain et al., 1999] visualize what 2-dimensional clusters might look like. Tradi- tionally, during pre-processing all points are clustered with for instance a k-means clustering algorithm. While recent approaches are more advanced, an introduction to the basic idea is given by [Hartigan, 1975]. The conceptual idea behind k-means clustering is to start with a set of points that subsequently become centroids of clusters by assigning more and more data points to each initial point and moving the initial points after each assignment, so that they represent the centroid of their assigned data points and hence, the cluster. When performing a nearest-neighbor search for a query point, one option is to then look for the closest centroid and perform a nearest-neighbor search only for points belonging to the respective cluster.

[Kuncheva, 1995] proposes an interesting approach that uses a genetic algorithm in order to edit the reference set for the nearest-neighbor search which results in signi cant increases in prediction accuracy. In a fairly recent paper, [Zhang et al., 2006] present a hybrid approach that combines Visual Category Recognition with Support Vector Machines and achieves good results on various datasets such as the MNIST digits^{1}

There are further approaches such as min-hashing, neural networks and self-organizing maps, to name but a few. However, for the sake of brevity these concepts will not be explored in more depth.

Summarizing, there has been active research on how to nd similar points more e ciently for decades. Generally most approaches try to establish trade-o s between accuracy and speed by varying the similarity radius, success probability, I/O cost, indexing size, search complexity, or a combination of these.

## 3 An Introduction to SRS-12

### 3.1 Why SRS-12 is worth a consideration

The novelty of the SRS algorithm lies in its three strengths that have not been combined in a single algorithm up until the time of its publishing:

- Theoretical guarantees

- Linear-sized index

- Arbitrary approximation ratio

In practice, the advantages of SRS-12 are straightforward: it enables the search for approximate nearest-neighbors with the same, or in some instances even better performance as LSH, but without the same, heavy memory usage. The results demonstrate that SRS-12 is actually able to outperform state-of-the-art LSH-based methods while at the same time radically decreasing memory requirements such that it is possible to perform nearest-neighbor searches on a dataset containing one billion high-dimensional points on only a single commodity PC.

### 3.2 SRS-12 in a Nutshell

The SRS-12 algorithm projects a high-dimensional data set and a given query point into a randomly chosen low-dimensional space, searches for the k exact nearest-neighbors of the query point in the low-dimensional space and checks the corresponding points in the high-dimensional space whether they satisfy the early-termination condition. If none of the k examined points satisfy the condition, the point with the smallest distance is returned. The distribution of the projected points, which is implied by the distribution of the matrix entries of the random projection matrix, is then exploited in order to estimate how likely a neighbor of a given query point in the projected space is a c-approximate nearest-neighbor of the query point in the original space.

The algorithm takes the following parameters:

- T - the maximum number of points to be accessed in the projected dataset

- c - the approximation ratio

- pτ - the desired success probability for nding a c-approximate nearest-neighbor

The experiments presented in this paper allow for an examination of the in uence of these parameters on the quality of the output.

### 3.3 Variants of SRS-12

There are three variants of the SRS-12 algorithm that allow for more precise control of parameter alteration. Ultimately they are not distinct algorithms, but rather variants of the SRS-12 algorithm that specify which parameters are used for computations and to what extent. We will briefly present them for the sake of completeness, but not run any experiments with them because our focus is on how well SRS-12 performs for image data in general. Since the scope of this thesis is to evaluate the performance of the SRS-12 algorithm when applied to image data in general, we run extensive tests for a variety of parameter settings and leave special cases such as the variants for future work. For a detailed explanation of the three variants we refer to [Sun et al., 2014] and only present them briefly:

- SRS-1 does not apply the early termination condition and therefore accesses all of the points passed to it.

- SRS-2 examines not only maxP ts, but all points in the low-dimensional dataset and can additionally take any success probability p_{τ} and pass it on to the threshold parameter which e ectively overwrite the computed success probability p_{τ}.

- SRS-12+ takes an additional input value c′ ∈ [1, c) which is passed to the incSearch function in order to potentially increase the success probability to p′_{τ}in case the algorithm stops due to the early termination condition.

As mentioned above, the SRS-12 algorithm creates the index by projecting the dataset onto a randomly chosen lower-dimensional subspace. The projection dimension is automatically determined by the algorithm depending on the other parameters. However, it is also possible to set it manually. In the following we will briefly discuss the index size and the theoretical foundation of the algorithm that builds upon p-stable projections and knowledge of the distri- bution of the projected points.

### 3.4 Indexing

Index Size

While the index size is of fundamental importance for most real-world applications due to the large amounts of data being processed, we chose not to verify the claims of the authors regarding the index size because it is apparent that the size of the projected dataset is just [Abbildung in dieser Leseprobe nicht enthalten]with mbeingtheprojectiondimensionplusthesizeoftheindexingstructure.Sincemis usually smaller than 10 and the dim ension is 50,100or even higher, the projected dataset is likely to t in the main memory even for large datasets such as the SIFT1B dataset with one billion points. It has a size of 92 GB and dimension 128, so by reducing the dimension down to e.g. m = 10, the size of the projected dataset would be roughly [Abbildung in dieser Leseprobe nicht enthalten]*92 GB = 0.72 GB.

For this dimension the indexing structure is reasonably small as shown by [Guttman, 1984]. Infigures 4-10 they report the space requirements of an R-tree for up to 5,000 data points. The difference between the sizes an index for 5,000 points and 1,000 points with dimension 10

**[...]**

^{1} http://yann.lecun.com/exdb/mnist/

## Details

- Pages
- 41
- Year
- 2015
- ISBN (eBook)
- 9783668033542
- ISBN (Book)
- 9783668033559
- File size
- 2.2 MB
- Language
- English
- Catalog Number
- v305214
- Institution / College
- University of Heidelberg – Computer Science
- Grade
- 1.0
- Tags
- Machine Learning Big Data Image Prediction Predictive Analytics Nearest Neighbors Image Recognition SRS-12