Loading...

Motion Tracking in Video using the Best Feature Extraction Technique

Master's Thesis 2009 92 Pages

Computer Science - Applied

Excerpt

Table of Contents

1 Chapter: Introduction

2 Chapter: Background
2.1.1 Principal Issues

3 Chapter: Literature Review

4 Chapter: Algorithms for Feature Extraction
4.1.1 SIFT
4.1.2 SURF
4.2 Matching Algorithms
4.2.1 Nearest Neighbor Algorithm
4.2.2 RANSAC
4.2.3 Hough Transform
4.3 Testing Methods

5 Chapter: Performance Evaluation
5.1 Evaluation Data
5.1.1 Code Implementation
5.1.2 RANSAC vs Hough Transform
5.1.3 SIFT vs. SURF
5.1.4 SIFT vs SURF Conclusion

6 Chapter: Application on real world video sequences
6.1.1 Proposed Method for Direction Estimation of a Moving Vehicle
6.1.2 Real World Videos for Experiment:
6.1.3 Experimental Setup
6.1.4 Verification of Results
6.1.5 Results and discussion
6.1.6 Application’s appraisal

7 Chapter: Conclusion and Future Work

References

Appendix A: Graphs Comparing RANSAC and Hough Transform

List of Tables

Table 1: Feature detectors

Table 2: Local descriptors

Table 3: Z-Score table

Table 4: Mc Nemar's Test for Hough and RANSAC using SIFT features

Table 5: Mc Nemar's Test for Hough and RANSAC using SURF features

Table 6: Algorithm’s parameter values

Table 7: Mc Nemar's Test for Blurred images

Table 8: Mc Nemar's Test for Compressed images

Table 9: Mc Nemar's Test for Images with change in Illumination

Table 10: Mc Nemar's Test for Images with change in view point

Table 11: Mc Nemar's Test for Images with change in view point

Table 12: Mc Nemar's Test for Zoomed and Rotated Images

Table 13: Mc Nemar's Test for Zoomed and Rotated Images

Table 14: Parameter values used for application

Table 15: Real world video data

Table 16: Camera Specifications

Table 17: Video # 1 Results verification

Table 18: Video # 2 Results verification

Table 19: Video # 3 Results verification

Table 20: Video # 4 Results verification

Table 21: Video # 5 Results verification

Table 22: Video # 6 Results verification

Table 23: Video # 7 Results verification

List of Figures

Figure 1: Scale Space (Lowe, 2004)

Figure 2: Extrema Detection (Lowe, 2004)

Figure 3: SIFT Key Point Descriptor (Lowe, 2004)

Figure 4: Original image

Figure 5: Integral image

Figure 6: Haar Wavelet Filter for x and y direction

Figure 7: SURF Descriptor (Bay, Ess, Tuytelaars, & Luc, 2006)

Figure 8: ROC Curves

Figure 9: Datasets and their description (a)

Figure 10: Datasets and their description (b)

Figure 11: Graphs comparing Number of Matches by Hough Transform and RANSAC

Figure 12: ROC Curves: Hough Transform and RANSAC matching Graffiti, Boat and UBC Images

Figure 13: ROC Curves: Hough Transform and RANSAC matching Bike and Leuven Images

Figure 14: Graphs comparing number of matched feature points detected by SIFT and SURF

Figure 15: Graphs comparing SIFT and SURF features for Bikes dataset

Figure 16: Graphs comparing SIFT and SURF features for UBC dataset

Figure 17: Graphs comparing SIFT and SURF features for Leuven dataset

Figure 18: Graphs comparing SIFT and SURF features for Graffiti dataset

Figure 19: Graphs comparing SIFT and SURF features for Wall dataset

Figure 20: Graphs comparing SIFT and SURF features for Boat dataset

Figure 21: Graphs comparing SIFT and SURF features for Bark dataset

Figure 22: Direction detection in two sample frames

Figure 23: Frames for video # 1 (Car moving on straight road)

Figure 24: Graphs indication Left and Right motion of a moving vehicle

Figure 25: Frames form video # 2 (Car Moving Up and Down the Hill)

Figure 26: Graphs showing Vehicle's motion in Left, Right, Up and Down direction

Figure 27: Sample frames form indoor video

Figure 28: Graph showing left motion in Indoor video sequence

Figure 29: Sample frames form outdoor sequence

Figure 30: Graph showing left movement in outdoor sequence

Figure 31: Aircraft Taking-off Video Sequence

Figure 32: Graphs presenting motion detection of an aircraft taking-off

Figure 33: Aircraft Flying over Trees

Figure 34: Graphs presenting Upward and Straight Motion of an aerial vehicle

Figure 35: Air to Air Video Sequence

Figure 36: Graphs presenting Left, right, up and down movement of an aircraft

Abstract

This project explores the use of local featuresfor motion tracking to estimate the direction of a moving object. Availability of a number of feature extraction and matching algorithms in literature make it difficult to select any one of them. Therefore, it was consideredappropriate to assess the suitability of a technique for a particular application before actually putting it into use. The project begins with a comparative study and analyzes two state-of-art techniques for feature extraction (SIFT and SURF) along with two best known matching algorithms (RANSAC and Hough Transform). The performance evaluation is focused on measuringthe correctness of the algorithms for tracking applications. The statistics Mc Nemar’s test has been applied to find the more efficient method for feature extraction and matching. Using the results obtained from this analysis,a set of feature extractor and matching technique has been employed to estimate the direction of a moving object from videos captured using a handheld camera as well as camera fixed on a moving vehicle. The proposed method is capable of detecting left, right, up, and down movements with reasonable accuracy in real world videos.

The results are not hundred percent accurate but encouraging enough for further investigation. The system is capable of identifying the directionof the moving object with more than 90% accuracy if the object changes its direction independent of the surroundings, and with less than 30% accuracy otherwise.

This Project dissertation is in accordance with Examination Regulations 6.12 and 6.13.

Acknowledgements

I am pleased to thank Dr. Adrain Clark whose endless support and guidance was available to me throughout this project. I acknowledge his personal efforts in collecting real world videos for the application part of this project. I would like to thank Mr. Shoaib Ehsan for having valuable discussions on the topic. I would like to express my heartfelt gratitude to my husband and my immediate family who have been a constant source of love and support. I am also indebted to my colleagues at Lahore College for Women University,for their well wishes.

1 Chapter: Introduction

Today, the focus of research is not only to develop sophisticated hardware to capture the best quality images and videos but also to process this information efficiently and use it purposefully. Recently the advancement in object detection, recognition and tracking supported a lot of vision applications such as “Human Behavior Analysis through videos”, “Security Surveillance Systems”, “Recognition Systems to facilitate people with disabilities”, “Image Mosaicing”, Image Rendering” and much more. Performance of these applications is based on accurately characterizing the image data into classes representing specific properties to facilitate matching and recognition tasks.

Image points having some distinguishable property from its neighborhood,can be collected as interest points or features of an image.This property can be the information about colour, gradient, contrast etc. An interest points is only useful if it can be matched with images under occlusion, cluttering and varying imaging conditions. Therefore the goal of any vision algorithm is to find those features which are insensitive to noise, illumination and other transformations, efficiently and accurately.

Different approaches for detecting and matching objectsand images with set of other images have been proposed in literature such as optical flow, Image segmentation and region based methods, buttechniques based on local featuresarethe most popular for detection, recognition and tracking applications.Feature extraction is mainly a two step process, detection and description of an interest point. Where detection means to locate image points with some distinguishable property and description contains the information (such as derivatives) of the neighborhoods of points which provide a mean of establishing point to point correspondences and improves matching results. Both of these steps in feature extraction plays vital role in upgrading the performance of the algorithm. The efficiency and accuracy of the technique lies in accurately detecting feature points and efficiently generating its descriptor.

There are number of techniques developed for feature detection and description, some well known of them are Harris-Laplace, Hessian-Laplace, Steerable filters, Spin Images, SIFT and SURF. These techniques can be characterized as good or bad, based on how much these are capable of detecting stable features in images with noise, illumination, and affine transformations. All of these describe features differently and as given in (Mikolajczyk & Schmid, October 2005) region based descriptors appear to give better performance. Therefore,Scale and rotation invariant feature extraction algorithms like SIFT (Scale Invariant Feature Transform), SURF (Speeded-Up-Robust Features), Harris-Laplace and Hessian-Laplace are considered good for applications dealing with real images/videos. However,SIFT and SURF are considered state-of-the-art and mostly referred in literature so far. The features extracted by these techniques are robust to significant range of affine distortion, changes in illumination, 3-D view point changes and noise(Lowe, 2004). Therefore, I have selected them to carry out a detailed comparative study.

Feature detection and description without matching is incomplete for any vision application therefore, wealso need some efficient method to associate related points in pair of images.In literature, there are two widely used matching algorithms known as RANSAC and Hough Transform.Just like other vision techniques the performance of matching algorithms also varies with application and therefore,no technique can be characterized as best (in terms of accuracy and efficiency). The purpose of analyzing the detector and descriptor’s performance along with matching algorithms is to find a complete set of technique for recognition or tracking applications with good performance.

Remaining part of this report explains background of the project (chapter 2), an overview of the work already done in this field (chapter 3) followed by introduction to the techniques (for feature extraction and matching) and testing methods used for evaluation (chapter 4). Next partis performance evaluation results and their discussion (chapter 5), and application of best feature extraction and matching technique for tracking the motion of a moving object (chapter 6). The report ends with conclusion and future work (chapter 7).

2 Chapter: Background

Image and video analysis had great influence on most of the research work going on in computer vision these days. It is due to its application in a vast variety of areas such as in entertainment (development of software to mimic human actions on visual actor), in sports (the development of robotic players for giving practice sessions to human players, like a foot ball player, tennis ball tracker, badminton player etc). Different kinds of automatic vehicles can be developed which can move autonomously in hazardous situations and difficult areas. In Medical Science, imaging can be very helpful to locate specific cells associated with a particular disease in the human body and it helps doctors with their diagnosis and surgeries. Similarly, one cannot deny the importance of efficient analysis of human behavior for security surveillance systems to monitor areas with high risks, in virtual reality to allow the user to interact with the computer-simulated environment, in robotics to allow machines to interact with humans in order to facilitate them in their day to day life like attendants for elderly people and patients, and last but not the least, intelligent machines especially for people with disabilities such as wearable glasses to track objects for blind people. For all these applications the requirement is to have efficient object detection and tracking methods which can provide real time results with great accuracy.

The application developed for the fulfillment of this project is not new however, it is unique in a sense that no other sensors have been used except for camera to get real time data, while allowing camera to move along with object/vehicle, instead of having a fixed position. The basic idea is to use local image information from frame to frame to identify the direction of moving object carrying a camera. It is, therefore, different form object detection or tracking using fixed a camera. The purpose of this activity is to identify the suitability of local features matching method for application like control system for an unmanned vehicle which can be an automobile or an airplane.

2.1.1 Principal Issues

Matching two images together and finding exact correspondence between their feature points is the most challenging task but the most fundamental one for any recognition and tracking application. In order to find local information from images which should be sufficient enough to give information regarding a moving object’s pose is the prime objective of this application.

Researchers in robotics and vision community are presenting solutions to similar problems, such as Lowe, who has been the creator of SIFT technique, along with his colleagues used SIFT features to identify natural landmarks to build SLAM. This method appeared to be a better solution than finding artificial landmarks (Stephen, Lowe, & little, October, 2002).The use of local gradient information of video frames to estimate the global position of object has been a great motivation. Sim and Dudek have also suggested the use of landmarks for pose estimation (Sim & Dudek, September 1999.). These two applications proved the use of local feature based methods as better approach to find the global position and pose of a mobile robot. Therefore, to estimate the direction of a moving object using the same technique seems quite feasible.

Another issue is the selection of feature detector and descriptor. Numbers of local feature descriptors have been proposed in the literature. An evaluation of these is given in (Mikolajczyk & Schmid, October 2005). But the problem with this analysis is that it is still unclear and cannot tell in unequivocal terms if any algorithm has superiority over the others for matching features between images with affine transformations. The study also lacks the use of error bars to clearly identify the effectiveness of correct results. The number of correct matches does not make an algorithm the best unless its claim is accompanied by number of false matches as well as true and false negatives. It has been mentioned in the referenced paper that it is difficult to identify false matches and hence also difficult to find number of true and false negatives. Therefore, this issue has been taken up in this study. Another factor of revising this kind of evaluation is to include the latest technique SURF which claims to have equally better performance than SIFT or any other distribution based descriptor. SURF descriptor stores Haar wavelet response of the region surrounding the pixel in 64 or 128 bins. Herbert et al themselves analyze the performance of SURF-128 and SURF-64 and found SURF-64 with efficient and accurate results.

After acknowledging SIFT as best performer now the focus of research has been shifted to compare SIFT with SURF in vision community. It is because of the claim made by SURF being better as well as time efficient which has grabbed the attention of many scientists. Work done by (Danny, Shane, & Enrico) for comparing local descriptors for image registration and comparison of different implementations of local descriptors by (Bauer, Sunderhauf, & Protzel, 2007) is a good starting point to build ground for the study to be undertaken. Valgren and Lilienthal have also evaluated the performance of SIFT and SURF algorithms in outdoor scenes for robot localization (Valgren & Lilienthal). According to their study SURF gives comparable results to SIFT but with low computational time. All of this comparison and evaluation work does not carry any substantial statistical proof of the results claimed and therefore needs to have a detailed investigation of the behavior of the two sate of the art algorithms in different imaging conditions or matching between images with affine transformations.

Most of the research papers referred above used precision-recall curves to compare the performance of the local descriptors. The approach is not wrong but some other statistical test can also be applied to do in-depth analysis of the behavior of the techniques, such as to compare two or more algorithms ROC curves can be used to find the superiority of one technique over the other. Some statistical test such as Mc Nemar’s test can also be applied to accurately identify the behavior of two techniques under similar conditions. The detail of this test is provided in chapter 4.

3 Chapter: Literature Review

Detection and recognition tasks have been the focus of research for more than two decades. It would not be out of place to say that it is getting mature now and researchers have produced impressive algorithms to solve these issues, some of which are considered landmark in computer vision and image processing. The main approaches for object detection and tracking can broadly be divided into three categories i.e. Top-down approaches, bottom-up approaches and combination of the two (Wang, Jianbo, Gang, & I-fan, 2007).

In top-down approaches, the image as a whole is taken as one region and then segmented into sub regions to find specific object. This model is inspired by pattern matching technique in human beings to identify and locate objects (Oliva, Antonio, & Monica, Sept, 2003) but it is important to say that humans also use their knowledge of object classes which helps them to identify different types of objects belonging to the same class such as different types of cups, pens, humans and animals, which machines lack. To achieve this intelligence, software needs two types of information, the image and the object itself. The object can be modeled using some representation method and image information can be collected using methods like Colour Histogram, Wavelet Histograms etc. Then segmentation of the image into sub-regions based on these colours or wavelet histograms helps to identify the one that holds the object. The difficulty with this approach is that it does not give importance to object’s appearance or its local features and therefore fails to differentiate between different objects of the same class and so can result in increased false matches.

Alternatively the Bottom-up approach focuses on finding object features like edges, corners, segments and colours. Algorithms then use these or group of these features to correspond to the object of interest which are then matched by evaluating certain mathematical functions such as contour matching (Vittorio, Tuytelaars, & Gool, July, 2006), corner detection, colour histograms etc... Use of features for object detection has been quiet popular but the limitation of Bottom up approach is that they have to spend more time in finding local features and then group them to correspond to one object. Therefore, many researchers combined these two approaches to take advantage of the both. A number of algorithms have been developed using this concept.

Before going into the discussion of these algorithms, it is important to discuss the basic techniques developed to identify an image point as feature. An image point can qualify as feature only if it holds the property of uniqueness and repeatability. Identification of features having these two properties can increase the accuracy of detection algorithm but it is a difficult task at the same time especially due to possible change in imaging conditions and viewpoints.

Simple features (‘interest points’) can be extracted from images by applying ‘Binary Thresholding’ to find difference in pixel values which indicates the presence of some interest points or by using complex mathematical functions like 1st / 2nd order derivatives expressed in terms of Gaussian derivative operators or in other words as local maxima or zero-crossings of image pixels in 2nd order derivatives. Using these mathematical methods, edges, corners and blobs can be computed. A brief Summary of features and their detection methods is given in (Table-1). These detectors have been commonly used in all algorithms for object detection and tracking.

Table 1: Feature detectors

illustration not visible in this excerpt

Moravec’s operator was the first known operator to find ‘interest point’ and he found a point where there was large intensity variation in every direction. He used square overlapped windows of 4x4 to 8x8 and calculated sums of squares of differences of pixels adjacent in each of four directions, such as difference in a 4x4 patch in horizontal, vertical, diagonal and anti diagonal directions. He selected the minimum of the four variations as representative values for the selected window (Moravec, 1977). The ‘interest points’ defined by Moravec were afterwards classified as corner points in an image. Moravec’s intention was to find image points having different properties than their neighboring pixels. The idea was then used by others to find edges and blobs. Edges simply make the boundary of an object and are therefore the most basic characteristics of an object to define its shape. In computer vision and image processing, the most common object representation used is contour (based on edges). An edge in an image is the pixel location where there is a strong intensity contrast in pixel values. Many researchers have defined ways of finding edges, like Sobel, Laplace and Canny (Canny, 1986). However, Canny’s technique has been considered the most efficient one. According to his approach, the image needs to be smoothed first to eliminate noise and then Image Gradient is used to highlight with high spatial derivatives. Then his algorithm attacks each region to suppress non-maximum edges. Finally hysteresis is used to suppress remaining non edge pixels in which two threshold values are used. The pixels with value higher than the high threshold value are considered as edge pixels.

Harris and Stephen’s Corner Detector’(Harris, Chris, & Stephens, 1988)is the improvement of Moravec’s detector. They find interest points using derivatives in ‘x’ and ‘y’ direction to highlight the intensity variations for a pixel in a small neighborhood. The second matrix will encode this variation which is named as trace matrix. An interest point is then identified using the determinant of the trace matrix and then non-maxima suppression is applied. In brief, it uses gradient information to find the corner. The technique is more sound than Moravec’s Operator and provides good results but it has some short comings as well. Firstly, it is sensitive to noise due to the dependence on gradient information. Secondly, it gives poor performance on junction points and ideally identifies corner points on L junctions only. At the same time, its computational cost is more than any other corner detector.

SUSAN Corner Detector(Smith & Brady, 1997) used a circular mask over pixels to be tested. The approach is to sum the differences in intensities of all pixels in the mask with the pixel under consideration and if that sum is less than the threshold then pixel value will be replaced by the difference between threshold and the sum otherwise set to ‘0’. The technique can be adopted to find edges along with corners. If threshold is large it will detect all edges and if the threshold is small then it will find the corner

FAST (Trajkovic & Hedley, 1998), another approach to find local features, is to consider 16 pixels in a circle of radius 3 around the pixel. If ‘n’ immediate pixels are all brighter than the central pixel by at least two or on the other hand all darker than the pixel under consideration then this pixel is considered to be a feature point. The best value of radius can be different, as most researchers found radius of ‘9’ to produce good results in terms of corner detection. FAST is more efficient than any other corner detector but it is quiet sensitive to noise along edges and readily identify diagonal edges as corners.

Difference of Gaussian (DoG) is a greyscale image improvement algorithm, in which one blurred image is subtracted from another less blurred image to find edges and removing noise. The blurring of image is done by applying two different blurring factors (Gaussian filter with low and high blurring values) and the resultant two blurred images are then subtracted. The resultant image only contains points with higher grayscale value as compared to others and presents edges. It has another application of smoothening of image used by some other feature detectors as a basic step. Laplacian operator uses second derivative to find zero crossings for edge detection(Morse, 1998-2000). Laplacian of Gaussian was the first method used to find blobs in an image. It uses Gaussian kernel to convolve the image at a certain scale and then a Laplacian function is applied to identify regions known as blobs where the result, if less than zero, shows the presence of bright blobs and greater than zero shows the presence of dark blobs. The problem with second derivative methods is that they are very sensitive to noise.

Hessian matrix is 2nd order partial derivative of a function and is useful to find blobs and edges. First of all, the 2nd order partial derivative of a window of pixel values is calculated. Then the determinant of this matrix is checked for non zero at any point to categorize that as feature point. If the determinant of Hessian at point ‘x’ is positive then ‘x’ is considered to be a local minima (useful to find blobs), if determinant appears to be negative it means that ‘x’ is local maxima (useful to find blob and edge) and a zero value means that ‘x’ is in determinant. These critical points give the presence of blobs. Hessian Matrix has been used to develop algorithms for feature point’s extraction. The determinant of Hessian Matrix of Haar Wavelet has been used as an interest point locator in ‘SURF’(Bay, Ess, Tuytelaars, & Luc, 2006).

Table 2: Local descriptors

illustration not visible in this excerpt

All of the above mentioned detectors are capable of detecting feature points based on some characteristics but in order to match these points in other images there is still need of distinguishable data regarding feature point which can make it unique and robust to varying imaging conditions. Therefore a lot of descriptors have been developed to store region’s information around feature point. A descriptor plays an important role in matching because of the amount of information it contains about interest point and its neighborhood. (Table-2) present different descriptors proposed in literature along with the type of data they store.

A detailed performance evaluation of these descriptors has been done by (Mikolajczyk & Schmid, October 2005) which not only explore the capability of producing correct matches but also providesa detailed analysis of behavior of these descriptors under different image transformations like Scale, Rotation, JPEG compression and Illumination. It is interesting to see that deferential based descriptors had worst performance for affine transformation while distribution based descriptors like descriptors based on histograms and shape context performed very well for almost all kind of transformations.

4 Chapter: Algorithms for Feature Extraction

The goal of a good recognition system is to find image features which are repeatable, distinctive, invariant and efficient to compute and match. Techniques selected for this study have followed almost similar concept of identifying features and describing their neighborhood information.Brief description of the two techniques, SIFT and SURF used in this project is given here under.

4.1.1 SIFT

This technique combines a scale invariant region detector and a descriptor based on histogram of gradient distribution around each detected point(Lowe, 2004). Main steps defined for detection and description are as follows.

4.1.1.1 Scale Space Extrema Detection

The purpose of this step is to find interest points which are stable across all possible scales using scale space. Where scale spaces is generated by convolving the input image with a variable scale Gaussian function.

illustration not visible in this excerpt

where * is the convolution operation in x and y

illustration not visible in this excerpt

Laplacian of Gaussian is computationally expensive operation, therefore an efficient way is to find Difference-of-Gaussian of smoothed images and the operation becomes subtraction of two images only.

illustration not visible in this excerpt

A scale space contains at least four octaves with 3 intervals (3 samples per octave) and then further octaves can also be generated by down sampling the Gaussian Image with a factor of 2

illustration not visible in this excerpt

Figure 1: Scale Space(Lowe, 2004)

Next is to find the interest point from this scale space. In every octave each image point is compared with its 26 neighborhood points in three dimensional space (x,y and scale) and it qualifies as distinctive pixel if it has maximum or minimum value with respect to all 26 neighbours.

illustration not visible in this excerpt

Figure 2: Extrema Detection(Lowe, 2004)

The question which arises here is how many octaves and how many samples per octave should be enough to generated stable key points. Lowe’s paper as well as Bay et. al.found that taking 3 samples per octave with the multiplicative difference gives maximum number of key points and with large number of samples, the number of stable key points starts to decrease. The point is strengthen from the fact that the blurring the imagewith big filters cause image points to losetheir characteristics. The same problem occurs if we increase the number of octaves. Therefore 4 octaves with 3 samples each are considered to be the optimal choice.

4.1.1.2 Accurate Key point Localization

After identifying a keypoint, next step is to find its local information such as location, scale and ratio of principal of curvature. At this point it is important to leave the points which have very low contrast (selected as keypoint due to noise). Here Lowe suggests using 3D quadratic function to determine the interpolated location of selected points. In order to find points with maximum contrast the Lowe discard the pixels whose derivative is less than 0.03 assuming pixel values in the range of (0,1).

The keypoints selected so far are the one with considerable contrast value as compared to their neighboring pixels. But we still have to deal with noise and false edge responses and for this Lowe suggested the use of Hessian Matrix instead of Harris and Stephan’s corner detector to avoid unnecessary calculations of eign values. The method is to find the principal curvature in two directions, if the point is an edge point it will have one principal curvature much higher than the other, but if it will be a corner the difference will be very low and therefore will be selected. The determinant of Hessian Matrix gives the value of principal curvature

illustration not visible in this excerpt

Or equivalently

illustration not visible in this excerpt

The keypoints are rejected with principal curvatures greater than 1.

4.1.1.3 Orientation Assignment

Another important property that a feature point should possess is its invariance to orientation. An orientation histogram is formed by recording all gradient orientations of sample points within a region around the key point. Peaks in the histogram reflect the possible directions of the local gradients. Highest peak in the histogram along with any other local peak within the range of 80% of the highest peak help selecting key points with these orientations. Therefore there will be multiple key points created at the same location and scale but with different orientation. Finally to improve accuracy the key point is interpolatedamong three peaks histogram values to find stable gradient orientation.

4.1.1.4 Descriptor

The final step is to define a descriptor that also adds invariance to illumination and 3D view point. SIFT descriptor starts by calculating gradient magnitude and orientation at each image point in a region around key point as shown in the Figure. The implementation uses 4x4 descriptor computed from 16x16 sample array. Therefore the results are stored for 4x4 array of histogram with 8 orientation bins in 128 element feature vector for each key point.

illustration not visible in this excerpt

Figure 3: SIFT Key Point Descriptor (Lowe, 2004)

Gaussian weighted function is used to assign more weight to the magnitude of nearest neighbors sample point in the window and less weight to the points which are far from the center of the window. This is done to avoid false registration of keypoints. In order to make this descriptor invariant to brightness the feature vector is normalized to unit length. The output of SIFT detector and descriptor is now invariant to scale, rotation, illumination and 3D view point and thus can be used for recognition, tracking, registration or 3D reconstruction application.

4.1.2 SURF

SURF is a technique based on the use of local information for feature extraction. It is different from SIFT both in detecting the feature point and calculating its descriptor. In SURF, Bay et al. proposed some alternative methods to compute similar kind of information as SIFT do. The detail of which has been given here under.

4.1.2.1 Integral Image

Bay et. al. proposed that the computational load of applying filters can be reduced if we use integral image instead of original image, originally proposed by Voila and Jones for face detection (Viola & Jones, 2001). Integral images allow very fast computation of any box type convolution filters. It is very simple to calculate an integral image, i.e. each pixel location (x,y) is defined as the sum of all pixels in a rectangular region formed by the origin and x.

illustration not visible in this excerpt

It takes only four operations to calculate the area of a rectangular region of any size.

illustration not visible in this excerpt

(Fig4) and (Fig.5) shows the example pixel values of original and integral Image.

illustration not visible in this excerpt

Figure 4: Original image

illustration not visible in this excerpt

Figure 5: Integral image

Once an integral image is computed, the rest of the steps to be applied in SURF become computationally cheap such as Haar wavelets response and Hessian matrix calculation to find interest points.

4.1.2.2 Interest Point Detection

To detect an interest point from image,a popular choice of detector has been used i.e. Hessian Matrix because of its good performance and accuracy over integral images. A point in the image is selected as Interest Point for which the determinant of the matrix is largest.

For a point in an image ‘I’, the Hessian Matrix H(x, ) in ‘x’ and scale is defined as below:

illustration not visible in this excerpt

Where L xx is the Laplacian-of-Gaussian of the image. They followed the argument made by Lindeberg(Lindeberg, 1988) that Laplacian-of-Gaussian is optimal for scale space as compared to Difference-of- Gaussian which causes loss in repeatability under image rotation (Herbert Bay et al.)Interest point is selected if a pixel has maximum or minimum value within a window of 26 neighboring pixels.To find points in scale space the determinant Hessian Matrix is used. Themain focus of this technique is to optimize the performance of the algorithm therefore, the authors explored every possibility to reduce the computational cost and the use of Fast Hessian Matrix is also one of it, proposed by (Neubeck & Gool, 2006)

4.1.2.3 Localization of Interest Point

As the scale space is build using different values of Gaussian filter and the difference between two smoothed images is substantial, therefore it is important to identify the point’s position by interpolation.

4.1.2.4 Orientation Assignment

To find features which are invariant to rotation Bay et. al. suggested the use of Haar Wavelet response in ‘x’and’ y’ direction in a circular neighborhood of radius 6s around the interest point. Where‘s’ is the scale at which the interest point was detected. Again Integral image optimizes the computation of Haar wavelets due to the application of Haar box filters to compute responses in ‘x’ and ‘y’ direction.

illustration not visible in this excerpt

Figure 6: Haar Wavelet Filter for x and y direction

Where dark region has weight -1 and white region has weight +1, and with only six computations the result can be obtained. The dominant orientation is calculated by adding all responses within a sliding orientation window of size .

4.1.2.5 SURF Descriptor

After identifying the points with different scales and orientations, the rest is to calculate the descriptor.The technique suggests dividing the region into 4 x 4 sub-regions and calculating Haar wavelet response of the sub-region in horizontal and vertical direction and then calculating the sum of the absolute values of the responses.

illustration not visible in this excerpt

Figure 7: SURF Descriptor(Bay, Ess, Tuytelaars, & Luc, 2006)

The results are then stored into four dimensional descriptor vector, the length of which comesto 64 to store information of 4 x 4 sub-regions. Similar to SIFT,SURF descriptor is also normalized so that it becomes illumination invariant.

4.2 Matching Algorithms

Matching two images needs to find point to point correspondences between pair of images. The similarity between features can be found using their descriptors which hold the content information of the feature point. If the distance between the descriptor of two features is less than some threshold, they are considered as matched features. Taking only the Euclidean distance of two feature points does not always give true matches because of the presence of some highly distinctive features (Lowe, 2004). Lowe suggested that the nearest neighbor’s distance should be compared with the second nearest neighbor and if it has lower distance measure than the second one, only then it should be selected. In SURF, Bay et. al. has followed the same approach to find point to point correspondence.

4.2.1 Nearest Neighbor Algorithm

Nearest neighbor can be found simply by using Euclidean distance between two descriptors,if the distance between them is less than some threshold they qualify as nearest neighbors. But to apply this technique in practice, some data structure is required and its choice plays critical role in efficiently finding corresponding features between two images. It is also necessary to use data structure for finding nearest neighbor because the descriptor is normally a multidimensional array of data and to match all of it is a time consuming process. The original implementation of SIFT and SURFuses similar data structures. SIFT implementation uses Best Bin First (BBF) algorithm, which is a modified search ordering for the k-d tree algorithm. Added benefit ofthis method is the useof priority queue based BBF search.This searchcan be stopped after checking certain number of bins (for SIFT the search is stopped after first 200 nearest neighbors to reduce the computational load of comparing unimportant matches).

SURF also uses k-d trees to find nearest neighbors but here again SURF claims efficiency over SIFT because of the information collected during detection phase, i.e. sign of Laplacian (the trace of Hessian Matrix) to distinguish between the blob points. Here blobs are regions around feature points and a feature point is detected on the basis of blob response which results in dark spot on light background and light spot on dark background. Only the points having similar sign of Laplacian are considered for matching.

Similarity based on descriptor’s distance is not sufficient to obtain correct matches; therefore, we need additional robust algorithm that can also estimate the transformation between matched pointsand identify the false matches. For this two best suited techniques are RANSAC(Bolles, Martin, & Robert, 1981) and Hough Transform(Ballard & Brown, 1982).

4.2.2 RANSAC

RANSAC (Random Sample Consensus) is an iterative method to partition dataset into inliers (consensus set) and outliers (remaining data) and also delivers an estimate of the model computed from the minimal set with maximum support. The concept behind RANSAC is to select four points randomly; these points define the projective transformation (Homography Matrix), the support for this transformation is measured by the number of points that lie within a certain distance. The random selection is then repeated a number of times and the transformation with the greatest support is considered the best model fit. (Richard & Andrew)

Algorithm contains following steps

1. Compute Interest Points in each image

2. Compute set of interest points matches based on descriptor’s similarity

3. For ‘N’ samples (where N can be a random number)

a. Select a random sample of 4 correspondences and compute Homography Matrix ‘H’
b. Calculate the distance ”d” for each pair of corresponding feature points
c. Compute the number of inliers consistent with ‘H’ and for which ‘d’<t (where t is again a threshold distance for similarity measure)

4. Choose ‘H’ with largest number of inliers.

5. For optimal solution re-estimate ‘H’ from all correspondences classified as inliers by minimizing the cost function.

4.2.3 Hough Transform

Hough Transform was developed to identify lines by representing each potential pixel as a line in Hough space, and the maximum number of pixels voting for line predicts the position of line in image. This idea of using parametric space to find geometric shapes is then extended to find general shapes and matching. For matching applications, Hough Transform is used to find clusters of features that vote for a particular model pose. It allows each feature to vote for all object/model poses that are consistent with the feature’s location, scale and orientation and the peak in the Hough parametric space helps identifying the number of feature points having consistent transformation and which are then marked as matched feature points.

The advantage of using Hough Transform is that it allows the search for all possible posses of the object simultaneously by accumulating votes in parametric space. The problem with this implementation is that with the increase in number of parameters to find matches the dimensions of matrix also increase. It not only increases the time complexity of filling the votes and then parsing the whole space to find the highest peak, but also utilizes a lot of memory. Despite its having high time and space complexity the accuracy is unobjectionable.

Hough Transform for Feature Matching

1. Generate a four dimensional accumulator array A(x, y, scale, )

2. Initialize the Array to zero

3. For all matched feature points in pair of images

a. Calculate the transformation between pair of matched feature points in terms of x-axis position(x), y-axis position(y), scale(s) and orientation( )
b. Increment A(x, y, scale, )

4. Find maximum value in accumulator array, its x, y, scale and gives the image transformation

5. For all matched feature points

a. Mark feature point as true match which follows the image transformation identified by Hough’s Peak.

4.3 Testing Methods

Algorithms developed for Image processing and Vision problems work very well for one application but fail to give impressive results for another application. Some of the object detection techniques, for example,may work well to identify rigid objects (with some geometric shapes) but fail to detect non-rigid objects or deformable objects. It is, therefore, fundamental to assess the performance of any algorithm before applying it to a certain problem.

To characterize the performance of algorithms, lots of methods have been introduced. The most common and logical of which arecomparing the outcome ofa certain method withsome other knownresults, and then calculatingthe data(TP, FP, FN, TN)

illustration not visible in this excerpt

Where

TPis The number of test predictions matches the correct result

TN is The number of test predictions matches the wrong result

FNis The number of test predictions shows a false non-match

FPis The number of test predictions shows a false match

Because these Figures do not convey the information (that we need regarding the performance of an algorithm) separately, different methods have been developed for combining theseFiguresand presenting them graphically. Some of these are described and utilized in this project.

4.3.1.1 ROC Curve (Receiver Operating Characteristic Curve)

An ROC Curve (of different values of one or more than one parameter)is a plot of False Positive rate versus True Positive rate. This curve shows the correct predictions as well as false predictions.The closer the curve to the top-left corner of the graph, the better the algorithm is as shown in (Fig.8 (a)). These two plots represent twoROC curves for two matching algorithms represented by two different colours i.e red and green.

illustration not visible in this excerpt

Figure 8: ROC Curves

(Fig.8 (a)) clearly shows that the algorithm represented with a red line performed much better than the algorithm represented with a green line as evidenced by the higher rate of thetrue positive values. The graph on the right side, however, givesunclear information about the performance of the two algorithms. This makes it impossible to say which of the algorithm is better. In such a situation, some other but more sophisticated method needs to be applied. McNemar’s Test is one such method.

4.3.1.2 McNemar’s Test

It is a form of a chi-square test for matched pair data. In this test, the results of the two algorithms are stored as follows:

illustration not visible in this excerpt

McNemar’s Test calculates in the following way:

illustration not visible in this excerpt

(-1 is the continuity correction)

“If the number of tests is greater than 30 then central limit theorem applies.This states that if the sample size is moderately large and the sampling fraction is small to moderate, then the distribution is approximately Normal” (Clark & Clark, 1999). In this case, we need to calculate ‘Z’ score of the algorithms using the following equation:

illustration not visible in this excerpt

Z-score should be interpreted using Z-score table so that the Algorithm ‘A’ and ‘B’will be categorized as similar/different and better/worsewith a confidence limit. If Algorithm ‘A’ and ‘B’ are similar, the Score of Z will tend to zero, but if they are different the two-tailed prediction could be used with a confidence limit in order to show that both algorithms are different from each other. Whereas one-tailed prediction can also be used to find the supremacy of one algorithm over the other with the following confidence limits.

Table 3: Z-Score table

illustration not visible in this excerpt

If the Z-Score is 11.314 for example, then by using the above table, the following results will be deducted:

We are 99% confident that Algorithms ‘A’ and ‘B’ give different results.

We are 99.5% confident that Algorithm ‘A’ is superior to Algorithm ‘B’.

5 Chapter: Performance Evaluation

5.1 Evaluation Data

The data selected to evaluate the performance of matching and feature extraction algorithms are standard datasets used for comparing vision algorithms by almost everyone. These have been provided in the University of Oxford website[1].

There are five datasets with different imaging conditions and synthetic transformations which are listed below:

- Increase in Blur
- JPEG Compression
- Change in Illumination
- Change in View Point
- Change in Scale and Orientation

Each dataset contain 6 images each with varying geometric or photometric transformations. This means that the effect of changein image conditions can be separated from the effect of changing the scene type. Each scene type contains homogeneous regions with distinctive edge boundaries (e.g. graffiti, buildings), while only some contain repeated textures of different forms. All images are of medium resolution 800 x 640 pixels approximately.

illustration not visible in this excerpt

Trees and Bikes dataset, a sequence of images with increase in blur from left to right. The blur has been introduced by varying the camera focus.

illustration not visible in this excerpt

The images are UBC (University of British Columbia’s building) with increase in JPEG compression. The JPEG sequence is generated using a standard xv image browser with the image quality parameter varying from 40% to 2%.

illustration not visible in this excerpt

A sequence of six images with a decrease in brightness. The light changes are introduced by varying the camera aperture.

Figure 9: Datasets and their description (a)

illustration not visible in this excerpt

The first dataset is known as Graffiti dataset, while the second contains images of a wall. Both have a sequence of images taken from different viewpoints. In the viewpoint change test, the camera varies from a fronto-parallel view to a one with a significant foreshortening at approximately 60 degrees to the camera.

illustration not visible in this excerpt

Boat and Bark datasets are images with a difference in scale and orientation. The scale change sequences are acquired by varying the camera zoom. The scale changes by the factor of 4, and the images weretaken by rotating the camera up to 60 degree angle. Each image in a sequence is at 10 degree rotation from the previous one.

illustration not visible in this excerpt

Figure 10: Datasets and their description (b)

5.1.1 Code Implementation

There are few open source implementationsavailable for both SIFT and SURF algorithms. Recently Bauer et al(2007) have compared some of the source codes by implementing SIFT and SURF(Bauer, Sunderhauf, & Protzel, 2007). These codes includeDavid Lowe[2] ’s SIFT original implementation, SIFT++[3], LTI-lib[4] SIFT, SURF codes, SURF[5] and SURF-d. According to this study,while the Lowe’s implementation and SIFT ++ give the best results, SURF and SURF give very close results to the former ones (Lowe’s implementation and SIFT ++). SIFT original implementation (by Lowe) is in its binary form and, therefore could not be used.

There are some other implementations of both algorithms available, such as SIFT[6] by Rob Hess and open source SURF[7] feature extraction library (under GNU License). The latter two are written in C++ with OpenCv in Microsoft Visual Studio implementations, and therefore are selected for use in this project. These implementations do not come with a matching code like RANSAC and Hough Transform (except for SIFT by Rob Hess which already hasa RANSAC implementation). That is why; the rest of the codes have been developed inthis study.

5.1.2 RANSAC vs Hough Transform

Matching features between a pair of images is a complex problem to deal with. It involves comparing the features of a reference or a model image with all the features of the test image in order to find the relative matches. The complexity of the problem increases with the increase in number of the features in both images. Hough Space has been utilized to accumulate votes of image pixels to find lines, circles, or other shapes. This concept (the accumulation of votes) has been extended to match feature points by collecting the votes for all sets of posesin an image to match an object. There are other techniques such as alignment techniques, in which a subset of test and reference image features is used to determine the perspective transformation and orientation (Lowe, 2004). Most alignment based methods use RANSAC to find the number of features which havesimilar model or transformation.Tree based methods are also used for different applications, such as K-D trees which allow the arrangement of points in k-dimensional array to facilitate key search or nearest neighbor search. Because of the high complexity of these algorithms,the most popular approaches are computational methods like RANSAC and Hough Transform. Both givegood performancesunder similar conditions.That is why;researchers have also proposed the combination of the two techniques (Brown, Szeliski, & Winder, 2006) to get better results.

Although Hough Transform and RANSAC are different approaches as described in section 4.2.2 and 4.2.3they produce comparable results for different types of Image data. It is therefore essential to analyze the performance of these two matching strategies before conducting a comparative study of thetwo outstanding feature extraction algorithms. A good evaluation of the two algorithms in terms of computational time has been given by(Stephen, Lowe, & little, October, 2002). The comparisons mainly focus on the efficiency of the two techniques. Although they proved Hough Transform to be computationally more expensive than RANSAC, it is worth exploring whether the quality of the results generated is also good or not?

In order to carry out this comparison,lotsof experiments have been performed on test images taken from test datasets with different imaging conditions and affine transformations. SIFT and SURF features have been matched using RANSAC and Hough Transform in a pair of images. The results were compared using true positive matches against a scale error threshold.The step by step procedure went as follows:

1. Feature detection using SIFT and SURF
2. Finding point to point correspondence between feature points in a pair of images based on a similarity measure in their descriptor (using Nearest Neighbor Algorithm)
3. Identifying true Positive and False positive matches based on the difference from the actual transformation (peak value in Hough Space)

illustration not visible in this excerpt

Figure 11: Graphscomparing Number of Matches by Hough Transform and RANSAC

The analysis starts by comparing the number of matches identified by both algorithms. The results are presented in the form of bar charts in the (Fig. 11). It is obvious from these graphs that due to different imaging conditions and differences in scale and orientation, Hough Transform sometimes became unable to give sufficient number of matches in, for example, ‘Graffiti’, ‘Wall’, ‘Bark’ and ‘Boat’ datasets. Graffiti and Wall datasets contain images with a change in viewpoints, while ‘Bark’ and ‘Boat’ are the datasets with different Zoom and Rotation. Hough transform tries to collect the votes of features to identify similar transformations in positions, scales and orientations giving four degrees of freedom i.e. a position in x-axis and y-axis, a scale and anorientation. But it seems that while rotating the image or changing viewpoint, the features lose consistency and therefore fail to create a single significant peak in Hough Space. Hough found 4 or more matches to qualify the criteria of matching, but did not prove to be the best option. In the remaining datasets, Hough along with RANSAC gives us comparable matches to analyze their performance. The number of matches alone cannot be the criteria to judge the performance of any algorithm; therefore, the true positive matches and error bars have also been calculated in order to complete the evaluation process.

Before investigating the quality of results obtained using both algorithms, it is important to describe the method and purpose of calculating actual transformations between two pairs of Images and the method used to calculate error bars.

The verification of matching results is the most difficult task without human consultation, as the number of local features identified by a good detector or descriptor lies in the range of thousands of features, and if only 50% of them match, we need to verify atleast 500 matches, which is impossible to do visually. This also demands some robust method for deciding the correctness of the matched points. Krystian et al.have suggesteda mathematical method to find the precisionand recall values using a number of matches andcorrespondences(Mikolajczyk & Schmid, October 2005). However, it is unclear how they categorize a match point as a correct match point or a false match point.

illustration not visible in this excerpt

Figure 12: ROC Curves: Hough Transform and RANSAC matching Graffiti, Boat and UBCImages

To solve this issue, it is suggested that Hough Transform can be used to find the exact transformations in feature points. Therefore, the transformation followed by a maximum number of feature points can be selected as the true difference in features. Hence, it is used to find true positive and false positive matches. The features does not have true transformation are considered false positive matches.

5.1.2.1 ROC Curves

To plot ROC curves we needa parameter, which when changed can show the varying performance of the techniques. Different values of scale produce different number of matched features. I have selected scale as varying parameter because almost all datasets contain images with difference in scale levels (except for the dataset with rotated images (‘Boat’ and ‘Bark’), where angle has been selected). Therefore, changing scale error threshold from lower to higher, the number of “True Positive Matches” also changes. The algorithm for which the ROC curve approaches to top-left corner is considered best as it will have maximum true positive matches with respect to false positive matches.

5.1.2.2 Error Bar Graphs

Method used to find error bar is as follows

- Find number of matchesin pair of images using RANSAC and Hough Transform
- Extract true matches and repeat the process number of times
- calculate mean and standard deviation
- Calculate Standard Error by dividing Standard Deviation with Square root of number of iterations and plot Mean vs Standard Error

The Hough and RANSAC for all sets of images have been compared and discussed in the forth coming paragraphs.Only one image per dataset has been selected for discussion here and the results for the rest of the images are attached as Appendix A.

illustration not visible in this excerpt

Figure 13: ROC Curves: Hough Transform and RANSAC matching Bike and LeuvenImages

Table 4: Mc Nemar's Test for Hough and RANSAC using SIFT features

illustration not visible in this excerpt

Table 5: Mc Nemar's Test for Hough and RANSAC using SURF features

illustration not visible in this excerpt

For almost all images from datasets presented here RANSCA outperformed not only in finding good number of matched points but with maximum True Positive matches in case of SIFT features. However, for SURF features, the Hough has given the performance almost equal to RANSAC as obvious from the graphs in (Fig. 13) (for ‘Bike’ and ‘Leuven’ datasets), so it is hard to select the best performer.In the next step, error bars have been calculated and the results have been presented in the third column of (Fig. 12 &13) which clearly shows that Hough despite having less number of matches contain no standard error and therefore produces accurate results. But at the same time, RANSAC surprises with negligible error bars and therefore do not allow categorizing it as bad matching technique when compare with Hough Transform.

In case of SURF features, the ROC curve (shown in column-2, Fig.12 &13) overlaps each other,so it becomes impossible to tell that which method gives better results and even error bars fail to distinguish the superior method. Therefore,another test (Mc Nemar’s Test) is applied to resolve the issue.

RANSAC and Hough Transform are giving contradictory results for Bikes dataset. If the features are extracted using SIFT technique, then RANSAC is getting more true positive matches than Hough Transform but if there are SURF features, then Hough Transform produces better results. Due to this contradiction Mc Nemar’s test is applied to SIFT features first and then to SURF features. The data for Mc Nemar’s test has been collected for all six images of dataset and is, therefore, much more reliable than ROC curves where, matching results between two imagesare presented.

To sort the data in terms of ‘fail’ or ‘pass’for an algorithm, I have selected a pass limit of 15% meaning that if an algorithm produce more than 15% of total matches for a specific scale error threshold than it is considered pass otherwise it is counted for its failure. As the Z-score has been calculated for both features, the Z-score table given as (Table-3) has been used to interpret the results shown on the right side of Mc Nemar’s (Table-4&5). Both Mc Nemar’s tests prove that RANSAC is a better algorithm than Hough Transform. Further, images selected as test data are either of planar scenes or the camera position is fixed during acquisition(Oxford). Therefore, in all cases the images are related by projective transformation. RANSAC also works out the inliers using homography matrix and therefore, best suited to remove outliers from matching pair of image data. While taking about Hough Transform, it uses all four parameters (position, scale and theta) to identify the correct transformation, which sometimes result in reasonably less number of votes from feature points.

In view of above, RANSAC has been selected as more efficient algorithm and used for further evaluation of SIFT and SURF features. In addition, RANSAC has also been selected as feature matching technique for the application developed.

illustration not visible in this excerpt

Figure 14: Graphscomparingnumber of matched feature points detected by SIFT and SURF

5.1.3 SIFT vs. SURF

After selecting the best matching algorithm, next step is to put SIFT and SURF algorithms on trial to select the one with better performance on images with varying imaging conditions and different transformations like change in brightness, compression, viewing condition, scale and rotation etc. The algorithms are compared on the basis of number of corresponding feature points identified between pair of Images and then apply more sophisticated tests to determine the most suited algorithm.

The graphs in (Fig.14)clearly indicate the superiority of SIFT over SURF technique for finding more consistent feature points. It is important to mention that the number features extracted by the two techniques greatly depend on the selection of parameters values such as number of octaves, samples per octave, blob threshold, non maxima suppression threshold, etc.However, it does not significantly influence the number of correct matches. For this particular comparison the values of these parameters are kept as prescribed in the implementations as tuning the parameter values isbeyond the scope of this project. Following table summarize these parameters and their values.

Table 6: Algorithm’s parameter values

illustration not visible in this excerpt

It has been examined by (Bay, Ess, Tuytelaars, & Luc, 2006) that maximum number of feature points can be extracted using up to 4th Octave and therefore increasing number of octave only increase processing time and do not contribute in performance optimization. By keeping the parameter values static for all types of images give us the chance to monitor the behavior of algorithms for varying amount of imaging conditions.

The performance evaluation of Feature Extraction is divided into four different testing areas. First the number of true matches identified by the two has been compared for similar images. This will highlight the method which can give maximum performance under all imagining conditions. Then percentage of correct matches is analyzed to check the strength of feature detector and descriptor (Fig. 14). If the descriptor is good then the matches configured by the algorithm are mostly true match. The performance of both techniques has been evaluated for change In Blur, Scale, Rotation, View Point and JPEG compression.

Results for Blurred Images

illustration not visible in this excerpt

Figure 15: Graphs comparing SIFT and SURF features for Bikesdataset

5.1.3.1 Blurring

Bikes Dataset present sequence of images with increase in blur. Blur in the images has been introduced by changing the camera focus. The amount of blurring effect is in the range from 2 to 6% of the original image. It is interesting to observe that although the percentage of correct matches found by SURF is lessthan SIFT as shown in (Fig.15 (a)), still all of these feature points have consistent transformation is this case and comparable to SIFT features. Result of matching image 2 and 3 in (Fig.15(c)) and image 4 and 5 in (Fig. 15(e)) indicate that SURF features are more consistent than SIFT, considering that these images have more blurring effect than previous images.

Furthermore, there is no geometric transformation in these images,but the intensities of pixels are changing unexpectedly in regions. Therefore, both descriptors happen to find most of the true positive matches. The overall results of matching all six images of the dataset do not help in selecting the most suitable algorithm. Therefore, Mc Nemar’s test is performed (Table-7) on empirical data to sort the more appropriate algorithm.

Table 7: Mc Nemar's Test for Blurred images

illustration not visible in this excerpt

Interpreting Z-score proves that both algorithms appear to be the same as Z-score is approaching to zero indicating that both algorithms have similar performance. For images with more blurred regions SIFT attained maximum score, calculated by (Mikolajczyk & Schmid, October 2005). Therefore, it will be appropriate to say that SIFT and SURF are equally suitable for blurred images for matching and tracking applications.

Results for JPEG Compression

illustration not visible in this excerpt

Figure 16: Graphs comparing SIFT and SURF features for UBC dataset

5.1.3.2 JPEG Compression

In UBC (University of British Columbia’s Building) dataset, the images have been compressed with JPEG compression and the last image is almost 98% compressed. Matching compressed images is very important to analyze because most of the hardware (like still and video cameras) nowadays provide additional facility of compressing image data. It is important for vision algorithms to be able to process the compressed images or otherwise suffer from the loss of important information during compression and decompressing image data. Therefore feature extraction needs to be effective for these images to develop any real time recognition, or tracking application.

According to percentage of correct matches both algorithm compete each other with SURF having edge on SIFT features. But the ROC curves (Fig. 16)actually explains that the SIFT features though less in number are more consistent than SURF feature.The results depict the performance of two techniques to be ideally equal on compressed images, as both have high true positive rate for very small scale error threshold.

Table 8: Mc Nemar's Test for Compressed images

illustration not visible in this excerpt

Z-score for SIFT and SRUF techniques as shown in the table above (Table-8), gives us more than 90% confidence to select SIFT over SURF for images with JPEG compression.

illustration not visible in this excerpt

Figure 17: Graphs comparing SIFT and SURF features for Leuven dataset

5.1.3.3 Illumination

Leuven dataset is a sequence of images with decrease in brightness. Most of vision algorithms work on greyscale images i.e. they need to convert coloured image into grey scale for their operations to be performed. The images with change in brightness can affect the performance of these algorithms as there maybe less contrast between pixels in some parts of the images and may be more contrast in other parts.However, SIFT has managed to find more consistent features (as shown in Fig. 14(d)) which truly matched in images with varying pixel intensities.

Table 9: Mc Nemar's Test for Images with change in Illumination

illustration not visible in this excerpt

In order to make features illumination invariant,both the algorithms follows the same method of converting the descriptor into unit vector (intensity normalization) which accounts for overall brightness change. Therefore, it is not unexpected that both produce almost similar results.

ROC curves (Fig. 17) indicate that there is no significant difference between the two algorithms but SIFT features do have slight edge over SURF features. Mc Nemar’s test on the other hand proves this assessment with Z-score more than 2 meaning that SIFT is better than SURF for images with difference in illumination.

Results for Images with change in View Point

illustration not visible in this excerpt

Figure 18: Graphs comparing SIFT and SURF features for Graffitidataset

5.1.3.4 View Point (a)

Graffiti and Wall datasets contain sequence of images with change in viewpoint. The angle of different viewpoints lies in the range of 20 to 60 degree. In these images, there is change in angle and therefore it is expected for features to have transformation in scale and orientation. The number of matches shown in(Fig.14(e)) indicates that SURF features are less matched as compare to SIFT and so on the basis of correct percentage matches,SIFT producing more accurate results

The ‘true positive matches’ results for the whole dataset in (Fig. 18) indicate that SURF and SIFT both are equally good to detect angular transformations. Therefore, Mc Nemar’s test has been applied to select the one with better confidence limit.

Table 10: Mc Nemar's Test for Images with change in view point

illustration not visible in this excerpt

Mc Nemar’s test result shows the supremacy of SIFT technique over SURF in finding features which are robust to change in view point. Therefore, we can select SIFT with 95% confidence to match images with varying view points.

Result for Images with change in View Point

illustration not visible in this excerpt

Figure 19: Graphs comparing SIFT and SURF features for Walldataset

5.1.3.5 View Point (b)

Another dataset with similar transformation has been used to test the behavior of the two algorithms. Walls dataset also contain similar transformation in view point as in Graffiti images.

With exactly same transformation, no wonder that the results are also similar. Again both algorithms produce overlapping ROC curve (Fig. 19) and have been put under Mc Nemar’s test.

Table 11: Mc Nemar's Test for Images with change in view point

illustration not visible in this excerpt

Contrary to graffiti dataset both algorithms appear to be similar in this case. It shows that both techniques are able to give good results for textured images as well as for planar surfaces.

Results for Images with difference in zoom and rotation

illustration not visible in this excerpt

Figure 20: Graphs comparing SIFT and SURF features for Boatdataset

5.1.3.6 Zoom and Rotation (a)

The images tested in this section contain two synthetic transformations introduced by changing the camera angle and zoom. This is a complex task for both algorithms, and it is obvious from the (Fig.14(g&h)) that SURF struggles to find consistent features and mostly finds less than 500 matches while SIFT still performs better as compared to SURF by keeping number of matches above 1000. Irrespective of number of matches thepercentage of correctness (Fig 20 (a)) showsthat the performance of both algorithms is almost at par. Matching results of only the first two images form ‘Boat’ dataset give different results showing SURF performing better than SIFT otherwise the rest of images have been better matched with SIFT features as shown in (Fig.20 (c), (d), (e)and (f)).

To draw ROC curve for these datasets, the varying parameter has been changed to angle (from scale). It is because these images have rotational transformation and therefore it would be better to assess their performance for change in angle error threshold. Each image has 10o more rotation than the previous one.

To further increase the degree of confidence in these results, Mc Nemar’s test has been applied on both datasets and the result is as given hereunder.

Table 12: Mc Nemar's Test for Zoomed and Rotated Images

illustration not visible in this excerpt

Once again SIFT features appear to be more consistent in case of complex transformation as proved by greater Z-score.

Results for Zoomed and Rotated Images

illustration not visible in this excerpt

Figure 21: Graphs comparing SIFT and SURF features for Barkdataset

5.1.3.7 Zoom and Rotation (b)

The sequence of images in ‘Bark’ dataset also contains zoom and rotation transformation. The ROC curves in (Fig. 21) shows the poor performance of both algorithms. (Fig. 21 (a)) indicates that SIFT and SURF both manage to find good percentage of correct feature points but very few of them are true positive matches. By looking at these ROC curves, though SIFT features appear to gain more true positive matches but the curve for SIFT features (Fig.21) is far from top left corner. Even though none of the algorithm can be regarded as best in this scenario but if the selection needs to be made then SIFT will be a better option. Mc Nemar’s test result given below strengthens our claim.

Table 13: Mc Nemar's Test for Zoomed and Rotated Images

illustration not visible in this excerpt

Z-score above 2 for SIFT shows its supremacy over SURF features.

5.1.4 SIFT vs SURF Conclusion

The above comparative analysis shows that both techniques are equally good. However, SIFT features are more consistent for complex transformations such as scale and rotation. So if an application demands more efficiency than accuracy, then SURF is a better option. However, if accuracy of the results is more important, then SIFT is a better choice. The speed of SURF algorithm has been a consideration since its development and researchers in vision community seems convinced to use SURF for real time applications. However, the implementation of SURF used in this study has not been found fast enough to be used for real time applications. So thisstudy only suggests the use of technique according to type of images.This comparative study is different and more reliable from previous studies as more sophisticated statistical test (ROC curves and Mc Nemar’s tests) has been used to analyze the performance of the algorithms on similar test data. Although the outcome is in line with the majority of the previous studies but it is relatively more reliable.

6 Chapter: Application on real world video sequences

The whole exercise of evaluating the performance of SIFT and SURF along with matching algorithms (RANSAC and Hough Transform) was for the purpose of choosing the best algorithm for recognition or tracking application. The conclusion which emerges from this study is that RANSAC is the best available option for local feature matching and SIFT and SURF can be used depending on the requirement of application. However, SIFT features appeared to be more robust to all kind of imaging conditions and transformation.

To verify the results, it is important to use these techniques for real time applications instead of simulated data. Therefore, it is suggestedto use it for two different real time applications.First,use the local feature extraction and matching method to track the motion of a moving object and identify its direction such as a moving car or an aerial vehicle. Secondly, identify the movement of object carrying hand held camera. The directions to be estimated are left, right, up and down. The proposed applications can help developing a lot more sophisticated systems such as a control system for an ‘Autonomous Vehicle’to control its movement and collectinformation of the surrounding environment and a control system for UAV (Un Manned Aerial Vehicle).

The requirement of a control system for autonomous vehicles is very important and needs a lot of accuracy and efficiency. The shortcoming of current autonomous vehicles is that these vehiclesdo not carry high speed processor and memory and therefore are unable to process data on its own to make decisions. This is done by sendingthe data back to control system, which then, after analyzing it sends command for next action. In case of image data, problem becomes worse because of high amount of image/video information. It takes much time to send and receive data andalso loose efficiency and accuracy in the process. Therefore, a lot of work is being carried out to develop efficient and economical algorithms for these kinds of applications. The use of these local features for global localization of robot have proved to be giving promising results (Stephen, Lowe, & little, October, 2002).

6.1.1 Proposed Method for Direction Estimation of a Moving Vehicle

SIFT has been used to identify the local features from video data captured using handheld camera on automotive vehicles. As described previously, the threshold values for different parameters play a vital role in improving the quality of result. For the use of algorithms on real data, the contrast threshold has been adjusted to get better results and the parameter values usedare as follows

Table 14: Parameter values used for application

illustration not visible in this excerpt

The algorithm has been modified to calculate the direction of the vehicle. Basic method and modifications are as follows.

1. Set of matched feature points are obtained by applying SIFT along with RANSAC on consecutive frames of the video

2. For all matched feature points

a. Calculate the accumulative difference in feature’s position between the two frames

illustration not visible in this excerpt

Where ‘x’ and ‘y’ are feature point’s position in Frame-1 and and are the corresponding feature’s position in Frame-2 and ‘n’ is the total number of matched features.

3. Calculate average difference in all features positions

4. Estimatedirection of the vehicle under following conditions

b. If the change in X-axis is greater than 1 then car is moving in left direction.

c. If the change in X-axis is less than -1 then car is moving in the right direction

d. If change in Y-axis is less than -1 then vehicle is moving downward.

e. If change in Y-axis is greater than 1 then vehicle is moving upward.

f. For all change in any direction between -1 and 1 the vehicle is considered to be moving straight

Using this data further information,such astime for which the vehicle moved in one direction, can also be calculated as follows.

Count number of consecutive frames for any movement (N). Calculate time of downward movement as

illustration not visible in this excerpt

If the speed of the car is known then length can also be calculated as

Unfortunately the data collected for this project lacks the speed of the vehicle in videos as well as other information in aerial videos due to which it is not possible to calculate the information proposed above for ‘time’ and ‘length’ for specific movement.

6.1.2 Real World Videos for Experiment:

Two types of videos have been used to test algorithm. First, a video made by fixing a camera in an automotive vehicle and second, a video using a handheld camera. Both types of videos have been recorded by Dr. Adrian Clark. Some aerial videos have also been used to check the system. As there is no aerial video data available for experimentation therefore these sequences have been taken from “Proaerialvideo[8] ” website, where royalty free videos are available to download. The data for the following videos have been recorded and discussed

Table 15: Real world video data

illustration not visible in this excerpt

OpenCv with Visual Studio (Express Edition) has the limitation of being unable to read compressed video files and therefore, the video frames were stored as JPG images to allow program to read the image data. Hence the data as well as the results are stored in JPG image format.

6.1.3 Experimental Setup

A video sequence has been recorded using a camera mounted at the front of the car. The camera used for this purpose has following specifications.

Camera specifications:

Table 16: Camera Specifications

illustration not visible in this excerpt

Camera Positioning:

The camera was fixed at the front dash board of the car and was zoomed out to take the video of the road and surroundings. To minimize the bumping effect (due to uneven roads) and to keep the camera still it was fixed to the metal using glue.

illustration not visible in this excerpt

Figure 22: Direction detection in two sample frames

6.1.3.1 Data Recorded

The output of the program is stored in a text file in the following format

illustration not visible in this excerpt

‘Matches found’ (red circles in (Fig.22)) are corresponding features identified by nearest neighbor algorithm, whereas ‘# of inliers’ (blue circles encircling red circles in (Fig. 22)) is the features with consistent transformation with Homography matrix calculated by RANSAC. The program calculates the drift in ‘x’and ‘y’ axis to predict the movement as ‘L’(left), ‘R’(right) or ‘S’(straight) for the last two columns (as shown by redline and a circle in the middle of images in (Fig. 22)).

6.1.4 Verification of Results

The movement of the vehicle is identified by the change in the position of feature points while tracking them from frame to frame. If the features have average drift less than -1 in x-axis then the vehicle is moving in right direction and otherwise left, similarly if the average drift of feature points in y-axis is less than -1 then the object is moving in downward direction or upward otherwise. For both axis if the average drift is in between ‘-1’ and ‘1’ the motion is considered as straight.

In order to verify the results and calculate the correctness, the number of frames has been counted for a particular direction of motion and then the results generated by the system are matched.

6.1.5 Results and discussion

The following section presents and discusses the results of tracking in predicting an object’s direction in real world videos.

illustration not visible in this excerpt

Figure 23: Frames for video # 1 (Car moving on straight road)

illustration not visible in this excerpt

Figure 24: Graphs indication Left and Right motion of a moving vehicle

6.1.5.1 Video 1: Car moving on Straight Road with left/right turns

The video has been captured using on-board camera. It is approximately five minutes video with 9794 frames. Fig. 24(a & b) shows two sample frames from the video. The road is almost flat and most of the time car is moving in straight direction. The significant left movement appear in first 100 frames and right movement appear in the last set of frames and therefore the result verification for these two movements have been done on those frames.

Table 17: Video # 1 Results verification

illustration not visible in this excerpt

The results obtained are quite encouraging as shown in graphs (Fig.24 (a&b)). The accuracy of the results lies in the range of 60 to 90% in case of right turn and left turn. The system is unable to precisely predict the left and right motion for some occasions in this particular video. This may be because of two reasons, first, the camera position (which can be tilted towards right) secondly; because of non-overlapping of adjacent regions. This is visible from the video that when the car is taking left turn most of the view is not broad enough to capture the scene beyond the turn and therefore, the frames have very little overlapped regions.Due to small ratio of overlapping regions between frames, the feature tracking becomes difficult and result in false prediction.

illustration not visible in this excerpt

Figure 25: Frames form video # 2 (Car Moving Up and Down the Hill)

illustration not visible in this excerpt

Figure 26: Graphs showing Vehicle's motion in Left, Right, Up and Down direction

6.1.5.2 Video 2: Car moving up and down the hill

The video has been captured using on-board camera. Video is approximately three minutes long with 5257 frames.

Table 18: Video # 2 Results verification

illustration not visible in this excerpt

The system successfully identified the right movement with highest accuracy (Table-18) whereas it is 40% reliable to detect straight and downward movement. But the system almost failed (with less than 20% accuracy) to find left and upward movement. It is because during left movement, the road itself is turning and so is the vehicle, which causes all feature points to be identified as moving to the left direction (Fig.26). Therefore, the system is fixed to classify it as vehicle’s right movement instead of left. In case of upward movement, the system does find the upward motion but the difference in features location is too small that it cannot differentiate between straight or steep path.

Another important aspect of having poor performance for left and upward movement is the reduction in matching percentage. In case of right and upward motion, the matching percentage remains above 60% but when the vehicle is moving left or going upward, the background keeps changing and the matching percentage drops below 50% which causes the close features to be matched and increase the count of straight motion.

Video # 3

illustration not visible in this excerpt

Figure 28: Graph showing left motion in Indoor video sequence

6.1.5.3 Video 3: Indoor motion sequence

This is an indoor video captured using a handheld camera. The sequence is of few seconds with 421 frames. The subject carrying the camera is moving in an ‘L’ shaped corridor, and therefore takes sharp left turn during the sequence.

Table 19: Video # 3 Results verification

illustration not visible in this excerpt

The system successfully detected the direction as shown in (Fig. 28) where two sharp peaks represent left motion of the subject. As regards the straight movement, the result shows that the system is only 24% reliable. Here, the problem is the use of hand held camera which cannot remain fixed during the motion but this can be handled by using high threshold. However, the drawback of high threshold is that slight left or right turns will be ignored by the system

Video # 4

illustration not visible in this excerpt

Figure 29: Sample frames form outdoor sequence

illustration not visible in this excerpt

Figure 30: Graph showing left movement in outdoor sequence

6.1.5.4 Video 4: Outdoor motion sequence

The video has been captured using handheld camera in an outdoor environment. The object takes a sharp left turn during the motion.

Table 20: Video # 4 Results verification

illustration not visible in this excerpt

The results in (Table-20) show the accuracy of tracking SIFT features to detect change in the direction of a moving object in outdoor environment. It proves that SIFT features are robust to change in illumination and scale transformation, which are the two varying conditions in this outdoor video. System provides 80% reliability to estimate change in right direction. Straight movement is again an issue to handle because of holding the camera in hands while moving, which cause some uneven movements as shown in graph (Fig. 30). The main peak represent the left turn of the object, whereas the rest of small peaks represent straight movement. The threshold for straight motion (between 1 and -1), if changed to 5 and -5, will greatly improve the result.

Video # 5

illustration not visible in this excerpt

Figure 31: Aircraft Taking-off Video Sequence

illustration not visible in this excerpt

Figure 32: Graphs presenting motion detection of an aircraft taking-off

6.1.5.5 Video 5: An Aircraft taking-off

The video has been captured by fixing the camera at a specific position near runway. It is a small sequence of 218 frames. The camera is focused on an aircraft, which is taking-off, and changes its direction with the vehicle in left and upward direction.

For direction estimation, the thresholds are kept the same. This video has camera movement in two directions, left in x-axis and upward in y-axis. Therefore, the results for only these two motions have been calculated as shown in table below.

Table 21: Video # 5 Results verification

illustration not visible in this excerpt

The video shows that initially the aircraft is moving in right to left direction on the runway and so is the camera. So there is only left motion till 80th frame after which the upward movement starts andthe system has identified this movement with reasonable accuracy. As the graphs in (Fig. 32) clearly shows that most of the results are consistent with actual data, with all positive values representing left (Fig. 32(a)) and upward movement (Fig. 32(b)).

In this video clip, the movement of both camera and object in focus is in the same direction and so the features identified by SIFT in the background help predicting the correct movement. Another important factor is that the background has a lot of overlapped regions in frames and hence the matched features havesufficiently large translational transformation.

Video # 6

illustration not visible in this excerpt

Figure 33: Aircraft Flying over Trees

illustration not visible in this excerpt

Figure 34: Graphs presenting Upward and Straight Motion of an aerial vehicle

6.1.5.6 Video 6: Video of an aircraft flying over trees

This video has been taken form an aerial vehicle, which flies over some trees and pass through a monument. The scene contains dense regions in which there islarge number of local features found by SIFT which further help to predict the direction.

Table 22: Video # 6 Results verification

illustration not visible in this excerpt

The system reflects 100% reliability for this particular sequence. The obvious reason for this is the presence of dense regions and broader view which causes frames to have more ratios of overlapped regions. The large number of features results in high matching rate enabling the system to detect the direction easily in both axis (as shown in Fig 34(a & b)).

Video # 7

illustration not visible in this excerpt

Figure 36: Graphs presenting Left, right, up and down movement of an aircraft

6.1.5.7 Video 7: Air to Air Video of an aircraft

An on-board camera has been used to make thevideo of another craft flying in front. The system is unable to detect the motion properly. The subject carrying the camera and the object in front of it makes difficult for the system to relate the relative motion. If the object is moving in upward direction and subject remains at it original position, the system considers the motion of the subjectas downward because the features identified on the object are moving in upward direction (as shown in Fig.36). Same problem occurs in left or right direction. The correctness of the algorithm can be seen in the following table.

Table 23: Video # 7 Results verification

illustration not visible in this excerpt

6.1.6 Application’s appraisal

For tracking a moving object, the use of local features is quite helpful. The feature extractor (SIFT) finds features in two consecutive frames, and then matches these points to find the transformation in pixels. But the problem arises when the frames have less percentage of overlapped regions, as it happened when the car moved in upward direction. The features in one frame oftendo not exist in the next frame and soit result in less accurate localization. However, in case of aerial vehicle it was not true which can be seen in video # 4 (Fig.36).The reason is that the camera on this aircraft has broader view and the overlapping ratio of regions in the frames is much more than on a road. Therefore, itcan be concluded that in order to make system more reliable the features which occur only in overlapped regions should be considered for direction estimation.

7 Chapter: Conclusion and Future Work

In this project the performance of the two state-of-art algorithms for feature extraction along with two most popular algorithms for identifying cluster of features with similar transformations has been evaluated. The evaluation shows that SIFT along with RANSAC is the best combination to find and match local features in images with varying imaging conditions and some affine transformations. But at the same time SURF is no worse than SIFT and hence can be used for the applications where time efficiency is the essence of the application and accuracy comes thereafter.

The proposed solution of using local features for direction estimation during motion trackingproved successful. In situations where the moving object carrying camera changes its direction independent of the surroundings, the system captures its direction very easily such as a car taking U-turn or an object turning left or right in the corridor. The results show that the prediction is more than 90% accurate. However, if the change in object’s direction occurs along with the change in the direction of surroundings, the system receives unclear information such as if a car travels on a road with soft turns. It happens because the frames captured during this kind of motion have less overlappingregions due to which the features in overlapped region are outnumbered by new features. Therefore, by taking the average drift in features positions, the system wrongly predicts the direction of the motion and the accuracy lies in the range of 30 to 50%.

Future work will focus to increase the system’s reliability, for which it is suggested that the number of features in the overlapped regions should contribute towards the direction estimation of the moving object. Further, by recording other information the system will be upgraded to calculate the length, depth, steepness and time duration of the moving object.

Thesystem so developed may be able to suggest solutions to various vision problems. An option can be to develop a control system for aerial vehicles which can be used for aerial surveillance in high risk areas. Similarly an automotive warning system can be developed to help the drives in adverse weather and poor visibility conditions. There is also scope of developing intelligent devises which can recognize the environment for the blind people.

References

Ballard, D., & Brown, C. (1982). Computer Vision. In Computer Vision (p. Chapter 8). Prentice-Hall.

Bauer, J., Sunderhauf, N., & Protzel, P. (2007). Comparing Several Implementations of Two Recently Published Feature Detectors. International Conference on Intelligent and Autonomous Systems, IAV. Toulouse, France.

Bauer, J., Sunderhauf, N., & Protzel, P. (2007). Comparing Several Implementations of Two Recently Published Feature Detectors. In Proc. of the International Conference on /Intelligent and Autonomous Systmes, IAV. Toulouse, France.

Bay, H., Ess, A., Tuytelaars, T., & Luc, G. V. (2006). Speeded-Up Robust Features (SURF). computer Vision and Image Understanding, (pp. vol 110, No 3).

Bolles, Martin, A. F., & Robert, C. (1981). Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartgraphy. Comm. of ACM.

Brown, M., Szeliski, R., & Winder, S. (2006). Multi-Image Matching using Multi-Scale Oriented Patches. IEEE Conference on Computer Vision.

Canny, J. (1986). A Computational Approach to Edge Detection. IEEE Transcations on Pattern Matching and Machine Intelligence, 8, no.6.

Clark, A., & Clark, C. (1999). Performance Characterization in Computer Vision: A Tutorial.

Danny, C., Shane, X., & Enrico, H. Comparision of Local Desriptors for Image Registration of Geometrically-coplex 3D Scenes.

Harris, Chris, & Stephens, M. (1988). A Combined Corner and Edge Detector . Proceedings of 4th Alvey Vision Conference .

Lindeberg, T. (1988). Feature Detection with Automatic Scale Selection . IJCV, 30(2): 79-116.

Lowe, D. G. (2004). Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision.

Mikolajczyk, K., & Schmid, C. (October 2005). A Performance Evaluation of Local Descriptors. IEEE Transactions on Pattern Analysis and Machine Intellignece, (pp. Vol 27, No 10).

Moravec, H. P. (1977). Towards Automatic Visual Obstacles Avoidance. 5th International Conference on Artifical Intelligence.

Morse, B. S. (1998-2000). Lecture Notes: Edge Detection and Gaussian Related Mathematics . Edinburgh.

Neubeck, A., & Gool, V. (2006). Efficient Non-Maximum Supression . ICPR, (pp. 2161-2168).

Oliva, A., Antonio, T., & Monica, S. C. (Sept, 2003). Top Down Control of Visual Attention in Object Detection. IEEE Proceedings of International Conference on Image Processing .

Oxford, U. (n.d.). Affine Covariant Features.

Richard, H., & Andrew, Z. Multiple View Geometry in computer Vision.

Sim, R., & Dudek, G. (September 1999.). Learning and Evaluating Visual Features for Pose Estimation. Proceedings of the Seventh International Conference on Computer Vision (ICCV'99). Kerkyra, Greece.

Smith, S. M., & Brady, J. M. (1997). SUSAN-A New Approach to Low Level Image Processing . International Conference of Computer Vision, 23, no. 1.

Stephen, S., Lowe, D., & little, J. (October, 2002). Global Localization using Distinctive Visual Features. Proceedings of the 2002 IEEE/RSJ Conference in Intelligent Robots and Systems. Lausanne, Switzerland.

Trajkovic, M., & Hedley, M. (1998). Fast Corner Detection. Image and Vision Computing, 16, no. 2.

Valgren, C., & Lilienthal, A. SIFT, SURF and Seasons: Long-term Outdoor Localization Using Local Features.

Viola, P., & Jones, M. (2001). Rapid Object Detection using a Boosted Cascade of Simple Features. IEEE Computer Vision and Pattern Recognition, 1:511-518.

Vittorio, F., Tuytelaars, T., & Gool, l. V. (July, 2006). Object Detection by Contour Segments Networks. In Lecture Notes in Computer Science . Berlin: Heidlberg: Springer .

Wang, L., Jianbo, S., Gang, S., & I-fan, S. (2007). Object Detection Combining Recognition and Segmentation. Computer Vision,ACCV.

Appendix A: Graphs Comparing RANSAC and Hough Transform

illustration not visible in this excerpt

[...]


[1] http://www.robots.ox.ac.uk/~km/

[2] http://www.cs.ubc.ca/~lowe/keypoints/

[3] http://vision.ucla.edu/~vedaldi/code/siftpp/siftpp.html

[4] http://ltilib.sourceforge.net

[5] http://www.vision.ee.ethz.ch/~surf/

[6] http://web.engr.oregonstate.edu/~hess/

[7] http://code.google.com/p/opensurf1/

[8] http://www.proaerialvideo.com/categories.php

Details

Pages
92
Year
2009
ISBN (Book)
9783656969051
File size
6.6 MB
Language
English
Catalog Number
v299263
Institution / College
University of Essex
Grade
Distinction
Tags
motion tracking video best feature extraction technique

Author

Share

Previous

Title: Motion Tracking in Video using the Best Feature Extraction Technique