Analysis of Feature Point Distributions for Fast Image Mosaicking Algorithms

 Composition of image mosaics  Fast image mosaicking algorithms are usually based on feature-based methods processing sparse sets of interest points.  Distinctive image features are extracted and matched using similarity measurements.  Many different feature detectors, showing robust matching, high repeatability, and precise alignment, have been proposed: SIFT, SURF, GLOH, MOPs, ...  Problem: Despite of high detector performance, feature point clusters with non-uniform spatial distribution, resulting from local contrast variability, lead to large global image registration errors.  Medical application: Endoscopic images of the internal urinary bladder wall show often only sparse located structures with high contrast, like vasculature or lesions.  Image mosaicking algorithms can only process a limited number of feature points in real-time.  Detector must distribute sparse data set of feature points uniformly across whole image to ensure low registration error.  We analyzed different feature distribution algorithms and evaluated their average image registration errors.


Introduction
Alignment and stitching of images into seamless photomosaics is most widely used in computer vision [1].Besides the approach to minimize pixel-to-pixel dissimilarities directly, a common way to combine image pairs into one image composition is performed using only sparse sets of interest points.Here, distinctive feature points are first extracted and described from the images, and then matched using a similarity measurement.Many different feature-based algorithms, like SIFT [2], SURF [3], GLOH [4], MOPs [5] or [6] have been proposed for extracting distinctive image features.Their robustness, repeatability and invariance to different illumination changes and image transformations have also been widely evaluated by [4,7,8,9].
On the one hand, a high level of distinctiveness of the features, expressed by their descriptors and the filter response values of the detector, leads to robust matching and precise image alignment.On the other hand, these characteristics alone do not always result in a low registration error, since the spatial feature distribution in the images is also relevant.Images which show only a high contrast in local image regions (see Fig. 1) usually create only clusters of strong features and provide an overall non-uniform spatial feature distribution.Since the estimation of a global image transformation between a given image pair is based on matched feature points, image regions with many interest points will create only small registration errors, whereas the error in regions with feature clusters becomes larger.Thus, only the combination of distinctive and also spatially well uniformed distributed feature points leads to good global image transformation with low registration errors.While in many natural image scenes feature points can usually be extracted across the whole image, spatial distribution is often a relevant consideration for medical images, e.g.endoscopic images of the internal urinary bladder wall.These images often show only sparsely located structures with high contrast, e.g.vasculature or lesions (see Fig. 1), and thus impede robust image mosaicking.During cystoscopy, where a rigid endoscope is introduced into the bladder, an inspection and analysis of bladder cancer, the 4th most common disease with about 52810 new cases among males in the United States estimated in 2009 [10], is performed by a physician based on endoscopic images captured by a video endoscope system.Since sufficient illumination of the bladder wall is only provided if the endoscope is guided close to the bladder wall, the field of view is very small.This results in difficulties in orientation and navigation for the physician.To provide navigation assistance during cystoscopy, a feature-based mosaicking algorithm composing local or global panoramic overview images from single video images has been proposed by [11,12,13,14].For this application, a fixed number of feature points is required for real-time processing [14].It is therefore necessary to generate an adequate sample set of the extracted features list for robust image mosaicking.
While some registration algorithms have been developed that create feature distributions based on local non-maximum suppression [5] or spacepartitioning [15], these different techniques have not been evaluated.Thus, in this paper we analyze these different feature distribution algorithms and evaluate their average image registration errors.
The paper is organized as follows: In section 2 the mosaicking algorithm is described.The distribution algorithms for selecting a feature sample set are described in section 3. Section 4 presents the evaluated data, and the conclusions and future work are given in section 5.

Mosaicking algorithm
In the following sections only a brief overview is given of the mosaicking algorithm.A more detailed description can be found in [12,14].During the mosaicking process, image pairs of a captured video sequence are sequentially stitched and blended to compose a successively growing panoramic overview image.When endoscopic images are used, the lens distortions of the fish-eye optics are compensated in a preprocessing step.Then, distinctive feature points are extracted and described by their local neighborhood region.Based on these sparse feature sets, point correspondences are matched between two subsequent images.Thus, a global image-to-image transformation, a so-called homography, is estimated.The two images are registered by applying the transformation.Finally, the overlap regions are blended by an interpolation algorithm.Iteratively, all images of the video sequence are then sequentially processed to build the final panoramic overview image.

Feature detection
Feature extraction is performed by the Speeded Up Robust Features (SURF) detector [3].Based on a very basic approximation of the Hessian matrix, the feature strength is calculated by where L xx represents the convolution of the Gaussian second order derivative ∂ 2 ∂x 2 G with the input image.Instead of iteratively reducing the image size, integral images and scalable box filters D xx , D yy , D xy are used to speed up the blob filter response.The distinctive feature points are localized in the space domain and over scales by non-maximum suppression.Feature descriptors are then extracted in a square region centered around the point of interest.Split up into smaller 4 × 4 square sub-regions, the Haar wavelet responses in horizontal h x and vertical direction h y are calculated and composed to a four-dimensional descriptor vector The concatenation of the sixteen sub-regions then results in a descriptor vector d of length 64.

Matching and registration
The point correspondences between two images are determined by matching the SURF feature descriptors.Based on the least similarity measurement a feature descriptor d i of the first image is compared to all feature descriptors d j of the second image within the 64 dimensional feature space, and vice versa.The minimum squared error Δ d ij then leads to the best point match d i ↔ d j .To increase the distinctiveness and robustness of the point correspondences, only matches with the ratio between the best and the second best match are considered.
After single points of an image pair are matched, the homography H is estimated.The applied affine transformation model provides six degrees of freedom, parameterized by a translation vector t T = (t x , t y ), rotation angle α, scales s x and s y , and skew a.In homogeneous coordinates, the homography matrix H can be written as with To ensure robust homography estimation, unavoidable false point correspondences are rejected by the RANSAC (RANdom SAmple Consensus) model fitting algorithm [16].Based on iteratively and randomly selected point correspondences p, a homography Ĥ is estimated and the number of inliers is determined by the threshold operation If point correspondences p i ↔ p j satisfy eq.6 with estimation Ĥ and the given pixel distance d, they are marked as inliers.Finally, the estimated homography Ĥ with the highest number of inliers is selected as the final image transform and is used for registration by warping the second image into the reference system of the first image.

Feature selection
Since the computational complexity of the SURF detector increases linearly with number of feature points, the limitation to a fixed number is often desired for real-time mosaicking algorithms [14].Also calculating of the feature descriptors is computationally more intensive than locating the feature and the filter response value itself (eq.1).Thus, feature extraction is separated into two steps.First, the SURF feature detector is applied to the whole image to detect the locations and the strengths of potential interest points based on the filter response values.Then a sample set with a fixed number of the points is chosen from all the point candidates.To assure constant computational complexity, only these features are then characterized by the 64-dimensional SURF descriptor and passed to the matching process.Since the usage of fewer feature points leads to less robust homography estimations and registration results, the selection of an adequate and representative sample set is very important.

Top N selection
A straightforward approach is to select the first N -th strongest feature points from the whole set of interest points.For each feature point, the location and the response value of the blob filter is calculated according to eq. 1, and is used for comparison.After the extracted interest points have been sorted in descending order of their filter response values, the first N -th feature points are selected.Since this algorithm generates only a sample set based on the filter response, no additional information about the spatial distribution of the feature points is exploited.

Adaptive non-maximal suppression
To select a fixed number of interest points from images which are local maxima and whose response values are also significantly greater than those of all of their neighbors within radius r, Brown et al. [5] developed an adaptive non-maximal suppression (ANMS) strategy.Conceptually, this can be performed by initializing the suppression radius r = 0 and then increasing it until the desired number of N feature points is obtained.In practice, this operation can be done by first sorting all local maxima by their response values, and then creating a second list sorted by decreasing suppression radius.The first entry in the list represents the global maximum, which is not suppressed at any radius.As the radius decreases from infinity, feature points are added to the list.To increase the robustness, a second constraint is defined, which requires that a neighbor feature has sufficiently greater strength.Thus, the minimum suppression radius r i is determined by where x i describes the position (x, y) of the feature point, f ( x i ) its strength, and I the set of all feature point positions.The parameter c = 0.9 represents the robustness factor, which adjusts how significantly greater the strength of a neighbor feature must be for suppression to take place.For each feature point its radius is then determined by eq. 7, and the feature list is sorted by the radius in descending order.Then, the first N entries of the list are selected.This sample set now provides features which are spatially distributed across the whole image (cf.Fig. 4 (right column)).

k-d tree partitioning
Another selection method that considers the spatial data distribution is based on space-partitioning of high-dimensional data sets.Cheng et al. [15] developed an algorithm using a 2-dimensional k-d tree.
Here, the feature points are separated into M rectangular image regions and cells, respectively.In a recursive manner, each partition cuts the region with the current highest variance in two, using the median data value as the dividing line.The position of the dividing line is stored as the node's data.For each nonterminal node of the binary tree, the feature points of the related region are then handed over to the left and right child node, respectively, until the given number of M cells is obtained.Finally, each leaf node contains a list of the features which are located within the related image partition.An example is given in Fig. 2. From each cell N/M , the strongest features are selected as the output sample set, cf.Fig. 4 (middle column).While Cheng [15] uses a balanced tree, which limits the number of leaf nodes to be a power of two, we have extended the algorithm to handle any integer number of points.Since only one feature is taken from each cell, the partitions are divided until the desired number of features N is obtained.Thus, a direct comparison between k-d tree partitioning, ANMS, and top N selection becomes more feasible.

Evaluation and results
The three feature distribution algorithms (top N , ANMS, and k-d tree) are evaluated using three data sets (Fig. 3) which show varying contrast across the whole image.Each image pair represents one scene from two slightly different point of views, but still providing a large overlap region.Since the images are not synthesized, no ground truth data of the correct image composition is given.Thus, for each image pair control points p i ↔ p j are manually set and matched.Based on these points, a homography is also estimated according to eq. 4, and the registration error is calculated.These error values are used as references in later evaluations.
To evaluate different feature distributions, SURF features are first extracted and listed for each data set.Each algorithm of section 3 is then applied to the point lists to generate a sample set of only N feature points, resulting in different point distributions.Representative examples for each data set are shown in Fig. 4. The figure shows, that the selection of the top N strongest features leads to single point clusters in all three data sets.In example, many SURF features show a high filter response value in the tree region of the tree-clinic sequence, since the dark branches have a high contrast against the bright background.By contrast, the k-d tree and also the ANMS algorithm provide a more spatially uniform distributed set of feature points across the whole image and the relevant overlap regions, respectively.As shown in Fig. 4(c), more feature points are also located in the vasculature in the image center in the case of k-d tree partitioning and ANMS.Since this region is also present in the overlap region of the bladder sequence (cf.Fig. 3), consideration of these features during the homography estimation should provide a small registration error.
After the different feature lists have been generated, they are passed to the matching process, and a global image-to-image transformation for each sample set is estimated, as described in section 2.2.For a quantitative evaluation, the homographies are then applied to the ground truth control points p i ↔ p j and the final registration errors are calculated using eq.8. Fig. 5 shows the reproduction error characteristics for each distribution method over the selected number of feature points.Since the RANSAC algorithm provides robust point matches and rejects outliers on the basis of a random process, mean error values are calculated.Thus, Fig. 5 shows the averaged registration errors and the number of inliers over the number of feature points for each data set.The results of the tree-clinic and houses sequence are averaged by 15 measurements, and the graphs of the bladder sequence are based on 20 measurements.
As can be seen from the characteristics, the mean registration errors calculated from feature points distributed by k-d tree partitioning and ANMS are usually smaller than with the top N method.Especially for a small number of points, the error difference increases.At higher numbers, the different algorithms provide almost the same registration errors.This is obvious, since a larger point set usually leads to a spatial distribution with a higher variance.The results of the tree-clinic sequence also show that robust matching of the first N strongest features is not feasible until at least a minimum number of N = 210 feature points tree-clinic sequence with 200 features points houses sequence with 50 features points bladder sequence with 80 features points is obtained, but still resulting in high error.By contrast, the variance of the mean errors of the k-d tree and ANMS sample sets is much smaller.Although the registration errors of k-d tree partitioning and ANMS are almost equal in all image sequences, the number of inliers used for robust homography estimation is higher in the case of ANMS.Against this, the complexity of ANMS with up to O((N − 1)!) is much higher than O(log N )) of the k-d tree.Thus, the integration of the k-d tree partitioning in fast and real-time image mosaicking algorithms like [14] should be preferred.

Conclusions
An evaluation of the ANMS and k-d tree algorithms, which provide a spatially well uniformed feature distribution, has shown that the mean error of an image pair registration can be greatly reduced, compared to a sample set based on strongest feature selection.This difference becomes significant in images showing regions of different varying contrast.For appli-cations that require a fixed or limited number of interest points, e.g.real-time image mosaicking algorithms for medical computer assistance systems, the usage of feature distribution algorithms is also beneficial.In future work, further evaluations will be made on more medical image data sets and time measurements of the distribution algorithms integrated in a real-time image mosaicking algorithm for fluorescence endoscopy.The generation of feature point sets based on an adaptive algorithm considering the feature strengths and their distribution according to the current image information will also be analyzed.

Fig. 1 :
Fig. 1: Left: Landscape with a high contrast region in the lower right corner.Right: A noisy and low contrast endoscopic bladder image showing a lesion

Fig. 4 :
Fig. 4: For each data set top N selection (left), k-d tree partitioning (middle), and ANMS (right) is performed

Fig. 5 :
Fig. 5: Mean registration errors (black) and numbers of inliers (red) of each distribution method (Top N, ANMS, k-d Tree) over the number of feature points.The registration errors using the ground truth homography are highlighted in green