Application of Computer Vision Methods and Algorithms in Documentation of Cultural Heritage

The main task of this paper is to describe methods and algorithms used in computer vision for fully automatic reconstruction of exterior orientation in ordered and unordered sets of images captured by digital calibrated cameras without prior informations about camera positions or scene structure. Attention will be paid to the SIFT interest operator for finding key points clearly describing the image areas with respect to scale and rotation, so that these areas could be compared to the regions in other images. There will also be discussed methods of matching key points, calculation of the relative orientation and strategy of linking sub-models to estimate the parameters entering complex bundle adjustment. The paper also compares the results achieved with above system with the results obtained by standard photogrammetric methods in processing of project documentation for reconstruction of the Žinkovy castle.


Introduction
Images captured by digital cameras are one of the most important form of information in documentation of cultural heritage.Effective assignment of camera pose in space is necessary for consequential usage for measuring purposes.The automatic process of finding exterior orientation can be divided to three main tasks: Key point finding and matching, relative orientation and bundle adjustment.Our paper presents practical experiment of such procedure.

SIFT -scale invariant feature transform
The initial phase during comparing and relative orientation of two images is to choose characteristic or key points in images.The key point should by no mean characterize the image area so that this area could be reliably found and compared with the same area in different image.By finding corresponding points in both images a correspondent couple (correspondence) is defined.For detection and comparison of significant points in the image SIFT operator was Káňa, D., Hanzl V.: Application of Computer Vision Methods . . .chosen as the most convenient detector.This detector is unlike simple correlation between two areas in the images with the size of for example 21x 21 pixels partly invariant toward view geometry change therefore rotation (circa 15 degrees), scale change and is partly invariant to noise as well.SIFT detector is based on searching for extremities in the images by finding differences among images incurred by the convolution of image function I(x, y) and Gauss filter G(x, y, δ) with variable values of sigma.
Exact procedure is described for example in [2].
Figure 1: Blurred images and their differences

Finding extremities
Single images with a different degree of blurring are subtracted each other and difference imagess would come into being.These differences are evidently approximation of the second derivation of the image function I(x, y) and serve to detect local extremities.After creation of differential images (Fig. 1) each pixel value is compared with six neighbouring pixels in the image and nine neighbouring pixels in the image with blurring partly a level higher and partly a level lower.If the value of the tested pixel is the lowest or eventually the highest out of all the neighbouring pixels, this pixel is chosen as possible key point.Once the candidate for the key point is found by comparison with its neighbours it is necessary to decide about its stability and therefore about possible denial on the bases of information about its location, scale and a rate of main curvatures.This information enables effective removal of nondesired points in low contrast areas.For each point is its surrounding in the range of 3x3 pixels approximated by the three dimensional quadratic function.Consequently is found maximum or eventually minimum of this function that defines the exact location of the key point with subpixel accuracy.The points along the edges are according to curvature diameters rate in two perpendicular directions removed as well.

Key point descriptor
The parameters defining the location, scale and orientation of the key points also define auxiliary coordinate system, to which descriptor can be expressed clearly describing the area around key point location.Descriptor is based on image gradients and their orientation in a region surrounding key point, where area 16 x 16 pixels is divided into 16 blocks of 4 x 4 pixels.For every block is created histogram of eight meaningful orientations weighted by gradient magnitude using Gaussian function.Subsequently by ordering and normalizing those 16 partial histograms containing 8 intervals descriptor, vector of dimension 128 is formed.
For key points detection an implementation [1] was used, for another acceleration of computing an implementation with usage of hardware acceleration on CUDA [4] platform would be tested.

Finding Correspondences
If there are detected key points in each image counting descriptors, we can approach to pairing and finding of corresponding point couples -correspondences, which came into being by projection of point in three dimensional space to both images.The rate of agreement of the two key points is determined on the basis of their SIFT descriptor vector's Euclidean distance in two ways.Because we do not have any information about image sequence it is necessary to compare them by each to each method.

Symmetric pairing
To every key point in the first image we can find a point with the nearest descriptor distance in the second image.In the second image is to each key point found the nearest key point in the first image as well.As potential pair of corresponding points is labeled the one where mutual agreement is in both comparisons.

Distance ratio test
Lowe [2] recommends testifying the agreement of the key points by rate of descriptor Euclidean distance to the first and second nearest point.As limiting he offers distance ratio of 0.6.It is obvious that this criterion can fulfill more key points.For correspondence's stability and reliability reasons it is convenient not to count on these multiple correspondences in farther calculations.This test proofed as not much convenient with markedly repeating texture (for example the facade of panel house images).

Fundamental and Essential Matrix
The correspondences set obtained by above mentioned procedures is however loaded with errors and false correspondences, which arouse from a change of camera location, change of lighting, digital image noise and so on.These false correspondences could be eliminated by usage of geometric criterion -epipolar condition.
The X point together with projection centers C and C' forms epipolar plane in 3D space.By epipolar plane intesection with projection planes are formed epipolar lines -epipolars.And points x and x', which are projections of X point into projection plane, lies on this epipolar lines.This lines also pass thru epipoles e and e' where epipole is projection of first camera projection center to the projection plane of the second camera.Algebraic formulation of the epipolar condition is equation ( 2 Where F is fundamental matrix size is 3 x 3 and rank of this matrix is 2.This fundamental matrix defines relation between two cameras without dependency on scene structure.For calculation it is not necessary to know cameras interior orientation parameters.

Fundamental matrix computation using correspondences
Matrix F has seven degrees of freedom so minimal number of correspondences is seven.Number of solutions can vary from one to three.Due to the numeric stability is more optimal to use eight-point algorithm and find the best solution using SVD decomposition.Formula (2) defines relation between two corresponding points.Providing F: We can obtain one linear equation for each correspondence: For n correspondences we obtain homogeneous system of linear equations in following form: Solution which minimizes distances of projected points to epipolar lines is such a vector f where Ax is minimal and f = 1.

If UDV
T is SVD decomposition of matrix A, the solution is vector corresponding to the smallest A matrix singular value which is the last column of V matrix.To meet real fundamentality, the F matrix should have rank of 2. Due to noise and small inaccuracies, computed matrix has usually rank of 3.For conversion to rank of 2 SVD decomposition is used again, where last eigenvalue is set to zero and matrixes formed by decomposition are multiplied.
Out of numeric stability reason it is good to reduce input pixel coordinates towards the center of gravity and normalize.For successful result it is necessary that 3D points which projections are used for computation of the fundamental matrix must not lie in one plane.Described eight point algorithm is linear, nonlinear solution can be found for example in [5].

RANSAC -RAndom SAample Consensus
For key points selection meeting epipolar condition the RANSAC algorithm was used.This algorithm enables to find the best solution suiting given model iteratively.In our case x' F x = 0. Comparing to results obtained by using least mean squares (LMS) method which doesn't take into account blunders and mistakes, results obtained using RANSAC are more consistent.Model (fundamental matrix) is computed in every iteration from randomly selected sample formed by 8 pairs of corresponding key points.Consequently the rest of this points not contained in selected sample is tested against computed model.If pre-specified percentage of key points meets the model parameters and total model error is at the same time smaller than error obtained in previous iteration, given selection is marked as the best obtained selection.In other case is computed model rejected.The computation can be terminated after a specified number of iterations or after reaching the minimal threshold value of error from the statistically significant part of the correspondences.The last step is to determine the fundamental matrix using the SVD method from the best selection (5,6).

Essential Matrix and Relative Orientation
Provided the fundamental matrix F is computed and calibration matrixes of both cameras are known it is possible to formulate equation for essential matrix computation which defines relative position of cameras regardless of scale.
If only relative orientation is solved it is possible to set up first camera projection center into coordinate system origin and its rotation matrix as identity matrix.Projection matrix of the first camera can be formulated: P = [I | 0] And for the second camera: where R is rotation matrix and t is translation vector regardless the scale.
Rotation and translation we can determine by SVD decomposition of the essential matrix E in following way: Mentioned decomposition has four possible solutions.Valid solution is selected by projection depth testing, where projection depth is computed for any point and checked whether it is positive for both cameras.Relative orientation is computed for any image pair combination containing specified minimal number of correspondences.As sufficient number regarding noise appears to be 16 correspondences.In the case of unordered collection of images where no prior parameters are known, the images are compared each to each and the total number of comparisons is O(n 2 ), where n means number of images.In our practical experiment we selected 25 images and 300 possible combinations were tested.For larger images sets is convenient to use parallel computations.In case of image sequence is suitable due to computation complexity limit the number of compared images to 3 -5.

Sparse Bundle Adjustment
Based on the computed relative orientations between single images we can build an approximate scene model in relative coordinates that serve as an estimate of input parameters entering into complex bundle adjustment.As initializing pair is selected a pair of images with the largest number of correspondences (inliers).The objective of the bundle adjustment is the reprojection error minimalization, when the corrections are assigned both to three dimensional points and parameters of outer or possibly inner orientation of single cameras. min Where d(P j (X i ), x ij ) 2 marks the square of Euclidean distance between the predicted projection P j of the point X i in three dimensional global coordinate system and real projection x ij to image j.
The base of SBA algorithm (Sparse Bundle Adjustment) is implementation of nonlinear Levenberg-Marquart algorithm where local function minimum is sought for point X projection in global coordinates into the image coordinates using a combination of Gauss-Newton method and the method of the steepest descent.If the input estimate of parameters is too far from the f function local minimum, the algorithm behavior is similar to the steepest descent method which guarantees at least slow convergence.In the case of moving towards local minimum Gauss-Newton method is used to guarantee fast convergence.
The main advantage of the SBA software package [7] is the usage of optimized memory structures for sparse matrices storing arising from the normal form equations.Without the use of sparse matrices the solution would be (when having hundreds of thousands unknown images) out of extreme memory and computation cost almost impossible.

Experimental Results
The input selection contained 25 images of Žinkovy castle facade (Fig. 5) shot by digital camera Olympus, the computations were performed on a standard PC with Intel Core 2 Duo processor.Algorithms were implemented in C/C++ language.Based on the results of the lens calibration the distorsions were neglected.
The original images of size 3648 x 2736 pixels were modified due memory cost of key points computation to 1204 x 903 size.The results of the bundle adjustment were the absolute orientations of images in a relative coordinate system, which was defined by the first pair of images.The coordinates of projection centers were also independently determined in Intergraph ISAT software using ground control points in geodetic system.To compare the accuracy of automatic orientation, transformation

Summary
By above mentioned methods the absolute orientations of projection centers in a relative coordinate system using image descriptors were fully automatically determined.Parameters obtained in this way can be easily transformed into geodetic system using ground control points and ideally these parameters can directly serve as input to other tasks such as generating of orthogonalized mosaic of facade images.In less ideal case we can obtain the input estimates of parameters for another advanced calculations and decrease time and work difficulty of the absolute orientation procedure.
Inaccuracies can be attributed to both numerical instability of algorithms and the insufficient calibration of the lens.
Other options for achieving better results: 1. Using of GPU for reducing time complexity 2. Selection of key points only in a particular part of images (Gruber areas).

Figure 2 :
Figure 2: Key points and their orientations

Table 1 :
Káňa, D., Hanzl V.: Application of Computer Vision Methods . . .Differences between coordinates of projection centers computed in Intergraph ISAT and coordinates obtained by automatic process