Projective 3D-reconstruction of uncalibrated endoscopic images

The most common medical diagnostic method for urinary bladder cancer is cystoscopy. This inspection of the bladder is performed by a rigid endoscope, which is usually guided close to the bladder wall. This causes a very limited field of view; difficulty of navigation is aggravated by the usage of angled endoscopes. These factors cause difficulties in orientation and visual control. To overcome this problem, the paper presents a method for extracting 3D information from uncalibrated endoscopic image sequences and for reconstructing the scene content. The method uses the SURF-algorithm to extract features from the images and relates the images by an advanced matching. To stabilize the matching, the epipolar geometry is extracted for each image pair using a modified RANSAC-algorithm. Afterwards these matched point pairs are used to generate point triplets over three images and to describe the trifocal geometry. The 3D scene points are determined by applying triangulation to the matched image points. Thus, these points are used to generate a projective 3D reconstruction of the scene, and provide the first step for further metric reconstructions.


Introduction
With about 68 810 new cases in 2008 in the United States [1], bladder cancer is a common disease of the urinary system.Tumors are usually inspected and treated by endoscopic interventions.Urological intervention of the bladder and urethra is also called cystoscopy.The cystoscope is inserted into the bladder through the urethra, which allows an inspection of the bladder wall.The inspection is usually performed close to the bladder wall, which is why the field of view is very limited.A possible way to improve the difficult orientation is e.g. by using an image mosaicking algorithm [2] to provide a panoramic overview image, or by generating a 3D model of the bladder.This paper presents a method for extracting 3D information from an uncalibrated endoscopic image sequence, which is then used for a projective 3D bladder reconstruction.In further steps, this information can be used for auto calibration of the camera, which leads to the desired metric reconstruction.
The paper is organized as follows: In section 2 the image preprocessing, the mathematical reconstruction and the reconstruction algorithms are described.In section 3 the evaluated image sequences and the results are presented.Finally section 4 summarizes the results and gives prospects for future work.

Imaging
The image sequences are acquired by a rigid video endoscope system, in this case an Olympus EVIS EX-ERA II platform.At the ocular of the cystoscope, a CCD camera is attached, which delivers the data to a workstation.To illuminate the organ, a light source is coupled into the cystoscope.To increase the field of view, endoscopes usually have a fish-eye optic.A typical setup is shown in fig. 1.The RealTimeFrame software framework [3] is used for real-time data processing of endoscopic data.This software allows a very rapid prototyping of algorithms.

Distortion correction
Cystoscope optics produces a strongly radial distorted image, which has to be corrected to extract valid 3D information.To compensate this distortion, the method of Hartley and Kang [4] is applied to each image in a preprocessing step.The radial distortion is modeled by the function with distorted point x d , center of distortion z, a function depending on the radius λ(r) and corrected point x u .λ(r) is not based on a fixed model function but is dynamically determined, resulting in very precise distortion correction.An example using the implementation from [8] is shown in fig. 2.

Feature detection
Feature detection is accomplished by the SURFalgorithm [5], which extracts and describes distinctive points in each image independent of its scale, position and rotation.To detect points of interest, a Hessian matrix containing the approximated second order partial derivatives of a Gaussian function from fig. 3 is used.The extracted features are described by an analysis of the surrounding area via Haar wavelets.The results are stored in the descriptor vectors of the features.A simple comparison of different features can be made by summing the absolute summed differences of these vectors.

2-View correspondences
A simple method for generating correspondences over two images is brute matching.During this process, each feature f 1 from the first image is compared with every feature f 2 from the second image, and the f 2 which minimizes the difference between the feature vectors is chosen.A correspondence is identified, if the difference is less than a given threshold.This process usually results in a high number of wrong correspondences.Thus, advanced matching is applied.
In addition to forward matching, where the best f 2 for each f 1 is chosen, backward matching is applied by choosing the best f 1 for each f 2 .A correspondence is identified only if these two matchings are equal.To improve the robustness, a slight restriction of the scale and orientation of the features by factor two, respectively 45 • , is also applied.This assumption is valid since the position of the endoscope does not change much between two sequential images in a real bladder inspection.The last check is whether the detected best correspondence is reliable by comparing its distance d best with the distance d 2ndbest from the second-best one via looking at their ratio d best /d 2ndbest > τ.

Epipolar geometry
Epipolar geometry describes the setup of two cameras looking at the same scene from different points of view.While in this section only the basic fundamentals are described, more details can be found in [14].An exemplary camera setup showing the camera centers C and C is given in fig. 4. The 3D-point X is projected onto the image planes, resulting in points x and x .The points e and e are called epipoles, and they represent the projected camera centers on the image planes.The position, orientation and properties of the two cameras are described by the fundamental matrix F. It has seven degrees of freedom and rank 2. Only the intrinsic geometry is described, which is why the fundamental matrix is independent of the scene content.A central equation for understanding epipolar geometry is the epipolar relation which connects an image point from one image plane with its corresponding point on the other image plane.Epipolar lines for each image point can be calculated using the fundamental matrix.This line passes through the position of the corresponding point on the other image plane.All these lines intersect in the epipoles.They can be calculated by A geometric interpretation of eq. 3 is visualized in fig. 5.The different frames of the sequence are interpreted as different views.This is correct if there is a camera movement between the frames and if the scene is stationary.
The RANSAC-algorithm [6] is applied to estimate the fundamental matrix.This algorithm generates a random set of correspondences and calculates the fundamental matrix.Using backprojection, each corresponding point is classified as an inlier or an outlier.To be classified as an inlier, the reprojection error of a point pair has to be smaller than a threshold.For an acceptable computation time, the Sampson-Approximation [9] is used to determine the error.This process is repeated iteratively for other random samples.Finally, the inliers of the fundamental matrix, which yields to the largest number of inliers, are chosen to calculate the final fundamental matrix, and all outliers are eliminated.The RANSAC-algorithm uses the 7-point algorithm to calculate the fundamental matrices.The final matrix is estimated using the 8-point algorithm [10], which can handle eight or more points and provides a least squares solution.Both algorithms use a system of equations constructed with eq. 2.

2-View camera matrices
To perform a reconstruction at least two camera matrices are required.If the first camera is chosen with P = [I| 0] the second camera matrix is defined by The fundamental matrix F and the epipole e are known, but the scalar λ and the vector v are unknown.Correspondingly, there are four degrees of freedom to choose the second camera.Therefore a scene reconstruction based on eq. 4 is subjected to a projective transformation compared to the original scene, as shown in [11].Fig. 6 shows an example.Without any camera calibration or additional scene information, no metric reconstruction from two views is feasible.

3-View correspondences
It seems to be a straightforward process to connect two matched image pairs with one common image to an image triplet using the two view correspondences.But in practise the SURF-algorithm cannot detect identical features in the three images, because of view changes and noisy image data.Additionally, not all correspondences could be identified.This results in situations where not every feature could be retrieved in all images.This is visualized in fig. 7.As can easily be seen, only a small number of correspondences share the same corresponding point in the image i + 1 indicated by surrounding circles.To increase the number of correspondences over three images, an additional matching process from the first to the third view is performed.This step may induce new incorrect matches, which have to be considered.Thus, an advanced RANSAC-algorithm is used to join the set of tracked correspondences and the set of directly matched correspondences, and to detect outliers.The set of tracked correspondences contains a high amount of valid ones.To benefit from this fact, the RANSACalgorithms fills the samples with a higher probability from the tracked set than from the directly matched set of correspondences.But even if the tracked correspondences are verified by epipolar matching, they should not be chosen definitely, because they could still be wrong, as fig.8 shows.x 1 and x 2 are located on the epipolar line, which corresponds to the point x 1 .This implies the epipolar relation is fulfilled, and a correspondence between x 1 and x 1 or between x 1 and x 2 appears to be correct in the second view.Only in the third view it is possible to identify the wrong correspondence x 1 and x 2 .

Trifocal geometry
The mathematical description of trifocal geometry uses tensor notation.A good introduction to this topic can be found in [7] and [14].In this paper, tensors are written in calligraphic letters and the Einstein notation is used.
Trifocal geometry describes the setup of three different views for the same scene.Like epipolar geometry, trifocal geometry is also intrinsic and thus independent from the scene content.A sample configuration is shown in fig.8 with the camera centers C, C and C .The 3D point X 1 is projected to the image points x 1 , x 1 and x 1 .The epipoles e and e represent the projected camera center of the first camera on the image planes of the second and third view.The first camera is defined by P = [I| 0], whereby the second camera is P = [A| e ] and the third camera is P = [B| e ].The properties and the relation of these cameras are described by the trifocal tensor T .This is a 3 × 3 × 3 third-order tensor with 18 degrees of freedom.The reduction from 27 parameters to 18 degrees of freedom is caused by the internal constraint with the camera matrices P and P in tensor notation P and P .By analogy with the epipolar relation x T •F• x = 0 of two view geometry, trifocal geometry yields to The tensor E in eq.6 -called the Levi-Cevita symbolrepresents the constant third-order 3 × 3 × 3 tensor ) is an odd permutation (7) with r, s, t ∈ {1, 2, 3}.
The trifocal tensor can also be written in matrix notation using the three 3 × 3 matrices This notation can be used to extract the fundamental matrices between two different views from the trifocal tensor using the equations and Here the notation and [ x] × denotes the skew-symmetric matrix, which for a vector a is given by The trifocal tensor is calculated by the normalized linear algorithm [14] including algebraic minimization.
The basic idea is to solve a system of equations generated from eq. 6 and to enforce the inner constrains given by eq. 5.

Triangulation
After calculating the camera matrices, the 3D points can be reconstructed using triangulation.The concept is that the projection lines through the camera centers C and C and the image points x and x intersect in the 3D point X, as shown in fig. 5. To calculate X the equation is solved, where p iT and p iT are the i-th row vector of P and P .Since subpixel positions can only be determined by interpolation and additional distortion is induced by the camera system, the detected image points are noisy.This results in the effect that two projection lines do not meet in space and instead form two skew lines.To overcome this problem, the image points x and x are adjusted to meet the epipolar relation and are called x and x .Simultaneously, the sum of the Euclidean distance sum d( x, x) 2 + d( x , x ) 2 is minimized.

Results
The four different endoscopic video sequences from fig. 9   Fig. 10 analyses the matching process for three views, as described in section 2.7.The number of correspondences is shown on the left side of the boxplots.On the right side, the related reprojection error is shown.The first row shows the results from the two-view process using the advanced matching from section 2.5 compared to the three-view process.Analysing the "Wooden House" sequence it is observable that the number of tracked correspondences (1-3) of about 60 is significantly lower than the number representing the directly identified correspondences (1-2), which is of about 125.The advanced matching between the first and third view (1-3) is only slightly inferior than the matching from the first view to the second view (1-2).This can be explained by the higher temporal distance between the images, which leads to higher variation of the image data.In the fourth row, the tracked and newly matched correspondences are joined using the advanced RANSAC-algorithm (1-3), which leads to the higher number of about 150 correspondences, compared to the matching among the first and second view.Finally only correspondences between all three views (1-2-3), called triplets, are selected, which reduces the total number to about 90.The mean error of about 0.3 pixels is constantly low.Only the tracked correspondences (1-3) show slightly higher error and variation, before applying the RANSAC-algorithm.This is caused by sporadic wrong correspondences, as described in section 2.7.
Finally, the reprojection error from the estimated trifocal tensor is analyzed.For this step, two fundamental matrices are calculated from the trifocal tensor by eq. 9 (1-2) and 10 (1-3).Fig. 11 shows for all sequences nearly the same subpixel error of about 0.3 pixels (1-2) or ∼ 0.6 pixels (1-3), respectively.Compared to epipolar geometry, the temporal distance has a stronger impact.Since no RANSAC-algorithm has yet been applied for estimating the trifocal tensor, outliers have a direct impact on the error value.The reconstruction is compressed in the x-direction, caused by the projective transformation described in section 2.6.This paper presents a method for reconstructing 3D scene content from uncalibrated endoscopic sequences based on SURF-features.The different steps yield robust results by using the RANSAC-algorithm in adapted forms.It has been shown that an epipolar geometry and a trifocal geometry can be extracted with high precision, whereby subpixel-precise reconstruction is possible.An important application for trifocal geometry is for extracting consistent camera matrices for the whole sequence by a linear method, like in [12] or [13].Subsequently these cameras can be used for auto calibration [14], which allows metric reconstruction.

Fig. 7 :Fig. 8 :
Fig. 7: Three image correspondences are used to analyze the different steps of the algorithm.a) Pisa b) Rome c) Train Station d) Wooden House

Fig. 11 :
Fig. 11: Trifocal error An exemplary reconstruction of the "Pisa" sequence is shown in fig.12.In the left image all detected features are shown on the image plane, and in the right image a 3D reconstruction from these points is shown.Corresponding points have the same color.The reconstruction is compressed in the x-direction, caused by the projective transformation described in section 2.6.