Stick Based Speckle Reduction for Real-Time Processing of OCT Images on an FPGA

This paper presents an FPGA based real-time implementation of an adaptive speckle reduction algorithm. Applied to the log-compressed image of a high-resolution optical coherence tomography (OCT) system, all related signal processing steps from envelope detection to VGA video signal generation are executed on a single chip. Images from measured OCT data show that the chosen algorithm produces a smooth, detailed image with fewer image artifacts than comparable approaches. An estimation of the hardware effort, the possible throughput rate and the resulting image frame rate is given for different window sizes used here in speckle reduction.


Introduction
Optical coherence tomography (OCT) is a rather new imaging technique based on broadband interferometry.Like ultrasonic imaging, OCT generates a cross-sectional image of the scanned tissue.The penetration depth of OCT ranges from about 1-2 mm in opaque tissue (e.g.skin) to 2 cm in transparent tissue (e.g.eye).The achievable lateral resolution mainly depends on the focusing optics, while the axial resolution is based on the light source of the OCT scanner.Currently, resolutions down to a few micrometers are possible, allowing OCT images to compete with histological examination of tissue.The main drawback of histology is that for each observation a tissue sample has to be taken surgically.Thus, the main advantages of OCT are its non-invasive nature and real-time feasibility.
Similar to ultrasonic images, OCT speckle arises from constructive and destructive interference of light backscattered from reflectors smaller than a wavelength.Speckle noise degrades contrast, and makes it difficult to identify small reflectors.Additionally, post-processing algorithms (such as deconvolution or segmentation) that are used to enhance image resolution and quality may suffer from the presence of speckle in OCT images.Thus, prior to image enhancement algorithms, a speckle reduced image is available that can be blended with the original image to enhance viewing convenience.
This paper is organized as follows: In section 2 the basic principles of OCT are described and the high resolution OCT system is presented in detail.Section 3 deals with the theory of speckle reduction algorithms and gives an overview of the algorithmic complexity.Based on these investigations, section 4 evaluates real time implementation feasibility on an FPGA.

High resolution OCT
This section gives details about the OCT system used throughout this work, and a flexible digital post processing realized on an FPGA is presented.

OCT principle
A general OCT system is depicted in Fig. 1.The light from a broadband light source is split into a reference beam and a probe beam directed onto the object of interest.By combining the backscattered light with the reference beam, an interference pattern is created and detected by a photodetector.The strength of the interference is dependent on the optical path difference and on the reflection.The envelope of the interference pattern shows the reflectivity in a certain depth of the probe [1].
Hence, by altering the length of the reference beam, a reflectivity profile of the probe can be recorded, called A Scan.The photodetector signal is amplified and analog-to-digital converted to enable digital filtering, demodulation and advanced post processing.By deflecting successively the probe beam in lateral direction, multiple scans can be combined to a 2-dimensional B-Scan, showing a high resolution cross-sectional view of the probe.

OCT system
Interference of the detected light depends on the difference of the optical length of the reference arm and probe arm.Unlike narrowband light sources, broadband light sources show interference only for small differences up to the coherence length.The applied source emits light with a spectral bandwidth of 255 nm at a center wavelength of 800 nm, which leads to a coherence length of 1.6 mm.The coherence length is the limit for the axial resolution of the system.Resolution in the lateral direction of the OCT-System depends basically on the focusing optics and lenses.The system considered here provides up to 0.01 degrees of lateral resolution.
The length of the reference arm can be varied on a micrometer scale with a piezoelectric crystal.Adaptive control enables a linear movement of the reference mirror at constant speed.Therefore the recorded signal shows the interference pattern as a sine-wave carrier signal.The depth-dependent reflectivity of the probe shapes a superimposed envelope.The carrier frequency depends on the mirror velocity and center frequency of the light source, and is about 350 KHz for the considered system, mainly limited by the mechanical set-up of the reference mirror.

Digital post-processing
After amplifying the photodiode signal, it is sampled and converted to the digital domain.A standard analog-to-digital converter is used, providing precision of 12 Bits at a sampling frequency of 1 MHz.This data is fed to an Altera Stratix II FPGA, which serves as a test-bed to evaluate different speckle reduction algorithms and the real-time abilities of the system.Fig. 3 shows a block diagram of the architecture.
Standard signal processing for an OCT system consists of filtering and amplitude demodulation to determine the signal's envelope.The implemented system starts with a bandpass filter.Subsequently a Hilbert transformation is chosen to demodulate the signal by adding the imaginary part to the OCT signal and taking the absolute value.The bandpass and Hilbert transform can be performed in one step, in Fig. 3 denoted as complex bandpass.To compute the absolute value, i.e. the squared sum of the imaginary and real part of the signal, an efficient coordinate rotation (CORDIC) algorithm has been implemented as proposed in [5].
Subsequent to envelope detection, the signal is compressed to 8 Bit by computing the logarithm.To adapt the sampling rate to the axial resolution given by the coherence length, down-sampling is done after low-pass filtering to avoid aliasing.
The demodulated and compressed OCT data of each scan line is written to a memory buffer, and the adjacent scans are combined in lateral direction to a B-Scan.A VGA protocol converter allows visualization of the OCT picture on a standard VGA display with a resolution of 1024×768 pixels at 75 Hz.The architecture makes use of a 2-stage memory structure (Table 1).For advanced processing and speckle reduction, up to 63 adjacent A Scans can be stored in a buffer and can be processed in both the axial and the lateral direction.A 32 MByte SDRAM is used as frame buffer by the video controller.
The required Arithmetical Logical Units (ALUTs) for an FPGA implementation of this post processing scheme are summarized in Table 2.
As can be seen, only 9 % of the overall FPGA (48,352 ALUTs) are used in this post processing approach, leaving sufficient space for extension and enhancements.

Speckle reduction algorithm
This section details with the speckle reduction technique considered here.
Speckle reduction aims to reduce noise-like speckle distortion in B-Scans, and it should also generally preserve structures like edges, borders and small features of interest.A large variety of speckle reducing filters has been published, particularly from research in ultrasonic image processing.Approaches using region growing, diffusion and wavelet decomposition have been proposed.
This paper considers the adaptive speckle filter introduced in [3] using asymmetric sticks, called ASF.Similar to other speckle reduction techniques, the basic concept is to compute the output from windows which are adapted to the local content of the source image.In contrast to algorithms working with rectangular windows and/or weights [6,7,8]   which depend on the distance to the center of the window, ASF resolves the window in angular steps.This method shows superior results at the expense of high computational load and is impracticable to implement on a general purpose processor for real-time processing.The algorithm will now be explained in detail.
The filter uses directional features (called sticks, Fig. 5) and allows iterative application, since it behaves like a nonlinear diffusion model.The method is based on a weighted sum of means, which are calculated in the stick direction.The weights are given by the reciprocal of the local variance in each direction.
According to [3], the filter algorithm is given by and , where Î denotes the output for each pixel.V i and I i are the variance and mean of the original image data I along the i th stick pointing from the center of the N×N filter window in one of 4N-4 different directions.The sticks can be constructed by hardware, making efficient use of Bresenham's line algorithm [9].Given the i th stick as a filter kernel H i (x, y), the mean and variance are computed by with * denoting a 2-dimensional correlation.As variation function g( ) × this paper applies Sticks across an edge show high variance and therefore the mean in this direction is weighted to have less influence on the computed output.Hence, smoothing only happens in homogeneous directions whereas it is suppressed across edges.In homogeneous speckle regions, all sticks have approximately the same variance and weight, and as a result all directions are smoothed evenly.
Because of the high quantization of angular steps, ASF is able to adapt its kernel very flexibly to the image content.Inhomogeneities with various shapes can not only be retained, but can be smoothed without blurring edges.Fig. 4 shows a measured OCT image of a spheroid placed on an agar substrate and the enhanced images for two different stick lengths.The speckle pattern and additive noise are significantly reduced.
In [3], it is stated that due to the overlapping sticks the algorithm works like a Gaussian smoothing filter in homogeneous regions.However, a closer look at the filter kernel in these regions shows that the weights are (Assuming V i equal for all directions i, with r denoting the number of stick pixels from the window center and a constant c) The kernel is not Gaussian, but decreases with 1/(8r).For better smoothing, the influence of the center pixel can be reduced.A straightforward way to do this is to subtract a fraction of the original pixel value from the output a .
An empirically found value of a is 0.125.This adjustment is especially important if only one iteration is applied.
To compare the image quality of ASF to an existing FPGA implementation of the ARGF-based [8] algorithm by Mazumdar et al. [4], both filters have been implemented in C++ and Matlab and adjusted to OCT-image data.Fig. 6 shows the resulting images.The right image is slightly degraded by artifacts (marked with arrows).These are caused by hard switching between square filter kernels and outliers found in speckle statistics.They do not appear in the left image, because of the soft thresholding and high angular quantization of ASF.Furthermore, ASF needs only one parameter k, which can be adjusted very easily.The filter introduced in [4] needs 4 parameters, which are difficult to adjust empirically.

Real-time implementation
This section describes the real-time FPGA implementation of the ASF algorithm introduced in section 3.This work focuses on meeting real-time requirements.Therefore, the timing requirements for the ASF algorithm are derived based on the A-Scan rate and pixels per A-Scan.Subsequently the architecture of the ASF algorithm will be shown and the throughput rate for a serial and a parallel implementation will be estimated.
In the second part, implementation results are presented for a Stratix II FPGA for different sets of parameters of the algorithm.Here, the dependency of different stick-lengths corresponding to varying window sizes as well as the number of sticks on the required hardware will be shown, measured in Logic Elements (LEs).A possible parallel execution of the ASF algorithm to achieve highest throughput rate will be described.

Signal processing concept
Fig. 7 shows the signal processing concept of the ASF algorithm.As can be seen, the algorithm is divided into three blocks that realize different steps of the algorithm at different sample rates.This architecture allows pipelining of the algo-rithm to increase the throughput rate.In Fig. 7, the first block receives a window of N×N pixels and calculates the sum along each stick and the sum of the squared elements for N S sticks.Because the N×N window pixels are prefetched and stored in an array, they are available without latency.
In each clock cycle, a serial implementation performs one of the N S summations over N L stick elements.This leads to a required number of N S processing cycles for the first block and thus one summation value is calculated each clock cycle.In a parallel implementation, all these summations are calculated simultaneously leading to a calculation time of one clock cycle for all summation results.
The processing cycles of the second block are mainly affected by the summation over N S sticks, leading also to a number of about N S cycles, and hence one calculated value per clock cycle.Here, the parallel implementation calculates all results in one clock cycle.
Finally, block III realizes a divider which does not affect the throughput rate of this pipelined architecture.
The number of processing cycles required to perform the ASF algorithm realized by the three-block approach considered here is equal to the maximum number of processing cycles of block I and II.Thus, a serial implementation requires about N S processing cycles if the pipeline has been filled up.In contrast to this, a fully parallel implementation requires three clock cycles to determine a speckle reduced pixel value.
Based on the A-Scan rate and the number of pixels per A-Scan, the upper limit of processing cycles to calculate a new pixel value can be derived from where l is the A-Scan length (in pixels), f clk and f A-Scan are the clock frequency and the A-Scan rate in Hz respectively.With an A-Scan rate of 50 Hz and an A-Scan length of 768 pixels, approximately 2600 cycles are acceptable for the calculation of each speckle reduced pixel.As can be seen from the processing cycle calculation above, serial as well as parallel ASF implementations easily meet this requirement for reasonable window sizes.

FPGA implementation
Fig. 8 shows the number of required ALUTs for a serial implementation of the ASF algorithm.The increasing number of ALUTs is mainly caused by the first block.As can be seen in Fig. 8, the bit widths of the second and third block's input Furthermore, the maximum clock frequency for an implementation on the Stratix II FPGA (EP2S60F1020C4) is given in Fig. 8.As can be seen for reasonable values of N, the maximum clock frequency is greater than 50 MHz.

Performance evaluation
To determine the maximum window size for real time calculation, the equation for the upper limit of available cycles per pixel in section 4.1 can be rewritten.For a stick length of This form allows a calculation of the window size depending on A-Scan length (l), image rate (f img ) and width (w) as well as clock frequency (f clk ).
For a second available OCT system based on Fourier-domain image reconstruction, an A-Scan length of 1000 pixels, an image rate of 5 Hz and a width of 250 lines is required.In this case, a window size of N = 11 can be processed in real-time.

Conclusion
A generic design approach allows for a flexible implementation of the ASF algorithm.Thereby, algorithm parameters like window size, number of sticks, and weighting function can be configured at compile time.The performance of the serial implementation meets the requirements of the broadband OCT system used here.
Even for FFT-based fast OCT systems, the presented hardware implementation can apply real-time speckle reduction to the full image size.For even higher demands, parallelization of the algorithm is straightforward using the proposed generic design approach.Thus, increased throughput rate can be traded for hardware effort regarding the number of required ALUTs.
The introduction of a slight modification of the algorithm shows that a single run provides sufficient smoothing of the image.

Fig. 2 :
Fig. 2: Spectrum of the broadband light source and interference pattern of an ideal mirror

2 ,
using the minimum number of required cy-

Fig. 8 :
Fig. 8: Resource usage and maximum clock frequency of ASF as a function of window size

Table 1 :
Memory usage of OCT post-processing

Table 2 :
Resource requirements of basic OCT post-processing blocks