|Title:||A Case Study of Parallel Bilateral Filtering on the GPU|
Smoothing and noise reduction of images is often an important first step in image processing applications. Simple image smoothing algorithms like the Gaussian filter have the unfortunate side effect of blurring the image which could obfuscate important information and have a negative impact on the following
applications. The bilateral filter is a well-used non-linear smoothing algorithm that seeks to preserve edges and contours while removing noise.
The bilateral filter comes at a heavy cost in computational speed, especially when used on larger images, since the algorithm does a greater amount of work for each pixel in the image than some simpler smoothing algorithms. In applications where timing is important, this may be enough to encourage certain developers to choose a simpler filter, at the cost of quality. However, the time cost of the bilateral filter can be greatly reduced through parallelization, as the work for each pixel can theoretically be done simultaneously.
This work uses Nvidia's Compute Unified Device Architecture (CUDA) to implement and evaluate some of the most common and effective methods for parallelizing the bilateral filter on a Graphics processing unit (GPU). This includes use of the constant and shared memories, and a technique called 1 x N tiling. These techniques are evaluated on newer hardware and the results are compared to a sequential version, and a naive parallel version not using advanced techniques. This report also intends to give a detailed and comprehensible explanation to these techniques in the hopes that the reader may be able to use the information put forth to implement them on their own.
The greatest speedup is achieved in the initial parallelizing step, where the algorithm is simply converted to run in parallel on a GPU. Storing some data in the constant memory provides a slight but reliable speedup for a small amount of work. Additional time can be gained by using shared memory. However, memory transactions did not account for as much of the execution time as was expected, and therefore the memory optimizations only yielded small improvements. Test results showed 1 x N tiling to be mostly non-beneficial for the hardware that was used in this work, but there might have been problems with the implementation.
|Prel. end date:||2015-12-31|
|Student:||Jonas Larsson email@example.com|