I have spent some time examining to what extent GPUs could help us speed up scattered light fitting of ideal moon images to the observed images, while visiting Swinburne in Melbourne.

Ben Barsdell (Swinburne) had some c++ code to convolve two fits files using an NVIDIA compatible GPU (GTX480) running CUDA. The FFT library is FFTW, the same as I have been using for CPU-based modeling of the scattered light.


Upper left: ideal lunar image (intensity is log scale).
Upper right: convolved with our best PSF using CPU based FFT.
Lower left : convolved with PSF using GPU.
Lower right: ratio of the two methods (seen in more detail below).

On a desktop my fortran code calling FFTW does the three FFTs needed —
ideal moon image, psf image, their multiplication in the Fourier
domain, and the inverse FFT for the final result — in about 1100
milliseconds (ms). (This excludes the time needed to insert the 512×512
images into 1536×1536 images to sufficiently reduce edge wrapping
effects, which brings the total runtime to about 3000 ms). So the FFTs
are taking about 300 ms each.

Using the GPU, we attained FFT
speeds of about 40 ms each, a speed of a factor of 10 or so. This is
typical of what Ben expects for such applications (he is doing a careful
study of the types of astronomical problems GPUs can be profitably
applied to, and where the bottlenecks typically lie.)

Ostensibly
this means we can speed up our light modeling code by about 10 times — possibly
quite a bit more because the overheads per modeled image can be reduced
quite a lot by careful programming, since we want to explore a large
parameter space, but don’t need to do the same overheads each time we
run.

VERY IMPORTANT: the CPU code did the FFTs in double complex precision, whereas the GPU was doing single complex precision.


We
compared the output images of both methods — for the case of scattered
light from an ideal moon with a power law fall off PSF with a slope of
about r^-2.8.

VERY IMPORTANT: there is significant structure left in the methods if we divide the CPU output by the GPU output, as shown int he image above.

The scatter in the mean about 1.0000 is 2.028E-4 and there is a frame minimum of 0.9936 and frame maximum of 1.003, so the deviation from unity in the ratio of the two methods is not worse than ~0.7% anywhere on the frame. There is certainly structure, and it looks like it might be too much for our purposes! (We would like this to be better than 0.1% at the very worst). We are checking if this has to do with the single precision used.

Unless this problem can be solved, it is not clear that the speed gain with the GPUs
is worth having!