“Eureqa” is software that searches for regression formulae that fit given data in the most parsimonious way – it not only tests which regressors are the best, but also searches a large space of possible functional forms. We have used Eureqa to hunt for solutions for SSA (i.e. albedo) in sets of data extracted from a large number of synthetic images.
A training set was generated from the averages in 64 equal-size boxes laid out over synthetic lunar images.
Each synthetic image was generated for the same JD, but each had its own realization of Poisson noise. One or 30 of these frames were averaged, and a set of 1000 such images were generated with random values of alfa, pedestal and albedo sampled from uniform distributions.
The distribution limits are:
alfa 1.5 to 1.85
ped 0.0 to 50.0
albedo 0.1 to 0.9
Eureqa was run until no further progress seemed possible.
formula from 30 frame averages:
“SSA = (199,9*V29 – 199,3*V48)/
(V53 + V47*V47 + 399,8*V29*V47*V47*V47 – V1*V29 – V1*V47 – 398,5*V47*V47*V47*V48)”
formula from single frames:
“SSA = 60,55*V28 + 171,8*V11*V28 – 0,006181 – 60,62*V32 – 159,8*V40*V40”
The VXX refer to the mean of the XXth box on the lunar image – counting from the corner and along rows, then columns.
Statistics 30 frames 1 frame
“R^2 Goodness of Fit” 0.9999703 0.99935872
“Correlation Coefficient” 0.99998515 0.99967931
“Maximum Error” 0.0055585436 0.027577193
“Mean Squared Error” 1.6123362e-6 3.5070792e-5
“Mean Absolute Error” 0.00093508038 0.0043065976
“Albedo error” 0.32% 1.4%
We see that averaging 30 frames gives better results than using single frames, as expected: We see the maximum error better by a factor of 5, the mean square error by 22, the mean absolute error by 5. Thus, the error in the fit scales about as you would expect – by the
We see albedo errors of 0.3 – 1.4%. These are calculated using the Mean Absolute Error (i.e. best case choice – not conservative). These are all higher than our science goal of 0.1% so we must use more frames.
How would one use a system based on the above formulae?
Each formula is trained on centered images from a given point in time. It can be used only on images also centered and with the same image scale and from the same time. A training set can be generated for any point in time, and it is possible to ensure that the image scale is the same as for the observed image. Observed images may not be centered, however. It is probably more difficult to shift the observed image, because of edge
effects, than it is to determine the shift and then shift all synthetic images equivalently and train the formula on the shifted synthetic images.
While the use of a simple formula such as the one above would be extremely fast, it does not appear to offer an advantage in terms of error over our other methods.
The method is somewhat similar to Chris’ principal components method, but governs itself in finding the optimum boxes – i.e. areas on the lunar image to use.
The method quickly finds similar-quality solutions if started from a fresh set of images, but the formulae may differ in choice of detected variables. This implies that the possible-solution space has not been sufficiently sampled by the 1000 image training-set.
Generating larger training sets is possible – generating 1000 takes about 20 minutes, though – so it is not realistic to generate sets that are e.g. 100 times larger.
Training the formula takes 10-20 minutes using 1000 images and 2 CPUs on an ordinary PC workstation. Perhaps there is scope for much larger sets using the CRAY and multiple processors?