Home > analysis, bad data, cockups, software > Even NASA have trouble prising raw data out of friend researchers

Even NASA have trouble prising raw data out of friend researchers


Image couresty NASA, discovery of a new class of star

This is a tale where I am going to maintain cloaking embarressment over wasting the time of others.

The original version of this story, which I have read, is not as kind, Rather annoyed folks were writing.

In many cases raw data is not available or only pseudo raw, ie. cooked in various unspecified or even unknown ways, when raw means what it says, original instrumentation data, with ancillary full information. Too often claims of raw are plain nonsense. This has wasted my time, prevented proper data processing, etc. a lot of tales I could recount.

6.4 Case Studies

6.4.1 Infrared Astronomical Satellite (IRAS) Data.

The first major test of AutoClass on a large scale real-world database was the application of AutoClass to the IRAS Low Resolution Spectral Atlas. This atlas consisted of 5425 mean spectra of IRAS point sources. Each spectrum consists of 100 “blue” channels in the range 7 to 14 microns, and another 100 “red” channels in the range from 10 to 24 microns. Of these 200 channels, only 100 contain usable data. These point source spectra
covered a wide range of intensities, and showed many different spectral distributions. We applied AutoClass to this spectral database by treating each of the 100 spectral channels (intensities) as an independent normally distributed single real value. The log-normal model is preferable for such scalar data, but several percent of the reported intensity values were negative. Also, adjacent spectral values are expected to be highly correlated,
but it was not obvious how to incorporate neighbor correlation information. Thus we knew from the beginning that we were missing important information, but we were curious how well AutoClass would do despite this handicap.
Our very first attempts to apply AutoClass to the spectral data did not produce very good results, as was immediately apparent from visual inspection. Fortunately, inspection also exposed the cause of the problem. The spectra we were given had been “normalized”. In this case normalization meant scaling the spectra so that all had the same peak height. This normalization meant that noisy spectra were artificially scaled up (or down) depending on whether the noise at the peak was higher or lower than the average. Since all values in a single spectrum were scaled by the same constant, an incorrect scaling constant distorted all spectral values. Also, spectra with a single strong peak were scaled so that the rest of the spectrum was close to the noise level. We solved the “normalization problem” by renormalizing the data ourselves so that area under the all curves is the same. This method of normalization is much less sensitive to noise than the peak normalization method.
The experts who provided us with this data tried to make life easy for us by only giving us the brightest spectra from 1/4 of the sky (without telling us about this sampling bias).
When we found this out, we requested all the spectra in the atlas to work with. Because this larger atlas included much noisier spectra, we found a new problem: some spectral intensities were negative. A negative intensity, or measurement, is physically impossible, so these values were a mystery. After much investigation [ed: might be more here], we finally found out that the processing software had subtracted a “background” value from all spectra. This pre-processing, of course, violates the basic maxim that analysis should be performed on the data actually measured, and all “corrections” should be done in the statistical modeling step.


Thank you for being candid.

As I wrote, the original version revealed rather more. None of this looked like intentional withhold so much as not really being aware themselves, possibly by not involving the true experts, those who support them.
… and who might say things off-message. Techies do.

For anyone wondering, autoclass-c is freely available cross platform but you had better be familar with console/command line and at handling textual information.


  1. April 7, 2014 at 08:39

    “This pre-processing, of course, violates the basic maxim that analysis should be performed on the data actually measured, and all “corrections” should be done in the statistical modeling step.”

    Don’t tell a climatologist anything silly like that, they’d have to start from scratch.

  1. No trackbacks yet.

Leave a reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: