Dataset filtering with end correction
Some time ago I became annoyed at the suggestions about can’t, won’t, and so on about frequency filtering an short datasets. Was it really that impossible a problem?
Part of my background is in professional audio, including digital, equipment design, so I am very familiar with signal processing and PCM (pulse code modulation). Just about all climatic data is actually coded PCM so of course all the laws surrounding that apply. Not that you would know this from the way that most in science behave. Wrong math. The two huge ones are Nyquist and Shannon, it is not rocket science.
Handling for example audio data is one thing, doing the same for climatic data is a whole different pit of vipers. One of the massive problems is short data, it has two ends, something which is avoid like the plague in audio, often by hiding it, mute the sound.
It didn’t take me long to figure out one then another method of end correcting short data. There is no theoretically correct solution but I am pragmatic, engineer style, it works well enough. Classic, do you want a good answer, or none?
In this case I dubbed it first order end correction. All it does is disconnect the Y, up and down, offset from zero.
This was added to a filtering package in C I was writing. (partly about getting back into programming after a 10 year or so break)
The filtering used direct convolution, a practical matter of fast enough for the size of data and filters commonly used. (in most cases immediate) This lent itself perfectly to one of the end correction methods.
Roll forwards are couple of years and for various reasons further progress really needed a switch to FFT based convolution, generally done segmented. In contrast to direct this is errm… somewhat complicated. Actual FFT is not a problem, I researched and use an excellent Japanese authored library which has permissive licencing, in line with my intent to one day release software under a similar scheme. (talking here BSD/MiT style)
Implementing the new convolution method was fun, of the hair tear out variety. Obviously this is just a case of discover how it is done, then write it. Snag as usual is that no-one actually says accurately.
That worked and on test I could go past a million tap filter, perfect. I like to find the limits before anything bites me. At 10m I run out of disk spool and RAM, program exits gracefully with a rude message, no problem. That size of complex data etc. is awful large, if that limit seems low, this ain’t bytes.
The next can of trouble is end correction. How for goodness sake can that be done?
This took weeks of hairless. A simple answer turned out to be preprocess using direct convolution, then use the new fancy stuff. This worked more or less immediately once I guessed some bizarre properties, such as what to do with complex data. Counter intuitive seems everywhere.
At that point I had what I thought cracked the whole problem. It seemed to pass basic tests.
Over a month or two of using the program it dawned on me there were subtle problems. Some parts were wrong.
Just had a go at finally cracking it, took several periods of detail work.
The answer was figure out a way of detecting what kind of filter is being requested and then as necessary use spectral inversion on the end correction. (spectral inversion is a trivial maths process which reverses the frequency sense of a filter, such as transforming a low pass into a high pass, or the reverse)
The exact filter characteristic is unimportant here, is just a quick hack for testing.
The data is a straight diagonal line. Depending on the filter will turn into the same or a horizontal line.
The correction is not perfect, is quite involved.
Those of you who have come this far might be in for a surprise. There is one of those dreadful “everyone-knows” in science about the length of a transversal filter (or to use another name FIR) and incantations that the length of the filter must be cut off the output.
That lore is totally wrong. Above is hard proof, the filter is about 3x longer than the data.
As far as I can tell the truth is about the impulse response of the filter and the Gibbs phenomenon. The case could be that the filter exactly mimics are much shorter filter, see the point? Not about length at all.
I add, that ghastly moving average filter is the worst possible filter, not even a good filter other than for specialist usage. The only redeeming feature for climatic is that it is the shortest possible. Perhaps the most serious problem is the severe artefacts it creates, are not in the data and these have led some people astray, thinking artefacts are actual data.
I could post a demo of the dire effect.
So, there is still more to do. Not even looked at how arbitrary filters behave. Hilbert is also still a problem. May as well, looks as though a sane arbitrary definition is on disk. And yay, it worketh! 🙂
Really go mad, million point. Uh huh, 12 seconds and disk activity, needs more RAM. And with correction, 8s, already had space. I haven’t optimised anything, nor space usage. No point in showing anything, looks almost the same as the above plots. (different filter characteristic, actually a pair of sharp bandpass)
Real test time! What shall I choose, okay, a work in progress, computed from ephemeris gravitational force, daily data from 1950 and try bandpass extract of a 60 year term.
That is actually a very interesting result which deserves it’s own post.
Further investigation throws up problems I do not understand about getting the same result from other Solex 100 runs, suggesting I made a mistake. After fiddling for some time trying to discover what I had done wrong, I decided to leave the problem for the time being, come back to it when the time seems right. Filtering working? Not used it a great deal, but no problems have appeared.