Related
I am trying to look at astronomical spectra using Python, and I'm using numpy.correlate to try and find a radial velocity shift. I'm comparing each spectrum I have to one template spectrum. The problem that I'm encountering is that, no matter which spectra I use, numpy.correlate states that the maximal value of the correlation function occurs with a shift of zero pixels, i.e. the spectra already line up, which is very clearly not true. Here is some of the relevant code:
corr = np.correlate(temp_data, imag_data, mode='same')
ax1.plot(delta_data, corr, c='g')
ax1.plot(delta_data, 100*temp_data, c='b')
ax1.plot(delta_data, 100*imag_data, c='r')
The output of this code is shown here:
What I Have
Note that the cross-correlation function peaks at an offset of zero pixels despite the template (blue) and observed (red) spectra clearly showing an offset. What I would expect to see would be something a bit like (albeit not exactly like; this is merely the closest representation I could produce):
What I Want
Here I have introduced an artificial offset of 50 pixels in the template data, and they more or less line up now. What I would like is, for a case like this, for a peak to appear at an offset of 50 pixels rather than at zero (I don't care if the spectra at the bottom appear lined up; that is merely for visual representation). However, despite several hours of work and research online, I can't find someone who even describes this problem, let alone a solution. I've attempted to use ScyPy's correlate and MatLib's xcorr, and bot show this same thing (although I'm led to believe that they are essentially the same function).
Why is the cross-correlation not acting the way I expect, and how to do I get it to act in a useful way?
The issue you're experiencing is probably because your spectra are not zero-centered; their RMS value looks to be about 100 in whichever units you're plotting, instead of 0. The reason this is an issue is because numpy.correlate works by "sliding" imag_data over temp_data to get their dot product at each possible offset between the two series. (See the wikipedia on cross-correlation to understand the operation itself.) When using mode='same' to produce an output that is the same length as your first input (temp_data), NumPy has to "pad" a bunch of dummy values--zeroes--to the ends of imag_data in order to be able to calculate the dot products of all the shifted versions of the imag_data. When we have any non-zero offset between the spectra, some of the values in temp_data are being multiplied by those dummy zero-padding values instead of the values in image_data. If the values in the spectra were centered around zero (RMS=0), then this zero-padding would not impact our expectation of the dot product, but because these spectra have RMS values around 100 units, that dot product (our correlation) is largest when we lay the two spectra on top of one another with no offset.
Notice that your cross-correlation result looks like a triangular pulse, which is what you might expect from the cross-correlation of two square pulses (c.f. Convolution of a Rectangular "Pulse" With Itself. That's because your spectra, once padded, look like a step function from zero up to a pulse of slightly noisy values around 100. You can try convolving with mode='full' to see the entire response of the two spectra you're correlating, or, notice that with mode='valid' that you should only get one value in return, since your two spectra are the exact same length, so there is only one offset (zero!) where you can entirely line them up.
To sidestep this issue, you can try either subtracting away the RMS value of the spectra so that they are zero-centered, or manually padding the beginning and end of imag_data with (len(temp_data)/2-1) dummy values equal to np.sqrt(np.mean(imag_data**2))
Edit:
In response to your questions in the comments, I thought I'd include a graphic to make the point I'm trying to describe a little clearer.
Say we have two vectors of values, not entirely unlike your spectra, each with some large non-zero mean.
# Generate two noisy, but correlated series
t = np.linspace(0,250,250) # time domain from 0 to 250 steps
# signal_model = narrow_peak + gaussian_noise + constant
f = 10*np.exp(-((t-90)**2)/8) + np.random.randn(250) + 40
g = 10*np.exp(-((t-180)**2)/8) + np.random.randn(250) + 40
f has a spike around t=90, and g has a spike around t=180. So we expect the correlation of g and f to have a spike around a lag of 90 timesteps (in the case of spectra, frequency bins instead of timesteps.)
But in order to get an output that is the same shape as our inputs, as in np.correlate(g,f,mode='same'), we have to "pad" f on either side with half its length in dummy values: np.correlate pads with zeroes. If we don't pad f (as in np.correlate(g,f,mode='valid')), we will only get one value in return (the correlation with zero offset), because f and g are the same length, and there is no room to shift one of the signals relative to the other.
When you calculate the correlation of g and f after that padding, you find that it peaks when the non-zero portion of signals aligns completely, that is, when there is no offset between the original f and g. This is because the RMS value of the signals is so much higher than zero--the size of the overlap of f and g depends much more strongly on the number of elements overlapping at this high RMS level than on the relatively small fluctuations each function has around it. We can remove this large contribution to the correlation by subtracting the RMS level from each series. In the graph below, the gray line on the right shows the cross-correlation the two series before zero-centering, and the teal line shows the cross-correlation after. The gray line is, like your first attempt, triangular with the overlap of the two non-zero signals. The teal line better reflects the correlation between the fluctuation of the two signals, as we desired.
xcorr = np.correlate(g,f,'same')
xcorr_rms = np.correlate(g-40,f-40,'same')
fig, axes = plt.subplots(5,2,figsize=(18,18),gridspec_kw={'width_ratios':[5,2]})
for n, axis in enumerate(axes):
offset = (0,75,125,215,250)[n]
fp = np.pad(f,[offset,250-offset],mode='constant',constant_values=0.)
gp = np.pad(g,[125,125],mode='constant',constant_values=0.)
axis[0].plot(fp,color='purple',lw=1.65)
axis[0].plot(gp,color='orange',lw=lw)
axis[0].axvspan(max(125,offset),min(375,offset+250),color='blue',alpha=0.06)
axis[0].axvspan(0,max(125,offset),color='brown',alpha=0.03)
axis[0].axvspan(min(375,offset+250),500,color='brown',alpha=0.03)
if n==0:
axis[0].legend(['f','g'])
axis[0].set_title('offset={}'.format(offset-125))
axis[1].plot(xcorr/(40*40),color='gray')
axis[1].plot(xcorr_rms,color='teal')
axis[1].axvline(offset,-100,350,color='maroon',lw=5,alpha=0.5)
if n == 0:
axis[1].legend(["$g \star f$","$g' \star f'$","offset"],loc='upper left')
plt.show()
I am writing a Python program for finding areas of interest on a page. The positions on the page of all values of interest are given to me, but some values (typically only one or two) are far away from the others and I'd like to remove these. The data set is not huge, less than 100 data points but I will need to do this many times.
I have a cartesian coordinate system on two axes (x and y) in the first quadrant, so only positive values.
My data points represent boxes drawn on this coordinate system, which I have stored as a set of two coordinate pairs in a tuple. A box can be drawn by two coordinate pairs since all lines are straight. Example: (8, 2, 15, 10) would draw a box with indices (x,y) = (8,2), (8,10), (15,10) and (15,2).
I am trying to remove the outliers in this set but am having a hard time trying to figure out a good approach. I have thought about removing the outliers by finding the IQR and removing all points which fulfill these criteria:
Q1 - 1.5 * IQR or
Q3 + 1.5 * IQR
The problem here is that I am having a hard time figuring out how because the values are not just coordinates but areas if you will. However they are overlapping so they don't fit well in a histogram either.
First I thought I might add a point for each whole value that the box spans, the example box would in that case create 56 points. It seems to me as if this solution is quite bad. Does anyone have any alternative solutions?
Mainly there are two approaches: either you fixe the threshold value or you let machine learning infer it for you.
For machine learning, you can use Isolation Forest.
If you don't want ML then you have to fix yourself the threshold. So you can use a norm. There is no.linalg.norm(p1 - p2) or if you want more control on the metric there is cdist:
scipy.spatial.distance.cdist(p1, p2)
I am currently using Python to compare two different datasets (xDAT and yDAT) that are composed of 240 distance measurements taken over a certain amount of time. However, dataset xDAT is offset by a non-linear amount. This non-linear amount is equal to the width of a time-dependent, dynamic medium, which I call level-A. More specifically xDAT measures from the origin to the top of level-A, whereas yDAT measures from the origin to the bottom of level-A. See following diagram:
In order to compare both curves, I must fist apply a correction to xDAT to make up for its offset (the width of level-A).
As of yet, I have played around with different degrees of numpy.polyfit. I.E:
coefs = np.polynomial.polynomial.polyfit(xDAT, yDAT, 5)
polyEST=[]
for i in range(0,len(x-DAT)):
polyEST.append(coefs[0] + coefs[1]*xDAT[i] + coefs[2]*pow(xDAT[i],2) + coefs[3]*pow(xDAT[i],3) + coefs[4]*pow(xDAT[i],4) + coefs[5]*pow(xDAT[i],5))
The problem with using this method, is that when I plot polyEST (which is the corrected version of xDAT), the plot still does not match the trend of yDAT and remains offset. Please see the figure below, where xDAT= blue, corrected xDAT=red, and yDAT=green:
Ideally, the corrected xDAT should still remain noisier than the yDAT, but the general oscillation and trend of the curves should match.
I would greatly appreciate help on implementing a different curve-fitting and parameter estimation technique in order to correct for the non-linear offset caused by level-A.
Thank you.
The answer depends on what Level A is. If it is independent, your first line should be something like
coefs = np.polynomial.polynomial.polyfit(numpy.arange(xDAT.size), yDAT-xDAT, 5)
This will give a polyfit of an independent A as drawn, and then the corrected x should be
xDAT+np.polynomial.polynomial.polyval(numpy.arange(xDAT.size),coefs)
If A is dependent on the variables (as it looks to be), you don't want to polyfit, as that only regresses the real part of the oscillation (the "spring" part of a spring-damper system), which is why your corrected_xDat is in phase with xDat instead of yDat. To regress something like that you'll need to use Fourier transforms (which is not my specialty).
I have two time series of 3D accelerometer data that have different time bases (clocks started at different times, with some very slight creep during the sampling time), as well as containing many gaps of different size (due to delays associated with writing to separate flash devices).
The accelerometers I'm using are the inexpensive GCDC X250-2. I'm running the accelerometers at their highest gain, so the data has a significant noise floor.
The time series each have about 2 million data points (over an hour at 512 samples/sec), and contain about 500 events of interest, where a typical event spans 100-150 samples (200-300 ms each). Many of these events are affected by data outages during flash writes.
So, the data isn't pristine, and isn't even very pretty. But my eyeball inspection shows it clearly contains the information I'm interested in. (I can post plots, if needed.)
The accelerometers are in similar environments but are only moderately coupled, meaning that I can tell by eye which events match from each accelerometer, but I have been unsuccessful so far doing so in software. Due to physical limitations, the devices are also mounted in different orientations, where the axes don't match, but they are as close to orthogonal as I could make them. So, for example, for 3-axis accelerometers A & B, +Ax maps to -By (up-down), +Az maps to -Bx (left-right), and +Ay maps to -Bz (front-back).
My initial goal is to correlate shock events on the vertical axis, though I would eventually like to a) automatically discover the axis mapping, b) correlate activity on the mapped aces, and c) extract behavior differences between the two accelerometers (such as twisting or flexing).
The nature of the times series data makes Python's numpy.correlate() unusable. I've also looked at R's Zoo package, but have made no headway with it. I've looked to different fields of signal analysis for help, but I've made no progress.
Anyone have any clues for what I can do, or approaches I should research?
Update 28 Feb 2011: Added some plots here showing examples of the data.
My interpretation of your question: Given two very long, noisy time series, find a shift of one that matches large 'bumps' in one signal to large bumps in the other signal.
My suggestion: interpolate the data so it's uniformly spaced, rectify and smooth the data (assuming the phase of the fast oscillations is uninteresting), and do a one-point-at-a-time cross correlation (assuming a small shift will line up the data).
import numpy
from scipy.ndimage import gaussian_filter
"""
sig1 and sig 2 are assumed to be large, 1D numpy arrays
sig1 is sampled at times t1, sig2 is sampled at times t2
t_start, t_end, is your desired sampling interval
t_len is your desired number of measurements
"""
t = numpy.linspace(t_start, t_end, t_len)
sig1 = numpy.interp(t, t1, sig1)
sig2 = numpy.interp(t, t2, sig2)
#Now sig1 and sig2 are sampled at the same points.
"""
Rectify and smooth, so 'peaks' will stand out.
This makes big assumptions about your data;
these assumptions seem true-ish based on your plots.
"""
sigma = 10 #Tune this parameter to get the right smoothing
sig1, sig2 = abs(sig1), abs(sig2)
sig1, sig2 = gaussian_filter(sig1, sigma), gaussian_filter(sig2, sigma)
"""
Now sig1 and sig2 should look smoothly varying, with humps at each 'event'.
Hopefully we can search a small range of shifts to find the maximum of the
cross-correlation. This assumes your data are *nearly* lined up already.
"""
max_xc = 0
best_shift = 0
for shift in range(-10, 10): #Tune this search range
xc = (numpy.roll(sig1, shift) * sig2).sum()
if xc > max_xc:
max_xc = xc
best_shift = shift
print 'Best shift:', best_shift
"""
If best_shift is at the edges of your search range,
you should expand the search range.
"""
If the data contains gaps of unknown sizes that are different in each time series, then I would give up on trying to correlate entire sequences, and instead try cross correlating pairs of short windows on each time series, say overlapping windows twice the length of a typical event (300 samples long). Find potential high cross correlation matches across all possibilities, and then impose a sequential ordering constraint on the potential matches to get sequences of matched windows.
From there you have smaller problems that are easier to analyze.
This isn't a technical answer, but it might help you come up with one:
Convert the plot to an image, and stick it into a decent image program like gimp or photoshop
break the plots into discrete images whenever there's a gap
put the first series of plots in a horizontal line
put the second series in a horizontal line right underneath it
visually identify the first correlated event
if the two events are not lined up vertically:
select whichever instance is further to the left and everything to the right of it on that row
drag those things to the right until they line up
This is pretty much how an audio editor works, so you if you converted it into a simple audio format like an uncompressed WAV file, you could manipulate it directly in something like Audacity. (It'll sound horrible, of course, but you'll be able to move the data plots around pretty easily.)
Actually, audacity has a scripting language called nyquist, too, so if you don't need the program to detect the correlations (or you're at least willing to defer that step for the time being) you could probably use some combination of audacity's markers and nyquist to automate the alignment and export the clean data in your format of choice once you tag the correlation points.
My guess is, you'll have to manually build an offset table that aligns the "matches" between the series. Below is an example of a way to get those matches. The idea is to shift the data left-right until it lines up and then adjust the scale until it "matches". Give it a try.
library(rpanel)
#Generate the x1 and x2 data
n1 <- rnorm(500)
n2 <- rnorm(200)
x1 <- c(n1, rep(0,100), n2, rep(0,150))
x2 <- c(rep(0,50), 2*n1, rep(0,150), 3*n2, rep(0,50))
#Build the panel function that will draw/update the graph
lvm.draw <- function(panel) {
plot(x=(1:length(panel$dat3))+panel$off, y=panel$dat3, ylim=panel$dat1, xlab="", ylab="y", main=paste("Alignment Graph Offset = ", panel$off, " Scale = ", panel$sca, sep=""), typ="l")
lines(x=1:length(panel$dat3), y=panel$sca*panel$dat4, col="red")
grid()
panel
}
#Build the panel
xlimdat <- c(1, length(x1))
ylimdat <- c(-5, 5)
panel <- rp.control(title = "Eye-Ball-It", dat1=ylimdat, dat2=xlimdat, dat3=x1, dat4=x2, off=100, sca=1.0, size=c(300, 160))
rp.slider(panel, var=off, from=-500, to=500, action=lvm.draw, title="Offset", pos=c(5, 5, 290, 70), showvalue=TRUE)
rp.slider(panel, var=sca, from=0, to=2, action=lvm.draw, title="Scale", pos=c(5, 70, 290, 90), showvalue=TRUE)
It sounds like you want to minimize the function (Ax'+By) + (Az'+Bx) + (Ay'+Bz) for a pair of values: Namely, the time-offset: t0 and a time scale factor: tr. where Ax' = tr*(Ax + t0), etc..
I would look into SciPy's bivariate optimize functions. And I would use a mask or temporarily zero the data (both Ax' and By for example) over the "gaps" (assuming the gaps can be programmatically determined).
To make the process more efficient, start with a coarse sampling of A and B, but set the precision in fmin (or whatever optimizer you've selected) that is commensurate with your sampling. Then proceed with progressively finer-sampled windows of the full dataset until your windows are narrow and are not down-sampled.
Edit - matching axes
Regarding the issue of trying to identify which axis is co-linear with a given axis, and not knowing at thing about the characteristics of your data, i can point towards a similar question. Look into pHash or any of the other methods outlined in this post to help identify similar waveforms.
I have two dimensional discrete spatial data. I would like to make an approximation of the spatial boundaries of this data so that I can produce a plot with another dataset on top of it.
Ideally, this would be an ordered set of (x,y) points that matplotlib can plot with the plt.Polygon() patch.
My initial attempt is very inelegant: I place a fine grid over the data, and where data is found in a cell, a square matplotlib patch is created of that cell. The resolution of the boundary thus depends on the sampling frequency of the grid. Here is an example, where the grey region are the cells containing data, black where no data exists.
1st attempt http://astro.dur.ac.uk/~dmurphy/data_limits.png
OK, problem solved - why am I still here? Well.... I'd like a more "elegant" solution, or at least one that is faster (ie. I don't want to get on with "real" work, I'd like to have some fun with this!). The best way I can think of is a ray-tracing approach - eg:
from xmin to xmax, at y=ymin, check if data boundary crossed in intervals dx
y=ymin+dy, do 1
do 1-2, but now sample in y
An alternative is defining a centre, and sampling in r-theta space - ie radial spokes in dtheta increments.
Both would produce a set of (x,y) points, but then how do I order/link neighbouring points them to create the boundary?
A nearest neighbour approach is not appropriate as, for example (to borrow from Geography), an isthmus (think of Panama connecting N&S America) could then close off and isolate regions. This also might not deal very well with the holes seen in the data, which I would like to represent as a different plt.Polygon.
The solution perhaps comes from solving an area maximisation problem. For a set of points defining the data limits, what is the maximum contiguous area contained within those points To form the enclosed area, what are the neighbouring points for the nth point? How will the holes be treated in this scheme - is this erring into topology now?
Apologies, much of this is me thinking out loud. I'd be grateful for some hints, suggestions or solutions. I suspect this is an oft-studied problem with many solution techniques, but I'm looking for something simple to code and quick to run... I guess everyone is, really!
~~~~~~~~~~~~~~~~~~~~~~~~~
OK, here's attempt #2 using Mark's idea of convex hulls:
alt text http://astro.dur.ac.uk/~dmurphy/data_limitsv2.png
For this I used qconvex from the qhull package, getting it to return the extreme vertices. For those interested:
cat [data] | qconvex Fx > out
The sampling of the perimeter seems quite low, and although I haven't played much with the settings, I'm not convinced I can improve the fidelity.
I think what you are looking for is the Convex Hull of the data That will give a set of points that if connected will mean that all your points are on or inside the connected points
I may have mixed something, but what's the motivation for simply not determining the maximum and minimum x and y level? Unless you have an enormous amount of data you could simply iterate through your points determining minimum and maximum levels fairly quickly.
This isn't the most efficient example, but if your data set is small this won't be particularly slow:
import random
data = [(random.randint(-100, 100), random.randint(-100, 100)) for i in range(1000)]
x_min = min([point[0] for point in data])
x_max = max([point[0] for point in data])
y_min = min([point[1] for point in data])
y_max = max([point[1] for point in data])