I have a series of signals, sample data looks like this:
We can see that there are 5 peaks there. I can assume that there won't be more than 1 pick every 10 samples, usually there is one pick every 20 to 40 samples.
I was trying to fit a polynomial and then use scipy.signal.find_peaks and it kind of works but I have to choose different numbers of spline knots to approximate each series correctly and the number of knots correlates to the number of peaks so I sort of ended up where I begun - but now I'd need only a rough idea about the number of peaks.
Then I tried it by dividing the signal into parts:
window = 10 # the smallest range potentially containing whole peak
parts = np.array_split(data, len(data)//window) # divide data set into parts
lengths = []
d = np.nan
for i in parts:
d = abs(i.max() - i.min())
lengths.append(d) # differences between max and min values in each part
av = sum(lengths)/len(lengths)
for i in lengths:
if i < some_tolerance_fraction*av:
window = window+1 # make part for the next check bigger
break
The idea was that the difference between min and max values in these parts should be smaller than the height of an actual pick I'm looking for unless the parts are large enough to contain whole peak - then the differences should be similar in each part and the average should also be similar to the actual height of the pick.
But this doesn't work at all and possibly doesn't even make sense - depending on the tolerance it divides window all the time or doesn't divide it at all.
this is the array from the image:
array([254256., 254390., 251546., 250561., 250603., 250128., 251000.,
252612., 253552., 253776., 252843., 251800., 250808., 250569.,
249804., 247755., 247685., 247111., 242320., 242580., 243462.,
240383., 239689., 240730., 239508., 239604., 238544., 240174.,
240806., 240218., 239956., 241325., 241343., 241532., 240696.,
242064., 241830., 237569., 237392., 236353., 234819., 234430.,
233890., 233215., 233745., 232159., 231778., 230307., 228754.,
225823., 225139., 223737., 222078., 221188., 220669., 221944.,
223928., 224996., 223405., 223018., 224966., 226590., 226166.,
226012., 226192., 224900., 224439., 223179., 222375., 221509.,
220734., 219686., 218656., 217792., 215934., 214829., 213673.,
212837., 211604., 210748., 210216., 209974., 209659., 209707.,
210131., 210663., 212113., 213078., 214476., 215087., 216220.,
216831., 217286., 217373., 217030., 216491., 215642., 214249.,
213273., 212148., 210846., 209570., 208202., 207165., 206677.,
205703., 203837., 202620., 201530., 198812., 197654., 196506.,
194163., 193736., 193945., 193785., 193417., 193044., 193768.,
194690., 195739., 198592., 199237., 199932., 200142., 199859.,
199593., 199337., 198403., 197500., 195988., 195114., 194278.,
193837., 193861.])
I would use find_peaks of scipy but filtering the signal with a moving average mean:
import numpy as np
import matplotlib.pyplot as plt
arr = np.array([254256., 254390., 251546., 250561., 250603., 250128., 251000.,
252612., 253552., 253776., 252843., 251800., 250808., 250569.,
249804., 247755., 247685., 247111., 242320., 242580., 243462.,
240383., 239689., 240730., 239508., 239604., 238544., 240174.,
240806., 240218., 239956., 241325., 241343., 241532., 240696.,
242064., 241830., 237569., 237392., 236353., 234819., 234430.,
233890., 233215., 233745., 232159., 231778., 230307., 228754.,
225823., 225139., 223737., 222078., 221188., 220669., 221944.,
223928., 224996., 223405., 223018., 224966., 226590., 226166.,
226012., 226192., 224900., 224439., 223179., 222375., 221509.,
220734., 219686., 218656., 217792., 215934., 214829., 213673.,
212837., 211604., 210748., 210216., 209974., 209659., 209707.,
210131., 210663., 212113., 213078., 214476., 215087., 216220.,
216831., 217286., 217373., 217030., 216491., 215642., 214249.,
213273., 212148., 210846., 209570., 208202., 207165., 206677.,
205703., 203837., 202620., 201530., 198812., 197654., 196506.,
194163., 193736., 193945., 193785., 193417., 193044., 193768.,
194690., 195739., 198592., 199237., 199932., 200142., 199859.,
199593., 199337., 198403., 197500., 195988., 195114., 194278.,
193837., 193861.])
def moving_average(x, w):
"""calculate moving average with window size w"""
return np.convolve(x, np.ones(w), 'valid') / w
#moving average with size 5
n=5
arr_f = moving_average(arr, 5)
#to show in same plot
arr_f_ext= np.hstack([np.ones(n//2)*arr_f[0],arr_f])
plt.figure()
plt.plot(arr,'o')
plt.plot(arr_f_ext)
This will show:
Then find peaks:
from scipy.signal import find_peaks
#n//2 is the offset of the averaged signal (2 in this example)
peaks =find_peaks(arr_f)[0] + n//2
plt.plot(peaks,arr[peaks],'xr',ms=10)
wich will show:
Note that,
the filtered signal will have a delay of n/2 samples (rounding down) so add n//2 to the peaks finded in filtered signal.
2)the filtered signal does not have the same values that the original, but same behaviour, Then to extract peak value use the original signal.
My informal definition of a peak is a point surrounded by two vectors, one ascending and one descending. It's pretty easy to implement it by iterating the array and comparing two neighbouring segments.
If they are both in the same direction, we merge the 2 segments by deleting the middle point.
To determine if they are in the same direction, I used multiplication. The product is positive if the 2 segments are in same direction.
At the end, every point will be a peak (we cannot determine for the first and last two).
i = 0 # position cursor at beginning
while i <= (len(t)-3):
if (t[i] - t[i+1]) * (t[i+1] - t[i+2]) >= 0:
# Same direction: join 2 segments by removing the middlepoint.
# This test also include the case of an horizontal segment \
# formed by the first 2 points. We remove the second.
del( t[i+1])
else:
# different directions. Delete nothing. Move cursor by 1
i += 1
see plot. You can see the reduction from 135 to 34 points.
Each blue mark is a peak.
Some of these peaks are non-significant and some more filtering is required. But the best method depend on your application. You may filter on vertical distance between 2 adjacent peaks or the horizontal distance between 2 adjacent peaks. For this last case, we need the x value of each peak so I rewrote the program using x-y data points.
t0 = [254256, 254390, 251546, 250561, 250603, 250128, 251000,
252612, 253552, 253776, 252843, 251800, 250808, 250569,
249804, 247755, 247685, 247111, 242320, 242580, 243462,
240383, 239689, 240730, 239508, 239604, 238544, 240174,
240806, 240218, 239956, 241325, 241343, 241532, 240696,
242064, 241830, 237569, 237392, 236353, 234819, 234430,
233890, 233215, 233745, 232159, 231778, 230307, 228754,
225823, 225139, 223737, 222078, 221188, 220669, 221944,
223928, 224996, 223405, 223018, 224966, 226590, 226166,
226012, 226192, 224900, 224439, 223179, 222375, 221509,
220734, 219686, 218656, 217792, 215934, 214829, 213673,
212837, 211604, 210748, 210216, 209974, 209659, 209707,
210131, 210663, 212113, 213078, 214476, 215087, 216220,
216831, 217286, 217373, 217030, 216491, 215642, 214249,
213273, 212148, 210846, 209570, 208202, 207165, 206677,
205703, 203837, 202620, 201530, 198812, 197654, 196506,
194163, 193736, 193945, 193785, 193417, 193044, 193768,
194690, 195739, 198592, 199237, 199932, 200142, 199859,
199593, 199337, 198403, 197500, 195988, 195114, 194278,
193837, 193861]
def graph( t1, t2):
import matplotlib.pyplot as plt
fig=plt.figure()
plt.plot( [p[0] for p in t1], [p[1] for p in t1], color='r', label="raw data")
plt.plot( [p[0] for p in t2], [p[1] for p in t2], marker='.', color='b', label="reduced data")
plt.title('Peak identification')
plt.legend()
plt.show()
def reduce( t):
i = 0 # position cursor at beginning
while i < (len(t)-2):
if (t[i][1] - t[i+1][1]) * (t[i+1][1] - t[i+2][1]) >= 0:
# Same direction: join 2 segments by removing the middlepoint.
# This test also include the case of an horizontal segment \
# formed by the first 2 points. We remove the second.
del( t[i+1])
else:
# different directions. Delete nothing. Move cursor by 1
i += 1
t1 = [(i,t) for i,t in enumerate(t0)] # add x to every data point
t = t1.copy()
reduce( t)
graph( t1, t)
Have fun!
I use matplotlib's method hexbin to compute 2d histograms on my data.
But I would like to get the coordinates of the centers of the hexagons in order to further process the results.
I got the values using get_array() method on the result, but I cannot figure out how to get the bins coordinates.
I tried to compute them given number of bins and the extent of my data but i don't know the exact number of bins in each direction. gridsize=(10,2) should do the trick but it does not seem to work.
Any idea?
I think this works.
from __future__ import division
import numpy as np
import math
import matplotlib.pyplot as plt
def generate_data(n):
"""Make random, correlated x & y arrays"""
points = np.random.multivariate_normal(mean=(0,0),
cov=[[0.4,9],[9,10]],size=int(n))
return points
if __name__ =='__main__':
color_map = plt.cm.Spectral_r
n = 1e4
points = generate_data(n)
xbnds = np.array([-20.0,20.0])
ybnds = np.array([-20.0,20.0])
extent = [xbnds[0],xbnds[1],ybnds[0],ybnds[1]]
fig=plt.figure(figsize=(10,9))
ax = fig.add_subplot(111)
x, y = points.T
# Set gridsize just to make them visually large
image = plt.hexbin(x,y,cmap=color_map,gridsize=20,extent=extent,mincnt=1,bins='log')
# Note that mincnt=1 adds 1 to each count
counts = image.get_array()
ncnts = np.count_nonzero(np.power(10,counts))
verts = image.get_offsets()
for offc in xrange(verts.shape[0]):
binx,biny = verts[offc][0],verts[offc][1]
if counts[offc]:
plt.plot(binx,biny,'k.',zorder=100)
ax.set_xlim(xbnds)
ax.set_ylim(ybnds)
plt.grid(True)
cb = plt.colorbar(image,spacing='uniform',extend='max')
plt.show()
I would love to confirm that the code by Hooked using get_offsets() works, but I tried several iterations of the code mentioned above to retrieve center positions and, as Dave mentioned, get_offsets() remains empty. The workaround that I found is to use the non-empty 'image.get_paths()' option. My code takes the mean to find centers but which means it is just a smidge longer, but it does work.
The get_paths() option returns a set of x,y coordinates embedded that can be looped over and then averaged to return the center position for each hexagram.
The code that I have is as follows:
counts=image.get_array() #counts in each hexagon, works great
verts=image.get_offsets() #empty, don't use this
b=image.get_paths() #this does work, gives Path([[]][]) which can be plotted
for x in xrange(len(b)):
xav=np.mean(b[x].vertices[0:6,0]) #center in x (RA)
yav=np.mean(b[x].vertices[0:6,1]) #center in y (DEC)
plt.plot(xav,yav,'k.',zorder=100)
I had this same problem. I think what needs to be developed is a framework to have a HexagonalGrid object which can then be applied to many different data sets (and it would be awesome to do it for N dimensions). This is possible and it surprises me that neither Scipy or Numpy has anything for it (furthermore there seems to be nothing else like it except perhaps binify)
That said, I assume you want to use hexbinning to compare multiple binned data sets. This requires some common base. I got this to work using matplotlib's hexbin the following way:
import numpy as np
import matplotlib.pyplot as plt
def get_data (mean,cov,n=1e3):
"""
Quick fake data builder
"""
np.random.seed(101)
points = np.random.multivariate_normal(mean=mean,cov=cov,size=int(n))
x, y = points.T
return x,y
def get_centers (hexbin_output):
"""
about 40% faster than previous post only cause you're not calculating the
min/max every time
"""
paths = hexbin_output.get_paths()
v = paths[0].vertices[:-1] # adds a value [0,0] to the end
vx,vy = v.T
idx = [3,0,5,2] # index for [xmin,xmax,ymin,ymax]
xmin,xmax,ymin,ymax = vx[idx[0]],vx[idx[1]],vy[idx[2]],vy[idx[3]]
half_width_x = abs(xmax-xmin)/2.0
half_width_y = abs(ymax-ymin)/2.0
centers = []
for i in xrange(len(paths)):
cx = paths[i].vertices[idx[0],0]+half_width_x
cy = paths[i].vertices[idx[2],1]+half_width_y
centers.append((cx,cy))
return np.asarray(centers)
# important parts ==>
class Hexagonal2DGrid (object):
"""
Used to fix the gridsize, extent, and bins
"""
def __init__ (self,gridsize,extent,bins=None):
self.gridsize = gridsize
self.extent = extent
self.bins = bins
def hexbin (x,y,hexgrid):
"""
To hexagonally bin the data in 2 dimensions
"""
fig = plt.figure()
ax = fig.add_subplot(111)
# Note mincnt=0 so that it will return a value for every point in the
# hexgrid, not just those with count>mincnt
# Basically you fix the gridsize, extent, and bins to keep them the same
# then the resulting count array is the same
hexbin = plt.hexbin(x,y, mincnt=0,
gridsize=hexgrid.gridsize,
extent=hexgrid.extent,
bins=hexgrid.bins)
# you could close the figure if you don't want it
# plt.close(fig.number)
counts = hexbin.get_array().copy()
return counts, hexbin
# Example ===>
if __name__ == "__main__":
hexgrid = Hexagonal2DGrid((21,5),[-70,70,-20,20])
x_data,y_data = get_data((0,0),[[-40,95],[90,10]])
x_model,y_model = get_data((0,10),[[100,30],[3,30]])
counts_data, hexbin_data = hexbin(x_data,y_data,hexgrid)
counts_model, hexbin_model = hexbin(x_model,y_model,hexgrid)
# if you want the centers, they will be the same for both
centers = get_centers(hexbin_data)
# if you want to ignore the cells with zeros then use the following mask.
# But if want zeros for some bins and not others I'm not sure an elegant way
# to do this without using the centers
nonzero = counts_data != 0
# now you can compare the two data sets
variance_data = counts_data[nonzero]
square_diffs = (counts_data[nonzero]-counts_model[nonzero])**2
chi2 = np.sum(square_diffs/variance_data)
print(" chi2={}".format(chi2))
I need to compare some theoretical data with real data in python.
The theoretical data comes from resolving an equation.
To improve the comparative I would like to remove data points that fall far from the theoretical curve. I mean, I want to remove the points below and above red dashed lines in the figure (made with matplotlib).
Both the theoretical curves and the data points are arrays of different length.
I can try to remove the points in a roughly-eye way, for example: the first upper point can be detected using:
data2[(data2.redshift<0.4)&data2.dmodulus>1]
rec.array([('1997o', 0.374, 1.0203223485103787, 0.44354759972859786)], dtype=[('SN_name', '|S10'), ('redshift', '<f8'), ('dmodulus', '<f8'), ('dmodulus_error', '<f8')])
But I would like to use a less roughly-eye way.
So, can anyone help me finding an easy way of removing the problematic points?
Thank you!
This might be overkill and is based on your comment
Both the theoretical curves and the data points are arrays of
different length.
I would do the following:
Truncate the data set so that its x values lie within the max and min values of the theoretical set.
Interpolate the theoretical curve using scipy.interpolate.interp1d and the above truncated data x values. The reason for step (1) is to satisfy the constraints of interp1d.
Use numpy.where to find data y values that are out side the range of acceptable theory values.
DONT discard these values, as was suggested in comments and other answers. If you want for clarity, point them out by plotting the 'inliners' one color and the 'outliers' an other color.
Here's a script that is close to what you are looking for, I think. It hopefully will help you accomplish what you want:
import numpy as np
import scipy.interpolate as interpolate
import matplotlib.pyplot as plt
# make up data
def makeUpData():
'''Make many more data points (x,y,yerr) than theory (x,y),
with theory yerr corresponding to a constant "sigma" in y,
about x,y value'''
NX= 150
dataX = (np.random.rand(NX)*1.1)**2
dataY = (1.5*dataX+np.random.rand(NX)**2)*dataX
dataErr = np.random.rand(NX)*dataX*1.3
theoryX = np.arange(0,1,0.1)
theoryY = theoryX*theoryX*1.5
theoryErr = 0.5
return dataX,dataY,dataErr,theoryX,theoryY,theoryErr
def makeSameXrange(theoryX,dataX,dataY):
'''
Truncate the dataX and dataY ranges so that dataX min and max are with in
the max and min of theoryX.
'''
minT,maxT = theoryX.min(),theoryX.max()
goodIdxMax = np.where(dataX<maxT)
goodIdxMin = np.where(dataX[goodIdxMax]>minT)
return (dataX[goodIdxMax])[goodIdxMin],(dataY[goodIdxMax])[goodIdxMin]
# take 'theory' and get values at every 'data' x point
def theoryYatDataX(theoryX,theoryY,dataX):
'''For every dataX point, find interpolated thoeryY value. theoryx needed
for interpolation.'''
f = interpolate.interp1d(theoryX,theoryY)
return f(dataX[np.where(dataX<np.max(theoryX))])
# collect valid points
def findInlierSet(dataX,dataY,interpTheoryY,thoeryErr):
'''Find where theoryY-theoryErr < dataY theoryY+theoryErr and return
valid indicies.'''
withinUpper = np.where(dataY<(interpTheoryY+theoryErr))
withinLower = np.where(dataY[withinUpper]
>(interpTheoryY[withinUpper]-theoryErr))
return (dataX[withinUpper])[withinLower],(dataY[withinUpper])[withinLower]
def findOutlierSet(dataX,dataY,interpTheoryY,thoeryErr):
'''Find where theoryY-theoryErr < dataY theoryY+theoryErr and return
valid indicies.'''
withinUpper = np.where(dataY>(interpTheoryY+theoryErr))
withinLower = np.where(dataY<(interpTheoryY-theoryErr))
return (dataX[withinUpper],dataY[withinUpper],
dataX[withinLower],dataY[withinLower])
if __name__ == "__main__":
dataX,dataY,dataErr,theoryX,theoryY,theoryErr = makeUpData()
TruncDataX,TruncDataY = makeSameXrange(theoryX,dataX,dataY)
interpTheoryY = theoryYatDataX(theoryX,theoryY,TruncDataX)
inDataX,inDataY = findInlierSet(TruncDataX,TruncDataY,interpTheoryY,
theoryErr)
outUpX,outUpY,outDownX,outDownY = findOutlierSet(TruncDataX,
TruncDataY,
interpTheoryY,
theoryErr)
#print inlierIndex
fig = plt.figure()
ax = fig.add_subplot(211)
ax.errorbar(dataX,dataY,dataErr,fmt='.',color='k')
ax.plot(theoryX,theoryY,'r-')
ax.plot(theoryX,theoryY+theoryErr,'r--')
ax.plot(theoryX,theoryY-theoryErr,'r--')
ax.set_xlim(0,1.4)
ax.set_ylim(-.5,3)
ax = fig.add_subplot(212)
ax.plot(inDataX,inDataY,'ko')
ax.plot(outUpX,outUpY,'bo')
ax.plot(outDownX,outDownY,'ro')
ax.plot(theoryX,theoryY,'r-')
ax.plot(theoryX,theoryY+theoryErr,'r--')
ax.plot(theoryX,theoryY-theoryErr,'r--')
ax.set_xlim(0,1.4)
ax.set_ylim(-.5,3)
fig.savefig('findInliers.png')
This figure is the result:
At the end I use some of the Yann code:
def theoryYatDataX(theoryX,theoryY,dataX):
'''For every dataX point, find interpolated theoryY value. theoryx needed
for interpolation.'''
f = interpolate.interp1d(theoryX,theoryY)
return f(dataX[np.where(dataX<np.max(theoryX))])
def findOutlierSet(data,interpTheoryY,theoryErr):
'''Find where theoryY-theoryErr < dataY theoryY+theoryErr and return
valid indicies.'''
up = np.where(data.dmodulus > (interpTheoryY+theoryErr))
low = np.where(data.dmodulus < (interpTheoryY-theoryErr))
# join all the index together in a flat array
out = np.hstack([up,low]).ravel()
index = np.array(np.ones(len(data),dtype=bool))
index[out]=False
datain = data[index]
dataout = data[out]
return datain, dataout
def selectdata(data,theoryX,theoryY):
"""
Data selection: z<1 and +-0.5 LFLRW separation
"""
# Select data with redshift z<1
data1 = data[data.redshift < 1]
# From modulus to light distance:
data1.dmodulus, data1.dmodulus_error = modulus2distance(data1.dmodulus,data1.dmodulus_error)
# redshift data order
data1.sort(order='redshift')
# Outliers: distance to LFLRW curve bigger than +-0.5
theoryErr = 0.5
# Theory curve Interpolation to get the same points as data
interpy = theoryYatDataX(theoryX,theoryY,data1.redshift)
datain, dataout = findOutlierSet(data1,interpy,theoryErr)
return datain, dataout
Using those functions I can finally obtain:
Thank you all for your help.
Just look at the difference between the red curve and the points, if it is bigger than the difference between the red curve and the dashed red curve remove it.
diff=np.abs(points-red_curve)
index= (diff>(dashed_curve-redcurve))
filtered=points[index]
But please take the comment from NickLH serious. Your Data looks pretty good without any filtering, your "outlieres" all have a very big error and won't affect the fit much.
Either you could use the numpy.where() to identify which xy pairs meet your plotting criteria, or perhaps enumerate to do pretty much the same thing. Example:
x_list = [ 1, 2, 3, 4, 5, 6 ]
y_list = ['f','o','o','b','a','r']
result = [y_list[i] for i, x in enumerate(x_list) if 2 <= x < 5]
print result
I'm sure you could change the conditions so that '2' and '5' in the above example are the functions of your curves