Considering Highest Peak Curve From Two Sets of Data Points - python

I have two columns which would correspond to x and y-axis in which I will be eventually graphing that sets of data points to a curve like graph.
The problem is that based on the nature of the datapoints, when graphing it, I end up having two peaks however, I want to pick only the highest peak when graphing and discard the lowest peak(s) (not the highest point but the entire the highest peak graphed).
Is there away to do that in Python? I don't show the codes here because I am not sure how to do the coding at all.
Here is the datapoints (input) as well as the graph!

You can use scipy argrelextrema to get all the peaks, work out the maximum and then build up a mask array for the peak you want to plot. This will give you full control based on your data, using things like mincutoff to work out what determines a separate peak,
import numpy as np
from scipy.signal import argrelextrema
import matplotlib.pyplot as plt
#Setup and plot data
fig, ax = plt.subplots(1,2)
y = np.array([0,0,0,0,0,6.14,7.04,5.6,0,0,0,0,0,0,0,0,0,0,0,16.58,60.06,99.58,100,50,0.,0.,0.])
x = np.linspace(3.92,161,y.size)
ax[0].plot(x,y)
#get peaks
peaks_indx = argrelextrema(y, np.greater)[0]
peaks = y[peaks_indx]
ax[0].plot(x[peaks_indx],y[peaks_indx],'o')
#Get maxpeak
maxpeak = 0.
for p in peaks_indx:
print(p)
if y[p] > maxpeak:
maxpeak = y[p]
maxpeak_indx = p
#Get mask of data around maxpeak to plot
mincutoff = 0.
indx_to_plot = np.zeros(y.size, dtype=bool)
for i in range(maxpeak_indx):
if y[maxpeak_indx-i] > mincutoff:
indx_to_plot[maxpeak_indx-i] = True
else:
indx_to_plot[maxpeak_indx-i] = True
break
for i in range(y.size-maxpeak_indx):
if y[maxpeak_indx+i] > mincutoff:
indx_to_plot[maxpeak_indx+i] = True
else:
indx_to_plot[maxpeak_indx+i] = True
break
ax[1].plot(x[indx_to_plot],y[indx_to_plot])
plt.show()
The result is then,
UPDATE: Code to plot only the largest peak.
import numpy as np
from scipy.signal import argrelextrema
import matplotlib.pyplot as plt
#Setup and plot data
y = np.array([0,0,0,0,0,6.14,7.04,5.6,0,0,0,0,0,0,
0,0,0,0,0,16.58,60.06,99.58,100,50,0.,0.,0.])
x = np.linspace(3.92,161,y.size)
#get peaks
peaks_indx = argrelextrema(y, np.greater)[0]
peaks = y[peaks_indx]
#Get maxpeak
maxpeak = 0.
for p in peaks_indx:
print(p)
if y[p] > maxpeak:
maxpeak = y[p]
maxpeak_indx = p
#Get mask of data around maxpeak to plot
mincutoff = 0.
indx_to_plot = np.zeros(y.size, dtype=bool)
for i in range(maxpeak_indx):
if y[maxpeak_indx-i] > mincutoff:
indx_to_plot[maxpeak_indx-i] = True
else:
indx_to_plot[maxpeak_indx-i] = True
break
for i in range(y.size-maxpeak_indx):
if y[maxpeak_indx+i] > mincutoff:
indx_to_plot[maxpeak_indx+i] = True
else:
indx_to_plot[maxpeak_indx+i] = True
break
#Plot just the highest peak
plt.plot(x[indx_to_plot],y[indx_to_plot])
plt.show()
I would still suggest plotting both peaks to ensure the algorithm is working correctly... I think you will find that identifying an arbitrary peak is probably not always trivial with messy data.

Related

Issues with scipy find_peaks function when used on an inverted dataset

The script below is a mixture of stackoverflow answers on different topics, but closely related to finding peaks on signals. Finding peaks based on prominence, as noted here works incredibly well, but my issue is that I need to find the lowest point immediately after the peak. The dataset is a fluorescence signal of a plant captured during 14 continuous hours, and the peaks are saturating pulses used to determined saturation under light conditions. A picture of the dataset (a 68MB CSV file) bellow:
This is my python script:
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
from scipy.signal import find_peaks
# A parser is required to translate the timestamp
custom_date_parser = lambda x: datetime.strptime(x, "%d-%m-%Y %H:%M_%S.%f")
df = pd.read_csv('15-01-2022_05_00.csv', parse_dates=[ 'Timestamp'], date_parser=custom_date_parser)
x = df['Timestamp']
y = df['Mean_values']
# As per accepted answer here:
#https://stackoverflow.com/questions/1713335/peak-finding-algorithm-for-python-scipy
peaks, _ = find_peaks(y, prominence=1)
# Invert the data to find the lowest points of peaks as per answer here:
#https://stackoverflow.com/questions/61365881/is-there-an-opposite-version-of-scipy-find-peaks
valleys, _ = find_peaks(-y, prominence=1)
print(y[peaks])
print(y[valleys])
plt.subplot(2, 1, 1)
plt.plot(peaks, y[peaks], "ob"); plt.plot(y); plt.legend(['Prominence'])
plt.subplot(2, 1, 2)
plt.plot(valleys, y[valleys], "vg"); plt.plot(y); plt.legend(['Prominence Inverted'])
plt.show()
As you can see on the picture, not all the 'prominence inverted' points are below the respective peak. The prominence inverted function comes from this post here, and it simply inverts the dataset. Some are adjacent to the previous peak (difficult to see in the picture). Peaks and valleys below:
Peaks
1817 109.587178
3674 89.191393
56783 72.779385
111593 77.868118
166403 83.288949
221213 84.955026
276023 84.340550
330833 83.186605
385643 81.134827
440453 79.060960
495264 77.457803
550074 76.292243
604884 75.867575
659694 75.511924
714504 74.221657
769314 73.830941
824125 76.977637
878935 78.826169
933745 77.819844
988555 77.298089
1043365 77.188105
1098175 75.340765
1152985 74.311185
1207796 73.163844
1262606 72.613317
1317416 73.460068
1372226 70.388324
1427036 70.835355
1481845 70.154085
Valleys
2521 4.669368
56629 26.551585
56998 26.184984
111791 26.288734
166620 27.717165
221434 28.312708
330432 28.235397
385617 27.535091
440341 26.886589
495174 26.379043
549353 26.040947
550239 25.760023
605051 25.594147
714352 25.354300
714653 25.008184
769472 24.883584
824284 25.135316
879075 25.477464
933907 25.265173
988711 25.160046
1097917 25.058851
1098333 24.626667
1153134 24.357835
1207943 23.982878
1262750 23.938298
1371013 23.766077
1372381 23.351263
1427187 23.368314
Any ideas about this awkward result on the valleys?
You complicate your task by trying to find all valleys. This will always be difficult because they do not stand out as well as your peaks in comparison to the surrounding data. Whatever your parameters for find_peaks, sometimes it will identify two valleys after a peak, sometimes none. Instead, just identify the local minimum after each peak:
import pandas as pd
import matplotlib.pyplot as plt
from scipy.signal import find_peaks
#sample data
from scipy.misc import electrocardiogram
x = electrocardiogram()[2000:4000]
date_range = pd.date_range("20210116", periods=x.size, freq="10ms")
df = pd.DataFrame({"Timestamp": date_range, "Mean_values": x})
x = df['Timestamp']
y = df['Mean_values']
fig, (ax1, ax2, ax3) = plt.subplots(3, figsize=(12, 8))
#peak finding
peaks, _ = find_peaks(y, prominence=1)
ax1.plot(x[peaks], y[peaks], "ob")
ax1.plot(x, y)
ax1.legend(['Prominence'])
#valley finder general
valleys, _ = find_peaks(-y, prominence=1)
ax2.plot(x[valleys], y[valleys], "vg")
ax2.plot(x, y)
ax2.legend(['Valleys without filtering'])
#valley finding restricted to a short time period after a peak
#set time window, e.g., for 200 ms
time_window_size = pd.Timedelta(200, unit="ms")
time_of_peaks = x[peaks]
peak_end = x.searchsorted(time_of_peaks + time_window_size)
#in case of evenly spaced data points, this can be simplified
#and you just add n data points to your peak index array
#peak_end = peaks + n
true_valleys = peaks.copy()
for i, (start, stop) in enumerate(zip(peaks, peak_end)):
true_valleys[i] = start + y[start:stop].argmin()
ax3.plot(x[true_valleys], y[true_valleys], "sr")
ax3.plot(x, y)
ax3.legend(['Valleys after events'])
plt.show()
Sample output:
I am not sure what you intend to do with these minima, but if you are only interested in baseline shifts, you can directly calculate the peak-wise baseline values like
baseline_per_peak = peaks.copy().astype(float)
for i, (start, stop) in enumerate(zip(peaks, peak_end)):
baseline_per_peak[i] = y[start:stop].mean()
print(baseline_per_peak)
Sample output:
[-0.71125 -0.203 0.29225 0.72825 0.6835 0.79125 0.51225 0.23
0.0345 -0.3945 -0.48125 -0.4675 ]
This can, of course, also easily be adapted to the period before the peak:
#valley in the short time period before a peak
#set time window, e.g., for 200 ms
time_window_size = pd.Timedelta(200, unit="ms")
time_of_peaks = x[peaks]
peak_start = x.searchsorted(time_of_peaks - time_window_size)
#in case of evenly spaced data points, this can be simplified
#and you just add n data points to your peak index array
#peak_start = peaks - n
true_valleys = peaks.copy()
for i, (start, stop) in enumerate(zip(peak_start, peaks)):
true_valleys[i] = start + y[start:stop].argmin()

Is there a way I can find the range of local maxima of histogram?

I'm wondering if there's a way I can find the range of local maxima of a histogram. For instance, suppose I have the following histogram (just ignore the orange curve):
The histogram is actually obtained from a dictionary. I'm hoping to find the range of local maxima of this histogram (on the horizontal axis), which are, say, 1.3-1.6, and 2.1-2.4 in this case. I have no idea which tools would be helpful or which techniques I may want to use. I know there's a tool to find local maxima of a 1-D array:
from scipy.signal import argrelextrema
x = np.random.random(12)
argrelextrema(x, np.greater)
but I don't think it would work here since I'm looking for a range, and there're some 'wiggles' on the histogram. Can anyone give me some suggestions/examples about how I can obtain the range I'm looking for? Thanks a lot for the help
PS: I trying to not just search for the ranges of x whose y values are above a certain limit:)
I don't know if I correctly understand what you want to do, but you can treat the histogram as a Probability Density Function (PDF) of a bimodal distribution, then find the modes and the Highest Density Intervals (HDIs) around the two modes.
So, I create some sample data
import numpy as np
import pandas as pd
import scipy.stats as sps
from scipy.signal import find_peaks, argrelextrema
import matplotlib.pyplot as plt
d1 = sps.norm(loc=1.3, scale=.2)
d2 = sps.norm(loc=2.2, scale=.3)
r1 = d1.rvs(size=5000, random_state=1)
r2 = d2.rvs(size=5000, random_state=1)
r = np.concatenate((r1, r2))
h = plt.hist(r, bins=100, density=True);
We have only h, the result of the hist function that will contains the density (100) and the ranges of the bins (101).
print(h[0].size)
100
print(h[1].size)
101
So we first need to choose the mean of each bin
density = h[0]
values = h[1][:-1] + np.diff(h[1])[0] / 2
plt.hist(r, bins=100, density=True, alpha=.25)
plt.plot(values, density);
Now we can normalize the PDF (to sum to 1) and smooth the data with moving average that we'll use only to get the peaks (maxima) and minima
norm_density = density / density.sum()
norm_density_ma = pd.Series(norm_density).rolling(7, center=True).mean().values
plt.plot(values, norm_density_ma)
plt.plot(values, norm_density);
Now we can obtain indexes of maxima
peaks = find_peaks(norm_density_ma)[0]
peaks
array([24, 57])
and minima
minima = argrelextrema(norm_density_ma, np.less)[0]
minima
array([40])
and check they're correct
plt.plot(values, norm_density_ma)
plt.plot(values, norm_density)
for peak in peaks:
plt.axvline(values[peak], color='r')
plt.axvline(values[minima], color='k', ls='--');
Finally, we have to find out the HDIs around the two modes (peaks) from the normalized h histogram data. We can use a simple function to get the HDI of grid (see HDI_of_grid for details and Doing Bayesian Data Analysis by John K. Kruschke)
def HDI_of_grid(probMassVec, credMass=0.95):
sortedProbMass = np.sort(probMassVec, axis=None)[::-1]
HDIheightIdx = np.min(np.where(np.cumsum(sortedProbMass) >= credMass))
HDIheight = sortedProbMass[HDIheightIdx]
HDImass = np.sum(probMassVec[probMassVec >= HDIheight])
idx = np.where(probMassVec >= HDIheight)[0]
return {'indexes':idx, 'mass':HDImass, 'height':HDIheight}
Let's say we want the HDIs to contain a mass of 0.3
# HDI around the 1st mode
hdi1 = HDI_of_grid(norm_density, credMass=.3)
plt.plot(values, norm_density_ma)
plt.plot(values, norm_density)
plt.fill_between(
values[hdi1['indexes']],
0, norm_density[hdi1['indexes']],
alpha=.25
)
for peak in peaks:
plt.axvline(values[peak], color='r')
for the 2nd mode, we'll get HDI from minima to avoid the 1st mode
# HDI around the 2nd mode
hdi2 = HDI_of_grid(norm_density[minima[0]:], credMass=.3)
plt.plot(values, norm_density_ma)
plt.plot(values, norm_density)
plt.fill_between(
values[hdi1['indexes']],
0, norm_density[hdi1['indexes']],
alpha=.25
)
plt.fill_between(
values[hdi2['indexes']+minima],
0, norm_density[hdi2['indexes']+minima],
alpha=.25
)
for peak in peaks:
plt.axvline(values[peak], color='r')
And we have the values of the two HDIs
# 1st mode
values[peaks[0]]
1.320249129265321
# 0.3 HDI
values[hdi1['indexes']].take([0, -1])
array([1.12857599, 1.45715851])
# 2nd mode
values[peaks[1]]
2.2238510564735363
# 0.3 HDI
values[hdi2['indexes']+minima].take([0, -1])
array([1.95003229, 2.47028795])

Filling a 3D Array and Plotting the Values

I would like to write a code in Python that evaluates the time evolution of a density distribution, p(x,y). The initial conditions is p(t=0,x,y)=exp[-((x-500)^2)/500] and the formula for the solution is in the code below: t-time index, i-space index (x-direction), j-space index (y-direction), and v=0.8
My goal is to run the scheme for 10 iterations and plot the results at the final time step (t=9). What I'm getting is a big array just filled with zeros. I think it's because I am not using the 3D arrays correctly, does anyone have any suggestions? Thank you.
My attempt:
import numpy as np
import matplotlib.pyplot as plt
#Input Parameters
Nx = 1000 #number of grid points in x-direction
Ny = 500 #number of grid points in y-direction
T = 10 #number of time steps
v = 0.8
p = np.zeros((T,Nx,Ny))
P = np.zeros((T,Nx,Ny))
for t in range(0,T-1):
for i in range(0,Nx-1):
for j in range(0,Ny-1):
P[t,i,j] = p[t,i,j]-((v/2)*(p[t,i+1,j]-p[t,i,j]))
p[0,i,j] = np.exp(((-1*(i-500))**2)/500)
x = P[9,i]
y = P[9,j]
print(x)
plt.plot(x,y)
plt.xlim([0,1000])
plt.ylim([0,500])
plt.xlabel('x-direction')
plt.ylabel('y-direction')
plt.title("Density Distribution After 10 Iterations")
Looks like you only fill the values for t in range(0,T-1) which stops at T=8, and you are trying to get x = P[9,i]. They never get filled so obviously they are all 0.
Try to use range(0, T), it will loop over 0,1,2,...,T-1. Also change range(0,Nx), range(0,Ny)

get bins coordinates with hexbin in matplotlib

I use matplotlib's method hexbin to compute 2d histograms on my data.
But I would like to get the coordinates of the centers of the hexagons in order to further process the results.
I got the values using get_array() method on the result, but I cannot figure out how to get the bins coordinates.
I tried to compute them given number of bins and the extent of my data but i don't know the exact number of bins in each direction. gridsize=(10,2) should do the trick but it does not seem to work.
Any idea?
I think this works.
from __future__ import division
import numpy as np
import math
import matplotlib.pyplot as plt
def generate_data(n):
"""Make random, correlated x & y arrays"""
points = np.random.multivariate_normal(mean=(0,0),
cov=[[0.4,9],[9,10]],size=int(n))
return points
if __name__ =='__main__':
color_map = plt.cm.Spectral_r
n = 1e4
points = generate_data(n)
xbnds = np.array([-20.0,20.0])
ybnds = np.array([-20.0,20.0])
extent = [xbnds[0],xbnds[1],ybnds[0],ybnds[1]]
fig=plt.figure(figsize=(10,9))
ax = fig.add_subplot(111)
x, y = points.T
# Set gridsize just to make them visually large
image = plt.hexbin(x,y,cmap=color_map,gridsize=20,extent=extent,mincnt=1,bins='log')
# Note that mincnt=1 adds 1 to each count
counts = image.get_array()
ncnts = np.count_nonzero(np.power(10,counts))
verts = image.get_offsets()
for offc in xrange(verts.shape[0]):
binx,biny = verts[offc][0],verts[offc][1]
if counts[offc]:
plt.plot(binx,biny,'k.',zorder=100)
ax.set_xlim(xbnds)
ax.set_ylim(ybnds)
plt.grid(True)
cb = plt.colorbar(image,spacing='uniform',extend='max')
plt.show()
I would love to confirm that the code by Hooked using get_offsets() works, but I tried several iterations of the code mentioned above to retrieve center positions and, as Dave mentioned, get_offsets() remains empty. The workaround that I found is to use the non-empty 'image.get_paths()' option. My code takes the mean to find centers but which means it is just a smidge longer, but it does work.
The get_paths() option returns a set of x,y coordinates embedded that can be looped over and then averaged to return the center position for each hexagram.
The code that I have is as follows:
counts=image.get_array() #counts in each hexagon, works great
verts=image.get_offsets() #empty, don't use this
b=image.get_paths() #this does work, gives Path([[]][]) which can be plotted
for x in xrange(len(b)):
xav=np.mean(b[x].vertices[0:6,0]) #center in x (RA)
yav=np.mean(b[x].vertices[0:6,1]) #center in y (DEC)
plt.plot(xav,yav,'k.',zorder=100)
I had this same problem. I think what needs to be developed is a framework to have a HexagonalGrid object which can then be applied to many different data sets (and it would be awesome to do it for N dimensions). This is possible and it surprises me that neither Scipy or Numpy has anything for it (furthermore there seems to be nothing else like it except perhaps binify)
That said, I assume you want to use hexbinning to compare multiple binned data sets. This requires some common base. I got this to work using matplotlib's hexbin the following way:
import numpy as np
import matplotlib.pyplot as plt
def get_data (mean,cov,n=1e3):
"""
Quick fake data builder
"""
np.random.seed(101)
points = np.random.multivariate_normal(mean=mean,cov=cov,size=int(n))
x, y = points.T
return x,y
def get_centers (hexbin_output):
"""
about 40% faster than previous post only cause you're not calculating the
min/max every time
"""
paths = hexbin_output.get_paths()
v = paths[0].vertices[:-1] # adds a value [0,0] to the end
vx,vy = v.T
idx = [3,0,5,2] # index for [xmin,xmax,ymin,ymax]
xmin,xmax,ymin,ymax = vx[idx[0]],vx[idx[1]],vy[idx[2]],vy[idx[3]]
half_width_x = abs(xmax-xmin)/2.0
half_width_y = abs(ymax-ymin)/2.0
centers = []
for i in xrange(len(paths)):
cx = paths[i].vertices[idx[0],0]+half_width_x
cy = paths[i].vertices[idx[2],1]+half_width_y
centers.append((cx,cy))
return np.asarray(centers)
# important parts ==>
class Hexagonal2DGrid (object):
"""
Used to fix the gridsize, extent, and bins
"""
def __init__ (self,gridsize,extent,bins=None):
self.gridsize = gridsize
self.extent = extent
self.bins = bins
def hexbin (x,y,hexgrid):
"""
To hexagonally bin the data in 2 dimensions
"""
fig = plt.figure()
ax = fig.add_subplot(111)
# Note mincnt=0 so that it will return a value for every point in the
# hexgrid, not just those with count>mincnt
# Basically you fix the gridsize, extent, and bins to keep them the same
# then the resulting count array is the same
hexbin = plt.hexbin(x,y, mincnt=0,
gridsize=hexgrid.gridsize,
extent=hexgrid.extent,
bins=hexgrid.bins)
# you could close the figure if you don't want it
# plt.close(fig.number)
counts = hexbin.get_array().copy()
return counts, hexbin
# Example ===>
if __name__ == "__main__":
hexgrid = Hexagonal2DGrid((21,5),[-70,70,-20,20])
x_data,y_data = get_data((0,0),[[-40,95],[90,10]])
x_model,y_model = get_data((0,10),[[100,30],[3,30]])
counts_data, hexbin_data = hexbin(x_data,y_data,hexgrid)
counts_model, hexbin_model = hexbin(x_model,y_model,hexgrid)
# if you want the centers, they will be the same for both
centers = get_centers(hexbin_data)
# if you want to ignore the cells with zeros then use the following mask.
# But if want zeros for some bins and not others I'm not sure an elegant way
# to do this without using the centers
nonzero = counts_data != 0
# now you can compare the two data sets
variance_data = counts_data[nonzero]
square_diffs = (counts_data[nonzero]-counts_model[nonzero])**2
chi2 = np.sum(square_diffs/variance_data)
print(" chi2={}".format(chi2))

Remove data points below a curve with python

I need to compare some theoretical data with real data in python.
The theoretical data comes from resolving an equation.
To improve the comparative I would like to remove data points that fall far from the theoretical curve. I mean, I want to remove the points below and above red dashed lines in the figure (made with matplotlib).
Both the theoretical curves and the data points are arrays of different length.
I can try to remove the points in a roughly-eye way, for example: the first upper point can be detected using:
data2[(data2.redshift<0.4)&data2.dmodulus>1]
rec.array([('1997o', 0.374, 1.0203223485103787, 0.44354759972859786)], dtype=[('SN_name', '|S10'), ('redshift', '<f8'), ('dmodulus', '<f8'), ('dmodulus_error', '<f8')])
But I would like to use a less roughly-eye way.
So, can anyone help me finding an easy way of removing the problematic points?
Thank you!
This might be overkill and is based on your comment
Both the theoretical curves and the data points are arrays of
different length.
I would do the following:
Truncate the data set so that its x values lie within the max and min values of the theoretical set.
Interpolate the theoretical curve using scipy.interpolate.interp1d and the above truncated data x values. The reason for step (1) is to satisfy the constraints of interp1d.
Use numpy.where to find data y values that are out side the range of acceptable theory values.
DONT discard these values, as was suggested in comments and other answers. If you want for clarity, point them out by plotting the 'inliners' one color and the 'outliers' an other color.
Here's a script that is close to what you are looking for, I think. It hopefully will help you accomplish what you want:
import numpy as np
import scipy.interpolate as interpolate
import matplotlib.pyplot as plt
# make up data
def makeUpData():
'''Make many more data points (x,y,yerr) than theory (x,y),
with theory yerr corresponding to a constant "sigma" in y,
about x,y value'''
NX= 150
dataX = (np.random.rand(NX)*1.1)**2
dataY = (1.5*dataX+np.random.rand(NX)**2)*dataX
dataErr = np.random.rand(NX)*dataX*1.3
theoryX = np.arange(0,1,0.1)
theoryY = theoryX*theoryX*1.5
theoryErr = 0.5
return dataX,dataY,dataErr,theoryX,theoryY,theoryErr
def makeSameXrange(theoryX,dataX,dataY):
'''
Truncate the dataX and dataY ranges so that dataX min and max are with in
the max and min of theoryX.
'''
minT,maxT = theoryX.min(),theoryX.max()
goodIdxMax = np.where(dataX<maxT)
goodIdxMin = np.where(dataX[goodIdxMax]>minT)
return (dataX[goodIdxMax])[goodIdxMin],(dataY[goodIdxMax])[goodIdxMin]
# take 'theory' and get values at every 'data' x point
def theoryYatDataX(theoryX,theoryY,dataX):
'''For every dataX point, find interpolated thoeryY value. theoryx needed
for interpolation.'''
f = interpolate.interp1d(theoryX,theoryY)
return f(dataX[np.where(dataX<np.max(theoryX))])
# collect valid points
def findInlierSet(dataX,dataY,interpTheoryY,thoeryErr):
'''Find where theoryY-theoryErr < dataY theoryY+theoryErr and return
valid indicies.'''
withinUpper = np.where(dataY<(interpTheoryY+theoryErr))
withinLower = np.where(dataY[withinUpper]
>(interpTheoryY[withinUpper]-theoryErr))
return (dataX[withinUpper])[withinLower],(dataY[withinUpper])[withinLower]
def findOutlierSet(dataX,dataY,interpTheoryY,thoeryErr):
'''Find where theoryY-theoryErr < dataY theoryY+theoryErr and return
valid indicies.'''
withinUpper = np.where(dataY>(interpTheoryY+theoryErr))
withinLower = np.where(dataY<(interpTheoryY-theoryErr))
return (dataX[withinUpper],dataY[withinUpper],
dataX[withinLower],dataY[withinLower])
if __name__ == "__main__":
dataX,dataY,dataErr,theoryX,theoryY,theoryErr = makeUpData()
TruncDataX,TruncDataY = makeSameXrange(theoryX,dataX,dataY)
interpTheoryY = theoryYatDataX(theoryX,theoryY,TruncDataX)
inDataX,inDataY = findInlierSet(TruncDataX,TruncDataY,interpTheoryY,
theoryErr)
outUpX,outUpY,outDownX,outDownY = findOutlierSet(TruncDataX,
TruncDataY,
interpTheoryY,
theoryErr)
#print inlierIndex
fig = plt.figure()
ax = fig.add_subplot(211)
ax.errorbar(dataX,dataY,dataErr,fmt='.',color='k')
ax.plot(theoryX,theoryY,'r-')
ax.plot(theoryX,theoryY+theoryErr,'r--')
ax.plot(theoryX,theoryY-theoryErr,'r--')
ax.set_xlim(0,1.4)
ax.set_ylim(-.5,3)
ax = fig.add_subplot(212)
ax.plot(inDataX,inDataY,'ko')
ax.plot(outUpX,outUpY,'bo')
ax.plot(outDownX,outDownY,'ro')
ax.plot(theoryX,theoryY,'r-')
ax.plot(theoryX,theoryY+theoryErr,'r--')
ax.plot(theoryX,theoryY-theoryErr,'r--')
ax.set_xlim(0,1.4)
ax.set_ylim(-.5,3)
fig.savefig('findInliers.png')
This figure is the result:
At the end I use some of the Yann code:
def theoryYatDataX(theoryX,theoryY,dataX):
'''For every dataX point, find interpolated theoryY value. theoryx needed
for interpolation.'''
f = interpolate.interp1d(theoryX,theoryY)
return f(dataX[np.where(dataX<np.max(theoryX))])
def findOutlierSet(data,interpTheoryY,theoryErr):
'''Find where theoryY-theoryErr < dataY theoryY+theoryErr and return
valid indicies.'''
up = np.where(data.dmodulus > (interpTheoryY+theoryErr))
low = np.where(data.dmodulus < (interpTheoryY-theoryErr))
# join all the index together in a flat array
out = np.hstack([up,low]).ravel()
index = np.array(np.ones(len(data),dtype=bool))
index[out]=False
datain = data[index]
dataout = data[out]
return datain, dataout
def selectdata(data,theoryX,theoryY):
"""
Data selection: z<1 and +-0.5 LFLRW separation
"""
# Select data with redshift z<1
data1 = data[data.redshift < 1]
# From modulus to light distance:
data1.dmodulus, data1.dmodulus_error = modulus2distance(data1.dmodulus,data1.dmodulus_error)
# redshift data order
data1.sort(order='redshift')
# Outliers: distance to LFLRW curve bigger than +-0.5
theoryErr = 0.5
# Theory curve Interpolation to get the same points as data
interpy = theoryYatDataX(theoryX,theoryY,data1.redshift)
datain, dataout = findOutlierSet(data1,interpy,theoryErr)
return datain, dataout
Using those functions I can finally obtain:
Thank you all for your help.
Just look at the difference between the red curve and the points, if it is bigger than the difference between the red curve and the dashed red curve remove it.
diff=np.abs(points-red_curve)
index= (diff>(dashed_curve-redcurve))
filtered=points[index]
But please take the comment from NickLH serious. Your Data looks pretty good without any filtering, your "outlieres" all have a very big error and won't affect the fit much.
Either you could use the numpy.where() to identify which xy pairs meet your plotting criteria, or perhaps enumerate to do pretty much the same thing. Example:
x_list = [ 1, 2, 3, 4, 5, 6 ]
y_list = ['f','o','o','b','a','r']
result = [y_list[i] for i, x in enumerate(x_list) if 2 <= x < 5]
print result
I'm sure you could change the conditions so that '2' and '5' in the above example are the functions of your curves

Categories

Resources