Finding two linear fits on different domains in the same data - python
I'm trying to plot a 3rd-order polynomial, and two linear fits on the same set of data. My data looks like this:
,Frequency,Flux Density,log_freq,log_flux
0,1.25e+18,1.86e-07,18.096910013008056,-6.730487055782084
1,699000000000000.0,1.07e-06,14.84447717574568,-5.97061622231479
2,541000000000000.0,1.1e-06,14.73319726510657,-5.958607314841775
3,468000000000000.0,1e-06,14.670245853074125,-6.0
4,458000000000000.0,1.77e-06,14.660865478003869,-5.752026733638194
5,89400000000000.0,3.01e-05,13.951337518795917,-4.521433504406157
6,89400000000000.0,9.3e-05,13.951337518795917,-4.031517051446065
7,89400000000000.0,0.00187,13.951337518795917,-2.728158393463501
8,65100000000000.0,2.44e-05,13.813580988568193,-4.61261017366127
9,65100000000000.0,6.28e-05,13.813580988568193,-4.2020403562628035
10,65100000000000.0,0.00108,13.813580988568193,-2.96657624451305
11,25900000000000.0,0.000785,13.413299764081252,-3.1051303432547472
12,25900000000000.0,0.00106,13.413299764081252,-2.9746941347352296
13,25900000000000.0,0.000796,13.413299764081252,-3.099086932262331
14,13600000000000.0,0.00339,13.133538908370218,-2.469800301796918
15,13600000000000.0,0.00372,13.133538908370218,-2.4294570601181023
16,13600000000000.0,0.00308,13.133538908370218,-2.5114492834995557
17,12700000000000.0,0.00222,13.103803720955957,-2.653647025549361
18,12700000000000.0,0.00204,13.103803720955957,-2.6903698325741012
19,230000000000.0,0.133,11.361727836017593,-0.8761483590329142
22,90000000000.0,0.518,10.954242509439325,-0.28567024025476695
23,61000000000.0,1.0,10.785329835010767,0.0
24,61000000000.0,0.1,10.785329835010767,-1.0
25,61000000000.0,0.4,10.785329835010767,-0.3979400086720376
26,42400000000.0,0.8,10.627365856592732,-0.09691001300805639
27,41000000000.0,0.9,10.612783856719735,-0.045757490560675115
28,41000000000.0,0.7,10.612783856719735,-0.1549019599857432
29,41000000000.0,0.8,10.612783856719735,-0.09691001300805639
30,41000000000.0,0.6,10.612783856719735,-0.2218487496163564
31,41000000000.0,0.7,10.612783856719735,-0.1549019599857432
32,37000000000.0,1.0,10.568201724066995,0.0
33,36800000000.0,1.0,10.565847818673518,0.0
34,36800000000.0,0.98,10.565847818673518,-0.00877392430750515
35,33000000000.0,0.8,10.518513939877888,-0.09691001300805639
36,33000000000.0,1.0,10.518513939877888,0.0
37,31400000000.0,0.92,10.496929648073214,-0.036212172654444715
38,23000000000.0,1.4,10.361727836017593,0.146128035678238
39,23000000000.0,1.1,10.361727836017593,0.04139268515822508
40,23000000000.0,1.11,10.361727836017593,0.045322978786657475
41,23000000000.0,1.1,10.361727836017593,0.04139268515822508
42,22200000000.0,1.23,10.346352974450639,0.08990511143939793
43,22200000000.0,1.24,10.346352974450639,0.09342168516223506
44,21700000000.0,0.98,10.33645973384853,-0.00877392430750515
45,21700000000.0,1.07,10.33645973384853,0.029383777685209667
46,20000000000.0,1.44,10.301029995663981,0.15836249209524964
47,15400000000.0,1.32,10.187520720836464,0.12057393120584989
48,15000000000.0,1.5,10.176091259055681,0.17609125905568124
49,15000000000.0,1.5,10.176091259055681,0.17609125905568124
50,15000000000.0,1.42,10.176091259055681,0.15228834438305647
51,15000000000.0,1.43,10.176091259055681,0.1553360374650618
52,15000000000.0,1.42,10.176091259055681,0.15228834438305647
53,15000000000.0,1.47,10.176091259055681,0.1673173347481761
54,15000000000.0,1.38,10.176091259055681,0.13987908640123647
55,10700000000.0,2.59,10.02938377768521,0.4132997640812518
56,8870000000.0,2.79,9.947923619831727,0.44560420327359757
57,8460000000.0,2.69,9.927370363039023,0.42975228000240795
58,8400000000.0,2.8,9.924279286061882,0.4471580313422192
59,8400000000.0,2.53,9.924279286061882,0.40312052117581787
60,8400000000.0,2.06,9.924279286061882,0.31386722036915343
61,8300000000.0,2.58,9.919078092376074,0.41161970596323016
62,8080000000.0,2.76,9.907411360774587,0.4409090820652177
63,5010000000.0,3.68,9.699837725867246,0.5658478186735176
64,5000000000.0,0.81,9.698970004336019,-0.09151498112135022
65,5000000000.0,3.5,9.698970004336019,0.5440680443502757
66,5000000000.0,3.57,9.698970004336019,0.5526682161121932
67,4980000000.0,3.46,9.697229342759718,0.5390760987927766
68,4900000000.0,2.95,9.690196080028514,0.46982201597816303
69,4850000000.0,3.46,9.685741738602264,0.5390760987927766
70,4850000000.0,3.45,9.685741738602264,0.5378190950732742
71,4780000000.0,2.16,9.679427896612118,0.3344537511509309
72,4540000000.0,3.61,9.657055852857104,0.557507201905658
73,2700000000.0,3.5,9.431363764158988,0.5440680443502757
74,2700000000.0,3.7,9.431363764158988,0.568201724066995
75,2700000000.0,3.92,9.431363764158988,0.5932860670204573
76,2700000000.0,3.92,9.431363764158988,0.5932860670204573
77,2250000000.0,4.21,9.352182518111363,0.6242820958356683
78,1660000000.0,3.69,9.220108088040055,0.5670263661590603
79,1660000000.0,3.8,9.220108088040055,0.5797835966168101
80,1410000000.0,3.5,9.14921911265538,0.5440680443502757
81,1400000000.0,3.45,9.146128035678238,0.5378190950732742
82,1400000000.0,3.28,9.146128035678238,0.5158738437116791
83,1400000000.0,3.19,9.146128035678238,0.5037906830571811
84,1400000000.0,3.51,9.146128035678238,0.5453071164658241
85,1340000000.0,3.31,9.127104798364808,0.5198279937757188
86,1340000000.0,3.31,9.127104798364808,0.5198279937757188
87,750000000.0,3.14,8.8750612633917,0.49692964807321494
88,408000000.0,1.46,8.61066016308988,0.1643528557844371
89,408000000.0,1.46,8.61066016308988,0.1643528557844371
90,365000000.0,1.62,8.562292864456476,0.20951501454263097
91,365000000.0,1.56,8.562292864456476,0.1931245983544616
92,333000000.0,1.32,8.52244423350632,0.12057393120584989
93,302000000.0,1.23,8.48000694295715,0.08990511143939793
94,151000000.0,2.13,8.178976947293169,0.3283796034387377
95,73800000.0,3.58,7.868056361823042,0.5538830266438743
and my code is
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import numpy.polynomial.polynomial as poly
def find_extrema(poly, bounds):
'''
Finds the extrema of the polynomial; ensure real.
https://stackoverflow.com/questions/72932816/python-finding-local-maxima-minima-for-multiple-polynomials-efficiently
'''
deriv = poly.deriv()
extrema = deriv.roots()
# Filter out complex roots
extrema = extrema[np.isreal(extrema)]
# Get real part of root
extrema = np.real(extrema)
# Apply bounds check
lb, ub = bounds
extrema = extrema[(lb <= extrema) & (extrema <= ub)]
return extrema
def find_maximum(poly, bounds):
'''
Find the maximum point; returns the value of the turnover frequency.
https://stackoverflow.com/questions/72932816/python-finding-local-maxima-minima-for-multiple-polynomials-efficiently
'''
extrema = find_extrema(poly, bounds)
# Either bound could end up being the minimum. Check those too.
extrema = np.concatenate((extrema, bounds))
value_at_extrema = poly(extrema)
maximum_index = np.argmax(value_at_extrema)
return extrema[maximum_index]
# LOAD THE DATA FROM FILE HERE
# CARRY ON...
xvar = 'log_freq'
yvar = 'log_flux'
x, y = pks[xvar], pks[yvar]
lower = min(x)
upper = max(x)
# Find the 3rd-order polynomial which fits the SED
coefs = poly.polyfit(x, y, 3) # find the coeffs
x_new = np.linspace(lower, upper, num=len(x)*10) # space to plot the fit
ffit = poly.Polynomial(coefs) # find the polynomial
# Find turnover frequency and peak flux
nu_to = find_maximum(ffit, (lower, upper))
F_p = ffit(nu_to)
# HERE'S THE TRICKY BIT
# Find the straight line to fit to the left of nu_to
left_linefit = poly.polyfit(x, y, 1)
x_left = np.linspace(lower, nu_to, num=len(x)*10) # space to plot the fit
ffit_thin = poly.Polynomial(left_linefit,
domain = (lower, nu_to)
)
# PLOTS THE POLYNOMIAL WELL
ax1 = plt.subplot(1, 1, 1)
ax1.scatter(pks[xvar], pks[yvar], label = 'PKS 0742+10', c = 'b')
ax1.plot(x_new, ffit(x_new), color = 'r')
ax1.plot(x_left, ffit_left(x_left), color = 'gold')
ax1.set_yscale('linear')
ax1.set_xscale('linear')
ax1.legend()
ax1.set_xlabel(r'$\log\nu$ ($\nu$ in Hz)')
ax1.set_ylabel(r'$\log F_{\nu}$ ($F_{\nu}$ in Jy)')
ax1.grid(axis = 'both', which = 'major')
The code produces the poly fit well:
I'm trying to plot the straight-line fits for the points on either side of the maximum, as shown schematically below:
I thought I could do it with
ffit_left = poly.Polynomial(left_linefit,
domain = (lower, nu_to)
)
and similar for ffit_right, but that produces
which is actually the straight-line fit for the whole dataset, plotted only for that domain. I don't want to manipulate the dataset, because eventually I'll have to do it on a lot of datasets.
The fitting part of the code comes from an answer to this question .
How can I fit a straight line to just set of points without manipulating the dataset?
My guess is that I have to make left_linefit = poly.polyfit(x, y, 1) recognise a domain, but I can't see anything in the numpy polyfit docs.
Sorry for the long question!
I am not sure to well understand your request. If you want to fit a piecewise function made of three linear segments a method is described in https://fr.scribd.com/document/380941024/Regression-par-morceaux-Piecewise-Regression-pdf with theory and numerical examples.
Several cases are considered. Among them the case below might be convenient for you.
H(*) is the Heaviside step function.
Related
Python VTK "normalize" a point cloud
I have done a lot of searching but have yet to find an answer. I am currently working on some data of a crop field. I have PLY files for multiple fields which I have successfully read into, filtered, and visualised using Python and VTK. My main goal is to eventually segment and run analysis on individual crop plots. However to make that task easier I first want to "Normalize" my point cloud so that all plots are essentially "on the same level". From the image I have attached you can see that the point clod slopes from one corner to its opposite. So what I want to flatten out the image so the ground points are all on the same plane/ level. And the reset of the points adjusted accordingly. Point Cloud I've also included my code to show how I got to this point. If anyone has any advice on how I can achieve the normalising to one plane I would be very appreciative. Sadly I cannot include my data as it is work related. Thanks. Josh import vtk from vtk.util import numpy_support import numpy as np filename = 'File.ply' # Reader r = vtk.vtkPLYReader() r.SetFileName(filename) # Filters vgf = vtk.vtkVertexGlyphFilter() vgf.SetInputConnection(r.GetOutputPort()) # Elevation pc = r.GetOutput() bounds = pc.GetBounds() #print(bounds) minz = bounds[4] maxz = bounds[5] #print(bounds[4], bounds[5]) evgf = vtk.vtkElevationFilter() evgf.SetInputConnection(vgf.GetOutputPort()) evgf.SetLowPoint(0, 0, minz) evgf.SetHighPoint(0, 0, maxz) #pc.GetNumberOfPoints() # Look up table lut = vtk.vtkLookupTable() lut.SetHueRange(0.667, 0) lut.SetSaturationRange(1, 1) lut.SetValueRange(1, 1) lut.Build # Renderer mapper = vtk.vtkPolyDataMapper() mapper.SetInputConnection(evgf.GetOutputPort()) mapper.SetLookupTable(lut) actor = vtk.vtkActor() actor.SetMapper(mapper) renderer = vtk.vtkRenderer() renWin = vtk.vtkRenderWindow() renWin.AddRenderer(renderer) iren = vtk.vtkRenderWindowInteractor() iren.SetRenderWindow(renWin) renderer.AddActor(actor) renderer.SetBackground(0, 0, 0) renWin.Render() iren.Start()
I once solved a similar problem. Find below some code that I used back then. It uses two functions fitPlane and findTransformFromVectors that you could replace with your own implementations. Note that there are many ways to fit a plane through a set of points. This SO post discusses compares scipy.optimize.minimize with scipy.linalg.lstsq. In another SO post, the use of PCA or RANSAC and other methods are suggested. You probably want to use methods provided by sklearn, numpy or other modules. My solution simply (and non-robustly) computes ordinary least squares regression. import vtk import numpy as np # Convert vtk to numpy arrays from vtk.util.numpy_support import vtk_to_numpy as vtk2np # Create a random point cloud. center = [3.0, 2.0, 1.0] source = vtk.vtkPointSource() source.SetCenter(center) source.SetNumberOfPoints(50) source.SetRadius(1.) source.Update() source = source.GetOutput() # Extract the points from the point cloud. points = vtk2np(source.GetPoints().GetData()) points = points.transpose() # Fit a plane. nRegression contains the normal vector of the # regression surface. nRegression = fitPlane(points) # Compute a transform that maps the source center to the origin and # plane normal to the z-axis. trafo = findTransformFromVectors(originFrom=center, axisFrom=nRegression.transpose(), originTo=(0,0,0), axisTo=(0.,0.,1.)) # Apply transform to source. sourceTransformed = vtk.vtkTransformFilter() sourceTransformed.SetInputData(source) sourceTransformed.SetTransform(trafo) sourceTransformed.Update() # Visualize output... Here my implementations of fitPlane and findTransformFromVectors: # The following code has been written by normanius under the CC BY-SA 4.0 # license. # License: https://creativecommons.org/licenses/by-sa/4.0/ # Author: normanius: https://stackoverflow.com/users/3388962/normanius # Date: October 2018 # Reference: https://stackoverflow.com/questions/52716438 def fitPlane(X, tolerance=1e-10): ''' Estimate the plane normal by means of ordinary least dsquares. Requirement: points X span the full column rank. If the points lie in a perfect plane, the regression problem is ill-conditioned! Formulas: a = (XX^T)^(-1)*X*z Surface normal: n = [a[0], a[1], -1] n = n/norm(n) Plane intercept: c = a[2]/norm(n) NOTE: The condition number for the pseudo-inverse improves if the formulation is changed to homogenous notation. Formulas (homogenous): a = (XX^T)^(-1)*[1,1,1]^T n = a[:-1] n = n/norm(n) c = a[-1]/norm(n) Arguments: X: A matrix with n rows and 3 columns tolerance: Minimal condition number accepted. If the condition number is lower, the algorithm returns None. Returns: If the computation was successful, a numpy array of length three is returned that represents the estimated plane normal. On failure, None is returned. ''' X = np.asarray(X) d,N = X.shape X = np.vstack([X,np.ones([1,N])]) z = np.ones([d+1,1]) XXT = np.dot(X, np.transpose(X)) # XXT=X*X^T if np.linalg.det(XXT) < 1e-10: # The test covers the case where n<3 return None n = np.dot(np.linalg.inv(XXT), z) intercept = n[-1] n = n[:-1] scale = np.linalg.norm(n) n /= scale intercept /= scale return n def findTransformFromVectors(originFrom=None, axisFrom=None, originTo=None, axisTo=None, origin=None, scale=1): ''' Compute a transformation that maps originFrom and axisFrom to originTo and axisTo respectively. If scale is set to 'auto', the scale will be determined such that the axes will also match in length: scale = norm(axisTo)/norm(axisFrom) Arguments: originFrom: sequences with 3 elements, or None axisFrom: sequences with 3 elements, or None originTo: sequences with 3 elements, or None axisTo: sequences with 3 elements, or None origin: sequences with 3 elements, or None, overrides originFrom and originTo if set scale: - scalar (isotropic scaling) - sequence with 3 elements (anisotropic scaling), - 'auto' (sets scale such that input axes match in length after transforming axisFrom) - None (no scaling) Align two axes alone, assuming that we sit on (0,0,0) findTransformFromVectors(axisFrom=a0, axisTo=a1) Align two axes in one point (all calls are equivalent): findTransformFromVectors(origin=o, axisFrom=a0, axisTo=a1) findTransformFromVectors(originFrom=o, axisFrom=a0, axisTo=a1) findTransformFromVectors(axisFrom=a0, originTo=o, axisTo=a1) Move between two points: findTransformFromVectors(orgin=o0, originTo=o1) Move from one position to the other and align axes: findTransformFromVectors(orgin=o0, axisFrom=a0, originTo=o1, axisTo=a1) ''' # Prelude with trickle-down logic. # Infer the origins if an information is not set. if origin is not None: # Check for ambiguous input. assert(originFrom is None and originTo is None) originFrom = origin originTo = origin if originFrom is None: originFrom = originTo if originTo is None: originTo = originFrom if originTo is None: # We arrive here only if no origin information was set. originTo = [0.,0.,0.] originFrom = [0.,0.,0.] originFrom = np.asarray(originFrom) originTo = np.asarray(originTo) # Check if any rotation will be involved. axisFrom = np.asarray(axisFrom) axisTo = np.asarray(axisTo) axisFromL2 = np.linalg.norm(axisFrom) axisToL2 = np.linalg.norm(axisTo) if axisFrom is None or axisTo is None or axisFromL2==0 or axisToL2==0: rotate = False else: rotate = not np.array_equal(axisFrom, axisTo) # Scale. if scale is None: scale = 1. if scale == 'auto': scale = axisToL2/axisFromL2 if axisFromL2!=0. else 1. if np.isscalar(scale): scale = scale*np.ones(3) if rotate: rAxis = np.cross(axisFrom.ravel(), axisTo.ravel()) # rotation axis angle = np.dot(axisFrom, axisTo) / axisFromL2 / axisToL2 angle = np.arccos(angle) # Here we finally compute the transform. trafo = vtk.vtkTransform() trafo.Translate(originTo) if rotate: trafo.RotateWXYZ(angle / np.pi * 180, rAxis[0], rAxis[1], rAxis[2]) trafo.Scale(scale[0],scale[1],scale[2]) trafo.Translate(-originFrom) return trafo
find peaks location in a spectrum numpy
I have a TOF spectrum and I would like to implement an algorithm using python (numpy) that finds all the maxima of the spectrum and returns the corresponding x values. I have looked up online and I found the algorithm reported below. The assumption here is that near the maximum the difference between the value before and the value at the maximum is bigger than a number DELTA. The problem is that my spectrum is composed of points equally distributed, even near the maximum, so that DELTA is never exceeded and the function peakdet returns an empty array. Do you have any idea how to overcome this problem? I would really appreciate comments to understand better the code since I am quite new in python. Thanks! import sys from numpy import NaN, Inf, arange, isscalar, asarray, array def peakdet(v, delta, x = None): maxtab = [] mintab = [] if x is None: x = arange(len(v)) v = asarray(v) if len(v) != len(x): sys.exit('Input vectors v and x must have same length') if not isscalar(delta): sys.exit('Input argument delta must be a scalar') if delta <= 0: sys.exit('Input argument delta must be positive') mn, mx = Inf, -Inf mnpos, mxpos = NaN, NaN lookformax = True for i in arange(len(v)): this = v[i] if this > mx: mx = this mxpos = x[i] if this < mn: mn = this mnpos = x[i] if lookformax: if this < mx-delta: maxtab.append((mxpos, mx)) mn = this mnpos = x[i] lookformax = False else: if this > mn+delta: mintab.append((mnpos, mn)) mx = this mxpos = x[i] lookformax = True return array(maxtab), array(mintab) Below is shown part of the spectrum. I actually have more peaks than those shown here.
This, I think could work as a starting point. I'm not a signal-processing expert, but I tried this on a generated signal Y that looks quite like yours and one with much more noise: from scipy.signal import convolve import numpy as np from matplotlib import pyplot as plt #Obtaining derivative kernel = [1, 0, -1] dY = convolve(Y, kernel, 'valid') #Checking for sign-flipping S = np.sign(dY) ddS = convolve(S, kernel, 'valid') #These candidates are basically all negative slope positions #Add one since using 'valid' shrinks the arrays candidates = np.where(dY < 0)[0] + (len(kernel) - 1) #Here they are filtered on actually being the final such position in a run of #negative slopes peaks = sorted(set(candidates).intersection(np.where(ddS == 2)[0] + 1)) plt.plot(Y) #If you need a simple filter on peak size you could use: alpha = -0.0025 peaks = np.array(peaks)[Y[peaks] < alpha] plt.scatter(peaks, Y[peaks], marker='x', color='g', s=40) The sample outcomes: For the noisy one, I filtered peaks with alpha: If the alpha needs more sophistication you could try dynamically setting alpha from the peaks discovered using e.g. assumptions about them being a mixed gaussian (my favourite being the Otsu threshold, exists in cv and skimage) or some sort of clustering (k-means could work). And for reference, this I used to generate the signal: Y = np.zeros(1000) def peaker(Y, alpha=0.01, df=2, loc=-0.005, size=-.0015, threshold=0.001, decay=0.5): peaking = False for i, v in enumerate(Y): if not peaking: peaking = np.random.random() < alpha if peaking: Y[i] = loc + size * np.random.chisquare(df=2) continue elif Y[i - 1] < threshold: peaking = False if i > 0: Y[i] = Y[i - 1] * decay peaker(Y) EDIT: Support for degrading base-line I simulated a slanting base-line by doing this: Z = np.log2(np.arange(Y.size) + 100) * 0.001 Y = Y + Z[::-1] - Z[-1] Then to detect with a fixed alpha (note that I changed sign on alpha): from scipy.signal import medfilt alpha = 0.0025 Ybase = medfilt(Y, 51) # 51 should be large in comparison to your peak X-axis lengths and an odd number. peaks = np.array(peaks)[Ybase[peaks] - Y[peaks] > alpha] Resulting in the following outcome (the base-line is plotted as dashed black line): EDIT 2: Simplification and a comment I simplified the code to use one kernel for both convolves as #skymandr commented. This also removed the magic number in adjusting the shrinkage so that any size of the kernel should do. For the choice of "valid" as option to convolve. It would probably have worked just as well with "same", but I choose "valid" so I didn't have to think about the edge-conditions and if the algorithm could detect spurios peaks there.
As of SciPy version 1.1, you can also use find_peaks: import numpy as np import matplotlib.pyplot as plt from scipy.signal import find_peaks np.random.seed(0) Y = np.zeros(1000) # insert #deinonychusaur's peaker function here peaker(Y) # make data noisy Y = Y + 10e-4 * np.random.randn(len(Y)) # find_peaks gets the maxima, so we multiply our signal by -1 Y *= -1 # get the actual peaks peaks, _ = find_peaks(Y, height=0.002) # multiply back for plotting purposes Y *= -1 plt.plot(Y) plt.plot(peaks, Y[peaks], "x") plt.show() This will plot (note that we use height=0.002 which will only find peaks higher than 0.002): In addition to height, we can also set the minimal distance between two peaks. If you use distance=100, the plot then looks as follows: You can use peaks, _ = find_peaks(Y, height=0.002, distance=100) in the code above.
After looking at the answers and suggestions I decided to offer a solution I often use because it is straightforward and easier to tweak. It uses a sliding window and counts how many times a local peak appears as a maximum as window shifts along the x-axis. As #DrV suggested, no universal definition of "local maximum" exists, meaning that some tuning parameters are unavoidable. This function uses "window size" and "frequency" to fine tune the outcome. Window size is measured in number of data points of independent variable (x) and frequency counts how sensitive should peak detection be (also expressed as a number of data points; lower values of frequency produce more peaks and vice versa). The main function is here: def peak_finder(x0, y0, window_size, peak_threshold): # extend x, y using window size y = numpy.concatenate([y0, numpy.repeat(y0[-1], window_size)]) x = numpy.concatenate([x0, numpy.arange(x0[-1], x0[-1]+window_size)]) local_max = numpy.zeros(len(x0)) for ii in range(len(x0)): local_max[ii] = x[y[ii:(ii + window_size)].argmax() + ii] u, c = numpy.unique(local_max, return_counts=True) i_return = numpy.where(c>=peak_threshold)[0] return(list(zip(u[i_return], c[i_return]))) along with a snippet used to produce the figure shown below: import numpy from matplotlib import pyplot def plot_case(axx, w_f): p = peak_finder(numpy.arange(0, len(Y)), -Y, w_f[0], w_f[1]) r = .9*min(Y)/10 axx.plot(Y) for ip in p: axx.text(ip[0], r + Y[int(ip[0])], int(ip[0]), rotation=90, horizontalalignment='center') yL = pyplot.gca().get_ylim() axx.set_ylim([1.15*min(Y), yL[1]]) axx.set_xlim([-50, 1100]) axx.set_title(f'window: {w_f[0]}, count: {w_f[1]}', loc='left', fontsize=10) return(None) window_frequency = {1:(15, 15), 2:(100, 100), 3:(100, 5)} f, ax = pyplot.subplots(1, 3, sharey='row', figsize=(9, 4), gridspec_kw = {'hspace':0, 'wspace':0, 'left':.08, 'right':.99, 'top':.93, 'bottom':.06}) for k, v in window_frequency.items(): plot_case(ax[k-1], v) pyplot.show() Three cases show parameter values that render (from left to right panel): (1) too many, (2) too few, and (3) an intermediate amount of peaks. To generate Y data, I used the function #deinonychusaur gave above, and added some noise to it from #Cleb's answer. I hope some might find this useful, but it's efficiency primarily depends on actual peak shapes and distances.
Finding a minimum or a maximum is not that simple, because there is no universal definition for "local maximum". Your code seems to look for a miximum and then accept it as a maximum if the signal falls after the maximum below the maximum minus some delta value. After that it starts to look for a minimum with similar criteria. It does not really matter if your data falls or rises slowly, as the maximum is recorded when it is reached and appended to the list of maxima once the level fallse below the hysteresis threshold. This is a possible way to find local minima and maxima, but it has several shortcomings. One of them is that the method is not symmetric, i.e. if the same data is run backwards, the results are not necessarily the same. Unfortunately, I cannot help much more, because the correct method really depends on the data you are looking at, its shape and its noisiness. If you have some samples, then we might be able to come up with some suggestions.
get bins coordinates with hexbin in matplotlib
I use matplotlib's method hexbin to compute 2d histograms on my data. But I would like to get the coordinates of the centers of the hexagons in order to further process the results. I got the values using get_array() method on the result, but I cannot figure out how to get the bins coordinates. I tried to compute them given number of bins and the extent of my data but i don't know the exact number of bins in each direction. gridsize=(10,2) should do the trick but it does not seem to work. Any idea?
I think this works. from __future__ import division import numpy as np import math import matplotlib.pyplot as plt def generate_data(n): """Make random, correlated x & y arrays""" points = np.random.multivariate_normal(mean=(0,0), cov=[[0.4,9],[9,10]],size=int(n)) return points if __name__ =='__main__': color_map = plt.cm.Spectral_r n = 1e4 points = generate_data(n) xbnds = np.array([-20.0,20.0]) ybnds = np.array([-20.0,20.0]) extent = [xbnds[0],xbnds[1],ybnds[0],ybnds[1]] fig=plt.figure(figsize=(10,9)) ax = fig.add_subplot(111) x, y = points.T # Set gridsize just to make them visually large image = plt.hexbin(x,y,cmap=color_map,gridsize=20,extent=extent,mincnt=1,bins='log') # Note that mincnt=1 adds 1 to each count counts = image.get_array() ncnts = np.count_nonzero(np.power(10,counts)) verts = image.get_offsets() for offc in xrange(verts.shape[0]): binx,biny = verts[offc][0],verts[offc][1] if counts[offc]: plt.plot(binx,biny,'k.',zorder=100) ax.set_xlim(xbnds) ax.set_ylim(ybnds) plt.grid(True) cb = plt.colorbar(image,spacing='uniform',extend='max') plt.show()
I would love to confirm that the code by Hooked using get_offsets() works, but I tried several iterations of the code mentioned above to retrieve center positions and, as Dave mentioned, get_offsets() remains empty. The workaround that I found is to use the non-empty 'image.get_paths()' option. My code takes the mean to find centers but which means it is just a smidge longer, but it does work. The get_paths() option returns a set of x,y coordinates embedded that can be looped over and then averaged to return the center position for each hexagram. The code that I have is as follows: counts=image.get_array() #counts in each hexagon, works great verts=image.get_offsets() #empty, don't use this b=image.get_paths() #this does work, gives Path([[]][]) which can be plotted for x in xrange(len(b)): xav=np.mean(b[x].vertices[0:6,0]) #center in x (RA) yav=np.mean(b[x].vertices[0:6,1]) #center in y (DEC) plt.plot(xav,yav,'k.',zorder=100)
I had this same problem. I think what needs to be developed is a framework to have a HexagonalGrid object which can then be applied to many different data sets (and it would be awesome to do it for N dimensions). This is possible and it surprises me that neither Scipy or Numpy has anything for it (furthermore there seems to be nothing else like it except perhaps binify) That said, I assume you want to use hexbinning to compare multiple binned data sets. This requires some common base. I got this to work using matplotlib's hexbin the following way: import numpy as np import matplotlib.pyplot as plt def get_data (mean,cov,n=1e3): """ Quick fake data builder """ np.random.seed(101) points = np.random.multivariate_normal(mean=mean,cov=cov,size=int(n)) x, y = points.T return x,y def get_centers (hexbin_output): """ about 40% faster than previous post only cause you're not calculating the min/max every time """ paths = hexbin_output.get_paths() v = paths[0].vertices[:-1] # adds a value [0,0] to the end vx,vy = v.T idx = [3,0,5,2] # index for [xmin,xmax,ymin,ymax] xmin,xmax,ymin,ymax = vx[idx[0]],vx[idx[1]],vy[idx[2]],vy[idx[3]] half_width_x = abs(xmax-xmin)/2.0 half_width_y = abs(ymax-ymin)/2.0 centers = [] for i in xrange(len(paths)): cx = paths[i].vertices[idx[0],0]+half_width_x cy = paths[i].vertices[idx[2],1]+half_width_y centers.append((cx,cy)) return np.asarray(centers) # important parts ==> class Hexagonal2DGrid (object): """ Used to fix the gridsize, extent, and bins """ def __init__ (self,gridsize,extent,bins=None): self.gridsize = gridsize self.extent = extent self.bins = bins def hexbin (x,y,hexgrid): """ To hexagonally bin the data in 2 dimensions """ fig = plt.figure() ax = fig.add_subplot(111) # Note mincnt=0 so that it will return a value for every point in the # hexgrid, not just those with count>mincnt # Basically you fix the gridsize, extent, and bins to keep them the same # then the resulting count array is the same hexbin = plt.hexbin(x,y, mincnt=0, gridsize=hexgrid.gridsize, extent=hexgrid.extent, bins=hexgrid.bins) # you could close the figure if you don't want it # plt.close(fig.number) counts = hexbin.get_array().copy() return counts, hexbin # Example ===> if __name__ == "__main__": hexgrid = Hexagonal2DGrid((21,5),[-70,70,-20,20]) x_data,y_data = get_data((0,0),[[-40,95],[90,10]]) x_model,y_model = get_data((0,10),[[100,30],[3,30]]) counts_data, hexbin_data = hexbin(x_data,y_data,hexgrid) counts_model, hexbin_model = hexbin(x_model,y_model,hexgrid) # if you want the centers, they will be the same for both centers = get_centers(hexbin_data) # if you want to ignore the cells with zeros then use the following mask. # But if want zeros for some bins and not others I'm not sure an elegant way # to do this without using the centers nonzero = counts_data != 0 # now you can compare the two data sets variance_data = counts_data[nonzero] square_diffs = (counts_data[nonzero]-counts_model[nonzero])**2 chi2 = np.sum(square_diffs/variance_data) print(" chi2={}".format(chi2))
scipy.interpolate.UnivariateSpline not smoothing regardless of parameters
I'm having trouble getting scipy.interpolate.UnivariateSpline to use any smoothing when interpolating. Based on the function's page as well as some previous posts, I believe it should provide smoothing with the s parameter. Here is my code: # Imports import scipy import pylab # Set up and plot actual data x = [0, 5024.2059124920379, 7933.1645067836089, 7990.4664106277542, 9879.9717114947653, 13738.60563208926, 15113.277958924193] y = [0.0, 3072.5653360000988, 5477.2689107965398, 5851.6866463790966, 6056.3852496014106, 7895.2332350173638, 9154.2956175610598] pylab.plot(x, y, "o", label="Actual") # Plot estimates using splines with a range of degrees for k in range(1, 4): mySpline = scipy.interpolate.UnivariateSpline(x=x, y=y, k=k, s=2) xi = range(0, 15100, 20) yi = mySpline(xi) pylab.plot(xi, yi, label="Predicted k=%d" % k) # Show the plot pylab.grid(True) pylab.xticks(rotation=45) pylab.legend( loc="lower right" ) pylab.show() Here is the result: I have tried this with a range of s values (0.01, 0.1, 1, 2, 5, 50), as well as explicit weights, set to either the same thing (1.0) or randomized. I still can't get any smoothing, and the number of knots is always the same as the number of data points. In particular, I'm looking for outliers like that 4th point (7990.4664106277542, 5851.6866463790966) to be smoothed over. Is it because I don't have enough data? If so, is there a similar spline function or cluster technique I can apply to achieve smoothing with this few datapoints?
Short answer: you need to choose the value for s more carefully. The documentation for UnivariateSpline states that: Positive smoothing factor used to choose the number of knots. Number of knots will be increased until the smoothing condition is satisfied: sum((w[i]*(y[i]-s(x[i])))**2,axis=0) <= s From this one can deduce that "reasonable" values for smoothing, if you don't pass in explicit weights, are around s = m * v where m is the number of data points and v the variance of the data. In this case, s_good ~ 5e7. EDIT: sensible values for s depend of course also on the noise level in the data. The docs seem to recommend choosing s in the range (m - sqrt(2*m)) * std**2 <= s <= (m + sqrt(2*m)) * std**2 where std is the standard deviation associated with the "noise" you want to smooth over.
#Zhenya's answer of manually setting knots in between datapoints was too rough to deliver good results in noisy data without being selective about how this technique is applied. However, inspired by his/her suggestion, I have had success with Mean-Shift clustering from the scikit-learn package. It performs auto-determination of the cluster count and seems to do a fairly good smoothing job (very smooth in fact). # Imports import numpy import pylab import scipy import sklearn.cluster # Set up original data - note that it's monotonically increasing by X value! data = {} data['original'] = {} data['original']['x'] = [0, 5024.2059124920379, 7933.1645067836089, 7990.4664106277542, 9879.9717114947653, 13738.60563208926, 15113.277958924193] data['original']['y'] = [0.0, 3072.5653360000988, 5477.2689107965398, 5851.6866463790966, 6056.3852496014106, 7895.2332350173638, 9154.2956175610598] # Cluster data, sort it and and save inputNumpy = numpy.array([[data['original']['x'][i], data['original']['y'][i]] for i in range(0, len(data['original']['x']))]) meanShift = sklearn.cluster.MeanShift() meanShift.fit(inputNumpy) clusteredData = [[pair[0], pair[1]] for pair in meanShift.cluster_centers_] clusteredData.sort(lambda pair1, pair2: cmp(pair1[0],pair2[0])) data['clustered'] = {} data['clustered']['x'] = [pair[0] for pair in clusteredData] data['clustered']['y'] = [pair[1] for pair in clusteredData] # Build a spline using the clustered data and predict mySpline = scipy.interpolate.UnivariateSpline(x=data['clustered']['x'], y=data['clustered']['y'], k=1) xi = range(0, round(max(data['original']['x']), -3) + 3000, 20) yi = mySpline(xi) # Plot the datapoints pylab.plot(data['clustered']['x'], data['clustered']['y'], "D", label="Datapoints (%s)" % 'clustered') pylab.plot(xi, yi, label="Predicted (%s)" % 'clustered') pylab.plot(data['original']['x'], data['original']['y'], "o", label="Datapoints (%s)" % 'original') # Show the plot pylab.grid(True) pylab.xticks(rotation=45) pylab.legend( loc="lower right" ) pylab.show()
While I'm not aware of any library which will do it for you off-hand, I'd try a bit more DIY approach: I'd start from making a spline with knots in between the raw data points, in both x and y. In your particular example, having a single knot in between the 4th and 5th points should do the trick, since it'd remove the huge derivative at around x=8000.
I had trouble getting BigChef's answer running, here is a variation that works on python 3.6: # Imports import pylab import scipy import sklearn.cluster # Set up original data - note that it's monotonically increasing by X value! data = {} data['original'] = {} data['original']['x'] = [0, 5024.2059124920379, 7933.1645067836089, 7990.4664106277542, 9879.9717114947653, 13738.60563208926, 15113.277958924193] data['original']['y'] = [0.0, 3072.5653360000988, 5477.2689107965398, 5851.6866463790966, 6056.3852496014106, 7895.2332350173638, 9154.2956175610598] # Cluster data, sort it and and save import numpy inputNumpy = numpy.array([[data['original']['x'][i], data['original']['y'][i]] for i in range(0, len(data['original']['x']))]) meanShift = sklearn.cluster.MeanShift() meanShift.fit(inputNumpy) clusteredData = [[pair[0], pair[1]] for pair in meanShift.cluster_centers_] clusteredData.sort(key=lambda li: li[0]) data['clustered'] = {} data['clustered']['x'] = [pair[0] for pair in clusteredData] data['clustered']['y'] = [pair[1] for pair in clusteredData] # Build a spline using the clustered data and predict mySpline = scipy.interpolate.UnivariateSpline(x=data['clustered']['x'], y=data['clustered']['y'], k=1) xi = range(0, int(round(max(data['original']['x']), -3)) + 3000, 20) yi = mySpline(xi) # Plot the datapoints pylab.plot(data['clustered']['x'], data['clustered']['y'], "D", label="Datapoints (%s)" % 'clustered') pylab.plot(xi, yi, label="Predicted (%s)" % 'clustered') pylab.plot(data['original']['x'], data['original']['y'], "o", label="Datapoints (%s)" % 'original') # Show the plot pylab.grid(True) pylab.xticks(rotation=45) pylab.show()
Remove data points below a curve with python
I need to compare some theoretical data with real data in python. The theoretical data comes from resolving an equation. To improve the comparative I would like to remove data points that fall far from the theoretical curve. I mean, I want to remove the points below and above red dashed lines in the figure (made with matplotlib). Both the theoretical curves and the data points are arrays of different length. I can try to remove the points in a roughly-eye way, for example: the first upper point can be detected using: data2[(data2.redshift<0.4)&data2.dmodulus>1] rec.array([('1997o', 0.374, 1.0203223485103787, 0.44354759972859786)], dtype=[('SN_name', '|S10'), ('redshift', '<f8'), ('dmodulus', '<f8'), ('dmodulus_error', '<f8')]) But I would like to use a less roughly-eye way. So, can anyone help me finding an easy way of removing the problematic points? Thank you!
This might be overkill and is based on your comment Both the theoretical curves and the data points are arrays of different length. I would do the following: Truncate the data set so that its x values lie within the max and min values of the theoretical set. Interpolate the theoretical curve using scipy.interpolate.interp1d and the above truncated data x values. The reason for step (1) is to satisfy the constraints of interp1d. Use numpy.where to find data y values that are out side the range of acceptable theory values. DONT discard these values, as was suggested in comments and other answers. If you want for clarity, point them out by plotting the 'inliners' one color and the 'outliers' an other color. Here's a script that is close to what you are looking for, I think. It hopefully will help you accomplish what you want: import numpy as np import scipy.interpolate as interpolate import matplotlib.pyplot as plt # make up data def makeUpData(): '''Make many more data points (x,y,yerr) than theory (x,y), with theory yerr corresponding to a constant "sigma" in y, about x,y value''' NX= 150 dataX = (np.random.rand(NX)*1.1)**2 dataY = (1.5*dataX+np.random.rand(NX)**2)*dataX dataErr = np.random.rand(NX)*dataX*1.3 theoryX = np.arange(0,1,0.1) theoryY = theoryX*theoryX*1.5 theoryErr = 0.5 return dataX,dataY,dataErr,theoryX,theoryY,theoryErr def makeSameXrange(theoryX,dataX,dataY): ''' Truncate the dataX and dataY ranges so that dataX min and max are with in the max and min of theoryX. ''' minT,maxT = theoryX.min(),theoryX.max() goodIdxMax = np.where(dataX<maxT) goodIdxMin = np.where(dataX[goodIdxMax]>minT) return (dataX[goodIdxMax])[goodIdxMin],(dataY[goodIdxMax])[goodIdxMin] # take 'theory' and get values at every 'data' x point def theoryYatDataX(theoryX,theoryY,dataX): '''For every dataX point, find interpolated thoeryY value. theoryx needed for interpolation.''' f = interpolate.interp1d(theoryX,theoryY) return f(dataX[np.where(dataX<np.max(theoryX))]) # collect valid points def findInlierSet(dataX,dataY,interpTheoryY,thoeryErr): '''Find where theoryY-theoryErr < dataY theoryY+theoryErr and return valid indicies.''' withinUpper = np.where(dataY<(interpTheoryY+theoryErr)) withinLower = np.where(dataY[withinUpper] >(interpTheoryY[withinUpper]-theoryErr)) return (dataX[withinUpper])[withinLower],(dataY[withinUpper])[withinLower] def findOutlierSet(dataX,dataY,interpTheoryY,thoeryErr): '''Find where theoryY-theoryErr < dataY theoryY+theoryErr and return valid indicies.''' withinUpper = np.where(dataY>(interpTheoryY+theoryErr)) withinLower = np.where(dataY<(interpTheoryY-theoryErr)) return (dataX[withinUpper],dataY[withinUpper], dataX[withinLower],dataY[withinLower]) if __name__ == "__main__": dataX,dataY,dataErr,theoryX,theoryY,theoryErr = makeUpData() TruncDataX,TruncDataY = makeSameXrange(theoryX,dataX,dataY) interpTheoryY = theoryYatDataX(theoryX,theoryY,TruncDataX) inDataX,inDataY = findInlierSet(TruncDataX,TruncDataY,interpTheoryY, theoryErr) outUpX,outUpY,outDownX,outDownY = findOutlierSet(TruncDataX, TruncDataY, interpTheoryY, theoryErr) #print inlierIndex fig = plt.figure() ax = fig.add_subplot(211) ax.errorbar(dataX,dataY,dataErr,fmt='.',color='k') ax.plot(theoryX,theoryY,'r-') ax.plot(theoryX,theoryY+theoryErr,'r--') ax.plot(theoryX,theoryY-theoryErr,'r--') ax.set_xlim(0,1.4) ax.set_ylim(-.5,3) ax = fig.add_subplot(212) ax.plot(inDataX,inDataY,'ko') ax.plot(outUpX,outUpY,'bo') ax.plot(outDownX,outDownY,'ro') ax.plot(theoryX,theoryY,'r-') ax.plot(theoryX,theoryY+theoryErr,'r--') ax.plot(theoryX,theoryY-theoryErr,'r--') ax.set_xlim(0,1.4) ax.set_ylim(-.5,3) fig.savefig('findInliers.png') This figure is the result:
At the end I use some of the Yann code: def theoryYatDataX(theoryX,theoryY,dataX): '''For every dataX point, find interpolated theoryY value. theoryx needed for interpolation.''' f = interpolate.interp1d(theoryX,theoryY) return f(dataX[np.where(dataX<np.max(theoryX))]) def findOutlierSet(data,interpTheoryY,theoryErr): '''Find where theoryY-theoryErr < dataY theoryY+theoryErr and return valid indicies.''' up = np.where(data.dmodulus > (interpTheoryY+theoryErr)) low = np.where(data.dmodulus < (interpTheoryY-theoryErr)) # join all the index together in a flat array out = np.hstack([up,low]).ravel() index = np.array(np.ones(len(data),dtype=bool)) index[out]=False datain = data[index] dataout = data[out] return datain, dataout def selectdata(data,theoryX,theoryY): """ Data selection: z<1 and +-0.5 LFLRW separation """ # Select data with redshift z<1 data1 = data[data.redshift < 1] # From modulus to light distance: data1.dmodulus, data1.dmodulus_error = modulus2distance(data1.dmodulus,data1.dmodulus_error) # redshift data order data1.sort(order='redshift') # Outliers: distance to LFLRW curve bigger than +-0.5 theoryErr = 0.5 # Theory curve Interpolation to get the same points as data interpy = theoryYatDataX(theoryX,theoryY,data1.redshift) datain, dataout = findOutlierSet(data1,interpy,theoryErr) return datain, dataout Using those functions I can finally obtain: Thank you all for your help.
Just look at the difference between the red curve and the points, if it is bigger than the difference between the red curve and the dashed red curve remove it. diff=np.abs(points-red_curve) index= (diff>(dashed_curve-redcurve)) filtered=points[index] But please take the comment from NickLH serious. Your Data looks pretty good without any filtering, your "outlieres" all have a very big error and won't affect the fit much.
Either you could use the numpy.where() to identify which xy pairs meet your plotting criteria, or perhaps enumerate to do pretty much the same thing. Example: x_list = [ 1, 2, 3, 4, 5, 6 ] y_list = ['f','o','o','b','a','r'] result = [y_list[i] for i, x in enumerate(x_list) if 2 <= x < 5] print result I'm sure you could change the conditions so that '2' and '5' in the above example are the functions of your curves