Python findpeaks implementation difference with Matlab's - python

I'm interested in finding positive and negative peaks in a data sample with Python.
To give you a better idea, I'm considering Matlab's findpeaks function (https://it.mathworks.com/help/signal/ref/findpeaks.html).
For example, when in Matlab I do
[a1,peak_loc1] = findpeaks(data,"DoubleSided",'MinPeakHeight',peak_height)
I get the peaks and the indices at which the peaks occur.
Now, I found a good candidate in the scipy signal find_peaks function (https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.find_peaks.html).
This seems very similar. But as is, it does not provide the results I want. In fact it returns the peaks and a dictionary containing properties of the returned peaks.
So to return the a1 and peak_loc1 values I did a little extra operation:
import numpy as np
from scipy.signal import find_peaks
def findpeaks(arr, h, w=1, d=1):
pos = find_peaks(arr, height=h, width=w, distance=d)
pos_list = dict(zip(pos[0], pos[1]['peak_heights']))
neg = find_peaks(arr * -1, height=h, width=w, distance=d)
neg_list = dict(zip(neg[0], neg[1]['peak_heights'] * -1))
full_list = {**pos_list, **neg_list}
full_list = dict(sorted(full_list.items()))
heights = list(full_list.values())
indexes = list(full_list.keys())
return heights, indexes
However the results I get do not match 1:1 the results in Matlab.
To make the difference easier to see please take a look at this screenshot where I put the values side by side. On the left you have matlab's values and on the right Python's values.
values comparison
Values in white are identical. However at some point there is wierdness going on where Matlab's corresponding value are shifted one position later. I highlighted the matching values with colors. And then there is two values (in red) which do not have any correspondence in the ones calculated by Python.
This is my Python code:
Please note data is a simple np.array but it has around 12K values in it, so you can look at it here: https://pastebin.pl/view/d3d5dba4
data = *
peak_height = 0.4455
a1, peak_loc1 = findpeaks(data, peak_height)
What can be the reason? The only thing that came to my mind is that there is some parameter missing... i.e. what would correspond to "DoubleSided"? Given 'MinPeakHeight' should be the peak_height parameter I'm already passing to Python.

Based on the used parameter DoubleSided and your screenshot I'm assuming that you are actually using Octave's findpeaks rather than Matlab's! If you read that parameters description:
"DoubleSided" - Tells the function that data takes positive and negative values. The base-line for the peaks is taken as the mean value of the function. This is equivalent as passing the absolute value of the data after removing the mean.
So using DoubleSided will actually manipulate absolute values in data before checking for MinPeakHeight and will therefore change which peaks fall below or above the height 0.4455. We can imitate this in Python (I took the liberty to simplify/speed up your function):
def findpeaks(arr, h, w=1, d=1):
# Adjust baseline to emulate 'DoubleSided'
adjusted = arr - arr.mean()
# Only consider absolute value to find local maxima AND minima
adjusted = abs(adjusted)
indexes, _ = find_peaks(adjusted, height=h)
# Retrieve extrema heights from original signal rather
# than from the properties dict with adjusted heights
heights = arr[indexes]
return heights, indexes
a1, peak_loc1 = findpeaks(data, h=0.4455)
for h, i in zip(a1, peak_loc1):
# For display purpose, adjust index by +1
# to match Octave/Matlab's indexing
print(f"{h:>9.5f} | {i + 1:>5}")
Running this with your data will print values that look identical to Octave's output (at least I didn't find a difference).

Related

Integration of a function with discrete values

I want to do a integration without knowing the functional equation f(x). I also have only discrete values, which Python has connected by a plot. This one looks like this:
This is the code with the calculation for it:
import numpy as np
import matplotlib.pyplot as plt
import math as m
import loaddataa as ld
# Loading of the values
dataListStride = ld.loadData("../Projektpraktikum Binder/Data/1 Fabienne/Test1/left foot/50cm")
indexStrideData = 0
strideData = dataListStride[indexStrideData]
#%%Calculation of the horizontal acceleration
def horizontal(yAngle, yAcceleration, xAcceleration):
a = ((m.cos(m.radians(yAngle)))*yAcceleration)-((m.sin(m.radians(yAngle)))*xAcceleration)
return a
resultsHorizontal = list()
for i in range (len(strideData)):
strideData_yAngle = strideData.to_numpy()[i, 2]
strideData_xAcceleration = strideData.to_numpy()[i, 4]
strideData_yAcceleration = strideData.to_numpy()[i, 5]
resultsHorizontal.append(horizontal(strideData_yAngle, strideData_yAcceleration, strideData_xAcceleration))
resultsHorizontal.insert(0, 0)
print("The values are: " +str(resultsHorizontal))
print("There are " +str(len(resultsHorizontal)) + " values.")
#x-axis "convert" into time: 100 Hertz makes 0.01 seconds
scale_factor = 0.01
x_values = np.arange(len(resultsHorizontal)) * scale_factor
plt.plot(x_values, resultsHorizontal)
After the calculation I get a list of these values (which were shown and plotted in the diagram above):
Note about the code:
The code works as follows: By using loaddataa.py a csv file is reading in. Then the formula for the calculation of the horizontal acceleration is defined, which is represented in def horizontal(yAngle,yAcceleration, xAcceleration). In the for loop, the previously determined list is run through line by line. Columns 2, 4 and 5 of the CSV file are used here. Then a 0 is added to the beginning of the resulting list of values. This is important to perform the integration from 0 to the end.
Now I want to integrate this function (which is represented in the plot at the top) with these values (which can be seen in the image after the code) after the calculation.
Is there a way to implement this? If so, how and what would the plot look like? Maybe there is the opportunity to do this with a trapeze integration? Thanks for helping me!
At the end of my task I want to do a double integration with the acceleration values to get the path length. The first (trapezoidal) integration of the acceleration should represent the velocity and the second (trapezoidal) integration the path length (location). The x-axis should remain as it is.
What I just noticed are the negative values. Theoretically the integration should always result in positive values, right? Because there are no negative areas.

Number density distribution of an 1D-array - 2 different attempts

I have an large array of elements that I call RelDist (In which dimensionally, is a unit of distance) in a simulated volume. I am attempting to determine the distribution for the "number of values per unit volume" which is also number density. It should be similar to this diagram:
I am aware that the axis is scaled log base 10, the plot of the set should definitely drop off.
Mathematically, I set it up as two equivalent equations:
where N is the number of elements in the array being differentiated in respect to the natural log of the distances. It can also be equivalently re-written in the form of a regular derivative by introducing another factor of r.
Equivalently,
So for ever increasing r, I want to count the change in N of elements per logarithmic bin of r.
As of now, I have trouble setting up the frequency counting in the histogram while accommodating the volume along side it.
Attempt 1
This is using the dN/dlnr/volume equations
def n(dist, numbins):
logdist= np.log(dist)
hist, r_array = np.histogram(logdist, numbins)
dlogR = r_array[1]-r_array[0]
x_array = r_array[1:] - dlogR/2
## I am condifent the above part of this code is correct.
## The succeeding portion does not work.
dR = r_array[1:] - r_array[0:numbins]
dN_dlogR = hist * x_array/dR
volume = 4*np.pi*dist*dist*dist
## The included volume is incorrect
return [x_array, dN_dlogR/volume]
Plotting this does not even properly show a distribution like the first plot I posted above and it only works when I choose the bin number to be the same shape as my input array. The bun number should arbitrary, should it not?
Attempt 2
This is using the equivalent dN/dr/volume equation.
numbins = np.linspace(min(RelDist),max(RelDist), 100)
hist, r_array = np.histogram(RelDist, numbins)
volume = 4*np.float(1000**2)
dR = r_array[1]-r_array[0]
x_array = r_array[1:] - dR/2
y = hist/dR
A little bit easier, but without including the volume term, I get a sort of histogram distribution, which is at least a start.
With this attempt, how would include the volume term with the array?
Example
Start at a distance R value of something like 10, counts the change in number in respect to R, then increasing to a distance value R of 20, counts the change, increase to value of 30, counts the change, and so on so forth.
Here is a txt file of my array if you are interested in re-creating it
https://www.dropbox.com/s/g40gp88k2p6pp6y/RelDist.txt?dl=0
Since no one was able to help answer, I will provide my result in case someone wants to use it for future use:
def n_ln(dist, numbins):
log_dist = np.log10(dist)
bins = np.linspace(min(log_dist),max(log_dist), numbins)
hist, r_array = np.histogram(log_dist, bins)
dR = r_array[1]-r_array[0]
x_array = r_array[1:] - dR/2
volume = [4.*np.pi*i**3. for i in 10**x_array[:] ]
return [10**x_array, hist/dR/volume]

Given a subinterval of a value over time find "similar instances" of that pattern along the function

Recently I've been asked to find instances of a given pattern over a function (value over time), but I'm not sure about how to face the problem.
For example if the following case was given, and the time interval selected was [0,1], I would like to find all the instances of that shape, even if it's not exactly equal (emulating the human's eye behaviour):
Preferably I would like to code it in Python so any suggestions about libraries and/or frameworks that can be helpful, (of course also known methods and algorithms) will be very much appreciated.
Thanks
a rather trivial approach could be to take the given pattern and slide it across the data as a window, finding the difference between the pattern and the data under it. this would only be accurate if the shapes were always the same size as well as the same shape.
demo..
set up the data:
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(0,200,200)
y = np.zeros_like(x)
def addpeak(pos, y): #clipped triangular peak centered at pos (10 high, 20 wide)
y += np.array([np.clip(10-abs(pos-x), 0, 5) for x in xrange(len(y))])
return y
y = addpeak(15,y)
y = addpeak(40,y)
y = addpeak(125, y)
y = addpeak(100, y)
y = addpeak(180, y)
plt.plot(x,y) #visualize data
then take the sliding window difference
window = y[5:25] #first peak is sliding window
#you could take different difference formulas than simply linear
difference = np.array([sum(window-y[i:i+20]) for i in xrange(len(y)-20)])
plt.plot(x[:-20], difference) #note minimum difference might be offset based on window indexing
#pick your faviorite way to find local minima
You could use something like numpy (python numpy/scipy curve fitting) to inspect the points to fit a curve over the interval [0,1]. From that, you could do an offset from the x-axis to see if the curve 'fits' any other parts of the curve.
For example, from [1,2] it would be Offset: -1. Without having an example of code above, it's hard to go into precisely how to do it, but hope this is helpful.

Python fastKDE beyond limits of data points

I'm trying to use the fastKDE package (https://pypi.python.org/pypi/fastkde/1.0.8) to find the KDE of a point in a 2D plot. However, I want to know the KDE beyond the limits of the data points, and cannot figure out how to do this.
Using the code listed on the site linked above;
#!python
import numpy as np
from fastkde import fastKDE
import pylab as PP
#Generate two random variables dataset (representing 100000 pairs of datapoints)
N = 2e5
var1 = 50*np.random.normal(size=N) + 0.1
var2 = 0.01*np.random.normal(size=N) - 300
#Do the self-consistent density estimate
myPDF,axes = fastKDE.pdf(var1,var2)
#Extract the axes from the axis list
v1,v2 = axes
#Plot contours of the PDF should be a set of concentric ellipsoids centered on
#(0.1, -300) Comparitively, the y axis range should be tiny and the x axis range
#should be large
PP.contour(v1,v2,myPDF)
PP.show()
I'm able to find the KDE for any point within the limits of the data, but how do I find the KDE for say the point (0,300), without having to include it into var1 and var2. I don't want the KDE to be calculated with this data point, I want to know the KDE at that point.
I guess what I really want to be able to do is give the fastKDE a histogram of the data, so that I can set its axes myself. I just don't know if this is possible?
Cheers
I, too, have been experimenting with this code and have run into the same issues. What I've done (in lieu of a good N-D extrapolator) is to build a KDTree (with scipy.spatial) from the grid points that fastKDE returns and find the nearest grid point to the point I was to evaluate. I then lookup the corresponding pdf value at that point (it should be small near the edge of the pdf grid if not identically zero) and assign that value accordingly.
I came across this post while searching for a solution of this problem. Similiar to the building of a KDTree you could just calculate your stepsize in every griddimension, and then get the index of your query point by just subtracting the point value with the beginning of your axis and divide by the stepsize of that dimension, finally round it off, turn it to integer and voila. So for example in 1D:
def fastkde_test(test_x):
kde, axes = fastKDE.pdf(test_x, numPoints=num_p)
x_step = (max(axes)-min(axes)) / len(axes)
x_ind = np.int32(np.round((test_x-min(axes)) / x_step))
return kde[x_ind]
where test_x in this case is both the set for defining the KDE and the query set. Doing it this way is marginally faster by a factor of 10 in my case (at least in 1D, higher dimensions not yet tested) and does basically the same thing as the KDTree query.
I hope this helps anyone coming across this problem in the future, as I just did.
Edit: if your querying points outside of the range over which the KDE was calculated this method of course can only give you the same result as the KDTree query, namely the corresponding border of your KDE-grid. You would however have to hardcode this by cutting the resulting x_ind at the highest index, i.e. `len(axes)-1'.

Python-Integrate then save text file

To all:
This question pertains to some MS research I am doing.
What I want to do is seemingly simple. I have a text file of values( time, .....values). I then want to integrate from 0 to those values for all times, then save that value to a text file.
from numpy import *
from pylab import *
import os, sys, shutil
import math
#######################
#Load Data
data = loadtxt('wh.txt')
#Open file to save plots to
shutil.rmtree("wh_files")
os.makedirs("wh_files")
os.chdir("wh_files")
for i in range(0,100,1):
int = trapz(data[i,:],axis=0)
print int
savetxt('int.txt', int)
When I run this I get the following error:
File "integral.py", line 19, in
savetxt('int.txt', int)
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/numpy/lib/npyio.py", line 960, in savetxt
ncol = X.shape[1]
IndexError: tuple index out of range
I have been trying to solve this problem for a few days on an off now, but am yet to find a solution. I appreciate any help/comments you can give.
The value of int is a float but savetxt requires an array. You want to create a numpy array for all the integration results, then save it at the end using savetxt. I think something like this should do the trick:
int_array = apply_along_axis(trapz, 0, data)
savetxt('int.txt', int_array)
Keep in mind that this (along with your original example) will sum the time fields, rather than integrating across time. This will only yield valid results if the spacing along the x-axis is 1. Otherwise you'll want to do something like this:
t = data[0,:]
int_array = apply_along_axis(trapz, 0, data[1:,:], t)
savetxt('int.txt', int_array)
Assuming the time fields are numeric.
edit: Further explanation of 2nd code section follows.
You're using the trapezoidal rule to integrate a variety of values, which is an integration approximation technique that works by summing the average of successive y-values on a curve multiplied by the change in x between the two y-values. This amounts to calculating the area of a trapezoid that connects the two y-values and the x-axis, like so:
It's not completely clear from your question, but it seemed that you were integrating the values across time, so that the x-axis would represent time. The x-values need to be incorporated to get the correct area of each trapezoid (the area of each trapezoid is (x2 - x1) * (y2 + y1) / 2 and the final integration result is the sum of all such areas).
The simplest way to incorporate these x-axis values is to pass it into the trapz function as the x parameter (see the docstring). In the example above, I used t = data[0,:] as the array of x values suchly.
Another caveat: if all spacing between x values are the same (so that x2 - x1 is a constant), you can save yourself some calculation by pulling this out of the summation and simply multiplying it at the end. This functionality is available through the trapz function with the dx parameter. So, if you're time measurements were taken every 30 seconds, for example, you could replace the 2nd line in my 2nd example with:
int_array = apply_along_axis(trapz, 0, data[1:,:], None, 30)
Hope that helps.

Categories

Resources