To all:
This question pertains to some MS research I am doing.
What I want to do is seemingly simple. I have a text file of values( time, .....values). I then want to integrate from 0 to those values for all times, then save that value to a text file.
from numpy import *
from pylab import *
import os, sys, shutil
import math
#######################
#Load Data
data = loadtxt('wh.txt')
#Open file to save plots to
shutil.rmtree("wh_files")
os.makedirs("wh_files")
os.chdir("wh_files")
for i in range(0,100,1):
int = trapz(data[i,:],axis=0)
print int
savetxt('int.txt', int)
When I run this I get the following error:
File "integral.py", line 19, in
savetxt('int.txt', int)
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/numpy/lib/npyio.py", line 960, in savetxt
ncol = X.shape[1]
IndexError: tuple index out of range
I have been trying to solve this problem for a few days on an off now, but am yet to find a solution. I appreciate any help/comments you can give.
The value of int is a float but savetxt requires an array. You want to create a numpy array for all the integration results, then save it at the end using savetxt. I think something like this should do the trick:
int_array = apply_along_axis(trapz, 0, data)
savetxt('int.txt', int_array)
Keep in mind that this (along with your original example) will sum the time fields, rather than integrating across time. This will only yield valid results if the spacing along the x-axis is 1. Otherwise you'll want to do something like this:
t = data[0,:]
int_array = apply_along_axis(trapz, 0, data[1:,:], t)
savetxt('int.txt', int_array)
Assuming the time fields are numeric.
edit: Further explanation of 2nd code section follows.
You're using the trapezoidal rule to integrate a variety of values, which is an integration approximation technique that works by summing the average of successive y-values on a curve multiplied by the change in x between the two y-values. This amounts to calculating the area of a trapezoid that connects the two y-values and the x-axis, like so:
It's not completely clear from your question, but it seemed that you were integrating the values across time, so that the x-axis would represent time. The x-values need to be incorporated to get the correct area of each trapezoid (the area of each trapezoid is (x2 - x1) * (y2 + y1) / 2 and the final integration result is the sum of all such areas).
The simplest way to incorporate these x-axis values is to pass it into the trapz function as the x parameter (see the docstring). In the example above, I used t = data[0,:] as the array of x values suchly.
Another caveat: if all spacing between x values are the same (so that x2 - x1 is a constant), you can save yourself some calculation by pulling this out of the summation and simply multiplying it at the end. This functionality is available through the trapz function with the dx parameter. So, if you're time measurements were taken every 30 seconds, for example, you could replace the 2nd line in my 2nd example with:
int_array = apply_along_axis(trapz, 0, data[1:,:], None, 30)
Hope that helps.
Related
I'm interested in finding positive and negative peaks in a data sample with Python.
To give you a better idea, I'm considering Matlab's findpeaks function (https://it.mathworks.com/help/signal/ref/findpeaks.html).
For example, when in Matlab I do
[a1,peak_loc1] = findpeaks(data,"DoubleSided",'MinPeakHeight',peak_height)
I get the peaks and the indices at which the peaks occur.
Now, I found a good candidate in the scipy signal find_peaks function (https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.find_peaks.html).
This seems very similar. But as is, it does not provide the results I want. In fact it returns the peaks and a dictionary containing properties of the returned peaks.
So to return the a1 and peak_loc1 values I did a little extra operation:
import numpy as np
from scipy.signal import find_peaks
def findpeaks(arr, h, w=1, d=1):
pos = find_peaks(arr, height=h, width=w, distance=d)
pos_list = dict(zip(pos[0], pos[1]['peak_heights']))
neg = find_peaks(arr * -1, height=h, width=w, distance=d)
neg_list = dict(zip(neg[0], neg[1]['peak_heights'] * -1))
full_list = {**pos_list, **neg_list}
full_list = dict(sorted(full_list.items()))
heights = list(full_list.values())
indexes = list(full_list.keys())
return heights, indexes
However the results I get do not match 1:1 the results in Matlab.
To make the difference easier to see please take a look at this screenshot where I put the values side by side. On the left you have matlab's values and on the right Python's values.
values comparison
Values in white are identical. However at some point there is wierdness going on where Matlab's corresponding value are shifted one position later. I highlighted the matching values with colors. And then there is two values (in red) which do not have any correspondence in the ones calculated by Python.
This is my Python code:
Please note data is a simple np.array but it has around 12K values in it, so you can look at it here: https://pastebin.pl/view/d3d5dba4
data = *
peak_height = 0.4455
a1, peak_loc1 = findpeaks(data, peak_height)
What can be the reason? The only thing that came to my mind is that there is some parameter missing... i.e. what would correspond to "DoubleSided"? Given 'MinPeakHeight' should be the peak_height parameter I'm already passing to Python.
Based on the used parameter DoubleSided and your screenshot I'm assuming that you are actually using Octave's findpeaks rather than Matlab's! If you read that parameters description:
"DoubleSided" - Tells the function that data takes positive and negative values. The base-line for the peaks is taken as the mean value of the function. This is equivalent as passing the absolute value of the data after removing the mean.
So using DoubleSided will actually manipulate absolute values in data before checking for MinPeakHeight and will therefore change which peaks fall below or above the height 0.4455. We can imitate this in Python (I took the liberty to simplify/speed up your function):
def findpeaks(arr, h, w=1, d=1):
# Adjust baseline to emulate 'DoubleSided'
adjusted = arr - arr.mean()
# Only consider absolute value to find local maxima AND minima
adjusted = abs(adjusted)
indexes, _ = find_peaks(adjusted, height=h)
# Retrieve extrema heights from original signal rather
# than from the properties dict with adjusted heights
heights = arr[indexes]
return heights, indexes
a1, peak_loc1 = findpeaks(data, h=0.4455)
for h, i in zip(a1, peak_loc1):
# For display purpose, adjust index by +1
# to match Octave/Matlab's indexing
print(f"{h:>9.5f} | {i + 1:>5}")
Running this with your data will print values that look identical to Octave's output (at least I didn't find a difference).
I want to do a integration without knowing the functional equation f(x). I also have only discrete values, which Python has connected by a plot. This one looks like this:
This is the code with the calculation for it:
import numpy as np
import matplotlib.pyplot as plt
import math as m
import loaddataa as ld
# Loading of the values
dataListStride = ld.loadData("../Projektpraktikum Binder/Data/1 Fabienne/Test1/left foot/50cm")
indexStrideData = 0
strideData = dataListStride[indexStrideData]
#%%Calculation of the horizontal acceleration
def horizontal(yAngle, yAcceleration, xAcceleration):
a = ((m.cos(m.radians(yAngle)))*yAcceleration)-((m.sin(m.radians(yAngle)))*xAcceleration)
return a
resultsHorizontal = list()
for i in range (len(strideData)):
strideData_yAngle = strideData.to_numpy()[i, 2]
strideData_xAcceleration = strideData.to_numpy()[i, 4]
strideData_yAcceleration = strideData.to_numpy()[i, 5]
resultsHorizontal.append(horizontal(strideData_yAngle, strideData_yAcceleration, strideData_xAcceleration))
resultsHorizontal.insert(0, 0)
print("The values are: " +str(resultsHorizontal))
print("There are " +str(len(resultsHorizontal)) + " values.")
#x-axis "convert" into time: 100 Hertz makes 0.01 seconds
scale_factor = 0.01
x_values = np.arange(len(resultsHorizontal)) * scale_factor
plt.plot(x_values, resultsHorizontal)
After the calculation I get a list of these values (which were shown and plotted in the diagram above):
Note about the code:
The code works as follows: By using loaddataa.py a csv file is reading in. Then the formula for the calculation of the horizontal acceleration is defined, which is represented in def horizontal(yAngle,yAcceleration, xAcceleration). In the for loop, the previously determined list is run through line by line. Columns 2, 4 and 5 of the CSV file are used here. Then a 0 is added to the beginning of the resulting list of values. This is important to perform the integration from 0 to the end.
Now I want to integrate this function (which is represented in the plot at the top) with these values (which can be seen in the image after the code) after the calculation.
Is there a way to implement this? If so, how and what would the plot look like? Maybe there is the opportunity to do this with a trapeze integration? Thanks for helping me!
At the end of my task I want to do a double integration with the acceleration values to get the path length. The first (trapezoidal) integration of the acceleration should represent the velocity and the second (trapezoidal) integration the path length (location). The x-axis should remain as it is.
What I just noticed are the negative values. Theoretically the integration should always result in positive values, right? Because there are no negative areas.
The Problem:
Using NumPy, I have created an array of random points within a range.
import numpy as np
min_square = 5
positions = (np.random.random(size=(100, 2)) - 0.5) * 2 * container_radius
Where container_radius is an integer and min_square is an integer.
Following that, using matplotlib, I plot the points on a graph.
import matplotlib.pyplot as plt
plt.plot(positions[:, 0], positions[:, 1], 'r.')
plt.show()
This graph shows me the distribution of the points in relation to each other.
What I am looking for is a method to implement something similar to or exactly a k-d tree to draw a rectangle over the densest area of the scatter plot with a defined minimum for the size.
This would be done using plt.gca().add_patch(plt.Rectangle((x, y), width=square_size, height=square_side, fill=None where square_side is the defined by the density function and is at least a minimum sizeo of min_square.
Attempts to Solve the Problem:
So far, I have created my own sort of density function that is within my understanding of Python and easy enough to code without lagging my computer too hard.
The solve comes in the form of creating an additional predefined variable intervals which is an integer.
Using what I had so far, I define a function to calculate the densities by checking if the points are within a range of floats.
# clb stands for calculate_lower_bound
def clb(x):
return -1 * container_radius + (x * 2 * container_radius - min_square) / (intervals - 1)
# crd stands for calculate_regional_density
def crd(x, y):
return np.where(np.logical_and(\
np.logical_and(positions[:, 0] >= clb(x), positions[:, 0] < clb(x) + min_square),\
np.logical_and(positions[:, 1] >= clb(y), positions[:, 1] < clb(y) + min_square)))[0].shape[0]
Then, I create a NumPy array of size size=(intervals, intervals) and pass the indices of the array (I have another question about this as I am currently using a quite inefficient method) as inputs into crd(x,y) and store the values in another array called densities. Then using some method, I calculate the maximum value in my densities array and draw the rectangle using some pretty straightforward code that I do not think is necessary to include here as it is not the problem.
What I Looking For:
I am looking for some function, f(x), that computes the dimensions and coordinates of a square encompassing the densest region on a scatterplot graph. The function would have access to all the variables it needs such as positions, min_square, etc. If you could use informative variable names or explain what each variable means, that would be a great help as well.
Other (Potentially) Important Notes:
I am looking for something that gets the job done in a reasonable time. In most scenarios, I am going to be working with around 10000 points and I need to calculate the densest region around 100 times so the function needs to be efficient enough so that the task completes within around 10-20 seconds.
As such, approximations using formulas like the example I have shown are completely valid as long as they implement well and are able to grow the dimensions of the square larger if necessary.
Thanks!
TL;DR: Question: Is there a fast way to interpolate a scattered 2D-dataset at specific coordinates?
And if so could someone provide an example with the provided sample data and variables used from "Current Solution" (as I'm apparently to stupid to implement it myself).
Problem:
I need to interpolate (and if possible also extrapolate) a DataFrame (size = (34, 18)) of scattered data at specific coordinate points. The DataFrame stays always the same.
The interpolation need to be fast as it is done more than 10.000 times in a loop.
The coordinates at which will be interpolated are not know in advance as they change every loop.
Current Solution:
def Interpolation(a, b):
#import external modules
import pandas as pd
from scipy import interpolate
#reading .xlsx file into DataFrame
file = pd.ExcelFile(file_path)
mr_df = file.parse('Model_References')
matrix = mr_df.set_index(mr_df.columns[0])
#interpolation at specific coordinates
matrix = Matrix.stack().reset_index().values
value = interpolate.griddata(matrix[:,0:2], matrix[:,2], (a, b), method='cubic')
return(value)
This method is not acceptable for long time use as only the two lines of code under #interpolation at specific coordinates is more than 95% of the execution time.
My Ideas:
scipy.interpolate.Rbf seems like the best solution if the data needs to be interpolated and extrapolated but as to my understanding it only creates a finer mesh of the existing data and cannot output a interpolated value at specific coordinates
creating a smaller 4x4 matrix of the area around the specific coordinates (a,b) would maybe decrease the execution time per loop, but I do struggle how to use griddata with the smaller matrix. I created a 5x5 matrix with the first row and column being the indexes and the other 4x4 entries is the data with the specific coordinates in the middle.
But I get a TypeError: list indices must be integers or slices, not tuple which I do not understand as I did not change anything else.
Sample Data:
0.0 0.1 0.2 0.3
0.0 -407 -351 -294 -235
0.0001 -333 -285 -236 -185
0.0002 -293 -251 -206 -161
0.00021 -280 -239 -196 -151
Thanks to #Jdog's comment I was able to figure it out:
The creation of a spline once before the loop with scipy.interpolate.RectBivariateSpline and the read out of specific coordinates with scipy.interpolate.RectBivariateSpline.ev decreased the execution time of the interpolation from 255s to 289ms.
def Interpolation(mesh, a, b):
#interpolation at specific coordinates
value = mesh.ev(stroke, current)
return(value)
#%%
#import external modules
import pandas as pd
from scipy import interp
#reading .xlsx file into DataFrame
file = pd.ExcelFile(file_path)
mr_df = file.parse('Model_References')
matrix = mr_df.set_index(mr_df.columns[0])
mesh = interp.RectBivariateSpline(a_index, b_index, matrix)
for iterations in loop:
value = Interpolation(mesh, a, b)
Recently I've been asked to find instances of a given pattern over a function (value over time), but I'm not sure about how to face the problem.
For example if the following case was given, and the time interval selected was [0,1], I would like to find all the instances of that shape, even if it's not exactly equal (emulating the human's eye behaviour):
Preferably I would like to code it in Python so any suggestions about libraries and/or frameworks that can be helpful, (of course also known methods and algorithms) will be very much appreciated.
Thanks
a rather trivial approach could be to take the given pattern and slide it across the data as a window, finding the difference between the pattern and the data under it. this would only be accurate if the shapes were always the same size as well as the same shape.
demo..
set up the data:
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(0,200,200)
y = np.zeros_like(x)
def addpeak(pos, y): #clipped triangular peak centered at pos (10 high, 20 wide)
y += np.array([np.clip(10-abs(pos-x), 0, 5) for x in xrange(len(y))])
return y
y = addpeak(15,y)
y = addpeak(40,y)
y = addpeak(125, y)
y = addpeak(100, y)
y = addpeak(180, y)
plt.plot(x,y) #visualize data
then take the sliding window difference
window = y[5:25] #first peak is sliding window
#you could take different difference formulas than simply linear
difference = np.array([sum(window-y[i:i+20]) for i in xrange(len(y)-20)])
plt.plot(x[:-20], difference) #note minimum difference might be offset based on window indexing
#pick your faviorite way to find local minima
You could use something like numpy (python numpy/scipy curve fitting) to inspect the points to fit a curve over the interval [0,1]. From that, you could do an offset from the x-axis to see if the curve 'fits' any other parts of the curve.
For example, from [1,2] it would be Offset: -1. Without having an example of code above, it's hard to go into precisely how to do it, but hope this is helpful.