I am working on some code optimization. Currently I use the pandas rolling_mean and rolling_std to compute normalized cross correlations of time series data from seismic instruments. For non-pertinent technical reasons I am only interested in every Nth value of the output of these pandas rolling mean and rolling std calls, so I am looking for away to only compute every Nth value. I may have to write a cython code to do this but I would prefer not to. Here is an example:
import pandas as pd
import numpy as np
As=5000 #Array size
as=150 #Moving window size
N=3 # only interested in every N values of output array
ar=np.random.rand(As) # generate generic random array
RSTD=pd.rolling_std(ar,as)[as-1:] # dont return the nans before widows overlap
foo=RSTD[::N] # use array indexing to decimate RSTD to only return every Nth value
Is there a good pandas way to only calculate every Nth value of RSTD rather than calculate all the values and decimate?
Thanks
Related
I have the following looking correlation function.
I want to extract only the main peak of the function in a seperate array. The central peak has the form of a gaussian.. I want to seperate the peak with a width arround the peak of approximately four times the FWHM of the gaussian peak. I have the correlation function stored in a numpy array. Any tips/ideas how to approach this ?
Numpy's argmax (Docs) function returns the index of the max value of a numpy array. With that value you could then get the values around that index.
Example:
m = numpy.argmax(arr)
values = arr[m-width:m+width]
I want to calculate the rolling weighted mean of a time series and the average to be calculated over a specific time interval. For example, this calculated the rolling mean with a 90-day window (not weighted):
import numpy as np
import pandas as pd
data = np.random.randint(0, 1000, (1000, 10))
index = pd.date_range("20190101", periods=1000, freq="18H")
df = pd.DataFrame(index=index, data=data)
df = df.rolling("90D").mean()
However, when I apply a weighting function (line below) I get an error: "ValueError: Invalid window 90D"
df = df.rolling("90D", win_type="gaussian").mean(std=60)
On the other hand, the weighted average works if I make the window an integer instead of an offset:
df = df.rolling(90, win_type="gaussian").mean(std=60)
Using an integer does not work for my application since the observations are not evenly spaced in time.
Two questions:
can I do a weighted rolling mean with an offset (e.g. "90D" or "3M"?
If I can do a weighted rolling mean with an offset, then what does std
refer to when I specify window="90D" and win_type="gaussian"; does it mean the std is 60D?
Okey, I discoveret that its not implemented yet in pandas.
Look here:
https://github.com/pandas-dev/pandas/blob/v0.25.0/pandas/core/window.py
If you follow line 2844 you see that when win_type is not None a Window object is returned:
if win_type is not None:
return Window(obj, win_type=win_type, **kwds)
Then check the validate method of the window object at line 630, it only allows integer or list-like windows
I think this is because pandas uses scipy.signal library which receives an array, so it cannot take into account the distribution of your data over time.
You could implement your own weighting function and use apply but its performance won't be too good.
It is not clear to me what you wants the weights in your weighted average to be but is the weight a measure of the time for which an observation is 'in effect'?
If so, I believe you can re-index the dataframe so it has regularly-spaced observations. Then fill NAs appropriately - see method in https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html
That will allow rolling to work and also help you think explicitly about how missing observations are treated, for instance should a missing sample take its value from the last valid sample or the nearest sample.
I am reading a csv file in python and preparing a dataframe out of it. I have a Microsoft Kinect which is recording Arm Abduction exercise and generating this CSV file.
I have this array of Y-Coordinates of ElbowLeft joint. You can visualize this here. Now, I want to come up with a solution which can count number of peaks or local maximum in this array.
Can someone please help me to solve this problem?
You can use the find_peaks_cwt function from the scipy.signal module to find peaks within 1-D arrays:
from scipy import signal
import numpy as np
y_coordinates = np.array(y_coordinates) # convert your 1-D array to a numpy array if it's not, otherwise omit this line
peak_widths = np.arange(1, max_peak_width)
peak_indices = signal.find_peaks_cwt(y_coordinates, peak_widths)
peak_count = len(peak_indices) # the number of peaks in the array
More information here: https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.find_peaks_cwt.html
It's easy, put the data in a 1-d array and compare each value with the neighboors, the n-1 and n+1 data are smaller than n.
Read data as Robert Valencia suggests
max_local=0
for u in range (1,len(data)-1):
if ((data[u]>data[u-1])&(data[u]>data[u+1])):
max_local=max_local+1
You could try to smooth the data with a smoothing filter and then find all values where the value before and after are less than the current value. This assumes you want all peaks in the sequence. The reason you need the smoothing filter is to avoid local maxima. The level of smoothing required will depend on the noise present in your data.
A simple smoothing filter sets the current value to the average of the N values before and N values after the current value in your sequence along with the current value being analyzed.
Hi I have a data frame example below. I want to add all off diagonal values to calculate a for a miss-classification score. This is 1-the miss-classification rate. How to I add all the off diagonal values up?
I have tried this code.
1-(my_LINEARSVC_cross.ix[0,4]+my_LINEARSVC_cross.ix[4,0])/np.sum(my_LINEARSVC_cross.values)
How can I ammend this to add all the off diagonal values?
Simply compute the sum of whole matrix minus its trace (sum of a diagonal), which in numpy would be
m.sum() - m.trace()
so if its a pandas frame you can convert it
import numpy as np
m = np.array(my_LINEARSVC_cross)
print m.sum() - m.trace()
or take out through .values
print my_LINEARSVC_cross.values.sum() - my_LINEARSVC_cross.values.trace()
I have an array of some arbitrary data x and associated timestamps t that correspond to the data in x (they are the same length N).
I want to downsample my data x to a smaller length M < N, such that the new data is roughly equally spaced in time (by using the timestamp information). This would be instead of simply decimating the data by taking every nth datapoint. Using the closest time-neighbor is fine.
scipy has some resampling code, but it actually tries to interpolate between data points, which I cannot do for my data. Does numpy or scipy have code that does this?
For example, suppose I want to downsample the letters of the alphabet according to some logarithmic time:
import string
import numpy as np
x = string.lowercase[::]
t = np.logspace(1, 10, num=26)
y = downsample(x, t, 8)
I'd suggest using pandas, specifically the resample function:
Convenience method for frequency conversion and resampling of regular time-series data.
Note the how parameter in particular.
You can convert your numpy array to a DataFrame:
import pandas as pd
YourPandasDF = pd.DataFrame(YourNumpyArray)