I have a signal for Vibration, i want to smooth the signal using Root mean squared with a rolling window of 21 days. The data is in minute wise so rolling window of 21 days means 21*1440[21*24*60].
Is there any approach like:
# Dummy approach
df['Rolling_rms'] = df['signal'].rolling(21*1440).rms()
I am trying an approach by using the for loop which is way too time consuming:
# Function for calculating RMS
def rms_calc(ser):
return np.sqrt(np.mean(ser**2))
for i in range(0,len(signal)):
j = 21*1440+i
print(rms_calc(df[signal][i:j]))
You can use the method apply with a custom function:
df['signal'].pow(2).rolling(21*24*60).apply(lambda x: np.sqrt(x.mean()))
Further to #mykola-zotko's answer:
there is a mean method for the rolling object, which would speed this up considerably.
I'd still use the apply method to get the square root; however, passing the raw=True parameter should also speed up the calculation.
In full:
df['signal'].pow(2).rolling(21*24*60).mean().apply(np.sqrt, raw=True)
Alternate method
If you want to zero-mean your data windows before calculating the RMS (which I believe is common in vibration analysis), then the calculation will be mathematically equivalent to calculating the rolling standard deviation. In that case, you can also just use the std method for the rolling object:
df['signal'].rolling(21*24*60).std(ddof=0)
Related
I want to calculate the rolling weighted mean of a time series and the average to be calculated over a specific time interval. For example, this calculated the rolling mean with a 90-day window (not weighted):
import numpy as np
import pandas as pd
data = np.random.randint(0, 1000, (1000, 10))
index = pd.date_range("20190101", periods=1000, freq="18H")
df = pd.DataFrame(index=index, data=data)
df = df.rolling("90D").mean()
However, when I apply a weighting function (line below) I get an error: "ValueError: Invalid window 90D"
df = df.rolling("90D", win_type="gaussian").mean(std=60)
On the other hand, the weighted average works if I make the window an integer instead of an offset:
df = df.rolling(90, win_type="gaussian").mean(std=60)
Using an integer does not work for my application since the observations are not evenly spaced in time.
Two questions:
can I do a weighted rolling mean with an offset (e.g. "90D" or "3M"?
If I can do a weighted rolling mean with an offset, then what does std
refer to when I specify window="90D" and win_type="gaussian"; does it mean the std is 60D?
Okey, I discoveret that its not implemented yet in pandas.
Look here:
https://github.com/pandas-dev/pandas/blob/v0.25.0/pandas/core/window.py
If you follow line 2844 you see that when win_type is not None a Window object is returned:
if win_type is not None:
return Window(obj, win_type=win_type, **kwds)
Then check the validate method of the window object at line 630, it only allows integer or list-like windows
I think this is because pandas uses scipy.signal library which receives an array, so it cannot take into account the distribution of your data over time.
You could implement your own weighting function and use apply but its performance won't be too good.
It is not clear to me what you wants the weights in your weighted average to be but is the weight a measure of the time for which an observation is 'in effect'?
If so, I believe you can re-index the dataframe so it has regularly-spaced observations. Then fill NAs appropriately - see method in https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html
That will allow rolling to work and also help you think explicitly about how missing observations are treated, for instance should a missing sample take its value from the last valid sample or the nearest sample.
Let's say I have a set of data points called signal and I want to integrate it twice with respect to time (i.e., if signal was acceleration, I'd like to integrate it twice w.r.t. time to get the position). I can integrate it once using simps but the output here is a scalar. How can you numerically integrate a (random) data set twice? I'd imagine it would look something like this, but obviously the inputs are not compatible after the first integration.
n_samples = 5000
t_range = np.arange(float(n_samples))
signal = np.random.normal(0.,1.,n_samples)
signal_integration = simps(signal, t_range)
signal_integration_double = simps(simps(signal, t_range), t_range)
Any help would be appreciated.
Sorry I answered too fast. scipy.integrate.simps give the value of the integration over the range you give it, similar to np.sum(signal).
What you want is the integration beween the start and each data point, which is what cumsum does. A better method could be scipy.integrate.cumtrapz. You can apply either method twice to get the result you want.
See:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.integrate.simps.html
https://docs.scipy.org/doc/scipy/reference/generated/scipy.integrate.cumtrapz.html
Original answer:
I think you want np.cumsum. Integration of discrete data is just a sum. You have to multiply the result by the step value to get the correct scale.
See https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.cumsum.html
By partial integration you get from y''=f to
y(t) = y(0) + y'(0)*t + integral from 0 to t of (t-s)*f(s) ds
As you seem to assume that y(0)=0 and also y'(0)=0, you can thus get the the desired integral value in one integration as
simps((t-t_range)*signal, t_range)
I have a long pandas time series like this:
2017-11-27 16:19:00 120.0
2017-11-30 02:40:35 373.4
2017-11-30 02:40:42 624.5
2017-12-01 14:15:31 871.8
2017-12-01 14:15:33 1120.0
2017-12-07 21:07:04 1372.2
2017-12-08 06:11:50 1660.0
2017-12-08 06:11:53 1946.7
2017-12-08 06:11:57 2235.3
2017-12-08 06:12:00 2521.3
....
dtype: float64
and I want to plot it together with its derivative. By definition I calculate derivative in this manner:
numer=myTimeSeries.diff()
denominat=myTimeSeries.index.to_series().diff().dt.total_seconds()/3600
derivative=numer/denominat
Because some values of the delta time (that is in denominat) is very close (or equal sometimes) to zero I got some inf values in my derivative. Practically I got this:[
Time series blue(left scale), derivative green (right scale)
Now I would like to smooth the derivative to make it more readable. I tried different operations, like:
calculate differences on higher periods:
set periods=5 for both numer and denominat
use a moving average with: smotDeriv=derivative.rolling(window=10,min_periods=3,center=True,win_type='boxcar').mean() obtaining:
I used also different window types without any useful changes
I thought also to clip the values but I don't know which effective values to use as min and max. I tried 25% and 75% quantile without any great advantage
I also tied to use a Kalman filter using pykalman:
derivative.fillna(0,inplace=True)
kf = KalmanFilter(initial_state_mean=0)
state_means,_ = kf.filter(derivative.values)
state_means = state_means.flatten()
indexDate=derivative.index
derivativeKalman=pd.Series(state_means,index=indexDate)
to get this:
Practically I cannot find any useful improvement. What can you suggest me to improve the readability of the derivative plot on the chart, if it is possible. Obviously I'd cut some peak of the derivative to obtain a smoothed curve that approximate the true values. I tried different combination about the window types, periods, etc.. without any results. About the Kalman filter, I'm not an expert, let's say a newbie, so I just used default values following this. I've also found filterpy library which implements the Kalman filter but I've not found how to use without setting starting parameters.
If your goal is to remove "outlier" spikes in derivative series, I would try "rolling median" first instead of "rolling mean" since median in general is more insensitive to outliers.
For example:
smotDeriv = derivative.rolling(window=10, min_periods=3, center=True).median()
And then, if you further want to smooth it out, one of possible options is to apply rolling_mean().
Note: Since I don't have your data at hand to play around, I'm not sure about optimal values for window and min_periods. It depends on how far you want to smooth it out. Also, it seems to me that smoothing derivative is becoming more like smoothing the original time series, so if there is a known way to smooth your original time series, that may be more straight forward.
Hope this helps.
We know that derivate of a function is defined as below:
f'(x) = lim_(h -> 0) (f(x + h) - f(x - h)) / 2h
Lets assume that the derivative of your function is defined every where. When h is very small, you will get a better approximation of derivative and when h is very large, you will get a bad approximation of the derivative.
There is a problem to apply this approach in case of your dataset. Sometime h can become very small to essentially give absurdly high value of gradient. Sometimes h is too large that the gradient estimate is very bad. To overcome this problem, lets define two threshold of time t1 and t2. If the successive time difference is between t1 and t2, then we use that point to determine the gradient by the above formula of f'(x). If it is beyond this threshold, we ignore that point.
How do we compute the gradient for rest of the points?
We can fit a polynomial based on the points that we found in the previous step.
I have two arrays. x is the independent variable, and counts is the number of counts of x occurring, like a histogram. I know I can calculate the mean by defining a function:
def mean(x,counts):
return np.sum(x*counts) / np.sum(counts)
Is there a general function I can use to calculate each moment from the distribution defined by x and counts? I would also like to compute the variance.
You could use the moment function from scipy. It calculates the n-th central moment of your data.
You could also define your own function, which could look something like this:
def nmoment(x, counts, c, n):
return np.sum(counts*(x-c)**n) / np.sum(counts)
In that function, c is meant to be the point around which the moment is taken, and n is the order. So to get the variance you could do nmoment(x, counts, np.average(x, weights=counts), 2).
import scipy as sp
from scipy import stats
stats.moment(counts, moment = 2) #variance
stats.moment returns nth central moment.
Numpy supports order statistics now
https://numpy.org/doc/stable/reference/routines.statistics.html
np.average
np.std
np.var
etc
What is the efficient equivalent of R's scale function in pandas? E.g.
newdf <- scale(df)
written in pandas? Is there an elegant way using transform?
Scaling is very common in machine learning tasks, so it is implemented in scikit-learn's preprocessing module. You can pass pandas DataFrame to its scale method.
The only "problem" is that the returned object is no longer a DataFrame, but a numpy array; which is usually not a real issue if you want to pass it to a machine learning model anyway (e.g. SVM or logistic regression). If you want to keep the DataFrame, it would require some workaround:
from sklearn.preprocessing import scale
from pandas import DataFrame
newdf = DataFrame(scale(df), index=df.index, columns=df.columns)
See also here.
I don't know R, but from reading the documentation it looks like the following would do the trick (albeit in a slightly less general way)
def scale(y, c=True, sc=True):
x = y.copy()
if c:
x -= x.mean()
if sc and c:
x /= x.std()
elif sc:
x /= np.sqrt(x.pow(2).sum().div(x.count() - 1))
return x
For the more general version you'd probably need to do some type/length checking.
EDIT: Added explanation of the denominator in elif sc: clause
From the R docs:
... If ‘scale’ is
‘TRUE’ then scaling is done by dividing the (centered) columns of
‘x’ by their standard deviations if ‘center’ is ‘TRUE’, and the
root mean square otherwise. If ‘scale’ is ‘FALSE’, no scaling is
done.
The root-mean-square for a (possibly centered) column is defined
as sqrt(sum(x^2)/(n-1)), where x is a vector of the non-missing
values and n is the number of non-missing values. In the case
‘center = TRUE’, this is the same as the standard deviation, but
in general it is not.
The line np.sqrt(x.pow(2).sum().div(x.count() - 1)) computes the root mean square using the definition by first squaring x (the pow method) then summing along the rows and then dividing by the non NaN counts in each column (the count method).
As a side the note the reason I didn't just simply compute the RMS after centering is because the std method calls bottleneck for faster computation of that expression in that special case where you want to compute the standard deviation and not the more general RMS.
You could instead compute the RMS after centering, might be worth a benchmark since now that I'm writing this I'm not actually sure which is faster and I haven't benchmarked it.