Rolling weighted mean in pandas using date range - python

I want to calculate the rolling weighted mean of a time series and the average to be calculated over a specific time interval. For example, this calculated the rolling mean with a 90-day window (not weighted):
import numpy as np
import pandas as pd
data = np.random.randint(0, 1000, (1000, 10))
index = pd.date_range("20190101", periods=1000, freq="18H")
df = pd.DataFrame(index=index, data=data)
df = df.rolling("90D").mean()
However, when I apply a weighting function (line below) I get an error: "ValueError: Invalid window 90D"
df = df.rolling("90D", win_type="gaussian").mean(std=60)
On the other hand, the weighted average works if I make the window an integer instead of an offset:
df = df.rolling(90, win_type="gaussian").mean(std=60)
Using an integer does not work for my application since the observations are not evenly spaced in time.
Two questions:
can I do a weighted rolling mean with an offset (e.g. "90D" or "3M"?
If I can do a weighted rolling mean with an offset, then what does std
refer to when I specify window="90D" and win_type="gaussian"; does it mean the std is 60D?

Okey, I discoveret that its not implemented yet in pandas.
Look here:
https://github.com/pandas-dev/pandas/blob/v0.25.0/pandas/core/window.py
If you follow line 2844 you see that when win_type is not None a Window object is returned:
if win_type is not None:
return Window(obj, win_type=win_type, **kwds)
Then check the validate method of the window object at line 630, it only allows integer or list-like windows
I think this is because pandas uses scipy.signal library which receives an array, so it cannot take into account the distribution of your data over time.
You could implement your own weighting function and use apply but its performance won't be too good.

It is not clear to me what you wants the weights in your weighted average to be but is the weight a measure of the time for which an observation is 'in effect'?
If so, I believe you can re-index the dataframe so it has regularly-spaced observations. Then fill NAs appropriately - see method in https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html
That will allow rolling to work and also help you think explicitly about how missing observations are treated, for instance should a missing sample take its value from the last valid sample or the nearest sample.

Related

TSFRESH - features extracted by a symmetric sliding window

As raw data we have measurements m_{i,j}, measured every 30 seconds (i=0, 30, 60, 90,...720,..) for every subject j in the dataset.
I wish use TSFRESH (package) to extract time-series features, such that for a point of interest at time i, features are calculated based on symmetric rolling window.
We wish to calculate the feature vector of time point i,j based on measurements of 3 hours of context before i and 3 hours after i.
Thus, the 721-dim feature vector represents a point of interest surrounded by 6 hours “context”, i.e. 360 measurements before and 360 measurements after the point of interest.
For every point of interest, features should be extracted based on 721 measurements of m_{i,j}.
I've tried using rolling_direction param in roll_time_series(), but the only options are either roll backwards or forwards in “time” - I'm looking for a way to include both "past" and "future" data in features calculation.
If I understand your idea correctly, it is even possible to do this with only one-sided rolling. Let's try with one example:
You want to predict for the time 8:00 - and you need for this the data from 5:00 until 11:00.
If you roll through the data with a size of 6h and positive rolling direction, you will end up with a dataset, which also includes exactly this part of the data (5:00 to 11:00). Normally, it would be used to train for the value at 11:00 (or 12:00) - but nothing prevents you to use it for predicting the value at 8:00.
Basically, it is just a matter of re-indexing.
(Same is actually true for negative rolling direction)
A "workaround" solution:
Use the "roll_time_series" function twice; one for "backward" rolling (setting rolling_direction=1) and the second for "forward" (rolling_direction=-1), and then combine them into one.
This will provide, for each time point in the original dataset m_{i,j}$, a time series rolling object with 360 values "from the past" and 360 values "from the future" (i.e., the time point is at the center of the window and max_timeshift=360)
Note to the use of pandas functions below: concat(), sort_values(), drop_duplicates() - which are mandatory for this solution to work.
import numpy as np
import pandas as pd
from tsfresh.utilities.dataframe_functions import roll_time_series
from tsfresh.feature_extraction import EfficientFCParameters, MinimalFCParameters
rolled_backward = roll_time_series(activity_data,
column_id=id_column,
column_sort=sort_column,
column_kind=None,
rolling_direction=1,
max_timeshift=360)
rolled_farward = roll_time_series(activity_data,
column_id=id_column,
column_sort=sort_column,
column_kind=None,
rolling_direction=-1,
max_timeshift=360)
# merge into one dataframe, with rolled_farward and rolled_backward window for every time point (sample)
df = pd.concat([rolled_backward, rolled_farward])
# important! - sort and drop duplicates
df.sort_values(by=[id_column, sort_column], inplace=True)
df.drop_duplicates(subset=[id_column, sort_column, 'activity'], inplace=True, keep='first')

Question about numpy correlate: not giving expected result

I want to make sure I am using numpy's correlate correctly, it is not giving me the answer I expect. Perhaps I am misunderstanding the correlate function. Here is a code snipet with comments:
import numpy as np
ref = np.sin(np.linspace(-2*np.pi, 2*np.pi, 10000)) # make some data
fragment = ref[2149:7022] # create a fragment of data from ref
corr = np.correlate(ref, fragment) # Find the correlation between the two
maxLag = np.argmax(corr) # find the maximum lag, this should be the offset that we chose above, 2149
print(maxLag)
2167 # I expected this to be 2149.
Isn't the index in the corr array where the correlation is maximum the lag between these two datasets? I would think the starting index I chose for the smaller dataset would be the offset with the greatest correlation.
Why is there a discrepancy between what I expect, 2149, and the result, 2167?
Thanks
That looks like a precision error to me, cross-correlation is an integral and it will always have problems when being represented in discrete space, I guess the problem arises when the values are close to 0. Maybe if you increase the numbers or increase the precision that difference will disappear but I don't think it is really necessary since you are already dealing with approximation when using the discrete cross-correlation, below is the graph of the correlation for you te see that the values are indeed close:

Using ttest while increasing samplesize

I have a df with different features. I will focous on one feature here, called 'x':
count 2152.000000
mean 95.162587
std 0.758480
min 92.882304
25% 94.648659
50% 95.172078
75% 95.648485
max 97.407068
I want to perfom a ttest on my df while i sample data out of the df. I want to see the effect of the sampleSize. As i expect it to saturate after a number of samples. Therefore i loop the sampleSize for a specific random_state:
for N in np.arange(1,2153,1):
pull = helioPosition.sample(N,random_state= 140)
ttest_pull.append(stats.ttest_ind(df['x'],pull['x'])[1])
the distribution of 'x' is a normal distribution:
When i plot the p of the ttest over my sampleSize I get the following plot:
Is there a mistake in my code or method. I would expect to get a better p value with a higher sampleSize, but this is not true for every sampleSize. How can a sampleSize of ~1500 be worse than a sample size of ~450?
pull is from the sampled from the same data, i.e. the second sample is a random sample from the same population and the two samples have the same mean (expected value).
p-values are uniformly distributed on interval [0, 1] when the null hypothesis is true, which is here the case. This is independent of the sample size, so we expect to see fluctuations or randomness in the p-value of the tests.
However, in this case you do not have two independent samples which is the underlying assumption of the t-test. As far as I understand your code, in the limit as N becomes large the second sample will include the entire "population" and be identical to the first sample. In that case the p-value will go to one because you are comparing two essentially identical samples.
If sample samples with replacement, then you are essentially comparing a bootstrap sample with the "population", which would be two samples with the same expected value and very high correlation. So, p-value for standard t-test should be high but still a random number.
Just to add to the answer above, what you referring to is power. Basically how many false negative do you have given a certain effect and sample. In your case, the effect is zero since they come from the same distribution, and note you did only one test, which means all your pvalues are basically sampling from a uniform distribution.
What you need is first, a difference between the two distributions, and secondly to perform this test repeatedly to see the number of rejections. See example below:
import numpy as np
import pandas as pd
from scipy.stats import ttest_ind
import seaborn as sns
df = pd.DataFrame({'x':np.random.normal(0,2,150),
'y':np.random.normal(1,2,150)})
Now we have two columns that have different means. We go through the sampling with different sizes
def subsampletest(da,N):
pull = da.sample(N)
return(ttest_ind(pull['x'],pull['y'])[1])
sampleSize = np.arange(5,150,step=5)
results = np.array([[subsampletest(df,x) for x in sampleSize] for B in range(100)])
The number of rejections at alpha of 0.05 (out of 100) per sample size, is simply:
rejections = np.mean(results<0.05,axis=0)
sns.lineplot(x=sampleSize,y=rejections)

Calculating the rolling root mean squared

I have a signal for Vibration, i want to smooth the signal using Root mean squared with a rolling window of 21 days. The data is in minute wise so rolling window of 21 days means 21*1440[21*24*60].
Is there any approach like:
# Dummy approach
df['Rolling_rms'] = df['signal'].rolling(21*1440).rms()
I am trying an approach by using the for loop which is way too time consuming:
# Function for calculating RMS
def rms_calc(ser):
return np.sqrt(np.mean(ser**2))
for i in range(0,len(signal)):
j = 21*1440+i
print(rms_calc(df[signal][i:j]))
You can use the method apply with a custom function:
df['signal'].pow(2).rolling(21*24*60).apply(lambda x: np.sqrt(x.mean()))
Further to #mykola-zotko's answer:
there is a mean method for the rolling object, which would speed this up considerably.
I'd still use the apply method to get the square root; however, passing the raw=True parameter should also speed up the calculation.
In full:
df['signal'].pow(2).rolling(21*24*60).mean().apply(np.sqrt, raw=True)
Alternate method
If you want to zero-mean your data windows before calculating the RMS (which I believe is common in vibration analysis), then the calculation will be mathematically equivalent to calculating the rolling standard deviation. In that case, you can also just use the std method for the rolling object:
df['signal'].rolling(21*24*60).std(ddof=0)

pandas rolling_std only perform every Nth calculation

I am working on some code optimization. Currently I use the pandas rolling_mean and rolling_std to compute normalized cross correlations of time series data from seismic instruments. For non-pertinent technical reasons I am only interested in every Nth value of the output of these pandas rolling mean and rolling std calls, so I am looking for away to only compute every Nth value. I may have to write a cython code to do this but I would prefer not to. Here is an example:
import pandas as pd
import numpy as np
As=5000 #Array size
as=150 #Moving window size
N=3 # only interested in every N values of output array
ar=np.random.rand(As) # generate generic random array
RSTD=pd.rolling_std(ar,as)[as-1:] # dont return the nans before widows overlap
foo=RSTD[::N] # use array indexing to decimate RSTD to only return every Nth value
Is there a good pandas way to only calculate every Nth value of RSTD rather than calculate all the values and decimate?
Thanks

Categories

Resources