I need to calculate some rolling forward averages in a dataframe and really don't know where to start.
I know if I wanted to select a cell 10 days ahead say I would do df.shift(-10), but what I'm looking to do is calculate the average between 10 and 15 days ahead say.
So what I'm kind of thinking is df.rolling(-10,-15).mean(), if I was trying to calculate just a moving average going backing in time df.rolling(15, 10).mean() would work perfectly and I did think about just calculating the averages like that, and then somehow shifting the data.
Any help would be great
Many thanks
You could calculate the rolling mean 5 days ahead, and then shift that for 10 more periods. Since negative values in rolling are not allowed, you can invert the axis, calculate backwards, and then invert again (see How to use Pandas rolling_* functions on a forward-looking basis):
df = pd.DataFrame(np.random.rand(100, 2))
df[::-1].rolling(5).mean()[::-1].shift(-10)
The above answer doesn't look right. IMHO you musn't reverse and shift.
df = pd.DataFrame(np.random.randint(low=0, high=10, size=(100, 2))) # int easier to interpret
df[::-1].rolling(window=5, min_periods=1).mean()[::-1]
this also works but you lose the last 4 values:
df.rolling(window=5, min_periods=1).mean().shift(-5)
The more difficult problem of a rolling window that is arbitrarily shifted (offset) probably needs to use .shift() in some way.
There is a new method to deal with this. That said includes current row.
https://pandas.pydata.org/docs/reference/api/pandas.api.indexers.FixedForwardWindowIndexer.html
indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=2)
df.rolling(window=indexer, min_periods=1).sum()
Related
I have a small toy dataset of 23 hours of irregular time series data (financial tick data) with millisecond granularity, roughly 1M rows. By irregular I mean that the timestamps are not evenly spaced. I also have a column 'mid' with some values too.
I am trying to group by e.g. 2 minute buckets to calculate the absolute difference of 'mid', and then taking the median, in the following manner:
df.groupby(["RIC", pd.Grouper(freq='2min')]).mid.apply(
lambda x: np.abs(x[-1] - x[0]) if len(x) != 0 else 0).median()
Note: 'RIC' is just another layer of grouping I am applying before the time bucket grouping.
Basically, I am telling pandas to group by every [ith minute : ith + 2 minute] intervals, and in each interval, take the last (x[-1]) and the first (x[0]) 'mid' element, and take its absolute difference. I am doing this over a range of 'freqs' as well, e.g. 2min, 4min, ..., up to 30min intervals.
This approach works completely fine, but it is awfully slow because of the usage of pandas' .apply function. I am aware that .apply doesn't take advantage of the built in vectorization of pandas and numpy, as it is computationally no different to a for loop, and am trying to figure out how to achieve the same without having to use apply so I can speed it up by several orders of magnitude.
Does anyone know how to rewrite the above code to ditch .apply? Any tips will be appreciated!
On the pandas groupby.apply webpage:
"While apply is a very flexible method, its downside is that using it
can be quite a bit slower than using more specific methods like agg or
transform. Pandas offers a wide range of method that will be much
faster than using apply for their specific purposes, so try to use
them before reaching for apply."
Therefore, using transform should be a lot faster.
grouped = df.groupby(["RIC", pd.Grouper(freq='2min')])
abs(grouped.mid.transform("last") - grouped.mid.transform("first")).median()
I would like to add a column at the end of a dataframe containing the moving average (EWM) for a specific value.
Currently, I am using 2 for loops:
for country in Country_Names:
for i in i_Codes:
EMA = df[(df['COUNTRY_NAME']==country) & (df['I_CODE']==i)].KRI_VALUE.ewm(span=6, adjust=False).mean()
df.loc[(df['COUNTRY_NAME']==country) & (df['I_CODE']==i), 'EMA'] = EMA
This is really quite slow (takes a few minutes - I have more than 50,000 rows...): does anyone have a better idea?
Many thanks!
ODO22
I'm gonna guess how it might work without seeing the data,
df['EMA'] = (df.groupby([Country_Names,i_Codes])
.transform(lambda x:x.KRI_VALUE.ewm(span=6, adjust=False).mean())
So far I've read in 2 CSV's and merged them based on a common element. I take the output of the merged CSV and iterate through the unique element they've been merged on. While I have them separated I want to generate a daily count line and a two week rolling average from the current date going backward. I cannot index based of the 'Date Opened' field but I still need my outputs organized by this with the most recent first. Once these are sorted by date my daily count plotting issue will be rectified. My remaining task would be to compute a two week rolling average for count within the week. I've looked into the Pandas documentation and I think the rolling_mean will work but the parameters of this function don't really make sense to me. I've tried biwk_avg = pd.rolling_mean(open_dt, 28) but that doesnt seem to work. I know there is an easier way to do this but I think I've hit a roadblock with the documentation available. The end result should look something like this graph. Right now my daily count graph isnt sorted(even though I think I've instructed it to) and is unusable in line form.
def data_sort():
data_merge = data_extract()
domains = data_merge.groupby('PWx Domain')
for domain in domains.groups.items():
dsort = (data_merge.loc[domain[1]])
print (dsort.head())
open_dt = pd.to_datetime(dsort['Date Opened']).dt.date
#open_dt.to_csv('output\''+str(domain)+'_out.csv', sep = ',')
open_ct = open_dt.value_counts(sort= False)
biwk_avg = pd.rolling_mean(open_ct, 28)
plt.plot(open_ct,'bo')
plt.show()
data_sort()
Rolling mean alone is not enough in your case; you need a combination of resampling (to group data by days) followed by a 14-day rolling mean (why do you use 28 in your code?). Something like thins:
for _,domain in data_merge.groupby('PWx Domain'):
# Convert date to the index
domain.index = pd.to_datetime(domain['Date Opened'])
# Sort dy dates
domain.sort_index(inplace=True)
# Do the averaging
rolling = pd.rolling_mean(domain.resample('1D').mean(), 14)
plt.plot(rolling,'bo')
plt.show()
I will be shocked if there isn't some standard library function for this especially in numpy or scipy but no amount of Googling is providing a decent answer.
I am getting data from the Poloniex exchange - cryptocurrency. Think of it like getting stock prices - buy and sell orders - pushed to your computer. So what I have is timeseries of prices for any given market. One market might get an update 10 times a day while another gets updated 10 times a minute - it all depends on how many people are buying and selling on the market.
So my timeseries data will end up being something like:
[1 0.0003234,
1.01 0.0003233,
10.0004 0.00033,
124.23 0.0003334,
...]
Where the 1st column is the time value (I use Unix timestamps to the microsecond but didn't think that was necessary in the example. The 2nd column would be one of the prices - either the buy or sell prices.
What I want is to convert it into a matrix where the data is "sampled" at a regular time frame. So the interpolated (zero-order hold) matrix would be:
[1 0.0003234,
2 0.0003233,
3 0.0003233,
...
10 0.0003233,
11 0.00033,
12 0.00033,
13 0.00033,
...
120 0.00033,
125 0.0003334,
...]
I want to do this with any reasonable time step. Right now I use np.linspace(start_time, end_time, time_step) to create the new time vector.
Writing my own, admittedly crude, zero-order hold interpolator won't be that hard. I'll loop through the original time vector and use np.nonzero to find all the indices in the new time vector which fit between one timestamp (t0) and the next (t1) then fill in those indices with the value from time t0.
For now, the crude method will work. The matrix of prices isn't that big. But I have to think there a faster method using one of the built-in libraries. I just can't find it.
Also, for the example above I only use a matrix of Nx2 (column 1: times, column 2: price) but ultimately the market has 6 or 8 different parameters that might get updated. A method/library function that could handled multiple prices and such in different columns would be great.
Python 3.5 via Anaconda on Windows 7 (hopefully won't matter).
TIA
For your problem you can use scipy.interpolate.interp1d. It seems to be able to do everything that you want. It is able to do a zero order hold interpolation if you specify kind="zero". It can also simultaniously interpolate multiple columns of a matrix. You will just have to specify the appropriate axis. f = interp1d(xData, yDataColumns, kind='zero', axis=0) will then return a function that you can evaluate at any point in the interpolation range. You can then get your normalized data by calling f(np.linspace(start_time, end_time, time_step).
I have a fairly sophisticate prediction code with over 20 columns and millions of data per column using wls. Now i use iterrow to loop through dates, then based on those dates and values in those dates, extract different sizes of data for calculation. it takes hours to run in my production, I simplify the code into the following:
import pandas as pd
import numpy as np
from datetime import timedelta
df=pd.DataFrame(np.random.randn(1000,2), columns=list('AB'))
df['dte'] = pd.date_range('9/1/2014', periods=1000, freq='D')
def calculateC(A, dte):
if A>0: #based on values has different cutoff length for trend prediction
depth=10
else:
depth=20
lastyear=(dte-timedelta(days=365))
df2=df[df.dte<lastyear].head(depth) #use last year same date data for basis of prediction
return df2.B.mean() #uses WLS in my model but for simplification replace with mean
for index, row in df.iterrows():
if index>365:
df.loc[index,'C']=calculateC(row.A, row.dte)
I read that iterrow is the main cause because it is not an effective way to use Pandas, and I should use vector methods. However, I can't seem to be able to find a way to vector based on conditions (dates, different length, and range of values). Is there a way?
I have good news and bad news. The good news is I have something vectorized that is about 300x faster but the bad news is that I can't quite replicate your results. But I think that you ought to be able to use the principles here to greatly speed up your code, even if this code does not actually replicate your results at the moment.
df['result'] = np.where( df['A'] > 0,
df.shift(365).rolling(10).B.mean(),
df.shift(365).rolling(20).B.mean() )
The tough (slow) part of your code is this:
df2=df[df.dte<lastyear].head(depth)
However, as long as your dates are all 365 days apart, you can use code like this, which is vectorized and much faster:
df.shift(365).rolling(10).B.mean()
shift(365) replaces df.dte < lastyear and the rolling().mean() replaces head().mean(). It will be much faster and use less memory.
And actually, even if your dates aren't completely regular, you can probably resample and get this way to work. Or, somewhat equivalently, if you make the date your index, the shift can be made to work based on a frequency rather than rows (e.g. shift 365 days, even if that is not 365 rows). It would probably be a good idea to make 'dte' your index here regardless.
I would try pandas.DataFrame.apply(func, axis=1)
def calculateC2(row):
if row.name >365: # row.name is the index of the row
if row.A >0: #based on values has different cutoff length for trend prediction
depth=10
else:
depth=20
lastyear=(row.dte-timedelta(days=365))
df2=df[df.dte<lastyear].B.head(depth) #use last year same date data for basis of prediction
print row.name,np.mean(df2) #uses WLS in my model but for simplification replace with mean
df.apply(calculateC2,axis=1)