I have a small toy dataset of 23 hours of irregular time series data (financial tick data) with millisecond granularity, roughly 1M rows. By irregular I mean that the timestamps are not evenly spaced. I also have a column 'mid' with some values too.
I am trying to group by e.g. 2 minute buckets to calculate the absolute difference of 'mid', and then taking the median, in the following manner:
df.groupby(["RIC", pd.Grouper(freq='2min')]).mid.apply(
lambda x: np.abs(x[-1] - x[0]) if len(x) != 0 else 0).median()
Note: 'RIC' is just another layer of grouping I am applying before the time bucket grouping.
Basically, I am telling pandas to group by every [ith minute : ith + 2 minute] intervals, and in each interval, take the last (x[-1]) and the first (x[0]) 'mid' element, and take its absolute difference. I am doing this over a range of 'freqs' as well, e.g. 2min, 4min, ..., up to 30min intervals.
This approach works completely fine, but it is awfully slow because of the usage of pandas' .apply function. I am aware that .apply doesn't take advantage of the built in vectorization of pandas and numpy, as it is computationally no different to a for loop, and am trying to figure out how to achieve the same without having to use apply so I can speed it up by several orders of magnitude.
Does anyone know how to rewrite the above code to ditch .apply? Any tips will be appreciated!
On the pandas groupby.apply webpage:
"While apply is a very flexible method, its downside is that using it
can be quite a bit slower than using more specific methods like agg or
transform. Pandas offers a wide range of method that will be much
faster than using apply for their specific purposes, so try to use
them before reaching for apply."
Therefore, using transform should be a lot faster.
grouped = df.groupby(["RIC", pd.Grouper(freq='2min')])
abs(grouped.mid.transform("last") - grouped.mid.transform("first")).median()
Related
I am currently working on a project that uses Pandas, with a large dataset (~42K rows x 1K columns).
My dataset has many omissing values which I want to interpolate to obtain a better result when training an ML model using this data. My method of interpolating the data is by taking the average of the previous and the next value and then considering that the value for any NaN. Example:
TRANSACTION PAYED MONDAY TUESDAY WEDNESDAY
D8Q3ML42DS0 1 123.2 NaN 43.12
So in the above example the NaN would be replaced with the average of the 123.2 and 43.12 which is 83.16. If the value can't be interpolated then a 0 is put. I was able to implement this in a number of ways but I always end up getting into the issue of it taking a very long time to process all of the rows in the dataset despite running it on an Intel Core i9. The following are approaches I've tried and have found out that they take too long:
Interpolating the data and then only replacing the elements that need to be replaced instead of replacing the entire row.
Replacing the entire row with a new pd.Series that has the old and the interpolated values. It seems like my code is able to execute reasonably well on a Numpy Array but the slowness comes from the assignment.
I'm not quite sure why the performance of my code comes nowhere close to df.interpolate() despite it being the same idea. Here is some of my code responsible for the interpolation:
for transaction_id in df.index:
df.loc[transaction_id, 2:] = interpolate(df.loc[transaction_id, 2:])
def interpolate(array:np.array):
arr_len = len(array)
for i in range(array):
if math.isnan(array[i]):
if i == 0 or i == arr_len-1 or math.isnan(array[i-1]) or math.isnan(array[i+1]):
array[i] = 0
else:
statistics.mean([array[i-1], array[i+1]])
return array
My understanding is that Pandas has some sort of parallel techniques and functions that it is able to use to perform that. How can I speed this process up even a little?
df.interpolate(method='linear', limit_direction='forward', axis=0)
Try doing this it might help.
Suppose I have a series, and I want to do the sort of thing pandas does with resample - say, compute the mean (or some other aggregation) of rows 0-14, 14-29, ..., etc. Of course this can be done with rolling, but this will do (in the example case) 15 times as much work as necessary.
(so, if s is the series, then s.rolling(15).mean().iloc[::15] One can of course, introduce a DateTime index, and then do resample, but this seems like a kludge. What's the canonical way?
I need to calculate some rolling forward averages in a dataframe and really don't know where to start.
I know if I wanted to select a cell 10 days ahead say I would do df.shift(-10), but what I'm looking to do is calculate the average between 10 and 15 days ahead say.
So what I'm kind of thinking is df.rolling(-10,-15).mean(), if I was trying to calculate just a moving average going backing in time df.rolling(15, 10).mean() would work perfectly and I did think about just calculating the averages like that, and then somehow shifting the data.
Any help would be great
Many thanks
You could calculate the rolling mean 5 days ahead, and then shift that for 10 more periods. Since negative values in rolling are not allowed, you can invert the axis, calculate backwards, and then invert again (see How to use Pandas rolling_* functions on a forward-looking basis):
df = pd.DataFrame(np.random.rand(100, 2))
df[::-1].rolling(5).mean()[::-1].shift(-10)
The above answer doesn't look right. IMHO you musn't reverse and shift.
df = pd.DataFrame(np.random.randint(low=0, high=10, size=(100, 2))) # int easier to interpret
df[::-1].rolling(window=5, min_periods=1).mean()[::-1]
this also works but you lose the last 4 values:
df.rolling(window=5, min_periods=1).mean().shift(-5)
The more difficult problem of a rolling window that is arbitrarily shifted (offset) probably needs to use .shift() in some way.
There is a new method to deal with this. That said includes current row.
https://pandas.pydata.org/docs/reference/api/pandas.api.indexers.FixedForwardWindowIndexer.html
indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=2)
df.rolling(window=indexer, min_periods=1).sum()
I have a fairly sophisticate prediction code with over 20 columns and millions of data per column using wls. Now i use iterrow to loop through dates, then based on those dates and values in those dates, extract different sizes of data for calculation. it takes hours to run in my production, I simplify the code into the following:
import pandas as pd
import numpy as np
from datetime import timedelta
df=pd.DataFrame(np.random.randn(1000,2), columns=list('AB'))
df['dte'] = pd.date_range('9/1/2014', periods=1000, freq='D')
def calculateC(A, dte):
if A>0: #based on values has different cutoff length for trend prediction
depth=10
else:
depth=20
lastyear=(dte-timedelta(days=365))
df2=df[df.dte<lastyear].head(depth) #use last year same date data for basis of prediction
return df2.B.mean() #uses WLS in my model but for simplification replace with mean
for index, row in df.iterrows():
if index>365:
df.loc[index,'C']=calculateC(row.A, row.dte)
I read that iterrow is the main cause because it is not an effective way to use Pandas, and I should use vector methods. However, I can't seem to be able to find a way to vector based on conditions (dates, different length, and range of values). Is there a way?
I have good news and bad news. The good news is I have something vectorized that is about 300x faster but the bad news is that I can't quite replicate your results. But I think that you ought to be able to use the principles here to greatly speed up your code, even if this code does not actually replicate your results at the moment.
df['result'] = np.where( df['A'] > 0,
df.shift(365).rolling(10).B.mean(),
df.shift(365).rolling(20).B.mean() )
The tough (slow) part of your code is this:
df2=df[df.dte<lastyear].head(depth)
However, as long as your dates are all 365 days apart, you can use code like this, which is vectorized and much faster:
df.shift(365).rolling(10).B.mean()
shift(365) replaces df.dte < lastyear and the rolling().mean() replaces head().mean(). It will be much faster and use less memory.
And actually, even if your dates aren't completely regular, you can probably resample and get this way to work. Or, somewhat equivalently, if you make the date your index, the shift can be made to work based on a frequency rather than rows (e.g. shift 365 days, even if that is not 365 rows). It would probably be a good idea to make 'dte' your index here regardless.
I would try pandas.DataFrame.apply(func, axis=1)
def calculateC2(row):
if row.name >365: # row.name is the index of the row
if row.A >0: #based on values has different cutoff length for trend prediction
depth=10
else:
depth=20
lastyear=(row.dte-timedelta(days=365))
df2=df[df.dte<lastyear].B.head(depth) #use last year same date data for basis of prediction
print row.name,np.mean(df2) #uses WLS in my model but for simplification replace with mean
df.apply(calculateC2,axis=1)
I'm reading in timeseries data that contains only the available times. This leads to a Series with no missing values, but an unequally spaced index. I'd like to convert this to a Series with an equally spaced index with missing values. Since I don't know a priori what the spacing will be, I'm currently using a function like
min_dt = np.diff(series.index.values).min()
new_spacing = pandas.DateOffset(days=min_dt.days, seconds=min_dt.seconds,
microseconds=min_dt.microseconds)
series = series.asfreq(new_spacing)
to compute what the spacing should be (note that this is using Pandas 0.7.3 - the 0.8 beta code looks slightly differently since I have to use series.index.to_pydatetime() for correct behavior with Numpy 1.6).
Is there an easier way to do this operation using the pandas library?
If you want NaN's in the places where there is no data, you can just use Minute() located in datetools (as of pandas 0.7.x)
from pandas.core.datetools import day, Minute
tseries.asfreq(Minute())
That should provide an evenly spaced time series with 1 minute differences with NaNs as the series values where there is no data.