Suppose I have a series, and I want to do the sort of thing pandas does with resample - say, compute the mean (or some other aggregation) of rows 0-14, 14-29, ..., etc. Of course this can be done with rolling, but this will do (in the example case) 15 times as much work as necessary.
(so, if s is the series, then s.rolling(15).mean().iloc[::15] One can of course, introduce a DateTime index, and then do resample, but this seems like a kludge. What's the canonical way?
Related
I have a small toy dataset of 23 hours of irregular time series data (financial tick data) with millisecond granularity, roughly 1M rows. By irregular I mean that the timestamps are not evenly spaced. I also have a column 'mid' with some values too.
I am trying to group by e.g. 2 minute buckets to calculate the absolute difference of 'mid', and then taking the median, in the following manner:
df.groupby(["RIC", pd.Grouper(freq='2min')]).mid.apply(
lambda x: np.abs(x[-1] - x[0]) if len(x) != 0 else 0).median()
Note: 'RIC' is just another layer of grouping I am applying before the time bucket grouping.
Basically, I am telling pandas to group by every [ith minute : ith + 2 minute] intervals, and in each interval, take the last (x[-1]) and the first (x[0]) 'mid' element, and take its absolute difference. I am doing this over a range of 'freqs' as well, e.g. 2min, 4min, ..., up to 30min intervals.
This approach works completely fine, but it is awfully slow because of the usage of pandas' .apply function. I am aware that .apply doesn't take advantage of the built in vectorization of pandas and numpy, as it is computationally no different to a for loop, and am trying to figure out how to achieve the same without having to use apply so I can speed it up by several orders of magnitude.
Does anyone know how to rewrite the above code to ditch .apply? Any tips will be appreciated!
On the pandas groupby.apply webpage:
"While apply is a very flexible method, its downside is that using it
can be quite a bit slower than using more specific methods like agg or
transform. Pandas offers a wide range of method that will be much
faster than using apply for their specific purposes, so try to use
them before reaching for apply."
Therefore, using transform should be a lot faster.
grouped = df.groupby(["RIC", pd.Grouper(freq='2min')])
abs(grouped.mid.transform("last") - grouped.mid.transform("first")).median()
I have a pandas dataframe that includes time intervals that overlapping at some points (figure 1). I need a data frame that has a time series that starts beginning from the first start_time to the end of the last end_time (figure 2).
I have to sum up VIS values at overlapped time intervals.
I couldn't figure it out. How can I do it?
This problem is easily solved with the python package staircase, which is built on pandas and numpy for the purposes of working with (mathematical) step functions.
Assume your original dataframe is called df and the times you want in your resulting dataframe are an array (or datetime index, or series etc) called times.
import staircase as sc
stepfunction = sc.Stairs(df, start="start_time", end="end_time", value="VIS")
result = stepfunction(times, include_index=True)
That's it, result is a pandas Series indexed by times, and has the values you want. You can convert it to a dataframe in the format you want using reset_index method on the Series.
You can generate your times data like this
import pandas as pd
times = pd.date_range(df["start_time"].min(), df["end_time"].max(), freq="30min")
Why it works
Each row in your dataframe can be thought of a step function. For example the first row corresponds to a step function which starts with a value of zero, then at 2002-02-03 04:15:00 increases to a value of 10, then at 2002-02-04 04:45:00 returns to zero. When you sum all the step functions up for each row you have one step function whose value is the sum of all VIS values at any point. This is what has been assigned to the stepfunction variable above. The stepfunction variable is callable, and returns values of the step function at the points specified. This is what is happening in the last line of the example where the result variable is being assigned.
note:
I am the creator of staircase. Please feel free to reach out with feedback or questions if you have any.
If you paste your data instead of the images, I'd be able to test this. But this is how you may want to think about it. Assume your dataframe is called df.
df['start_time'] = pd.to_datetime(df['start_time']) # in case it's not datetime already
df.set_index('start_time', inplace=True)
new_dates = pd.date_range(start=min(df.index), end=max(df.end_time), freq='15Min')
new_df = df.reindex(new_dates, fill_value=np.nan)
As long as there are no duplicates in start_time, this should work. If there is, that'd need to be handled in some other way.
Resample is another possibility, but without data, it's tough to say what would work.
I have a Series that looks like this:
df.index[0:10]
DatetimeIndex(['1881-12-01', '1882-01-01', '1882-02-01', '1882-12-01',
'1883-01-01', '1883-02-01', '1883-12-01', '1884-01-01',
'1884-02-01', '1884-12-01'],
dtype='datetime64[ns]', name='date', freq=None)
Now I'd like to resample it so that every December, January and February is grouped together. More generally: I'd like to resample a dataframe to contain yearly periods, ignoring NaNs, so that the first index is taken into consideration:
assert(df[0:3].mean() == df.resample(something).mean().iloc[0])
df.resample('Y') treats the first index as a separate year. How do I do that? I wrote a partition function that groups an interable into equally sized chunks but I feel like there's a more idiomatic (and potentially faster) way that I'm missing.
I could solve this by using an anchored yearly offset as described here.
I have a fairly sophisticate prediction code with over 20 columns and millions of data per column using wls. Now i use iterrow to loop through dates, then based on those dates and values in those dates, extract different sizes of data for calculation. it takes hours to run in my production, I simplify the code into the following:
import pandas as pd
import numpy as np
from datetime import timedelta
df=pd.DataFrame(np.random.randn(1000,2), columns=list('AB'))
df['dte'] = pd.date_range('9/1/2014', periods=1000, freq='D')
def calculateC(A, dte):
if A>0: #based on values has different cutoff length for trend prediction
depth=10
else:
depth=20
lastyear=(dte-timedelta(days=365))
df2=df[df.dte<lastyear].head(depth) #use last year same date data for basis of prediction
return df2.B.mean() #uses WLS in my model but for simplification replace with mean
for index, row in df.iterrows():
if index>365:
df.loc[index,'C']=calculateC(row.A, row.dte)
I read that iterrow is the main cause because it is not an effective way to use Pandas, and I should use vector methods. However, I can't seem to be able to find a way to vector based on conditions (dates, different length, and range of values). Is there a way?
I have good news and bad news. The good news is I have something vectorized that is about 300x faster but the bad news is that I can't quite replicate your results. But I think that you ought to be able to use the principles here to greatly speed up your code, even if this code does not actually replicate your results at the moment.
df['result'] = np.where( df['A'] > 0,
df.shift(365).rolling(10).B.mean(),
df.shift(365).rolling(20).B.mean() )
The tough (slow) part of your code is this:
df2=df[df.dte<lastyear].head(depth)
However, as long as your dates are all 365 days apart, you can use code like this, which is vectorized and much faster:
df.shift(365).rolling(10).B.mean()
shift(365) replaces df.dte < lastyear and the rolling().mean() replaces head().mean(). It will be much faster and use less memory.
And actually, even if your dates aren't completely regular, you can probably resample and get this way to work. Or, somewhat equivalently, if you make the date your index, the shift can be made to work based on a frequency rather than rows (e.g. shift 365 days, even if that is not 365 rows). It would probably be a good idea to make 'dte' your index here regardless.
I would try pandas.DataFrame.apply(func, axis=1)
def calculateC2(row):
if row.name >365: # row.name is the index of the row
if row.A >0: #based on values has different cutoff length for trend prediction
depth=10
else:
depth=20
lastyear=(row.dte-timedelta(days=365))
df2=df[df.dte<lastyear].B.head(depth) #use last year same date data for basis of prediction
print row.name,np.mean(df2) #uses WLS in my model but for simplification replace with mean
df.apply(calculateC2,axis=1)
I'm reading in timeseries data that contains only the available times. This leads to a Series with no missing values, but an unequally spaced index. I'd like to convert this to a Series with an equally spaced index with missing values. Since I don't know a priori what the spacing will be, I'm currently using a function like
min_dt = np.diff(series.index.values).min()
new_spacing = pandas.DateOffset(days=min_dt.days, seconds=min_dt.seconds,
microseconds=min_dt.microseconds)
series = series.asfreq(new_spacing)
to compute what the spacing should be (note that this is using Pandas 0.7.3 - the 0.8 beta code looks slightly differently since I have to use series.index.to_pydatetime() for correct behavior with Numpy 1.6).
Is there an easier way to do this operation using the pandas library?
If you want NaN's in the places where there is no data, you can just use Minute() located in datetools (as of pandas 0.7.x)
from pandas.core.datetools import day, Minute
tseries.asfreq(Minute())
That should provide an evenly spaced time series with 1 minute differences with NaNs as the series values where there is no data.