I was reading resample a dataframe with different functions applied to each column?
The solution was:
frame.resample('1H', how={'radiation': np.sum, 'tamb': np.mean})
Say if I want to add a non-existing column to the result that stores the value of some other function, say count(). In the example given, say if I want to compute the number of rows in each 1H period.
Is it possible to do:
frame.resample('1H', how={'radiation': np.sum, 'tamb': np.mean,\
'new_column': count()})
Note, new_column is NOT an existing column in the original data frame.
The reason why I ask, is I need to do this and I have a very large data frame and I don't want to resample the original df twice just to get the count in the resample period.
I'm trying the above right now and it seems to be taking a very long time (no syntax errors). Not sure if python is trapped in some sort of forever loop.
Update:
I implemented the suggestion to use agg (thank you kindly for that).
However, I received the following error when computing the first aggregator:
grouped = df.groupby(['name1',pd.TimeGrouper('M')])
return pd.DataFrame(
{'new_col1': grouped['col1'][grouped['col1'] > 0].agg('sum')
...
/Users/blahblah/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in __getitem__(self, key)
521
522 def __getitem__(self, key):
--> 523 raise NotImplementedError('Not implemented: %s' % key)
524
525 def _make_wrapper(self, name):
NotImplementedError: Not implemented: True
The following works when I use grouped.apply(foo).
new_col1 = grp['col1'][grp['col1'] > 0].sum()
resampling is similar to grouping with a TimeGrouper. While resampling's
how parameter only allows you to specify one aggregator per column,
The GroupBy object returned by df.groupby(...) has an agg method which can be passed various functions (e.g. mean, sum, or count) to aggregate the groups in various ways. You can use these results to build the desired DataFrame:
import datetime as DT
import numpy as np
import pandas as pd
np.random.seed(2016)
date_times = pd.date_range(DT.datetime(2012, 4, 5, 8, 0),
DT.datetime(2012, 4, 5, 12, 0),
freq='1min')
tamb = np.random.sample(date_times.size) * 10.0
radiation = np.random.sample(date_times.size) * 10.0
df = pd.DataFrame(data={'tamb': tamb, 'radiation': radiation},
index=date_times)
resampled = df.resample('1H', how={'radiation': np.sum, 'tamb': np.mean})
print(resampled[['radiation', 'tamb']])
# radiation tamb
# 2012-04-05 08:00:00 279.432788 4.549235
# 2012-04-05 09:00:00 310.032188 4.414302
# 2012-04-05 10:00:00 257.504226 5.056613
# 2012-04-05 11:00:00 299.594032 4.652067
# 2012-04-05 12:00:00 8.109946 7.795668
def using_agg(df):
grouped = df.groupby(pd.TimeGrouper('1H'))
return pd.DataFrame(
{'radiation': grouped['radiation'].agg('sum'),
'tamb': grouped['tamb'].agg('mean'),
'new_column': grouped['tamb'].agg('count')})
print(using_agg(df))
yields
new_column radiation tamb
2012-04-05 08:00:00 60 279.432788 4.549235
2012-04-05 09:00:00 60 310.032188 4.414302
2012-04-05 10:00:00 60 257.504226 5.056613
2012-04-05 11:00:00 60 299.594032 4.652067
2012-04-05 12:00:00 1 8.109946 7.795668
Note my first answer suggested using groupby/apply:
def using_apply(df):
grouped = df.groupby(pd.TimeGrouper('1H'))
result = grouped.apply(foo).unstack(-1)
result = result.sortlevel(axis=1)
return result[['radiation', 'tamb', 'new_column']]
def foo(grp):
radiation = grp['radiation'].sum()
tamb = grp['tamb'].mean()
cnt = grp['tamb'].count()
return pd.Series([radiation, tamb, cnt], index=['radiation', 'tamb', 'new_column'])
It turns out that using apply here is much slower than using agg. If we benchmark using_agg versus using_apply on a 1681-row DataFrame:
np.random.seed(2016)
date_times = pd.date_range(DT.datetime(2012, 4, 5, 8, 0),
DT.datetime(2012, 4, 6, 12, 0),
freq='1min')
tamb = np.random.sample(date_times.size) * 10.0
radiation = np.random.sample(date_times.size) * 10.0
df = pd.DataFrame(data={'tamb': tamb, 'radiation': radiation},
index=date_times)
I find using IPython's %timeit function
In [83]: %timeit using_apply(df)
100 loops, best of 3: 16.9 ms per loop
In [84]: %timeit using_agg(df)
1000 loops, best of 3: 1.62 ms per loop
using_agg is significantly faster than using_apply and (based on additional
%timeit tests) the speed advantage in favor of using_agg grows as len(df)
grows.
By the way, regarding
frame.resample('1H', how={'radiation': np.sum, 'tamb': np.mean,\
'new_column': count()})
besides the problem that the how dict does not accept non-existant column names, the parentheses in count are problematic. The values in the how dict should be function objects. count is a function object, but count() is the value returned by calling count.
Since Python evaluates arguments before calling functions, count() is getting called before frame.resample(...), and the return value of count() is then associated with the key 'new_column' in the dict bound to the how parameter. That's not what you want.
Regarding the updated question: Precompute the values that you will need before calling groupby/agg:
Instead of
grouped = df.groupby(['name1',pd.TimeGrouper('M')])
return pd.DataFrame(
{'new_col1': grouped['col1'][grouped['col1'] > 0].agg('sum')
...
# ImplementationError since `grouped['col1']` does not implement __getitem__
use
df['col1_pos'] = df['col1'].clip(lower=0)
grouped = df.groupby(['name1',pd.TimeGrouper('M')])
return pd.DataFrame(
{'new_col1': grouped['col1_pos'].agg('sum')
...
See the bottom of this post for more on why pre-computation helps performance.
Related
I'm building a time series, trying to get a more efficient way to do this - ideally vectorized.
The pandas apply with list comprehension step is very slow (on a big data set).
import datetime
import pandas as pd
# Dummy data:
todays_date = datetime.datetime.now().date()
xdates = pd.date_range(todays_date-datetime.timedelta(10), periods=4, freq='D')
categories = list(2*'A') + list(2*'B')
d = {'xdate': xdates, 'periods': [8]*2 + [2]*2, 'interval': [3]*2 + [12]*2}
df = pd.DataFrame(d,index=categories)
# This step is slow:
df['sdates'] = df.apply(lambda x: [x.xdate + pd.DateOffset(months=k*x.interval) for k in range(x.periods)], axis=1)
# This step is quite quick, but shown here for completeness
df = df.explode('sdates')
Maybe something like this:
df['sdates'] = [df.xdate + df.periods * [df.interval.astype('timedelta64[M]')]]
but the syntax isn't quite right.
This code
df = pd.DataFrame(d,index=categories)
df['m_offsets'] = df.interval.apply(lambda x: list(range(0, 72, x)))
df = df.explode('m_offsets')
df['sdate'] = df.xdate + df.m_offsets * pd.DateOffset(months=1)
I think is similar to one of the answers, but the last step, pd.DateOffset gives a warning:
PerformanceWarning: Adding/subtracting array of DateOffsets to DatetimeArray not vectorized
I tried building something along the lines of one answer, but as mentioned the modular arithmatic needs tweaking a lot to deal with edge cases, and haven't figured that out yet (calendar monthrange wasn't playing nicely).
This function doesn't run:
from calendar import monthrange
def add_months(df, date_col, n_col):
""" Adds ncol months do date_col """
z = df.copy()
# calculate new year/month/day and convert to datetime
z['year'] = (z[date_col].dt.year * 12 + (z[date_col].dt.month-1) + z[n_col]) // 12
z['month'] = ((z[date_col].dt.month + z[n_col] - 1) % 12) + 1
x,x = monthrange(z.year, z.month)
z['days_in_month'] = monthrange(z.year, z.month)
z['target_day'] = z[date_col].dt.day
# z['day'] = min(z.target_day, z.days_in_month)
z['day'] = z.days_in_month
z['sdates'] = pd.to_datetime(z[['year', 'month', 'day']])
return z['sdates']
This works, for now, but the dateoffset is a really heavy step.
df = pd.DataFrame(d,index=categories)
df['m_offsets'] = df.interval.apply(lambda x: list(range(0, 72, x)))
df = df.explode('m_offsets')
df['sdates'] = df.apply(lambda x: x.xdate + pd.DateOffset(months=x.m_offsets), axis=1)
Here's one option. You're adding months, so we can actually calculate new year/month/day by only dealing with integers in a vectorized way, and then create datetime from these y/m/d combinations:
def f_proposed(df):
z = df.copy()
z = z.reset_index()
# repeat xdate as many times as the number of periods
z = z.loc[np.repeat(z.index, z['periods'])]
# calculate k number of months to add
z['k'] = z.groupby(level=0).cumcount() * z['interval']
# calculate new year/month/day and convert to datetime
z['year'] = (z['xdate'].dt.year * 12 + z['xdate'].dt.month - 1 + z['k']) // 12
z['month'] = (z['xdate'].dt.month - 1 + z['k']) % 12 + 1
# clip day to days_in_month
z['days_in_month'] = pd.to_datetime(
z['year'].astype(str)+'-'+z['month'].astype(str)+'-01').dt.days_in_month
z['day'] = np.clip(z['xdate'].dt.day, 0, z['days_in_month'])
z['sdates'] = pd.to_datetime(z[['year', 'month', 'day']])
# drop temporary columns
z = z.set_index('index').drop(columns=['k', 'year', 'month', 'day', 'days_in_month'])
return z
To compare performance with the original, I've generated a test dataset with 10,000 rows.
Here's my timings (~23x speedup for 10K):
%timeit f_proposed(z)
82.7 ms ± 222 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit f_original(z)
1.92 s ± 2.75 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
P.S. For 170K it takes about 1.39s with f_proposed and 33.6 s with f_original on my machine
Semi-vectorized way
As I say below, I don't think there is a pure vectorized way to add a variable and general DateOffset to a Series of Timestamps. #perl solution works in the case where the DateOffset is an exact multiple of 1 month.
Now, adding a single constant DateOffset is vectorized, so we can use the following. It capitalizes on the fact that there is a limited set of distinct values for the date offset. It is also relatively fast, and it is correct for any DateOffset and dates:
n = df['periods'].values
period_no = np.repeat(n - n.cumsum(), n) + np.arange(n.sum())
z = pd.DataFrame(
np.repeat(df.reset_index().values, repeats=n, axis=0),
columns=df.reset_index().columns,
).set_index('index')
z = z.assign(madd=period_no * z['interval'])
z['sdates'] = z['xdate']
for madd in set(z['madd'].unique()):
z.loc[z['madd'] == madd, 'sdates'] += pd.DateOffset(months=madd)
Timing:
# modified large dummy data:
N = 170_000
todays_date = datetime.datetime.now().date()
xdates = pd.date_range(todays_date-datetime.timedelta(10), periods=N, freq='H')
categories = np.random.choice(list('ABCDE'), N)
d = {'xdate': xdates, 'periods': np.random.randint(1,10,N), 'interval': np.random.randint(1,12,N)}
df = pd.DataFrame(d,index=categories)
%%time (the above)
CPU times: user 3.49 s, sys: 13.5 ms, total: 3.51 s
Wall time: 3.51 s
(Note: for 10K rows using the generation above, I see times of ~240ms, but of course it is dependent on how many distinct month offsets you have in your data).
Example result (for one draw of 170K rows as per above):
>>> z.tail()
xdate periods interval madd sdates
index
B 2040-08-25 06:00:00 8 8 48 2044-08-25 06:00:00
B 2040-08-25 06:00:00 8 8 56 2045-04-25 06:00:00
D 2040-08-25 07:00:00 3 2 0 2040-08-25 07:00:00
D 2040-08-25 07:00:00 3 2 2 2040-10-25 07:00:00
D 2040-08-25 07:00:00 3 2 4 2040-12-25 07:00:00
Correction on the initial answer
I stand corrected: my original answer is not vectorized either. The first part, exploding the DataFrame and building the number of months to add, is vectorized and very fast. But the second part, adding a DateOffset of a variable number of months, is not.
I hope I am wrong, but I don't think there is currently a way to do that second part in a vectorized way.
Direct date-parts manipulation (e.g. month = (month - 1 + n_months) % 12 + 1, etc.) are bound to fail for corner cases (e.g. '2021-02-31'). Short of replicating the logic used in DateOffset, this is not going to work for certain cases.
Initial answer
Here is a vectorized way:
n = df.periods.values
period_no = np.repeat(n - n.cumsum(), n) + np.arange(n.sum())
z = pd.DataFrame(
np.repeat(df.reset_index().values, repeats=n, axis=0),
columns=df.reset_index().columns,
).set_index('index').assign(period_no=period_no)
z['sdates'] = z['period_no'] * z['interval'] * pd.DateOffset(months=1) + z['xdate']
I am using python 2.7. I am looking to calculate compounding returns from daily returns and my current code is pretty slow at calculating returns, so I was looking for areas where I could gain efficiency.
What I want to do is pass two dates and a security into a price table and calulate the compounding returns between those dates using the giving security.
I have a price table (prices_df):
security_id px_last asof
1 3.055 2015-01-05
1 3.360 2015-01-06
1 3.315 2015-01-07
1 3.245 2015-01-08
1 3.185 2015-01-09
I also have a table with two dates and security (events_df):
asof disclosed_on security_ref_id
2015-01-05 2015-01-09 16:31:00 1
2018-03-22 2018-03-27 16:33:00 3616
2017-08-03 2018-03-27 12:13:00 2591
2018-03-22 2018-03-27 11:33:00 3615
2018-03-22 2018-03-27 10:51:00 3615
Using the two dates in this table, I want to use the price table to calculate the returns.
The two functions I am using:
import pandas as pd
# compounds returns
def cum_rtrn(df):
df_out = df.add(1).cumprod()
df_out['return'].iat[0] = 1
return df_out
# calculates compound returns from prices between two dates
def calc_comp_returns(price_df, start_date=None, end_date=None, security=None):
df = price_df[price_df.security_id == security]
df = df.set_index(['asof'])
df = df.loc[start_date:end_date]
df['return'] = df.px_last.pct_change()
df = df[['return']]
df = cum_rtrn(df)
return df.iloc[-1][0]
I then iterate over the events_df with .iterrows passng the calc_comp_returns function each time. However, this is a very slow process as I have 10K+ iterations, so I am looking for improvements. Solution does not need to be based in pandas
# example of how function is called
start = datetime.datetime.strptime('2015-01-05', '%Y-%m-%d').date()
end = datetime.datetime.strptime('2015-01-09', '%Y-%m-%d').date()
calc_comp_returns(prices_df, start_date=start, end_date=end, security=1)
Here is a solution (100x times faster on my computer with some dummy data).
import numpy as np
price_df = price_df.set_index('asof')
def calc_comp_returns_fast(price_df, start_date, end_date, security):
rows = price_df[price_df.security_id == security].loc[start_date:end_date]
changes = rows.px_last.pct_change()
comp_rtrn = np.prod(changes + 1)
return comp_rtrn
Or, as a one-liner:
def calc_comp_returns_fast(price_df, start_date, end_date, security):
return np.prod(price_df[price_df.security_id == security].loc[start_date:end_date].px_last.pct_change() + 1)
Not that I call the set_index method beforehand, it only needs to be done once on the entire price_df dataframe.
It is faster because it does not recreate DataFrames at each step. In your code, df is overwritten almost at each line by a new dataframe. Both the init process and the garbage collection (erasing unused data from memory) take a lot of time.
In my code, rows is a slice or a "view" of the original data, it does not need to copy or re-init any object. Also, I used directly the numpy product function, which is the same as taking the last cumprod element (pandas uses np.cumprod internally anyway).
Suggestion : if you are using IPython, Jupyter or Spyder, you can use the magic %prun calc_comp_returns(...) to see which part takes the most time. I ran it on your code, and it was the garbage collector, using like more than 50% of the total running time!
I'm not very familiar with pandas, but I'll give this a shot.
Problem with your solution
Your solution currently does a huge amount of unnecessary calculation. This is mostly due to the line:
df['return'] = df.px_last.pct_change()
This line is actually calcuating the percent change for every date between start and end. Just fixing this issue should give you a huge speed up. You should just get the start price and the end price and compare the two. The prices inbetween these two prices are completely irrelevant to your calculations. Again, my familiarity with pandas is nil, but you should do something like this instead:
def calc_comp_returns(price_df, start_date=None, end_date=None, security=None):
df = price_df[price_df.security_id == security]
df = df.set_index(['asof'])
df = df.loc[start_date:end_date]
return 1 + (df['px_last'].iloc(-1) - df['px_last'].iloc(0)
Remember that this code relies on the fact that price_df is sorted by date, so be careful to make sure you only pass calc_comp_returns a date-sorted price_df.
We'll use pd.merge_asof to grab prices from prices_df. However, when we do, we'll need to have relevant dataframes sorted by the date columns we are utilizing. Also, for convenience, I'll aggregate some pd.merge_asof parameters in dictionaries to be used as keyword arguments.
prices_df = prices_df.sort_values(['asof'])
aed = events_df.sort_values('asof')
ded = events_df.sort_values('disclosed_on')
aokw = dict(
left_on='asof', right_on='asof',
left_by='security_ref_id', right_by='security_id'
)
start_price = pd.merge_asof(aed, prices_df, **aokw).px_last
dokw = dict(
left_on='disclosed_on', right_on='asof',
left_by='security_ref_id', right_by='security_id'
)
end_price = pd.merge_asof(ded, prices_df, **dokw).px_last
returns = end_price.div(start_price).sub(1).rename('return')
events_df.join(returns)
asof disclosed_on security_ref_id return
0 2015-01-05 2015-01-09 16:31:00 1 0.040816
1 2018-03-22 2018-03-27 16:33:00 3616 NaN
2 2017-08-03 2018-03-27 12:13:00 2591 NaN
3 2018-03-22 2018-03-27 11:33:00 3615 NaN
4 2018-03-22 2018-03-27 10:51:00 3615 NaN
My data is organized in multi-index dataframes. I am trying to groupby the "Sweep" index and return both the min (or max) in a specific time range, along with the time at which that time occurs.
Data looks like:
Time Primary Secondary BL LED
Sweep
Sweep1 0 0.00000 -28173.828125 -0.416565 -0.000305
1 0.00005 -27050.781250 -0.416260 0.000305
2 0.00010 -27490.234375 -0.415955 -0.002441
3 0.00015 -28222.656250 -0.416260 0.000305
4 0.00020 -28759.765625 -0.414429 -0.002136
Getting the min or max is very straightforward.
def find_groupby_peak(voltage_df, start_time, end_time, peak="min"):
boolean_vr = (voltage_df.Time >= start_time) & (voltage_df.Time <=end_time)
df_subset = voltage_df[boolean_vr]
grouped = df_subset.groupby(level="Sweep")
if peak == "min":
peak = grouped.Primary.min()
elif peak == "max":
peak = grouped.max()
return peak
Which gives (partial output):
Sweep
Sweep1 -92333.984375
Sweep10 -86523.437500
Sweep11 -85205.078125
Sweep12 -87109.375000
Sweep13 -77929.687500
But I need to time where those peaks occur as well. I know I could iterate over the output and find where in the original dataset those values occur, but that seems like a rather brute-force way to do it. I also could write a different function to apply to the grouped object that returns both the max and the time where that max occurs (at least in theory - haven't tried to do this, but I assume it's pretty straightforward).
Other than those two options, is there a simpler way to pass the outputs from grouped.Primary.min() (i.e. the peak values) to return where in Time those values occur?
You could consider using the transform function with groupby. If you had data that look a bit like this:
import pandas as pd
sweep = ["sweep1", "sweep1", "sweep1", "sweep1",
"sweep2", "sweep2", "sweep2", "sweep2",
"sweep3", "sweep3", "sweep3", "sweep3",
"sweep4", "sweep4", "sweep4", "sweep4"]
Time = [0.009845, 0.002186, 0.006001, 0.00265,
0.003832, 0.005627, 0.002625, 0.004159,
0.00388, 0.008107, 0.00813, 0.004813,
0.003205, 0.003225, 0.00413, 0.001202]
Primary = [-2832.013203, -2478.839133, -2100.671551, -2057.188346,
-2605.402055, -2030.195497, -2300.209967, -2504.817095,
-2865.320903, -2456.0049, -2542.132906, -2405.657053,
-2780.140743, -2351.743053, -2232.340363, -2820.27356]
s_count = [ 0, 1, 2, 3,
0, 1, 2, 3,
0, 1, 2, 3,
0, 1, 2, 3]
df = pd.DataFrame({ 'Time' : Time,
'Primary' : Primary}, index = [sweep, s_count])
Then you could write a very simple transform function that will return for each group of data (grouped by the sweep index), the row at which the minimum value of 'Primary' is located. This you would do with simple boolean slicing. That would look like this:
def trans_function(df):
return df[df.Primary == min(df.Primary)]
Then to use this function simply call it inside the transform method:
df.groupby(level = 0).transform(trans_function)
And that gives me the following output:
Primary Time
sweep1 0 -2832.013203 0.009845
sweep2 0 -2605.402055 0.003832
sweep3 0 -2865.320903 0.003880
sweep4 3 -2820.273560 0.001202
Obviously you could incorporate that into you function that is acting on some subset of the data if that is what you require.
As an alternative you could index the group by using the argmin() function. I tried to do this with transform but it was just returning the entire dataframe. I'm not sure why that should be, it does however work with apply:
def trans_function2(df):
return df.loc[df['Primary'].argmin()]
df.groupby(level = 0).apply(trans_function2)
That again gives me:
Primary Time
sweep1 -2832.013203 0.009845
sweep2 -2605.402055 0.003832
sweep3 -2865.320903 0.003880
sweep4 -2820.273560 0.001202
I'm not totally sure why this function does not work with transform - perhaps someone will enlighten us.
I do not know if this will work with your multi-index frame, but it is worth a try; working with:
>>> df
tag tick val
z C 2014-09-07 32
y C 2014-09-08 67
x A 2014-09-09 49
w A 2014-09-10 80
v B 2014-09-11 51
u B 2014-09-12 25
t C 2014-09-13 22
s B 2014-09-14 8
r A 2014-09-15 76
q C 2014-09-16 4
find the indexer using idxmax and then use .loc:
>>> i = df.groupby('tag')['val'].idxmax()
>>> df.loc[i]
tag tick val
w A 2014-09-10 80
v B 2014-09-11 51
y C 2014-09-08 67
I am using Python 2.7 and keep getting the below error. Please let me know if you need the full code but it is a bit long. Thank you for your help.
Warning (from warnings module):
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 3619
FutureWarning)
FutureWarning: TimeSeries broadcasting along DataFrame index by default is deprecated.
Please use DataFrame.<op> to explicitly broadcast arithmetic operations along the index
here is the class Portfolio
class Portfolio(object):
"""An abstract base class representing a portfolio of
positions (including both instruments and cash), determined
on the basis of a set of signals provided by a Strategy."""
__metaclass__ = abc.ABCMeta
#abc.abstractmethod
def generate_positions(self):
raise NotImplementedError("Should implement generate_positions()!")
#abc.abstractmethod
def backtest_portfolio(self):
raise NotImplementedError("Should implement backtest_portfolio()!")
here is the code that is causing the issue in the <<<< if name == "main"
import datetime
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pandas.io.data import DataReader
from backtest import Strategy, Portfolio
class MovingAverageCrossStrategy(Strategy):
def __init__(self, symbol, bars, short_window=8, long_window=50):
self.symbol = symbol
self.bars = bars
self.short_window = short_window
self.long_window = long_window
def generate_signals(self):
signals = pd.DataFrame(index=self.bars.index)
signals['signal'] = 0.0
# Create the set of short and long simple moving averages over the
# respective periods
signals['short_mavg'] = pd.rolling_mean(bars['Close'], self.short_window, min_periods=1)
signals['long_mavg'] = pd.rolling_mean(bars['Close'], self.long_window, min_periods=1)
# Create a 'signal' (invested or not invested) when the short moving average crosses the long
# moving average, but only for the period greater than the shortest moving average window
signals['signal'][self.short_window:] = np.where(signals['short_mavg'][self.short_window:]
> signals['long_mavg'][self.short_window:], 1.0, 0.0)
# Take the difference of the signals in order to generate actual trading orders
signals['positions'] = signals['signal'].diff()
return signals
class MarketOnClosePortfolio(Portfolio):
def __init__(self, symbol, bars, signals, initial_capital=100000.0):
self.symbol = symbol
self.bars = bars
self.signals = signals
self.initial_capital = float(initial_capital)
self.positions = self.generate_positions()
def generate_positions(self):
positions = pd.DataFrame(index=signals.index).fillna(0.0)
positions[self.symbol] = 100*signals['signal'] # This strategy buys 100 shares
return positions
def backtest_portfolio(self):
portfolio = self.positions*self.bars['Close']
pos_diff = self.positions.diff()
portfolio['holdings'] = (self.positions*self.bars['Close']).sum(axis=1)
portfolio['cash'] = self.initial_capital - (pos_diff*self.bars['Close']).sum(axis=1).cumsum()
portfolio['total'] = portfolio['cash'] + portfolio['holdings']
portfolio['returns'] = portfolio['total'].pct_change()
return portfolio
if __name__ == "__main__":
# Obtain daily bars of stock from Yahoo Finance for the period
# 1st Jan 1990 to 1st Jan 2014 - This is an example from ZipLine
symbol = 'AAPL'
bars = DataReader(symbol, "yahoo", datetime.datetime(1990,1,1), datetime.datetime(2014,1,1))
# Create a Moving Average Cross Strategy instance with a short moving
# average window of 8 days and a long window of 50 days
mac = MovingAverageCrossStrategy(symbol, bars, short_window=8, long_window=50)
signals = mac.generate_signals()
# Create a portfolio of stock, with $100,000 initial capital
portfolio = MarketOnClosePortfolio(symbol, bars, signals, initial_capital=100000.0)
returns = portfolio.backtest_portfolio()
Without being able to run your code, it is difficult to point to the exact reason, but take for example this line in generate_positions:
portfolio = self.positions*self.bars['Close']
And supposing self.positions is a DataFrame, and self.bars['Close'] is a Series (in this case a column of a DataFrame, which is returned as a Series). I try to explain the issue with a toy example:
First generating a dataframe and a series (with a datetimeindex):
In [3]: idx = pd.date_range('2012-01-01', periods=3)
In [5]: df = pd.DataFrame({'A':[1,2,3], 'B':[10,20,30]}, index=idx)
In [6]: df
Out[6]:
A B
2012-01-01 1 10
2012-01-02 2 20
2012-01-03 3 30
In [7]: s = pd.Series([1,2,3], index=idx)
In [8]: s
Out[8]:
2012-01-01 1
2012-01-02 2
2012-01-03 3
Freq: D, dtype: int64
Now if we multiply both, we will get the warning you noticed:
In [10]: df * s
/home/joris/scipy/pandas-np16/pandas/core/frame.py:2920: FutureWarning: TimeSeries
broadcasting along DataFrame index by default is deprecated. Please use DataFrame.<op>
to explicitly broadcast arithmetic operations along the index
FutureWarning)
Out[10]:
A B
2012-01-01 1 10
2012-01-02 4 40
2012-01-03 9 90
This is because of what I mentioned in the comments and is explained here: http://pandas.pydata.org/pandas-docs/stable/dsintro.html?highlight=broadcasting#data-alignment-and-arithmetic. Normally if a dataframe and series are multiplied, the series is broadcasted over the columns, while for timeserieses this is done over the rows. But this is deprecated.
So instead you should use an equivalent operator as the warnings advices. In case of a multiplication:
In [13]: df.mul(s, axis=0)
Out[13]:
A B
2012-01-01 1 10
2012-01-02 4 40
2012-01-03 9 90
So for each operator (+, *; <, >, /, etc) there is an equivalent method. See here for the list of methods: http://pandas.pydata.org/pandas-docs/stable/api.html#id4 [url updated]
To show what is meant with 'broadcasted over the columns', another example:
In [14]: s2 = pd.Series([10, 100], index=['A', 'B'])
In [15]: s2
Out[15]:
A 10
B 100
dtype: int64
In [16]: df * s2
Out[16]:
A B
2012-01-01 10 1000
2012-01-02 20 2000
2012-01-03 30 3000
So as you can see, each element of the series is matched with one column, and the whole column is then multiplied with that value. While in the case of the timeseries, each element of the series was matched with a row.
Code
import pandas as pd
import numpy as np
dates = pd.date_range('20140301',periods=6)
id_col = np.array([[0, 1, 2, 0, 1, 2]])
data_col = np.random.randn(6,4)
data = np.concatenate((id_col.T, data_col), axis=1)
df = pd.DataFrame(data, index=dates, columns=list('IABCD'))
print df
print "before groupby:"
for index in df.index:
if not index.freq:
print "key:%f, no freq:%s" % (key, index)
print "after groupby:"
gb = df.groupby('I')
for key, group in gb:
#group = group.resample('1D', how='first')
for index in group.index:
if not index.freq:
print "key:%f, no freq:%s" % (key, index)
The output:
I A B C D
2014-03-01 0 0.129348 1.466361 -0.372673 0.045254
2014-03-02 1 0.395884 1.001859 -0.892950 0.480944
2014-03-03 2 -0.226405 0.663029 0.355675 -0.274865
2014-03-04 0 0.634661 0.535560 1.027162 1.637099
2014-03-05 1 -0.453149 -0.479408 -1.329372 -0.574017
2014-03-06 2 0.603972 0.754232 0.692185 -1.267217
[6 rows x 5 columns]
before groupby:
after groupby:
key:0.000000, no freq:2014-03-01 00:00:00
key:0.000000, no freq:2014-03-04 00:00:00
key:1.000000, no freq:2014-03-02 00:00:00
key:1.000000, no freq:2014-03-05 00:00:00
key:2.000000, no freq:2014-03-03 00:00:00
key:2.000000, no freq:2014-03-06 00:00:00
But after I uncomment the statement:
#group = group.resample('1D', how='first')
It seems no problem. The thing is, when I running on a large dataset with some operations on the timestamp, there is always an error "cannot add integral value to timestamp without offset". Is it a bug, or did I miss some thing?
You are treating a groupby object as a DataFrame.
It is like a dataframe, but requires apply to generate a new structure (either reduced or an actual DataFrame).
The idiom is:
df.groupby(....).apply(some_function)
Doing something like: df.groupby(...).sum() is syntactic sugar for using apply. Functions which are naturally applicable to using this kind of sugar are enabled; otherwise they will raise an error.
In particular you are accessing a group.index which can be but is not guaranteed to be a DatetimeIndex (when time grouping). The freq attributes of a datetimeindex are inferred when required (via inferred_freq).
You code is very confusing, you are grouping, then resampling; resample does this for you, so you don't need the former step at all.
resample is de-facto equivalent of a groupby-apply (but has special handling for the time-domain).