Code
import pandas as pd
import numpy as np
dates = pd.date_range('20140301',periods=6)
id_col = np.array([[0, 1, 2, 0, 1, 2]])
data_col = np.random.randn(6,4)
data = np.concatenate((id_col.T, data_col), axis=1)
df = pd.DataFrame(data, index=dates, columns=list('IABCD'))
print df
print "before groupby:"
for index in df.index:
if not index.freq:
print "key:%f, no freq:%s" % (key, index)
print "after groupby:"
gb = df.groupby('I')
for key, group in gb:
#group = group.resample('1D', how='first')
for index in group.index:
if not index.freq:
print "key:%f, no freq:%s" % (key, index)
The output:
I A B C D
2014-03-01 0 0.129348 1.466361 -0.372673 0.045254
2014-03-02 1 0.395884 1.001859 -0.892950 0.480944
2014-03-03 2 -0.226405 0.663029 0.355675 -0.274865
2014-03-04 0 0.634661 0.535560 1.027162 1.637099
2014-03-05 1 -0.453149 -0.479408 -1.329372 -0.574017
2014-03-06 2 0.603972 0.754232 0.692185 -1.267217
[6 rows x 5 columns]
before groupby:
after groupby:
key:0.000000, no freq:2014-03-01 00:00:00
key:0.000000, no freq:2014-03-04 00:00:00
key:1.000000, no freq:2014-03-02 00:00:00
key:1.000000, no freq:2014-03-05 00:00:00
key:2.000000, no freq:2014-03-03 00:00:00
key:2.000000, no freq:2014-03-06 00:00:00
But after I uncomment the statement:
#group = group.resample('1D', how='first')
It seems no problem. The thing is, when I running on a large dataset with some operations on the timestamp, there is always an error "cannot add integral value to timestamp without offset". Is it a bug, or did I miss some thing?
You are treating a groupby object as a DataFrame.
It is like a dataframe, but requires apply to generate a new structure (either reduced or an actual DataFrame).
The idiom is:
df.groupby(....).apply(some_function)
Doing something like: df.groupby(...).sum() is syntactic sugar for using apply. Functions which are naturally applicable to using this kind of sugar are enabled; otherwise they will raise an error.
In particular you are accessing a group.index which can be but is not guaranteed to be a DatetimeIndex (when time grouping). The freq attributes of a datetimeindex are inferred when required (via inferred_freq).
You code is very confusing, you are grouping, then resampling; resample does this for you, so you don't need the former step at all.
resample is de-facto equivalent of a groupby-apply (but has special handling for the time-domain).
Related
I have a Pandas dataframe df that looks as follows:
created_time action_time
2021-03-05T07:18:12.281-0600 2021-03-05T08:32:19.153-0600
2021-03-04T15:34:23.373-0600 2021-03-04T15:37:32.360-0600
2021-03-01T04:57:47.848-0600 2021-03-01T08:37:39.083-0600
import pandas as pd
df = pd.DataFrame({'created_time':['2021-03-05T07:18:12.281-0600', '2021-03-04T15:34:23.373-0600', '2021-03-01T04:57:47.848-0600'],
'action_time':['2021-03-05T08:32:19.153-0600', '2021-03-04T15:37:32.360-0600', '2021-03-01T08:37:39.083-0600']})
I then create another column which represents the the difference in minutes between these two columns:
df['elapsed_time'] = (pd.to_datetime(df['action_time']) - pd.to_datetime(df['created_time'])).dt.total_seconds() / 60
df['elapsed_time']
elapsed_time
74.114533
3.149783
219.853917
We assume that "action" can only take place during business hours (which we assume to start 8:30am).
I would like to create another column named created_time_adjusted, which adjusts the created_time to 08:30am if the created_time is before 08:30am).
I can parse out the date and time string that I need, as follows:
df['elapsed_time'] = pd.to_datetime(df['created_time']).dt.date.astype(str) + 'T08:30:00.000-0600'
But, this doesn't deal with the conditional.
I'm aware of a few ways that I might be able to do this:
replace
clip
np.where
loc
What is the best (and least hacky) way to accomplish this?
Thanks!
First of all, I think your life would be easier if you convert the columns to datetime dtypes from the go. Then, its just a matter of running an apply op on the 'created_time' column.
df.created_time = pd.to_datetime(df.created_time)
df.action_time = pd.to_datetime(df.action_time)
df.elapsed_time = df.action_time-df.created_time
time_threshold = pd.to_datetime('08:30').time()
df['created_time_adjusted']=df.created_time.apply(lambda x:
x.replace(hour=8,minute=30,second=0)
if x.time()<time_threshold else x)
Output:
>>> df
created_time action_time created_time_adjusted
0 2021-03-05 07:18:12.281000-06:00 2021-03-05 08:32:19.153000-06:00 2021-03-05 08:30:00.281000-06:00
1 2021-03-04 15:34:23.373000-06:00 2021-03-04 15:37:32.360000-06:00 2021-03-04 15:34:23.373000-06:00
2 2021-03-01 04:57:47.848000-06:00 2021-03-01 08:37:39.083000-06:00 2021-03-01 08:30:00.848000-06:00
df['created_time']=pd.to_datetime(df['created_time'])#Coerce to datetime
df1=df.set_index(df['created_time']).between_time('00:00:00', '08:30:00', include_end=False)#Isolate earlier than 830 into df
df1['created_time']=df1['created_time'].dt.normalize()+ timedelta(hours=8,minutes=30, seconds=0)#Adjust time
df2=df1.append(df.set_index(df['created_time']).between_time('08:30:00','00:00:00', include_end=False)).reset_index(drop=True)#Knit before and after 830 together
df2
I have a following analysis.py file. The function group_analysis changes the datetime index of df_input by the Count column of df_input
# analysis.py
import pandas as pd
def group_analysis(df_input):
df_input.index = df_input.index - pd.to_timedelta(df_input.Count, unit = 'days')
df_ouput = df_input.sort_index()
return df_ouput
def test(df):
df = df + 1
return df
And I have a following dataframe.
x = pd.DataFrame(np.arange(1,14), index = pd.date_range('2020-01-01', periods = 13, freq= 'D'), columns = ['Count'])
Count
2020-01-01 1
2020-01-02 2
2020-01-03 3
2020-01-04 4
2020-01-05 5
2020-01-06 6
2020-01-07 7
2020-01-08 8
2020-01-09 9
2020-01-10 10
2020-01-11 11
2020-01-12 12
2020-01-13 13
When I run the following code,
import analysis
y = analysis.group_analysis(x)
the datetime index of both x and y are changed (and so, x.equals(y) is True). Why group_analysis changes the both the input and output datetime index? And how can I make it to change only the datetime index of y (but not x)?
However, when running the following code, x does not change (so, x.equals(y) is True)
import analysis
y = analysis.test(x)
EDIT: analysis.test(df) is added.
The reason for this behaviour is because when calling group_analysis you are not passing a copy of the dataframe to the function, but rather a reference to the original data in the memory of the computer. Therefore, if you modify the data behind it, the original data (which is the same) will also be modified.
For a very good explanation refer to https://robertheaton.com/2014/02/09/pythons-pass-by-object-reference-as-explained-by-philip-k-dick/.
To prevent this create a copy of the data when you enter the function:
...
def group_analysis(df):
df_input = df.copy()
...
When you pass a dataframe to a function, it passes the dataframe reference. So, any in-place change that you do to the dataframe, It will reflect in the passed dataframe.
But in the case of your test function, the addition returns a copy of the dataframe in-memory. How do I know that? Just print the memory reference id of the variable before and after the operation.
>>> def test(df):
... print(id(df))
... df = df + 1
... print(id(df))
... return df
...
>>> test(df)
139994174011920
139993943207568
Notice the change? This means its reference has been changed. Hence not affecting the original dataframe.
As the title suggests, I'm not even sure how to word the question. :D
But here it is, "simply put":
I) I would like to create a df on day x and II) from the next day onwards x+1...x+n I would like to update just day x+n without touching the first part (I) of creating the df - and all that by only calling one function. So basically "just" appending the row for the day the function is called (there is no need to "recreate" the df since it is already there. Is there a possibility to do that all in one statement?
It would look something like this:
import pandas as pd
def pull_data():
data = {'DATE': ['2020-05-01','2020-05-02','2020-05-03','2020-05-04'],
'X': [400,300,200,100],
'Y': [100,200,300,400]
}
df = pd.DataFrame(data, columns = ['DATE', 'X', 'Y'])
return df
data_ = pull_data()
Let's say I call this function on 2020-05-04 --> but now on the next day I want it to automatically ONLY attach 2020-05-05 without creating the whole data frame again.
Does my whole question make any sense/is it comprehensible? I'd be happy about every input! :)
Based on the dataframe and the integer index, you can append a value using the shape of the dataframe with loc:
from datetime import datetime
data_ = pull_data()
value_X = 0
value_Y = 1
data_.loc[data_.shape[0]] = [datetime.now().date(), value_X, value_Y]
data_
# DATE X Y
# 0 2020-05-01 400 100
# 1 2020-05-02 300 200
# 2 2020-05-03 200 300
# 3 2020-05-04 100 400
# 4 2020-05-06 0 1
I am using python 2.7. I am looking to calculate compounding returns from daily returns and my current code is pretty slow at calculating returns, so I was looking for areas where I could gain efficiency.
What I want to do is pass two dates and a security into a price table and calulate the compounding returns between those dates using the giving security.
I have a price table (prices_df):
security_id px_last asof
1 3.055 2015-01-05
1 3.360 2015-01-06
1 3.315 2015-01-07
1 3.245 2015-01-08
1 3.185 2015-01-09
I also have a table with two dates and security (events_df):
asof disclosed_on security_ref_id
2015-01-05 2015-01-09 16:31:00 1
2018-03-22 2018-03-27 16:33:00 3616
2017-08-03 2018-03-27 12:13:00 2591
2018-03-22 2018-03-27 11:33:00 3615
2018-03-22 2018-03-27 10:51:00 3615
Using the two dates in this table, I want to use the price table to calculate the returns.
The two functions I am using:
import pandas as pd
# compounds returns
def cum_rtrn(df):
df_out = df.add(1).cumprod()
df_out['return'].iat[0] = 1
return df_out
# calculates compound returns from prices between two dates
def calc_comp_returns(price_df, start_date=None, end_date=None, security=None):
df = price_df[price_df.security_id == security]
df = df.set_index(['asof'])
df = df.loc[start_date:end_date]
df['return'] = df.px_last.pct_change()
df = df[['return']]
df = cum_rtrn(df)
return df.iloc[-1][0]
I then iterate over the events_df with .iterrows passng the calc_comp_returns function each time. However, this is a very slow process as I have 10K+ iterations, so I am looking for improvements. Solution does not need to be based in pandas
# example of how function is called
start = datetime.datetime.strptime('2015-01-05', '%Y-%m-%d').date()
end = datetime.datetime.strptime('2015-01-09', '%Y-%m-%d').date()
calc_comp_returns(prices_df, start_date=start, end_date=end, security=1)
Here is a solution (100x times faster on my computer with some dummy data).
import numpy as np
price_df = price_df.set_index('asof')
def calc_comp_returns_fast(price_df, start_date, end_date, security):
rows = price_df[price_df.security_id == security].loc[start_date:end_date]
changes = rows.px_last.pct_change()
comp_rtrn = np.prod(changes + 1)
return comp_rtrn
Or, as a one-liner:
def calc_comp_returns_fast(price_df, start_date, end_date, security):
return np.prod(price_df[price_df.security_id == security].loc[start_date:end_date].px_last.pct_change() + 1)
Not that I call the set_index method beforehand, it only needs to be done once on the entire price_df dataframe.
It is faster because it does not recreate DataFrames at each step. In your code, df is overwritten almost at each line by a new dataframe. Both the init process and the garbage collection (erasing unused data from memory) take a lot of time.
In my code, rows is a slice or a "view" of the original data, it does not need to copy or re-init any object. Also, I used directly the numpy product function, which is the same as taking the last cumprod element (pandas uses np.cumprod internally anyway).
Suggestion : if you are using IPython, Jupyter or Spyder, you can use the magic %prun calc_comp_returns(...) to see which part takes the most time. I ran it on your code, and it was the garbage collector, using like more than 50% of the total running time!
I'm not very familiar with pandas, but I'll give this a shot.
Problem with your solution
Your solution currently does a huge amount of unnecessary calculation. This is mostly due to the line:
df['return'] = df.px_last.pct_change()
This line is actually calcuating the percent change for every date between start and end. Just fixing this issue should give you a huge speed up. You should just get the start price and the end price and compare the two. The prices inbetween these two prices are completely irrelevant to your calculations. Again, my familiarity with pandas is nil, but you should do something like this instead:
def calc_comp_returns(price_df, start_date=None, end_date=None, security=None):
df = price_df[price_df.security_id == security]
df = df.set_index(['asof'])
df = df.loc[start_date:end_date]
return 1 + (df['px_last'].iloc(-1) - df['px_last'].iloc(0)
Remember that this code relies on the fact that price_df is sorted by date, so be careful to make sure you only pass calc_comp_returns a date-sorted price_df.
We'll use pd.merge_asof to grab prices from prices_df. However, when we do, we'll need to have relevant dataframes sorted by the date columns we are utilizing. Also, for convenience, I'll aggregate some pd.merge_asof parameters in dictionaries to be used as keyword arguments.
prices_df = prices_df.sort_values(['asof'])
aed = events_df.sort_values('asof')
ded = events_df.sort_values('disclosed_on')
aokw = dict(
left_on='asof', right_on='asof',
left_by='security_ref_id', right_by='security_id'
)
start_price = pd.merge_asof(aed, prices_df, **aokw).px_last
dokw = dict(
left_on='disclosed_on', right_on='asof',
left_by='security_ref_id', right_by='security_id'
)
end_price = pd.merge_asof(ded, prices_df, **dokw).px_last
returns = end_price.div(start_price).sub(1).rename('return')
events_df.join(returns)
asof disclosed_on security_ref_id return
0 2015-01-05 2015-01-09 16:31:00 1 0.040816
1 2018-03-22 2018-03-27 16:33:00 3616 NaN
2 2017-08-03 2018-03-27 12:13:00 2591 NaN
3 2018-03-22 2018-03-27 11:33:00 3615 NaN
4 2018-03-22 2018-03-27 10:51:00 3615 NaN
I was reading resample a dataframe with different functions applied to each column?
The solution was:
frame.resample('1H', how={'radiation': np.sum, 'tamb': np.mean})
Say if I want to add a non-existing column to the result that stores the value of some other function, say count(). In the example given, say if I want to compute the number of rows in each 1H period.
Is it possible to do:
frame.resample('1H', how={'radiation': np.sum, 'tamb': np.mean,\
'new_column': count()})
Note, new_column is NOT an existing column in the original data frame.
The reason why I ask, is I need to do this and I have a very large data frame and I don't want to resample the original df twice just to get the count in the resample period.
I'm trying the above right now and it seems to be taking a very long time (no syntax errors). Not sure if python is trapped in some sort of forever loop.
Update:
I implemented the suggestion to use agg (thank you kindly for that).
However, I received the following error when computing the first aggregator:
grouped = df.groupby(['name1',pd.TimeGrouper('M')])
return pd.DataFrame(
{'new_col1': grouped['col1'][grouped['col1'] > 0].agg('sum')
...
/Users/blahblah/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in __getitem__(self, key)
521
522 def __getitem__(self, key):
--> 523 raise NotImplementedError('Not implemented: %s' % key)
524
525 def _make_wrapper(self, name):
NotImplementedError: Not implemented: True
The following works when I use grouped.apply(foo).
new_col1 = grp['col1'][grp['col1'] > 0].sum()
resampling is similar to grouping with a TimeGrouper. While resampling's
how parameter only allows you to specify one aggregator per column,
The GroupBy object returned by df.groupby(...) has an agg method which can be passed various functions (e.g. mean, sum, or count) to aggregate the groups in various ways. You can use these results to build the desired DataFrame:
import datetime as DT
import numpy as np
import pandas as pd
np.random.seed(2016)
date_times = pd.date_range(DT.datetime(2012, 4, 5, 8, 0),
DT.datetime(2012, 4, 5, 12, 0),
freq='1min')
tamb = np.random.sample(date_times.size) * 10.0
radiation = np.random.sample(date_times.size) * 10.0
df = pd.DataFrame(data={'tamb': tamb, 'radiation': radiation},
index=date_times)
resampled = df.resample('1H', how={'radiation': np.sum, 'tamb': np.mean})
print(resampled[['radiation', 'tamb']])
# radiation tamb
# 2012-04-05 08:00:00 279.432788 4.549235
# 2012-04-05 09:00:00 310.032188 4.414302
# 2012-04-05 10:00:00 257.504226 5.056613
# 2012-04-05 11:00:00 299.594032 4.652067
# 2012-04-05 12:00:00 8.109946 7.795668
def using_agg(df):
grouped = df.groupby(pd.TimeGrouper('1H'))
return pd.DataFrame(
{'radiation': grouped['radiation'].agg('sum'),
'tamb': grouped['tamb'].agg('mean'),
'new_column': grouped['tamb'].agg('count')})
print(using_agg(df))
yields
new_column radiation tamb
2012-04-05 08:00:00 60 279.432788 4.549235
2012-04-05 09:00:00 60 310.032188 4.414302
2012-04-05 10:00:00 60 257.504226 5.056613
2012-04-05 11:00:00 60 299.594032 4.652067
2012-04-05 12:00:00 1 8.109946 7.795668
Note my first answer suggested using groupby/apply:
def using_apply(df):
grouped = df.groupby(pd.TimeGrouper('1H'))
result = grouped.apply(foo).unstack(-1)
result = result.sortlevel(axis=1)
return result[['radiation', 'tamb', 'new_column']]
def foo(grp):
radiation = grp['radiation'].sum()
tamb = grp['tamb'].mean()
cnt = grp['tamb'].count()
return pd.Series([radiation, tamb, cnt], index=['radiation', 'tamb', 'new_column'])
It turns out that using apply here is much slower than using agg. If we benchmark using_agg versus using_apply on a 1681-row DataFrame:
np.random.seed(2016)
date_times = pd.date_range(DT.datetime(2012, 4, 5, 8, 0),
DT.datetime(2012, 4, 6, 12, 0),
freq='1min')
tamb = np.random.sample(date_times.size) * 10.0
radiation = np.random.sample(date_times.size) * 10.0
df = pd.DataFrame(data={'tamb': tamb, 'radiation': radiation},
index=date_times)
I find using IPython's %timeit function
In [83]: %timeit using_apply(df)
100 loops, best of 3: 16.9 ms per loop
In [84]: %timeit using_agg(df)
1000 loops, best of 3: 1.62 ms per loop
using_agg is significantly faster than using_apply and (based on additional
%timeit tests) the speed advantage in favor of using_agg grows as len(df)
grows.
By the way, regarding
frame.resample('1H', how={'radiation': np.sum, 'tamb': np.mean,\
'new_column': count()})
besides the problem that the how dict does not accept non-existant column names, the parentheses in count are problematic. The values in the how dict should be function objects. count is a function object, but count() is the value returned by calling count.
Since Python evaluates arguments before calling functions, count() is getting called before frame.resample(...), and the return value of count() is then associated with the key 'new_column' in the dict bound to the how parameter. That's not what you want.
Regarding the updated question: Precompute the values that you will need before calling groupby/agg:
Instead of
grouped = df.groupby(['name1',pd.TimeGrouper('M')])
return pd.DataFrame(
{'new_col1': grouped['col1'][grouped['col1'] > 0].agg('sum')
...
# ImplementationError since `grouped['col1']` does not implement __getitem__
use
df['col1_pos'] = df['col1'].clip(lower=0)
grouped = df.groupby(['name1',pd.TimeGrouper('M')])
return pd.DataFrame(
{'new_col1': grouped['col1_pos'].agg('sum')
...
See the bottom of this post for more on why pre-computation helps performance.