I am using python 2.7. I am looking to calculate compounding returns from daily returns and my current code is pretty slow at calculating returns, so I was looking for areas where I could gain efficiency.
What I want to do is pass two dates and a security into a price table and calulate the compounding returns between those dates using the giving security.
I have a price table (prices_df):
security_id px_last asof
1 3.055 2015-01-05
1 3.360 2015-01-06
1 3.315 2015-01-07
1 3.245 2015-01-08
1 3.185 2015-01-09
I also have a table with two dates and security (events_df):
asof disclosed_on security_ref_id
2015-01-05 2015-01-09 16:31:00 1
2018-03-22 2018-03-27 16:33:00 3616
2017-08-03 2018-03-27 12:13:00 2591
2018-03-22 2018-03-27 11:33:00 3615
2018-03-22 2018-03-27 10:51:00 3615
Using the two dates in this table, I want to use the price table to calculate the returns.
The two functions I am using:
import pandas as pd
# compounds returns
def cum_rtrn(df):
df_out = df.add(1).cumprod()
df_out['return'].iat[0] = 1
return df_out
# calculates compound returns from prices between two dates
def calc_comp_returns(price_df, start_date=None, end_date=None, security=None):
df = price_df[price_df.security_id == security]
df = df.set_index(['asof'])
df = df.loc[start_date:end_date]
df['return'] = df.px_last.pct_change()
df = df[['return']]
df = cum_rtrn(df)
return df.iloc[-1][0]
I then iterate over the events_df with .iterrows passng the calc_comp_returns function each time. However, this is a very slow process as I have 10K+ iterations, so I am looking for improvements. Solution does not need to be based in pandas
# example of how function is called
start = datetime.datetime.strptime('2015-01-05', '%Y-%m-%d').date()
end = datetime.datetime.strptime('2015-01-09', '%Y-%m-%d').date()
calc_comp_returns(prices_df, start_date=start, end_date=end, security=1)
Here is a solution (100x times faster on my computer with some dummy data).
import numpy as np
price_df = price_df.set_index('asof')
def calc_comp_returns_fast(price_df, start_date, end_date, security):
rows = price_df[price_df.security_id == security].loc[start_date:end_date]
changes = rows.px_last.pct_change()
comp_rtrn = np.prod(changes + 1)
return comp_rtrn
Or, as a one-liner:
def calc_comp_returns_fast(price_df, start_date, end_date, security):
return np.prod(price_df[price_df.security_id == security].loc[start_date:end_date].px_last.pct_change() + 1)
Not that I call the set_index method beforehand, it only needs to be done once on the entire price_df dataframe.
It is faster because it does not recreate DataFrames at each step. In your code, df is overwritten almost at each line by a new dataframe. Both the init process and the garbage collection (erasing unused data from memory) take a lot of time.
In my code, rows is a slice or a "view" of the original data, it does not need to copy or re-init any object. Also, I used directly the numpy product function, which is the same as taking the last cumprod element (pandas uses np.cumprod internally anyway).
Suggestion : if you are using IPython, Jupyter or Spyder, you can use the magic %prun calc_comp_returns(...) to see which part takes the most time. I ran it on your code, and it was the garbage collector, using like more than 50% of the total running time!
I'm not very familiar with pandas, but I'll give this a shot.
Problem with your solution
Your solution currently does a huge amount of unnecessary calculation. This is mostly due to the line:
df['return'] = df.px_last.pct_change()
This line is actually calcuating the percent change for every date between start and end. Just fixing this issue should give you a huge speed up. You should just get the start price and the end price and compare the two. The prices inbetween these two prices are completely irrelevant to your calculations. Again, my familiarity with pandas is nil, but you should do something like this instead:
def calc_comp_returns(price_df, start_date=None, end_date=None, security=None):
df = price_df[price_df.security_id == security]
df = df.set_index(['asof'])
df = df.loc[start_date:end_date]
return 1 + (df['px_last'].iloc(-1) - df['px_last'].iloc(0)
Remember that this code relies on the fact that price_df is sorted by date, so be careful to make sure you only pass calc_comp_returns a date-sorted price_df.
We'll use pd.merge_asof to grab prices from prices_df. However, when we do, we'll need to have relevant dataframes sorted by the date columns we are utilizing. Also, for convenience, I'll aggregate some pd.merge_asof parameters in dictionaries to be used as keyword arguments.
prices_df = prices_df.sort_values(['asof'])
aed = events_df.sort_values('asof')
ded = events_df.sort_values('disclosed_on')
aokw = dict(
left_on='asof', right_on='asof',
left_by='security_ref_id', right_by='security_id'
)
start_price = pd.merge_asof(aed, prices_df, **aokw).px_last
dokw = dict(
left_on='disclosed_on', right_on='asof',
left_by='security_ref_id', right_by='security_id'
)
end_price = pd.merge_asof(ded, prices_df, **dokw).px_last
returns = end_price.div(start_price).sub(1).rename('return')
events_df.join(returns)
asof disclosed_on security_ref_id return
0 2015-01-05 2015-01-09 16:31:00 1 0.040816
1 2018-03-22 2018-03-27 16:33:00 3616 NaN
2 2017-08-03 2018-03-27 12:13:00 2591 NaN
3 2018-03-22 2018-03-27 11:33:00 3615 NaN
4 2018-03-22 2018-03-27 10:51:00 3615 NaN
Related
I want help in verifying if the logic of the given code is correct. My aim is to calculate the number of continuous seconds for which Value is >= 50.05. So by continuous seconds I mean the 1s: 50.06 , 2s: 50.07, 3s:50.06. Here the Value is above 50.05 for 3 seconds. What is not considered as continuous seconds is 1s: 50.06 , 2s: 50.02, 3s:50.06. Here the condition for 2s is not satisfied. I wrote down a code based on the what I read on the different questions on stack overflow.
The input data frame df1 is as follows:
Date
Value
2019-12-31 23:00:00
50.10
2019-12-31 23:00:01
50.06
2019-12-31 23:00:02
50.05
The code used is this:
import pandas as pd
import numpy as np
import math
from datetime import timedelta
import example
accumulator = 0.0
reset = False
def myfunc(mask):
global accumulator
if mask==False:
accumulator = 0.0
return 0
if mask==True:
accumulator+=1.0
return accumulator
df1 = example.get_pq() # to get dataframe
df1 = df1.reset_index()
df1[['Date','index']]=df1['index'].astype(str).str.split('+',expand=True)
del df1['index']
df1 = df1[['Date','Value']]
df1['Date'] = pd.to_datetime(df1['Date'])
groups = df1.groupby(df1['Date'].dt.date, as_index = False)
g = df1
g['mask 50.05'] = g['Value'] >= 50.05
g['Diff']= g['Date'].diff()
g['Diff'] = g['Diff'].dt.total_seconds()
g['Sum 50.05'] = g.apply(lambda row: myfunc(row['mask 50.05']),axis=1)
print(g)
The output dataframe g is this:
Date
Value
mask 50.05
Diff
Sum 50.05
2019-12-31 23:00:00
50.10
True
NaN
0.0
2019-12-31 23:00:01
50.06
True
1.0
1.0
2019-12-31 23:00:02
50.05
True
1.0
2.0
My aim is to extract the rows where the condition was satisfied for 900 seconds or more using g[g['Sum 50.05'] >=900]
I asked a similar question a while ago, but the answer given there results in a different output than the code I posted here. Therefore, I needed help in verifying if the code I wrote here is correct. I have ran through a verification myself but wanted to make sure to have another pair of eyes.
This is more pure Python solution
There should be more Pandas-way to do this.
df=pd.DataFrame({'Date':['2019-12-31 23:00:00','2019-12-31 23:00:01','2019-12-31 23:00:02','2019-12-31 23:00:03','2019-12-31 23:00:04'],
'Value':[50.1,50.06,50.05,49,52]})
dateVal=[]
def Above5005(row):
if row['Value']<50.05:
dateVal.append([row['Date'],0])
elif dateVal and dateVal[-1][1]>0:
dateVal[-1][1]+=1
else:
dateVal.append([row['Date'],1])
print(dateVal) #this line shows how the statistics is collected by reference table dateVal
return dateVal[-1][0]
df['refDate']=df.apply(Above5005,axis=1)
d=dict(dateVal)
#the below line link the data in reference table to main table
df['continuousCount']=df.apply(lambda row:d[row['refDate']],axis=1)
df
I have a Pandas dataframe df that looks as follows:
created_time action_time
2021-03-05T07:18:12.281-0600 2021-03-05T08:32:19.153-0600
2021-03-04T15:34:23.373-0600 2021-03-04T15:37:32.360-0600
2021-03-01T04:57:47.848-0600 2021-03-01T08:37:39.083-0600
import pandas as pd
df = pd.DataFrame({'created_time':['2021-03-05T07:18:12.281-0600', '2021-03-04T15:34:23.373-0600', '2021-03-01T04:57:47.848-0600'],
'action_time':['2021-03-05T08:32:19.153-0600', '2021-03-04T15:37:32.360-0600', '2021-03-01T08:37:39.083-0600']})
I then create another column which represents the the difference in minutes between these two columns:
df['elapsed_time'] = (pd.to_datetime(df['action_time']) - pd.to_datetime(df['created_time'])).dt.total_seconds() / 60
df['elapsed_time']
elapsed_time
74.114533
3.149783
219.853917
We assume that "action" can only take place during business hours (which we assume to start 8:30am).
I would like to create another column named created_time_adjusted, which adjusts the created_time to 08:30am if the created_time is before 08:30am).
I can parse out the date and time string that I need, as follows:
df['elapsed_time'] = pd.to_datetime(df['created_time']).dt.date.astype(str) + 'T08:30:00.000-0600'
But, this doesn't deal with the conditional.
I'm aware of a few ways that I might be able to do this:
replace
clip
np.where
loc
What is the best (and least hacky) way to accomplish this?
Thanks!
First of all, I think your life would be easier if you convert the columns to datetime dtypes from the go. Then, its just a matter of running an apply op on the 'created_time' column.
df.created_time = pd.to_datetime(df.created_time)
df.action_time = pd.to_datetime(df.action_time)
df.elapsed_time = df.action_time-df.created_time
time_threshold = pd.to_datetime('08:30').time()
df['created_time_adjusted']=df.created_time.apply(lambda x:
x.replace(hour=8,minute=30,second=0)
if x.time()<time_threshold else x)
Output:
>>> df
created_time action_time created_time_adjusted
0 2021-03-05 07:18:12.281000-06:00 2021-03-05 08:32:19.153000-06:00 2021-03-05 08:30:00.281000-06:00
1 2021-03-04 15:34:23.373000-06:00 2021-03-04 15:37:32.360000-06:00 2021-03-04 15:34:23.373000-06:00
2 2021-03-01 04:57:47.848000-06:00 2021-03-01 08:37:39.083000-06:00 2021-03-01 08:30:00.848000-06:00
df['created_time']=pd.to_datetime(df['created_time'])#Coerce to datetime
df1=df.set_index(df['created_time']).between_time('00:00:00', '08:30:00', include_end=False)#Isolate earlier than 830 into df
df1['created_time']=df1['created_time'].dt.normalize()+ timedelta(hours=8,minutes=30, seconds=0)#Adjust time
df2=df1.append(df.set_index(df['created_time']).between_time('08:30:00','00:00:00', include_end=False)).reset_index(drop=True)#Knit before and after 830 together
df2
I have a large dataset, below the training and test datasets
train_data is from 2016-01-29 to 2017-12-31
head(train_data)
date Date_time Temp Ptot JFK AEH ART CS CP
1 2016-01-29 2016-01-29 00:00:00 30.3 1443.888 52.87707 49.36879 28.96548 6.239999 49.61212
2 2016-01-29 2016-01-29 00:15:00 30.3 1410.522 49.50248 49.58356 26.37977 5.024000 49.19649
3 2016-01-29 2016-01-29 00:30:00 30.3 1403.191 50.79809 49.04253 26.15317 5.055999 47.48126
4 2016-01-29 2016-01-29 00:45:00 30.3 1384.337 48.88359 49.14100 24.52135 5.088000 46.19261
5 2016-01-29 2016-01-29 01:00:00 30.1 1356.690 46.61842 48.80624 24.28208 5.024000 43.00352
6 2016-01-29 2016-01-29 01:15:00 30.1 1341.985 48.09687 48.87748 24.49988 4.975999 39.90505
test_data is from 2018-01-01 to 2018-07-12
tail(test_data)
date Date_time Temp Ptot JFK AEH ART CS CP
86007 2018-07-12 2018-07-12 22:30:00 64.1 1458.831 82.30099 56.93944 27.20252 2.496 54.41050
86008 2018-07-12 2018-07-12 22:45:00 64.1 1457.329 61.68535 54.28934 28.59752 3.728 54.15208
86009 2018-07-12 2018-07-12 23:00:00 63.5 1422.419 80.56367 56.40752 27.99190 3.520 53.85705
86010 2018-07-12 2018-07-12 23:15:00 63.5 1312.021 52.25757 56.40283 22.03727 2.512 53.72166
86011 2018-07-12 2018-07-12 23:30:00 63.5 1306.349 65.65347 56.20145 22.77093 3.680 52.71584
86012 2018-07-12 2018-07-12 23:45:00 63.5 1328.528 57.47283 57.73747 19.50940 2.432 52.37458
I want to make a prediction validation loop for 24 hours(Each day from 2018-01-01 to 2018-07-12) in test_data . Each day prediction is (96) values-15 minute sampling-. In other words, I have to select 96 values each time and put them in the test_data shown in the code and calculate MAPE.
Target variable: Ptot
Predictors: Temp, JFK, AEH, ...etc
I finished running the prediction as shown below
input = train_data[c("Temp","JFK","AEH","ART","CS","CP","RLF", "FH" ,"TJF" ,"GH" , "JPH","JEK", "KL",
"MH","MC","MRH", "PH","OR","RP","RC","RL","SH", "SPC","SJH","SMH","VWK","WH","Month","Day",
"Year","hour")]
target = train_data["Ptot"]
glm_model <- glm(Ptot~ ., data= c(input, target), family=gaussian)
I want to iterate through the "test_data" -create a loop- by taking each time 96 observation -96 rows- from the test table sequentially until the end of the dataset and calculate MAPE and save all of the value. I implemented this in R.
As shown below in fig. each time take 96 rows from (test_data) and put them in "test_data" in the function. It is just an explanation, not showing all 96 values :)
This is the function I have to create a loop for it
pred<- predict.glm(glm_model,test_data)
mape <- function(actual, pred){
return(100 * mean(abs((actual- pred)/actual)))
}
I will show how to make first-day prediction validation
1- select 96 values from test_data(i.e 2018-01-01)
One_day_data <- test_data[test_data$date == "2018-01-01",]
2- Put one day values in the function
pred<- predict.glm(glm_model,One_day_data )
3- This is the prediction results after running pred (96 values =one day)
print(pred)
67489 67490 67491 67492 67493 67494 67495 67496 67497 67498
1074.164 1069.527 1063.726 1082.404 1077.569 1071.265 1070.776 1073.686 1061.720 1063.554
67499 67500 67501 67502 67503 67504 67505 67506 67507 67508
1074.264 1067.393 1071.111 1076.754 1079.700 1071.244 1097.977 1089.862 1091.817 1098.025
67509 67510 67511 67512 67513 67514 67515 67516 67517 67518
1125.495 1133.786 1136.545 1138.473 1176.555 1183.483 1184.795 1186.220 1192.328 1187.582
67519 67520 67521 67522 67523 67524 67525 67526 67527 67528
1186.513 1254.844 1262.021 1258.816 1240.280 1229.237 1237.582 1250.030 1243.189 1262.266
67529 67530 67531 67532 67533 67534 67535 67536 67537 67538
1251.563 1242.417 1259.352 1269.760 1271.318 1266.984 1260.113 1247.424 1200.905 1198.161
67539 67540 67541 67542 67543 67544 67545 67546 67547 67548
1202.372 1189.016 1193.479 1194.668 1207.064 1199.772 1189.068 1176.762 1188.671 1208.944
67549 67550 67551 67552 67553 67554 67555 67556 67557 67558
1199.216 1193.544 1215.866 1209.969 1180.115 1182.482 1177.049 1196.165 1145.335 1146.028
67559 67560 67561 67562 67563 67564 67565 67566 67567 67568
1161.821 1163.816 1114.529 1112.068 1113.113 1107.496 1073.080 1082.271 1097.888 1095.782
67569 67570 67571 67572 67573 67574 67575 67576 67577 67578
1081.863 1068.071 1061.651 1072.511 1057.184 1068.474 1062.464 1061.535 1054.550 1050.287
67579 67580 67581 67582 67583 67584
1038.086 1045.610 1038.836 1030.429 1031.563 1019.997
We can get the actual value from "Ptot"
actual<- One_day_data$Ptot
[1] 1113.398 1110.637 1111.582 1110.816 1101.921 1111.091 1108.501 1112.535 1104.631 1108.284
[11] 1110.994 1106.585 1111.397 1117.406 1106.690 1101.783 1101.605 1110.183 1104.162 1111.829
[21] 1117.093 1125.493 1118.417 1127.879 1133.574 1136.395 1139.048 1141.850 1145.630 1141.288
[31] 1141.897 1140.310 1138.026 1121.849 1122.069 1120.479 1120.970 1111.594 1109.572 1116.355
[41] 1115.454 1113.911 1115.509 1113.004 1119.440 1112.878 1117.642 1100.516 1099.672 1109.223
[51] 1105.088 1107.167 1114.355 1110.620 1110.499 1110.161 1107.868 1118.085 1108.166 1106.347
[61] 1114.036 1106.968 1109.807 1113.943 1106.869 1104.390 1102.446 1110.770 1114.684 1114.142
[71] 1118.877 1128.470 1133.922 1128.420 1134.058 1142.529 1126.432 1127.824 1124.561 1130.823
[81] 1122.907 1117.422 1116.851 1114.980 1114.543 1108.584 1120.410 1120.900 1109.226 1101.367
[91] 1098.330 1110.474 1106.010 1108.451 1095.196 1096.007
4- Run Mape function and save the results (I have the actual values)
mape <- function(actual, pred){
return(100 * mean(abs((actual- pred)/actual)))
}
5- Do the same thing it for next 24 hours (i.e 2018-01-02) and so on
Incomplete Solution, it is not correct! (I think it should be done by something like this)
result_df =[]
for (i in 1:96){
test_data<- test_data[i,]
pred<- predict.glm(glm_model,test_data)
result_df$pred[i] <- pred
result_df$Actual[i+1] <- result_df$pred[i]
mape[i] <- function(actual, pred){
return(100 * mean(abs((actual- pred)/actual)))
}
}
SUMMARY: I want to store all of the values of mape by passing one day incrementally each time to pred.
NOTE: I will appreciate if you can show me the loop process in R and/or Python.
Consider building a generalized function, mape_calc, to receive a subset data frame as input and call the function in R's by. As the object-oriented wrapper to tapply, by will subset the main data frame by each distinct date, passing subsets into defined function for calculation.
Within the method, a new, one-row data frame is built to align mape with each date. Then all rows are binded together with do.call:
mape_calc <- function(sub_df) {
pred <- predict.glm(glm_model, sub_df)
actual <- sub_df$Ptot
mape <- 100 * mean(abs((actual - pred)/actual))
new_df <- data.frame(date = sub_df$date[[1]], mape = mape)
return(new_df)
}
# LIST OF ONE-ROW DATAFRAMES
df_list <- by(test_data, test_data$date, map_calc)
# FINAL DATAFRAME
final_df <- do.call(rbind, df_list)
Should you have same setup in Python pandas and numpy (possibly statsmodels for glm model), use pandas DataFrame.groupby as the counterpart to R's by. Of course adjust below pseudocode to your actual needs.
import pandas as pd
import numpy as np
import statsmodels.api as sm
...
train_data = sm.add_constant(train_data)
model_formula = 'Ptot ~ Temp + JFK + AEH + ART + CS + CP ...'
glm_model = sm.glm(formula = model_formula,
data = train_data.drop(columns=['date','Date_time']),
family = sm.families.Gaussian()).fit()
def mape_calc(dt, sub_df):
pred = glm_model.predict(sub_df.drop(columns=['date','Date_time','Ptot']))
actual = sub_df['Ptot']
mape = 100 * np.mean(np.abs((actual - pred)/actual))
new_df = pd.DataFrame({'date': dt, 'mape': mape}, index=[0])
return new_df
# LIST OF ONE-ROW DATAFRAMES
df_list = [mape_calc(i, g) for i, g in test_data.groupby('date')]
# FINAL DATAFRAME
final_df = pd.concat(df_list, ignore_index=True)
It sounds to me like you are looking for an introduction to python. Forgive me if I have misunderstood. I realize my answer is very simple.
I am happy to answer your question about how to do a loop in python. I will
give you two examples. I am going to assume that you are using "ipython" which
would allow you to type the following and test it out. I will show you a
for loop and a while loop.
I will demonstrate summing a bunch of numbers. Note that loops must be indented to work. This is a feature of python that freaks out newbies.
So ... inside an ipython environment.
In [21]: data = [1.1, 1.3, 0.5, 0.8, 0.9]
In [22]: def sum1(data):
summ=0
npts=len(data)
for i in range(npts):
summ+=data[i]
return summ
In [23]: sum1(data)
Out[23]: 4.6000000000000005
In [24]: def sum2(data):
summ=0;i=0
npts=len(data)
while i<npts:
summ+=data[i]
i+=1
return summ
#Note that in a while loop you must increment "i" on your own but a for loop
#does it for you ... just like every other language!
In [25]: sum2(data)
Out[25]: 4.6000000000000005
I ignored the question of how to get your data into an array. Python supports both lists (which is what I used in the example) and actual arrays (via numpy). If this is of interest to you, we can talk about numpy next.
There are all kinds of wonderful function for reading data files as well.
OK -- I don't know how to read "R" ... but it looks kind of C-like with elements of Matlab (which means matplotlib and numpy will work great for you!)
I can make your syntax "pythonic". That does not mean I am giving you running code.
We assume that you are interested in learning python. If you are a student asking
for someone else to do your homework, then I will be irritated. Regardless, I would very much appreciate if you would accept one of my answers as I could use some reputation on this site. I just got on it tonight even though I have been coding since 1975.
Here's how to do a function:
def mape(actual, pred):
return(100 * mean(abs((actual-pred)/actual)))
You are obviously using arrays ... you probably want numpy which will work much like I think you expect R to work.
for i in range(2,97):
test=test_data[i]
pred=predict.glm(glm_model,test)
#don't know what this dollar sign thing means
#so I didn't mess with it
result_df$pred[i] =pred
result_df$Actual[i+1] = result_df$pred[i]
I guess the dollar sign is some kind of appending thing. You can certainly append to an array in python. At this point if you want more help you need to break this into questions like ... "How do I create and fill an array in numpy?"
Good luck!
I'm performing a Cohort analysis using python, and I am having trouble creating a new column that sums up the total months a user has stayed with us.
I know the math behind the answer, all I have to do is:
subtract the year when they canceled our service from when they started it
Multiply that by 12.
Subtract the month when they canceled our service from when they started it.
Add those two numbers together.
So in Excel, it looks like this:
=(YEAR(C2)-YEAR(B2))*12+(MONTH(C2)-MONTH(B2))
C is when the customer canceled the date, and B is when they started
The problem is that I am very new to Python and Pandas, and I am having trouble translating that function in Python
What I have tried so far:
df['Lifetime'] = df.Plan_Cancel_Date('%Y') - df.Plan_Start_Date('%Y')*12 +
df.Plan_Cancel_Date('%m') - df.Plan_Start_Date('%m')
df.head()
It returns with an error 'Series' is not callable, and I have a general understanding of what that means.
I then tried:
def LTVCalc (Plan_Start_Date, Plan_Cancel_Date):
df['Lifetime'] = df.Plan_Cancel_Date('%Y') - df.Plan_Start_Date('%Y')*12 +
df.Plan_Cancel_Date('%m') - df.Plan_Start_Date('%m')
df.head()
But that didn't add the Column 'Lifetime' to the DataFrame.
Anyone able to help a rookie?
I think need first convert to_datetime and then use dt.year and
dt.month:
df = pd.DataFrame({
'Plan_Cancel_Date': ['2018-07-07','2019-03-05','2020-10-08'],
'Plan_Start_Date': ['2016-02-07','2017-01-05','2017-08-08']
})
#print (df)
#if necessary convert to datetimes
df.Plan_Start_Date = pd.to_datetime(df.Plan_Start_Date)
df.Plan_Cancel_Date = pd.to_datetime(df.Plan_Cancel_Date)
df['Lifetime'] = ((df.Plan_Cancel_Date.dt.year - df.Plan_Start_Date.dt.year)*12 +
df.Plan_Cancel_Date.dt.month - df.Plan_Start_Date.dt.month)
print (df)
Plan_Cancel_Date Plan_Start_Date Lifetime
0 2018-07-07 2016-02-07 29
1 2019-03-05 2017-01-05 26
2 2020-10-08 2017-08-08 38
Code
import pandas as pd
import numpy as np
dates = pd.date_range('20140301',periods=6)
id_col = np.array([[0, 1, 2, 0, 1, 2]])
data_col = np.random.randn(6,4)
data = np.concatenate((id_col.T, data_col), axis=1)
df = pd.DataFrame(data, index=dates, columns=list('IABCD'))
print df
print "before groupby:"
for index in df.index:
if not index.freq:
print "key:%f, no freq:%s" % (key, index)
print "after groupby:"
gb = df.groupby('I')
for key, group in gb:
#group = group.resample('1D', how='first')
for index in group.index:
if not index.freq:
print "key:%f, no freq:%s" % (key, index)
The output:
I A B C D
2014-03-01 0 0.129348 1.466361 -0.372673 0.045254
2014-03-02 1 0.395884 1.001859 -0.892950 0.480944
2014-03-03 2 -0.226405 0.663029 0.355675 -0.274865
2014-03-04 0 0.634661 0.535560 1.027162 1.637099
2014-03-05 1 -0.453149 -0.479408 -1.329372 -0.574017
2014-03-06 2 0.603972 0.754232 0.692185 -1.267217
[6 rows x 5 columns]
before groupby:
after groupby:
key:0.000000, no freq:2014-03-01 00:00:00
key:0.000000, no freq:2014-03-04 00:00:00
key:1.000000, no freq:2014-03-02 00:00:00
key:1.000000, no freq:2014-03-05 00:00:00
key:2.000000, no freq:2014-03-03 00:00:00
key:2.000000, no freq:2014-03-06 00:00:00
But after I uncomment the statement:
#group = group.resample('1D', how='first')
It seems no problem. The thing is, when I running on a large dataset with some operations on the timestamp, there is always an error "cannot add integral value to timestamp without offset". Is it a bug, or did I miss some thing?
You are treating a groupby object as a DataFrame.
It is like a dataframe, but requires apply to generate a new structure (either reduced or an actual DataFrame).
The idiom is:
df.groupby(....).apply(some_function)
Doing something like: df.groupby(...).sum() is syntactic sugar for using apply. Functions which are naturally applicable to using this kind of sugar are enabled; otherwise they will raise an error.
In particular you are accessing a group.index which can be but is not guaranteed to be a DatetimeIndex (when time grouping). The freq attributes of a datetimeindex are inferred when required (via inferred_freq).
You code is very confusing, you are grouping, then resampling; resample does this for you, so you don't need the former step at all.
resample is de-facto equivalent of a groupby-apply (but has special handling for the time-domain).