I've got a dataframe that looks like this:
userid date count
a 2016-12-01 4
a 2016-12-03 5
a 2016-12-05 1
b 2016-11-17 14
b 2016-11-18 15
b 2016-11-23 4
The first column is a user id, the second column is a date (resulting from a groupby(pd.TimeGrouper('d')), and the third column is a daily count. However, per user, I would like to ensure that any days missing between a user's min and max date are filled in to be 0 on a per user basis. So if I am starting with a data frame like the above, I end up with a data frame like this:
userid date count
a 2016-12-01 4
a 2016-12-02 0
a 2016-12-03 5
a 2016-12-04 0
a 2016-12-05 1
b 2016-11-17 14
b 2016-11-18 15
b 2016-11-19 0
b 2016-11-20 0
b 2016-11-21 0
b 2016-11-22 0
b 2016-11-23 4
I know that there are various methods available with a pandas data frame to resample (with options to pick to interpolate forwards, backwards, or by averaging) but how would I do this in the sense above, where I want a continuous time series for each userid but where the dates of the time series are different per user?
Here's what I tried that hasn't worked:
grouped_users = user_daily_counts.groupby('user').set_index('timestamp').resample('d', fill_method = None)
However this throws an error AttributeError: Cannot access callable attribute 'set_index' of 'DataFrameGroupBy' objects, try using the 'apply' method. I'm not sure how I'd be able to use the apply method while bringing forward all columns as I'd like to do.
Thanks for any suggestions!
You can use groupby with resample, but first need Datetimeindex created by set_index.
(need pandas 0.18.1 and higher)
Then fill NaN by 0 by asfreq with fillna.
Last remove column userid and reset_index:
df = df.set_index('date')
.groupby('userid')
.resample('D')
.asfreq()
.fillna(0)
.drop('userid', axis=1)
.reset_index()
print (df)
userid date count
0 a 2016-12-01 4.0
1 a 2016-12-02 0.0
2 a 2016-12-03 5.0
3 a 2016-12-04 0.0
4 a 2016-12-05 1.0
5 b 2016-11-17 14.0
6 b 2016-11-18 15.0
7 b 2016-11-19 0.0
8 b 2016-11-20 0.0
9 b 2016-11-21 0.0
10 b 2016-11-22 0.0
11 b 2016-11-23 4.0
If want dtype of column count integer add astype:
df = df.set_index('date') \
.groupby('userid') \
.resample('D') \
.asfreq() \
.fillna(0) \
.drop('userid', axis=1) \
.astype(int) \
.reset_index()
print (df)
userid date count
0 a 2016-12-01 4
1 a 2016-12-02 0
2 a 2016-12-03 5
3 a 2016-12-04 0
4 a 2016-12-05 1
5 b 2016-11-17 14
6 b 2016-11-18 15
7 b 2016-11-19 0
8 b 2016-11-20 0
9 b 2016-11-21 0
10 b 2016-11-22 0
11 b 2016-11-23 4
Related
I have a DataFrame of store sales for 1115 stores with dates over about 2.5 years. The StateHoliday column is a categorical variable indicating the type of holiday it is. See the piece of the df below. As can be seen, b is the code for Easter. There are other codes for other holidays.
Piece of DF
My objective is to analyze sales before and during a holiday. The way I seek to do this is to change the value of the StateHoliday column to something unique for the few days before a particular holiday. For example, b is the code for Easter, so I could change the value to b- indicating that the day is shortly before Easter. The only way I can think to do this is to go through and manually change these values for certain dates. There aren't THAT many holidays, so it wouldn't be that hard to do. But still very annoying!
Tom, see if this works for you, if not please provide additional information:
In the file I have the following data:
Store,Sales,Date,StateHoliday
1,6729,2013-03-25,0
1,6686,2013-03-26,0
1,6660,2013-03-27,0
1,7285,2013-03-28,0
1,6729,2013-03-29,b
1115,10712,2015-07-01,0
1115,11110,2015-07-02,0
1115,10500,2015-07-03,0
1115,12000,2015-07-04,c
import pandas as pd
fname = r"D:\workspace\projects\misc\data\holiday_sales.csv"
df = pd.read_csv(fname)
df["Date"] = pd.to_datetime(df["Date"])
holidays = df[df["StateHoliday"]!="0"].copy(deep=True) # taking only holidays
dictDate2Holiday = dict(zip(holidays["Date"].tolist(), holidays["StateHoliday"].tolist()))
look_back = 2 # how many days back you want to go
holiday_look_back = []
# building a list of pairs (prev days, holiday code)
for dt, h in dictDate2Holiday.items():
prev = dt
holiday_look_back.append((prev, h))
for i in range(1, look_back+1):
prev = prev - pd.Timedelta(days=1)
holiday_look_back.append((prev, h))
dfHolidayLookBack = pd.DataFrame(holiday_look_back, columns=["Date", "StateHolidayNew"])
df = df.merge(dfHolidayLookBack, how="left", on="Date")
df["StateHolidayNew"].fillna("0", inplace=True)
print(df)
columns StateHolidayNew should have the info you need to start analyzing your data
Assuming you have a dataframe like this:
Store Sales Date StateHoliday
0 2 4205 2016-11-15 0
1 1 684 2016-07-13 0
2 2 8946 2017-04-15 0
3 1 6929 2017-02-02 0
4 2 8296 2017-10-30 b
5 1 8261 2015-10-05 0
6 2 3904 2016-08-22 0
7 1 2613 2017-12-30 0
8 2 1324 2016-08-23 0
9 1 6961 2015-11-11 0
10 2 15 2016-12-06 a
11 1 9107 2016-07-05 0
12 2 1138 2015-03-29 0
13 1 7590 2015-06-24 0
14 2 5172 2017-04-29 0
15 1 660 2016-06-21 0
16 2 2539 2017-04-25 0
What you can do is group the values between the different alphabets which represent the holidays and then groupby to find out the sales according to each group. An improvement to this would be to backfill the numbers before the groups, exp., groups=0.0 would become b_0 which would make it easier to understand the groups and what holiday they represent, but I am not sure how to do that.
df['StateHolidayBool'] = df['StateHoliday'].str.isalpha().fillna(False).replace({False: 0, True: 1})
df = df.assign(group = (df[~df['StateHolidayBool'].between(1,1)].index.to_series().diff() > 1).cumsum())
df = df.assign(groups = np.where(df.group.notna(), df.group, df.StateHoliday)).drop(['StateHolidayBool', 'group'], axis=1)
df[~df['groups'].str.isalpha().fillna(False)].groupby('groups').sum()
Output:
Store Sales
groups
0.0 6 20764
1.0 7 23063
2.0 9 26206
Final DataFrame:
Store Sales Date StateHoliday groups
0 2 4205 2016-11-15 0 0.0
1 1 684 2016-07-13 0 0.0
2 2 8946 2017-04-15 0 0.0
3 1 6929 2017-02-02 0 0.0
4 2 8296 2017-10-30 b b
5 1 8261 2015-10-05 0 1.0
6 2 3904 2016-08-22 0 1.0
7 1 2613 2017-12-30 0 1.0
8 2 1324 2016-08-23 0 1.0
9 1 6961 2015-11-11 0 1.0
10 2 15 2016-12-06 a a
11 1 9107 2016-07-05 0 2.0
12 2 1138 2015-03-29 0 2.0
13 1 7590 2015-06-24 0 2.0
14 2 5172 2017-04-29 0 2.0
15 1 660 2016-06-21 0 2.0
16 2 2539 2017-04-25 0 2.0
I have two dataframes as follows:
agreement
agreement_id activation term_months total_fee
0 A 2020-12-01 24 4800
1 B 2021-01-02 6 300
2 C 2021-01-21 6 600
3 D 2021-03-04 6 300
payments
cust_id agreement_id date payment
0 1 A 2020-12-01 200
1 1 A 2021-02-02 200
2 1 A 2021-02-03 100
3 1 A 2021-05-01 200
4 1 B 2021-01-02 50
5 1 B 2021-01-09 20
6 1 B 2021-03-01 80
7 1 B 2021-04-23 90
8 2 C 2021-01-21 600
9 3 D 2021-03-04 150
10 3 D 2021-05-03 150
I want to add another row in the payments dataframe when the total payments for the agreement_id in the payments dataframe is equal to the total_fee in the agreement_id. The row would contain a zero value under the payments and the date will be calculated as min(date) (from payments) plus term_months (from agreement).
Here's the results I want for the payments dataframe:
payments
cust_id agreement_id date payment
0 1 A 2020-12-01 200
1 1 A 2021-02-02 200
2 1 A 2021-02-03 100
3 1 A 2021-05-01 200
4 1 B 2021-01-02 50
5 1 B 2021-01-09 20
6 1 B 2021-03-01 80
7 1 B 2021-04-23 90
8 2 C 2021-01-21 600
9 3 D 2021-03-04 150
10 3 D 2021-05-03 150
11 2 C 2021-07-21 0
12 3 D 2021-09-04 0
The additional rows are row 11 and 12. The agreement_id 'C' and 'D' where equal to the total_fee shown in the agreement dataframe.
import pandas as pd
import numpy as np
Firstly convert 'date' column of payment dataframe into datetime dtype by using to_datetime() method:
payments['date']=pd.to_datetime(payments['date'])
You can do this by using groupby() method:
newdf=payments.groupby('agreement_id').agg({'payment':'sum','date':'min','cust_id':'first'}).reset_index()
Now by boolean masking get the data which mets your condition:
newdf=newdf[agreement['total_fee']==newdf['payment']].assign(payment=np.nan)
Note: here in the above code we are using assign() method and making the payments row to NaN
Now make use of pd.tseries.offsets.Dateoffsets() method and apply() method:
newdf['date']=newdf['date']+agreement['term_months'].apply(lambda x:pd.tseries.offsets.DateOffset(months=x))
Note: The above code gives you a warning so just ignore that warning as it's a warning not an error
Finally make use of concat() method and fillna() method:
result=pd.concat((payments,newdf),ignore_index=True).fillna(0)
Now if you print result you will get your desired output
#output
cust_id agreement_id date payment
0 1 A 2020-12-01 200.0
1 1 A 2021-02-02 200.0
2 1 A 2021-02-03 100.0
3 1 A 2021-05-01 200.0
4 1 B 2021-01-02 50.0
5 1 B 2021-01-09 20.0
6 1 B 2021-03-01 80.0
7 1 B 2021-04-23 90.0
8 2 C 2021-01-21 600.0
9 3 D 2021-03-04 150.0
10 3 D 2021-05-03 150.0
11 2 C 2021-07-21 0.0
12 3 D 2021-09-04 0.0
Note: If you want exact same output then make use of astype() method and change payment column dtype from float to int
result['payment']=result['payment'].astype(int)
I have a dataframe that looks like this
ID | START | END
1 |2016-12-31|2017-02-30
2 |2017-01-30|2017-10-30
3 |2016-12-21|2018-12-30
I want to know the number of active IDs in each possible day. So basically count the number of overlapping time periods.
What I did to calculate this was creating a new data frame c_df with the columns date and count. The first column was populated using a range:
all_dates = pd.date_range(start=min(df['START']), end=max(df['END']))
Then for every line in my original data frame I calculated a different range for the start and end dates:
id_dates = pd.date_range(start=min(user['START']), end=max(user['END']))
I then used this range of dates to increment by one the corresponding count cell in c_df.
All these loops though are not very efficient for big data sets and look ugly. Is there a more efficient way of doing this?
If your dataframe is small enough so that performance is not a concern, create a date range for each row, then explode them and count how many times each date exists in the exploded series.
Requires pandas >= 0.25:
df.apply(lambda row: pd.date_range(row['START'], row['END']), axis=1) \
.explode() \
.value_counts() \
.sort_index()
If your dataframe is large, take advantage of numpy broadcasting to improve performance.
Work with any version of pandas:
dates = pd.date_range(df['START'].min(), df['END'].max()).values
start = df['START'].values[:, None]
end = df['END'].values[:, None]
mask = (start <= dates) & (dates <= end)
result = pd.DataFrame({
'Date': dates,
'Count': mask.sum(axis=0)
})
Create IntervalIndex and use genex or list comprehension with contains to check each date again each interval (Note: I made a smaller sample to test on this solution)
Sample `df`
Out[56]:
ID START END
0 1 2016-12-31 2017-01-20
1 2 2017-01-20 2017-01-30
2 3 2016-12-28 2017-02-03
3 4 2017-01-20 2017-01-25
iix = pd.IntervalIndex.from_arrays(df.START, df.END, closed='both')
all_dates = pd.date_range(start=min(df['START']), end=max(df['END']))
df_final = pd.DataFrame({'dates': all_dates,
'date_counts': (iix.contains(dt).sum() for dt in all_dates)})
In [58]: df_final
Out[58]:
dates date_counts
0 2016-12-28 1
1 2016-12-29 1
2 2016-12-30 1
3 2016-12-31 2
4 2017-01-01 2
5 2017-01-02 2
6 2017-01-03 2
7 2017-01-04 2
8 2017-01-05 2
9 2017-01-06 2
10 2017-01-07 2
11 2017-01-08 2
12 2017-01-09 2
13 2017-01-10 2
14 2017-01-11 2
15 2017-01-12 2
16 2017-01-13 2
17 2017-01-14 2
18 2017-01-15 2
19 2017-01-16 2
20 2017-01-17 2
21 2017-01-18 2
22 2017-01-19 2
23 2017-01-20 4
24 2017-01-21 3
25 2017-01-22 3
26 2017-01-23 3
27 2017-01-24 3
28 2017-01-25 3
29 2017-01-26 2
30 2017-01-27 2
31 2017-01-28 2
32 2017-01-29 2
33 2017-01-30 2
34 2017-01-31 1
35 2017-02-01 1
36 2017-02-02 1
37 2017-02-03 1
I have a dataframe that looks like this
customer Start_date End_date
100 2016-06-01 2018-01-01
101 2017-06-01 2019-01-01
102 2016-04-01 2017-04-01
103 2015-06-03 2016-01-01
104 2016-06-01 2018-01-01
Now I want to create a dataframe with a period index that has a column with a count of the amount of customer for each of its periods that looks like this:
Period Customers
2017-01 3
2017-02 5
2017-03 8
2017-04 9
I have written a custom for loop to do it but it is VERY inefficient. There must be a faster way that uses the pandas functionality to get this done. Any help is greatly appreciated!
First, make sure the dates are OK:
df.Start_date = pd.to_datetime(df.Start_date)
df.End_date = pd.to_datetime(df.End_date)
Create a dummy column, and use it to merge on all periods:
df['dummy'] = 1
merged = pd.merge(
df,
pd.DataFrame({'Period': pd.date_range(df.Start_date.min(), df.End_date.max(), freq='M'), 'dummy': 1}),
how='outer')
Retain all rows where the period falls between start and end dates:
merged = merged[(merged.Start_date <= merged.Period) & (merged.End_date >= merged.Period)]
Now calculate the customers per period:
>>> merged.customer.groupby(merged.Period).nunique()
Period
2015-06-30 1
2015-07-31 1
2015-08-31 1
2015-09-30 1
2015-10-31 1
2015-11-30 1
2015-12-31 1
2016-04-30 1
2016-05-31 1
2016-06-30 3
2016-07-31 3
2016-08-31 3
2016-09-30 3
2016-10-31 3
...
You can create month period by to_period, in list comprehension all periods for each customer and last groupby with nunique:
df['Start_date'] = pd.to_datetime(df['Start_date']).dt.to_period('m')
df['End_date'] = pd.to_datetime(df['End_date']).dt.to_period('m')
#if want exclude last periods per rows subtract 1
#df['End_date'] = pd.to_datetime(df['End_date']).dt.to_period('m') - 1
L = [(a, d) for a,b,c in df.values for d in pd.period_range(b,c, freq='m')]
for all unique customers per period
df = pd.DataFrame(L, columns=['v','d']).groupby('d')['v'].nunique()
print (df.head(10))
d
2015-06 1
2015-07 1
2015-08 1
2015-09 1
2015-10 1
2015-11 1
2015-12 1
2016-01 1
2016-04 1
2016-05 1
Freq: M, dtype: int64
Sample with different data for test solution:
print (df)
customer Start_date End_date
0 100 2016-03-01 2016-06-01
1 100 2016-08-01 2016-10-01
2 102 2016-04-01 2017-01-01
3 103 2016-06-03 2016-01-01
4 103 2016-06-01 2016-05-01
df['Start_date'] = pd.to_datetime(df['Start_date']).dt.to_period('m')
df['End_date'] = pd.to_datetime(df['End_date']).dt.to_period('m')
L = [(a, d) for a,b,c in df.values for d in pd.period_range(b,c, freq='m')]
df = pd.DataFrame(L, columns=['v','d'])
print (df)
v d
0 100 2016-03
1 100 2016-04
2 100 2016-05
3 100 2016-06
4 100 2016-08
5 100 2016-09
6 100 2016-10
7 102 2016-04
8 102 2016-05
9 102 2016-06
10 102 2016-07
11 102 2016-08
12 102 2016-09
13 102 2016-10
14 102 2016-11
15 102 2016-12
16 102 2017-01
df1 = df.groupby('d')['v'].nunique().reset_index()
print (df1)
d v
0 2016-03 1
1 2016-04 2
2 2016-05 2
3 2016-06 2
4 2016-07 1
5 2016-08 2
6 2016-09 2
7 2016-10 2
8 2016-11 1
9 2016-12 1
10 2017-01 1
df.melt(id_vars='customer', \
var_name='Period', \
value_name='Date'). \
groupby('customer'). \
apply(lambda x: pd.Series(pd.date_range(x.Date.min(), \
x.Date.max(), \
freq='M'))). \
reset_index(). \
drop('level_1', axis=1). \
set_index(0). \
resample('M'). \
nunique()
# customer
# 0
# 2015-06-30 1
# 2015-07-31 1
# 2015-08-31 1
# 2015-09-30 1
# 2015-10-31 1
Problem
I want to calculate diff by group. And I don’t know how to sort the time column so that each group results are sorted and positive.
The original data :
In [37]: df
Out[37]:
id time
0 A 2016-11-25 16:32:17
1 A 2016-11-25 16:36:04
2 A 2016-11-25 16:35:29
3 B 2016-11-25 16:35:24
4 B 2016-11-25 16:35:46
The result I want
Out[40]:
id time
0 A 00:35
1 A 03:12
2 B 00:22
notice: the type of time col is timedelta64[ns]
Trying
In [38]: df['time'].diff(1)
Out[38]:
0 NaT
1 00:03:47
2 -1 days +23:59:25
3 -1 days +23:59:55
4 00:00:22
Name: time, dtype: timedelta64[ns]
Don't get desired result.
Hope
Not only solve the problem but the code can run fast because there are 50 million rows.
You can use sort_values with groupby and aggregating diff:
df['diff'] = df.sort_values(['id','time']).groupby('id')['time'].diff()
print (df)
id time diff
0 A 2016-11-25 16:32:17 NaT
1 A 2016-11-25 16:36:04 00:00:35
2 A 2016-11-25 16:35:29 00:03:12
3 B 2016-11-25 16:35:24 NaT
4 B 2016-11-25 16:35:46 00:00:22
If need remove rows with NaT in column diff use dropna:
df = df.dropna(subset=['diff'])
print (df)
id time diff
2 A 2016-11-25 16:35:29 00:03:12
1 A 2016-11-25 16:36:04 00:00:35
4 B 2016-11-25 16:35:46 00:00:22
You can also overwrite column:
df.time = df.sort_values(['id','time']).groupby('id')['time'].diff()
print (df)
id time
0 A NaT
1 A 00:00:35
2 A 00:03:12
3 B NaT
4 B 00:00:22
df.time = df.sort_values(['id','time']).groupby('id')['time'].diff()
df = df.dropna(subset=['time'])
print (df)
id time
1 A 00:00:35
2 A 00:03:12
4 B 00:00:22