I am looking to group by two columns: user_id and date; however, if the dates are close enough, I want to be able to consider the two entries part of the same group and group accordingly. Date is m-d-y
user_id date val
1 1-1-17 1
2 1-1-17 1
3 1-1-17 1
1 1-1-17 1
1 1-2-17 1
2 1-2-17 1
2 1-10-17 1
3 2-1-17 1
The grouping would group by user_id and dates +/- 3 days from each other. so the group by summing val would look like:
user_id date sum(val)
1 1-2-17 3
2 1-2-17 2
2 1-10-17 1
3 1-1-17 1
3 2-1-17 1
Any way someone could think of that this could be done (somewhat) easily? I know there are some problematic aspects of this. for example, what to do if the dates string together endlessly with three days apart. but the exact data im using only has 2 values per person..
Thanks!
I'd convert this to a datetime column and then use pd.TimeGrouper:
dates = pd.to_datetime(df.date, format='%m-%d-%y')
print(dates)
0 2017-01-01
1 2017-01-01
2 2017-01-01
3 2017-01-01
4 2017-01-02
5 2017-01-02
6 2017-01-10
7 2017-02-01
Name: date, dtype: datetime64[ns]
df = (df.assign(date=dates).set_index('date')
.groupby(['user_id', pd.TimeGrouper('3D')])
.sum()
.reset_index())
print(df)
user_id date val
0 1 2017-01-01 3
1 2 2017-01-01 2
2 2 2017-01-10 1
3 3 2017-01-01 1
4 3 2017-01-31 1
Similar solution using pd.Grouper:
df = (df.assign(date=dates)
.groupby(['user_id', pd.Grouper(key='date', freq='3D')])
.sum()
.reset_index())
print(df)
user_id date val
0 1 2017-01-01 3
1 2 2017-01-01 2
2 2 2017-01-10 1
3 3 2017-01-01 1
4 3 2017-01-31 1
Update: TimeGrouper will be deprecated in future versions of pandas, so Grouper would be preferred in this scenario (thanks for the heads up, Vaishali!).
I come with a very ugly solution but still work...
df=df.sort_values(['user_id','date'])
df['Key']=df.sort_values(['user_id','date']).groupby('user_id')['date'].diff().dt.days.lt(3).ne(True).cumsum()
df.groupby(['user_id','Key'],as_index=False).agg({'val':'sum','date':'first'})
Out[586]:
user_id Key val date
0 1 1 3 2017-01-01
1 2 2 2 2017-01-01
2 2 3 1 2017-01-10
3 3 4 1 2017-01-01
4 3 5 1 2017-02-01
Related
I have a DataFrame of store sales for 1115 stores with dates over about 2.5 years. The StateHoliday column is a categorical variable indicating the type of holiday it is. See the piece of the df below. As can be seen, b is the code for Easter. There are other codes for other holidays.
Piece of DF
My objective is to analyze sales before and during a holiday. The way I seek to do this is to change the value of the StateHoliday column to something unique for the few days before a particular holiday. For example, b is the code for Easter, so I could change the value to b- indicating that the day is shortly before Easter. The only way I can think to do this is to go through and manually change these values for certain dates. There aren't THAT many holidays, so it wouldn't be that hard to do. But still very annoying!
Tom, see if this works for you, if not please provide additional information:
In the file I have the following data:
Store,Sales,Date,StateHoliday
1,6729,2013-03-25,0
1,6686,2013-03-26,0
1,6660,2013-03-27,0
1,7285,2013-03-28,0
1,6729,2013-03-29,b
1115,10712,2015-07-01,0
1115,11110,2015-07-02,0
1115,10500,2015-07-03,0
1115,12000,2015-07-04,c
import pandas as pd
fname = r"D:\workspace\projects\misc\data\holiday_sales.csv"
df = pd.read_csv(fname)
df["Date"] = pd.to_datetime(df["Date"])
holidays = df[df["StateHoliday"]!="0"].copy(deep=True) # taking only holidays
dictDate2Holiday = dict(zip(holidays["Date"].tolist(), holidays["StateHoliday"].tolist()))
look_back = 2 # how many days back you want to go
holiday_look_back = []
# building a list of pairs (prev days, holiday code)
for dt, h in dictDate2Holiday.items():
prev = dt
holiday_look_back.append((prev, h))
for i in range(1, look_back+1):
prev = prev - pd.Timedelta(days=1)
holiday_look_back.append((prev, h))
dfHolidayLookBack = pd.DataFrame(holiday_look_back, columns=["Date", "StateHolidayNew"])
df = df.merge(dfHolidayLookBack, how="left", on="Date")
df["StateHolidayNew"].fillna("0", inplace=True)
print(df)
columns StateHolidayNew should have the info you need to start analyzing your data
Assuming you have a dataframe like this:
Store Sales Date StateHoliday
0 2 4205 2016-11-15 0
1 1 684 2016-07-13 0
2 2 8946 2017-04-15 0
3 1 6929 2017-02-02 0
4 2 8296 2017-10-30 b
5 1 8261 2015-10-05 0
6 2 3904 2016-08-22 0
7 1 2613 2017-12-30 0
8 2 1324 2016-08-23 0
9 1 6961 2015-11-11 0
10 2 15 2016-12-06 a
11 1 9107 2016-07-05 0
12 2 1138 2015-03-29 0
13 1 7590 2015-06-24 0
14 2 5172 2017-04-29 0
15 1 660 2016-06-21 0
16 2 2539 2017-04-25 0
What you can do is group the values between the different alphabets which represent the holidays and then groupby to find out the sales according to each group. An improvement to this would be to backfill the numbers before the groups, exp., groups=0.0 would become b_0 which would make it easier to understand the groups and what holiday they represent, but I am not sure how to do that.
df['StateHolidayBool'] = df['StateHoliday'].str.isalpha().fillna(False).replace({False: 0, True: 1})
df = df.assign(group = (df[~df['StateHolidayBool'].between(1,1)].index.to_series().diff() > 1).cumsum())
df = df.assign(groups = np.where(df.group.notna(), df.group, df.StateHoliday)).drop(['StateHolidayBool', 'group'], axis=1)
df[~df['groups'].str.isalpha().fillna(False)].groupby('groups').sum()
Output:
Store Sales
groups
0.0 6 20764
1.0 7 23063
2.0 9 26206
Final DataFrame:
Store Sales Date StateHoliday groups
0 2 4205 2016-11-15 0 0.0
1 1 684 2016-07-13 0 0.0
2 2 8946 2017-04-15 0 0.0
3 1 6929 2017-02-02 0 0.0
4 2 8296 2017-10-30 b b
5 1 8261 2015-10-05 0 1.0
6 2 3904 2016-08-22 0 1.0
7 1 2613 2017-12-30 0 1.0
8 2 1324 2016-08-23 0 1.0
9 1 6961 2015-11-11 0 1.0
10 2 15 2016-12-06 a a
11 1 9107 2016-07-05 0 2.0
12 2 1138 2015-03-29 0 2.0
13 1 7590 2015-06-24 0 2.0
14 2 5172 2017-04-29 0 2.0
15 1 660 2016-06-21 0 2.0
16 2 2539 2017-04-25 0 2.0
I have a pandas dataframe with several columns and I would like to know the number of columns above the date 2016-12-31 . Here is an example:
ID
Bill
Date 1
Date 2
Date 3
Date 4
Bill 2
4
6
2000-10-04
2000-11-05
1999-12-05
2001-05-04
8
6
8
2016-05-03
2017-08-09
2018-07-14
2015-09-12
17
12
14
2016-11-16
2017-05-04
2017-07-04
2018-07-04
35
And I would like to get this column
Count
0
2
3
Just create the mask and call sum on axis=1
date = pd.to_datetime('2016-12-31')
(df[['Date 1','Date 2','Date 3','Date 4']]>date).sum(1)
OUTPUT:
0 0
1 2
2 3
dtype: int64
If needed, call .to_frame('count') to create datarame with column as count
(df[['Date 1','Date 2','Date 3','Date 4']]>date).sum(1).to_frame('Count')
Count
0 0
1 2
2 3
Use df.filter to filter the Date* columns + .sum(axis=1)
(df.filter(like='Date') > '2016-12-31').sum(axis=1).to_frame(name='Count')
Result:
Count
0 0
1 2
2 3
You can do:
df['Count'] = (df.loc[:, [x for x in df.columns if 'Date' in x]] > '2016-12-31').sum(axis=1)
Output:
ID Bill Date 1 Date 2 Date 3 Date 4 Bill 2 Count
0 4 6 2000-10-04 2000-11-05 1999-12-05 2001-05-04 8 0
1 6 8 2016-05-03 2017-08-09 2018-07-14 2015-09-12 17 2
2 12 14 2016-11-16 2017-05-04 2017-07-04 2018-07-04 35 3
We select columns with 'Date' in the name. It's better when we have lots of columns like these and don't want to put them one by one. Then we compare it with lookup date and sum 'True' values.
I have a timeseries dataframe that is data agnostic and uses period vs date.
I would like to at some point add in dates, using the period.
My dataframe looks like
period custid
1 1
2 1
3 1
1 2
2 2
1 3
2 3
3 3
4 3
I would like to be able to pick a random starting date, for example 1/1/2018, and that would be period 1 so you would end up with
period custid date
1 1 1/1/2018
2 1 2/1/2018
3 1 3/1/2018
1 2 1/1/2018
2 2 2/1/2018
1 3 1/1/2018
2 3 2/1/2018
3 3 3/1/2018
4 3 4/1/2018
You could create a column of timedeltas, based on the period column, where each row is a time delta of period dates (-1, so that it starts at 0). then, starting from your start_date, which you can define as a datetime object, add the timedelta to start date:
start_date = pd.to_datetime('1/1/2018')
df['date'] = pd.to_timedelta(df['period'] - 1, unit='D') + start_date
>>> df
period custid date
0 1 1 2018-01-01
1 2 1 2018-01-02
2 3 1 2018-01-03
3 1 2 2018-01-01
4 2 2 2018-01-02
5 1 3 2018-01-01
6 2 3 2018-01-02
7 3 3 2018-01-03
8 4 3 2018-01-04
Edit: In your comment, you said you were trying to add months, not days. For this, you could use your method, or alternatively, the following:
from pandas.tseries.offsets import MonthBegin
df['date'] = start_date + (df['period'] -1) * MonthBegin()
I have dataframe, it's part of them
ID,"url","app_name","used_at","active_seconds","device_connection","device_os","device_type","device_usage"
e990fae0f48b7daf52619b5ccbec61bc,"",Phone,2015-05-01 09:29:11,13,3g,android,smartphone,home
e990fae0f48b7daf52619b5ccbec61bc,"",Phone,2015-05-01 09:33:00,3,unknown,android,smartphone,home
e990fae0f48b7daf52619b5ccbec61bc,"",Phone,2015-06-01 09:33:07,1,unknown,android,smartphone,home
e990fae0f48b7daf52619b5ccbec61bc,"",Phone,2015-06-01 09:34:30,5,unknown,android,smartphone,home
e990fae0f48b7daf52619b5ccbec61bc,"",Messaging,2015-06-01 09:36:22,133,3g,android,smartphone,home
e990fae0f48b7daf52619b5ccbec61bc,"",Messaging,2015-05-02 09:38:40,5,3g,android,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",Yandex.Navigator,2015-05-01 11:04:48,70,3g,ios,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",VK Client,2015-6-01 12:02:27,248,3g,ios,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",Viber,2015-07-01 12:06:35,7,3g,ios,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",VK Client,2015-08-01 12:23:26,86,3g,ios,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",Talking Angela,2015-08-02 12:24:52,0,3g,ios,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",My Talking Angela,2015-08-03 12:24:52,167,3g,ios,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",Talking Angela,2015-08-04 12:27:39,34,3g,ios,smartphone,home
I need to count quantity of days in every month to every ID.
If I try df.groupby('ID')['used_at'].count() I get quantity of visiting, how can I take and count days at month?
I think you need groupby by ID, month and day and aggregate size:
df1 = df.used_at.groupby([df['ID'], df.used_at.dt.month,df.used_at.dt.day ]).size()
print (df1)
ID used_at used_at
574c4969b017ae6481db9a7c77328bc3 5 1 1
6 1 1
7 1 1
8 1 1
2 1
3 1
4 1
e990fae0f48b7daf52619b5ccbec61bc 5 1 2
2 1
6 1 3
dtype: int64
Or by date - it is same as by year, month and day:
df1 = df.used_at.groupby([df['ID'], df.used_at.dt.date]).size()
print (df1)
ID used_at
574c4969b017ae6481db9a7c77328bc3 2015-05-01 1
2015-06-01 1
2015-07-01 1
2015-08-01 1
2015-08-02 1
2015-08-03 1
2015-08-04 1
e990fae0f48b7daf52619b5ccbec61bc 2015-05-01 2
2015-05-02 1
2015-06-01 3
dtype: int64
Differences between count and size:
size counts NaN values, count does not.
I have a simple dataframe that looks like this:
I would like to use groupby to group by id, then find some way to difference the dates, and then column bind them back to the dataframe, so I end up with this:
The groupby is straightforward,
grouped = DF.groupby('id')
and finding the earliest date is straightforward,
maxdates = grouped['date'].min()
But I'm not sure how to proceed. How do I apply the date subtraction operation, then combine?
There is a similar question here.
Thanks for reading this far.
My dataframe is:
dates=pd.to_datetime(['2015-01-01', '2015-02-01', '2015-03-01', '2015-04-01', '2015-05-01', '2015-01-01', '2015-01-02', '2015-01-03', '2015-01-04', '2015-01-05'])
DF = DataFrame({'id':[1,1,1,1,1,2,2,2,2,2], 'date':dates})
cols = ['id', 'date']
DF=DF[cols]
EDIT:
Both answers below are awesome. I wish I could accept them both.
You can use apply like this:
earliest_by_id = DF.groupby('id')['date'].min()
def since_earliest(row):
return row.date - earliest_by_id[row.id]
DF['days_since_earliest'] = DF.apply(since_earliest, axis=1)
print(DF)
id date days_since_earliest
0 1 2015-01-01 0 days
1 1 2015-02-01 31 days
2 1 2015-03-01 59 days
3 1 2015-04-01 90 days
4 1 2015-05-01 120 days
5 2 2015-01-01 0 days
6 2 2015-01-02 1 days
7 2 2015-01-03 2 days
8 2 2015-01-04 3 days
9 2 2015-01-05 4 days
edit:
DF['days_since_earliest'] = DF.apply(since_earliest, axis=1).astype('timedelta64[D]')
print(DF)
id date days_since_earliest
0 1 2015-01-01 0
1 1 2015-02-01 31
2 1 2015-03-01 59
3 1 2015-04-01 90
4 1 2015-05-01 120
5 2 2015-01-01 0
6 2 2015-01-02 1
7 2 2015-01-03 2
8 2 2015-01-04 3
9 2 2015-01-05 4
FWIW, using transform can often be simpler (and usually faster) than apply. transform takes the results of a groupby operation and broadcasts it up to the original index:
>>> df["dse"] = df["date"] - df.groupby("id")["date"].transform(min)
>>> df
id date dse
0 1 2015-01-01 0 days
1 1 2015-02-01 31 days
2 1 2015-03-01 59 days
3 1 2015-04-01 90 days
4 1 2015-05-01 120 days
5 2 2015-01-01 0 days
6 2 2015-01-02 1 days
7 2 2015-01-03 2 days
8 2 2015-01-04 3 days
9 2 2015-01-05 4 days
If you'd prefer integer days instead of timedelta objects, you can use the dt.days accessor:
>>> df["dse"] = df["dse"].dt.days
>>> df
id date dse
0 1 2015-01-01 0
1 1 2015-02-01 31
2 1 2015-03-01 59
3 1 2015-04-01 90
4 1 2015-05-01 120
5 2 2015-01-01 0
6 2 2015-01-02 1
7 2 2015-01-03 2
8 2 2015-01-04 3
9 2 2015-01-05 4