Pandas group by date with subcategories and sums - python

I have a dataframe such as this one:
Date Category1 Cat2 Cat3 Cat4 Value
0 2021-02-02 4310 0 1 0 1082.00
1 2021-02-03 5121 2 0 0 -210.82
2 2021-02-03 4310 0 0 0 238.41
3 2021-02-12 5121 2 2 0 -1489.11
4 2021-02-25 6412 1 0 0 -30.97
5 2021-03-03 5121 1 1 0 -189.91
6 2021-03-09 6412 0 0 0 238.41
7 2021-03-13 5121 0 0 0 -743.08
Date column has been converted into datetime format, Value is a float, other columns are strings.
I am trying to group the dataframe by month and by each level of category, such as:
Level 1 = filter over category 1 and sum values for each category for each month:
Date Category1 Value
0 2021-02 4310 1320.41
1 2021-02 5121 -1699.93
2 2021-02 6412 -30.97
3 2021-03 5121 -1489.11
4 2021-03 6412 -932.99
Level 2 = filter over category 2 alone (one output dataframe) and over the concatenation of category 1 + 2 (another output dataframe):
Date Cat2 Value
0 2021-02 0 1320.41
1 2021-02 1 -1699.93
2 2021-02 2 -30.97
3 2021-03 0 -504.67
4 2021-03 1 -189.91
Second output :
Date Cat1+2 Value
0 2021-02 43100 1320.41
1 2021-02 51212 -1699.93
2 2021-02 64121 -30.97
3 2021-03 51210 -743.08
4 2021-03 51211 -189.91
5 2021-03 64120 238.41
Level 3 : filter over category 3 alone and over category 1+2+3
etc.
I am able to do one grouping at a time (by date or by category) but I can't combine them.
Grouping by date:
df.groupby(df["Date"].dt.year)
Grouping by category:
df.groupby('Category1')['Value'].sum()

You can try this.
To group by month, you can use this example
df.groupby(df['Date'].dt.strftime('%B'))['Value'].sum()
How can I Group By Month from a Date field using Python/Pandas
For group by multiple columns
df.groupby(['col5', 'col2'])
You could create a Month year column and they group by using the new column.
Pandas DataFrame Groupby two columns and get counts ,
Extracting just Month and Year separately from Pandas Datetime column

Related

Count number of columns above a date

I have a pandas dataframe with several columns and I would like to know the number of columns above the date 2016-12-31 . Here is an example:
ID
Bill
Date 1
Date 2
Date 3
Date 4
Bill 2
4
6
2000-10-04
2000-11-05
1999-12-05
2001-05-04
8
6
8
2016-05-03
2017-08-09
2018-07-14
2015-09-12
17
12
14
2016-11-16
2017-05-04
2017-07-04
2018-07-04
35
And I would like to get this column
Count
0
2
3
Just create the mask and call sum on axis=1
date = pd.to_datetime('2016-12-31')
(df[['Date 1','Date 2','Date 3','Date 4']]>date).sum(1)
OUTPUT:
0 0
1 2
2 3
dtype: int64
If needed, call .to_frame('count') to create datarame with column as count
(df[['Date 1','Date 2','Date 3','Date 4']]>date).sum(1).to_frame('Count')
Count
0 0
1 2
2 3
Use df.filter to filter the Date* columns + .sum(axis=1)
(df.filter(like='Date') > '2016-12-31').sum(axis=1).to_frame(name='Count')
Result:
Count
0 0
1 2
2 3
You can do:
df['Count'] = (df.loc[:, [x for x in df.columns if 'Date' in x]] > '2016-12-31').sum(axis=1)
Output:
ID Bill Date 1 Date 2 Date 3 Date 4 Bill 2 Count
0 4 6 2000-10-04 2000-11-05 1999-12-05 2001-05-04 8 0
1 6 8 2016-05-03 2017-08-09 2018-07-14 2015-09-12 17 2
2 12 14 2016-11-16 2017-05-04 2017-07-04 2018-07-04 35 3
We select columns with 'Date' in the name. It's better when we have lots of columns like these and don't want to put them one by one. Then we compare it with lookup date and sum 'True' values.

How to filter one dataframe on the basis of other

I have one dataframe, I need to filter the dates on the basis of start and end date of the other dataframe.
df1 should have all_dates that is in the range of start_date and end_date of df2
example set is given below. What is the best way in pandas to achieve that?
Considering sample dataframes as below, I have included the expected result set
df1
ID all_date clicks
1 2019-08-21 5
1 2019-08-22 4
1 2019-08-25 2
1 2019-08-27 2
2 2019-07-18 5
2 2019-07-21 5
2 2019-07-23 6
2 2019-07-25 6
2 2019-07-27 6
df2
ID start_date end_date
1 2019-08-21 2019-08-23
2 2019-07-18 2019-07-24
expected output:
df1
ID all_date clicks
1 2019-08-21 5
1 2019-08-22 4
2 2019-07-18 5
2 2019-07-21 5
2 2019-07-23 6
Output should contain range of date i.e start_date and end_date of df2
Use DataFrame.merge first and filter by Series.between with loc for filter by columns names and boolean indexing:
df1['all_date'] = pd.to_datetime(df1['all_date'])
df2['start_date'] = pd.to_datetime(df2['start_date'])
df2['end_date'] = pd.to_datetime(df2['end_date'])
df = df1.merge(df2, on='ID')
df = df.loc[df['all_date'].between(df['start_date'], df['end_date']), df1.columns]
print (df)
ID all_date clicks
0 1 2019-08-21 5
1 1 2019-08-22 4
4 2 2019-07-18 5
5 2 2019-07-21 5
6 2 2019-07-23 6

Grouping by date range with pandas

I am looking to group by two columns: user_id and date; however, if the dates are close enough, I want to be able to consider the two entries part of the same group and group accordingly. Date is m-d-y
user_id date val
1 1-1-17 1
2 1-1-17 1
3 1-1-17 1
1 1-1-17 1
1 1-2-17 1
2 1-2-17 1
2 1-10-17 1
3 2-1-17 1
The grouping would group by user_id and dates +/- 3 days from each other. so the group by summing val would look like:
user_id date sum(val)
1 1-2-17 3
2 1-2-17 2
2 1-10-17 1
3 1-1-17 1
3 2-1-17 1
Any way someone could think of that this could be done (somewhat) easily? I know there are some problematic aspects of this. for example, what to do if the dates string together endlessly with three days apart. but the exact data im using only has 2 values per person..
Thanks!
I'd convert this to a datetime column and then use pd.TimeGrouper:
dates = pd.to_datetime(df.date, format='%m-%d-%y')
print(dates)
0 2017-01-01
1 2017-01-01
2 2017-01-01
3 2017-01-01
4 2017-01-02
5 2017-01-02
6 2017-01-10
7 2017-02-01
Name: date, dtype: datetime64[ns]
df = (df.assign(date=dates).set_index('date')
.groupby(['user_id', pd.TimeGrouper('3D')])
.sum()
.reset_index())
print(df)
user_id date val
0 1 2017-01-01 3
1 2 2017-01-01 2
2 2 2017-01-10 1
3 3 2017-01-01 1
4 3 2017-01-31 1
Similar solution using pd.Grouper:
df = (df.assign(date=dates)
.groupby(['user_id', pd.Grouper(key='date', freq='3D')])
.sum()
.reset_index())
print(df)
user_id date val
0 1 2017-01-01 3
1 2 2017-01-01 2
2 2 2017-01-10 1
3 3 2017-01-01 1
4 3 2017-01-31 1
Update: TimeGrouper will be deprecated in future versions of pandas, so Grouper would be preferred in this scenario (thanks for the heads up, Vaishali!).
I come with a very ugly solution but still work...
df=df.sort_values(['user_id','date'])
df['Key']=df.sort_values(['user_id','date']).groupby('user_id')['date'].diff().dt.days.lt(3).ne(True).cumsum()
df.groupby(['user_id','Key'],as_index=False).agg({'val':'sum','date':'first'})
Out[586]:
user_id Key val date
0 1 1 3 2017-01-01
1 2 2 2 2017-01-01
2 2 3 1 2017-01-10
3 3 4 1 2017-01-01
4 3 5 1 2017-02-01

Merge DateTimeIndex with month-year

I have two data frames. One has a precise (daily) DateTimeIndex. I have used that index to create monthly statistics using groupby(['groupid', pd.TimeGrouper('1M', closed='left', label='left')]).
Now I would like to merge the information back to the original data frame. However, the date-time labels of the collapsed data frame do of course not correspond exactly to the original DateTimeIndex. So then I'd like to match them to the corresponding month-year information.
How would I do that?
statistics
date groupid
2001-01-31 1 10
2001-02-31 1 11
and original data frame
date groupid foo
2001-01-25 1 1
2001-01-28 1 2
2001-02-02 1 4
With expected output
date groupid foo statistics
2001-01-25 1 1 10
2001-01-28 1 2 10
2001-02-02 1 4 11
You can create new columns with month period by to_period and then merge, also is necessary change 2001-02-31 to 2001-02-28 in df1, because 31. February does not exist:
df1['per'] = df1.index.get_level_values('date').to_period('M')
df2['per'] = df2.date.dt.to_period('M')
print (df1)
statistics per
date groupid
2001-01-31 1 10 2001-01
2001-02-28 1 11 2001-02
print (df2)
date groupid foo per
0 2001-01-25 1 1 2001-01
1 2001-01-28 1 2 2001-01
2 2001-02-02 1 4 2001-02
print (pd.merge(df2, df1.reset_index(level=1), on=['per','groupid'], how='right')
.drop('per', axis=1))
date groupid foo statistics
0 2001-01-25 1 1 10
1 2001-01-28 1 2 10
2 2001-02-02 1 4 11

Pandas: count some values in a column

I have dataframe, it's part of them
ID,"url","app_name","used_at","active_seconds","device_connection","device_os","device_type","device_usage"
e990fae0f48b7daf52619b5ccbec61bc,"",Phone,2015-05-01 09:29:11,13,3g,android,smartphone,home
e990fae0f48b7daf52619b5ccbec61bc,"",Phone,2015-05-01 09:33:00,3,unknown,android,smartphone,home
e990fae0f48b7daf52619b5ccbec61bc,"",Phone,2015-06-01 09:33:07,1,unknown,android,smartphone,home
e990fae0f48b7daf52619b5ccbec61bc,"",Phone,2015-06-01 09:34:30,5,unknown,android,smartphone,home
e990fae0f48b7daf52619b5ccbec61bc,"",Messaging,2015-06-01 09:36:22,133,3g,android,smartphone,home
e990fae0f48b7daf52619b5ccbec61bc,"",Messaging,2015-05-02 09:38:40,5,3g,android,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",Yandex.Navigator,2015-05-01 11:04:48,70,3g,ios,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",VK Client,2015-6-01 12:02:27,248,3g,ios,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",Viber,2015-07-01 12:06:35,7,3g,ios,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",VK Client,2015-08-01 12:23:26,86,3g,ios,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",Talking Angela,2015-08-02 12:24:52,0,3g,ios,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",My Talking Angela,2015-08-03 12:24:52,167,3g,ios,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",Talking Angela,2015-08-04 12:27:39,34,3g,ios,smartphone,home
I need to count quantity of days in every month to every ID.
If I try df.groupby('ID')['used_at'].count() I get quantity of visiting, how can I take and count days at month?
I think you need groupby by ID, month and day and aggregate size:
df1 = df.used_at.groupby([df['ID'], df.used_at.dt.month,df.used_at.dt.day ]).size()
print (df1)
ID used_at used_at
574c4969b017ae6481db9a7c77328bc3 5 1 1
6 1 1
7 1 1
8 1 1
2 1
3 1
4 1
e990fae0f48b7daf52619b5ccbec61bc 5 1 2
2 1
6 1 3
dtype: int64
Or by date - it is same as by year, month and day:
df1 = df.used_at.groupby([df['ID'], df.used_at.dt.date]).size()
print (df1)
ID used_at
574c4969b017ae6481db9a7c77328bc3 2015-05-01 1
2015-06-01 1
2015-07-01 1
2015-08-01 1
2015-08-02 1
2015-08-03 1
2015-08-04 1
e990fae0f48b7daf52619b5ccbec61bc 2015-05-01 2
2015-05-02 1
2015-06-01 3
dtype: int64
Differences between count and size:
size counts NaN values, count does not.

Categories

Resources