Pandas groupby sorting

Pandas groupby sorting - python

I am trying to analyze my Netflix data with pandas. I want to summarize the time each user spent watching a specific title and print the highest value for each Profile.
df_clean.sample(4)
Profile Name
Duration
time_clean
AAA
0 days 00:20:00
Harry Potter
AAA
0 days 00:41:50
The Sinner
BBB
0 days 00:00:15
Avatar
AAA
0 days 00:15:00
Harry Potter
I want to see only the first row for each Profile
I tried to use:
df_clean.groupby(['Profile Name','title_clean'])['Duration'].sum().sort_values(ascending=False).nlargest(1)
But it's returning me only the biggest result for 1 Profile
Profile Name
title_clean
AAA
Harry Potter
0 days 00:35:00

You can chain another groupby(level = 0) and head(1) to get the result you're looking for.
df_clean.groupby(['Profile Name', 'title_clean'])['Duration'].sum().sort_values(ascending=False).groupby(level = 0).head(1)

I would use idxmax:
(df_clean.groupby(['Profile Name','title_clean'])['Duration'].sum()
.loc[lambda g: g.groupby('Profile Name').idxmax()]
.reset_index()
)
Output:
Profile Name title_clean Duration
0 AAA The Sinner 0 days 00:41:50
1 BBB Avatar 0 days 00:00:15

def function1(dd:pd.DataFrame):
dd1=dd.groupby("title_clean")['Duration'].sum().sort_values(ascending=False).head(1)
return dd1.rename("Duration")
df1.groupby('Profile Name').apply(function1)
out：
Profile Name title_clean
AAA The Sinner 0 days 00:41:50
BBB Avatar 0 days 00:00:15

Related

Calculates a standard deviation columns for timedelta elements

I have the following dataframe in Python:
ID
country_ID
visit_time
0
ESP
10 days 12:03:00
0
ESP
5 days 02:03:00
0
ENG
5 days 10:02:00
1
ENG
3 days 08:05:03
1
ESP
1 days 03:02:00
1
ENG
2 days 07:01:03
2
ENG
0 days 12:01:02
For each ID I want to calculate the standard deviation of each country_ID group.
std_visit_ESP and std_visit_ENG columns.
standard deviation of visit time with country_ID = ESP for each ID.
standard deviation of visit time with country_ID = ENG for each ID.
ID
std_visit_ESP
std_visit_ENG
0
2 days 17:00:00
0 days 00:00:00
1
0 days 00:00:00
0 days 12:32:00
2
NaT
0 days 00:00:00
With the groupby method for the mean, you can specify the parameter numeric_only = False, but the std method of groupby does not include this option.
My idea is to convert the timedelta to seconds, calculate the standard deviation and then convert it back to timedelta. Here is an example:
td1 = timedelta(10,0,0,0,3,12,0).total_seconds()
td2 = timedelta(5,0,0,0,3,2,0).total_seconds()
arr = [td1,td2]
var = np.std(arr)
show_s = pd.to_timedelta(var, unit='s')
print(show_s)
I don't know how to use this with groupby to get the desired result. I am grateful for your help.

Use GroupBy.std and pd.to_timedelta
total_seconds = \
pd.to_timedelta(
df['visit_time'].dt.total_seconds()
.groupby([df['ID'], df['country_ID']]).std(),
unit='S').unstack().fillna(pd.Timedelta(days=0))
print(total_seconds)
country_ID ENG ESP
ID
0 0 days 00:00:00 3 days 19:55:25.973595304
1 0 days 17:43:29.315934274 0 days 00:00:00
2 0 days 00:00:00 0 days 00:00:00

If I understand correctly, this should work for you:
stddevs = df['visit_time'].dt.total_seconds().groupby([df['country_ID']]).std().apply(lambda x: pd.Timedelta(seconds=x))
Output:
>>> stddevs
country_ID
ENG 2 days 01:17:43.835702
ESP 4 days 16:40:16.598773
Name: visit_time, dtype: timedelta64[ns]
Formatting:
stddevs = df['visit_time'].dt.total_seconds().groupby([df['country_ID']]).std().apply(lambda x: pd.Timedelta(seconds=x)).to_frame().T.add_prefix('std_visit_').reset_index(drop=True).rename_axis(None, axis=1)
Output:
>>> stddevs
std_visit_ENG std_visit_ESP
0 2 days 01:17:43.835702 4 days 16:40:16.598773

Apply the average to a timedelta column for two group conditions

I have the following dataframe in Python:
ID
country_ID
visit_time
0
ESP
10 days 12:03:00
0
ENG
5 days 10:02:00
1
ENG
3 days 08:05:03
1
ESP
1 days 03:02:00
1
ENG
2 days 07:01:03
1
ENG
3 days 01:00:52
2
ENG
0 days 12:01:02
2
ENG
1 days 22:10:03
2
ENG
0 days 20:00:50
For each ID, I want to get:
avg_visit_ESP and avg_visit_ENG columns.
Average time visit with country_ID = ESP for each ID.
Average time visit with country_ID = ENG for each ID.
ID
avg_visit_ESP
avg_visit_ENG
0
10 days 12:03:00
5 days 10:02:00
1
1 days 03:02:00
(8 days 16:06:58) / 3
2
NaT
(3 days 06:11:55) / 3
I don't know how to specify in groupby a double grouping, first by ID and then by country_ID. If you can help me I would appreciate it.
P.S.: The date format of visit_time (timedelta), can perform addition and division without any apparent problem.
from datetime import datetime, timedelta
date1 = pd.to_datetime('2022-02-04 10:10:21', format='%Y-%m-%d %H:%M:%S')
date2 = pd.to_datetime('2022-02-05 20:15:41', format='%Y-%m-%d %H:%M:%S')
date3 = pd.to_datetime('2022-02-07 20:15:41', format='%Y-%m-%d %H:%M:%S')
sum1date = date2-date1
sum2date = date3-date2
sum3date = date3-date1
print((sum1date+sum2date+sum3date)/3)

(df.groupby(['ID', 'country_ID'])['visit_time']
.mean(numeric_only=False)
.unstack()
.add_prefix('avg_visit_')
)
should do the trick
>>> df = pd.read_clipboard(sep='\s\s+')
>>> df.columns = [s.strip() for s in df]
>>> df['visit_time'] = pd.to_timedelta(df['visit_time'])
>>> df.groupby(['ID', 'country_ID'])['visit_time'].mean(numeric_only=False).unstack().add_prefix('avg_visit_')
country_ID avg_visit_ENG avg_visit_ESP
ID
0 5 days 10:02:00 10 days 12:03:00
1 2 days 21:22:19.333333333 1 days 03:02:00
2 1 days 02:03:58.333333333 NaT

How to add days to pandas datetime until x day is met?

For example I have a pandas dataframe of names and dates:
name date
0 Tom 2021-12-05
1 Sue 2021-11-22
2 Steve 2021-10-17
I'm trying to round each date up to the 25th (not the nearest) to look like this:
name date
0 Tom 2021-12-25
1 Sue 2021-11-25
2 Steve 2021-10-25
My most recent attempts looks like this:
df['date'] = df['date'].apply(lambda x: x['date'] + pd.to_timedelta(1, unit='d') if x['date'].dt.strftime('%d') != '25' else x['date'])
I think my issue stems from being able to first check what the 'day' is and then add days until the 25th day is satisfied. Any help would be greatly appreciated!
EDIT:
jezrael's answer solves the problem while also factoring for dates beyond the 25th and is concise.
I also found this to work as well:
from pandas._libs.tslibs.timedeltas import Timedelta
def next_date(input_date):
while (input_date.day != 25):
input_date = input_date + Timedelta("1 day")
return input_date
df['date'] = pd.to_datetime(df['date'].dt.date)
df['next_start_ship'] = df['date'].map(lambda x: next_date(x))

If need replace all days to 25 use:
df['date'] = pd.to_datetime(df['date'].dt.strftime('%Y-%m-25'))
Or:
df['date'] = df['date'].apply(lambda x: x.replace(day=25))
print (df)
name date
0 Tom 2021-12-25
1 Sue 2021-11-25
2 Steve 2021-10-25
If need replace only days until 25, else is added 1 month and also set to days=25 use:
print (df)
name date
0 Tom 2021-12-05
1 Sue 2021-11-30
2 Steve 2021-10-17
mask = df['date'].dt.day < 25
s = pd.to_datetime(df['date'].dt.strftime('%Y-%m-25'))
df['date'] = np.where(mask, s, s + pd.DateOffset(months=1))
print (df)
name date
0 Tom 2021-12-25
1 Sue 2021-12-25
2 Steve 2021-10-25

Can I group by a column and resample a date?

I have some consumer purchase data that looks like
CustomerID InvoiceDate
13654.0 2011-07-17 13:29:00
14841.0 2010-12-16 10:28:00
19543.0 2011-10-18 16:58:00
12877.0 2011-06-15 13:34:00
15073.0 2011-06-06 12:33:00
I'm interested in the rate at which customers purchase. I'd like to group by each customer and then determine how many purchases were made in each quarter (let's say each quarter is every 3 months starting in January).
I could just define when each quarter starts and ends and make another column. I'm wondering if I could instead use groupby to achieve the same thing.
Presently, this is how I do it:
r = data.groupby('CustomerID')
frames = []
for name,frame in r:
f =frame.set_index('InvoiceDate').resample("QS").count()
f['CustomerID']= name
frames.append(f)
g = pd.concat(frames)

UPDATE:
In [43]: df.groupby(['CustomerID', pd.Grouper(key='InvoiceDate', freq='QS')]) \
.size() \
.reset_index(name='Count')
Out[43]:
CustomerID InvoiceDate Count
0 12877.0 2011-04-01 1
1 13654.0 2011-07-01 1
2 14841.0 2010-10-01 1
3 15073.0 2011-04-01 1
4 19543.0 2011-10-01 1
Is that what you want?
In [39]: df.groupby(pd.Grouper(key='InvoiceDate', freq='QS')).count()
Out[39]:
CustomerID
InvoiceDate
2010-10-01 1
2011-01-01 0
2011-04-01 2
2011-07-01 1
2011-10-01 1

I think this is the best I will be able to do:
data.groupby('CustomerID').apply(lambda x: x.set_index('InvoiceDate').resample('QS').count())

How To Slice, Rank & Wrangle Like A Pandas Boss

Say one has a lookup table summarizing the busy lives of a few people on this planet...
import pandas as pd
import numpy as np
import datetime as dt
from datetime import datetime as dt
t=pd.Timestamp
lu = pd.DataFrame({ 'name' : ['Bill','Elon','Larry','Jeff','Marissa'],
'feels' : ['charitable','Alcoa envy','Elon envy','like the number 7','sassy'],
'last ate' : [t('20151209'),t('20151201'),t('20151208'),t('20151208'),t('20151209')],
'boxers' : [True,True,True,False,True]})
Say one also knows where these people live and when they did certain things...
af = pd.DataFrame({ 'name' : ['Bill','Elon','Larry','Elon','Jeff','Larry','Larry'],
'address' : ['in my computer','moon','internet','mars','cardboard box','autonomous car','every where'],
'sq_ft' : [2,2135,69,84535, 1.32, 54,168],
'forks' : [7,1,2,1,0,np.nan,1]})
rand_dates=[t('20141202'),t('20130804'),t('20120508'),t('20150411'),
t('20141209'),t('20091023'),t('20130921'),t('20110102'),
t('20130728'),t('20141119'),t('20151024'),t('20130824')]
df = pd.DataFrame({ 'name' : ['Elon','Bill','Larry','Elon','Jeff','Larry','Larry','Bill','Larry','Elon','Marissa','Jeff'],
'activity' : ['slept','tripped','spoke','swam','spooked','liked','whistled','up dog','smiled','donated','grant men paternity leave','fondled'],
'date' : rand_dates})
One could rank these people according to addresses they live at as follows:
af.name.value_counts()
Larry 3
Elon 2
Jeff 1
Bill 1
Need 1: Using the ranking above, how would one create a new "ranked" dataframe composed of information from lookup table lu? Simply put, how does one make Exhibit A?
# Exhibit A
boxers feels last ate name addresses
0 True Elon envy 2015-12-08 Larry 3
1 True Alcoa envy 2015-12-01 Elon 2
2 False like the number 7 2015-12-08 Jeff 1
3 True charitable 2015-12-09 Bill 1
Need 2: Observe the output of the groupby operation that follows. How can one determine the time delta between the oldest and newest dates to rank members of lu according to such time deltas?.. Simply put, how does one get from the groupby to Exhibit D?
df.groupby(['name','date']).size()
name date
Bill 2011-01-02 1
2013-08-04 1
Elon 2014-11-19 1
2014-12-02 1
2015-04-11 1
Jeff 2013-08-24 1
2014-12-09 1
Larry 2009-10-23 1
2012-05-08 1
2013-07-28 1
2013-09-21 1
Marissa 2015-10-24 1
#Exhibit B - Calculate time deltas
name time_delta
Bill Timedelta('945 days 00:00:00')
Elon Timedelta('143 days 00:00:00')
Jeff Timedelta('472 days 00:00:00')
Larry Timedelta('1429 days 00:00:00')
Marissa Timedelta('0 days 00:00:00')
#Exhibit C - Rank time deltas (this is easy)
name time_delta
Larry Timedelta('1429 days 00:00:00')
Bill Timedelta('945 days 00:00:00')
Jeff Timedelta('472 days 00:00:00')
Elon Timedelta('143 days 00:00:00')
Marissa Timedelta('0 days 00:00:00')
#Exhibit D - Add to and re-rank the table built in Exhibit A according to time_delta
boxers feels last ate name addresses time_delta
0 True Elon envy 2015-12-08 Larry 3 1429 days 00:00:00
1 True charitable 2015-12-09 Bill 1 945 days 00:00:00
2 False like the number 7 2015-12-08 Jeff 1 472 days 00:00:00
3 True Alcoa envy 2015-12-01 Elon 2 143 days 00:00:00
4 True sassy 2015-12-09 Marissa NaN 0 days 00:00:00
Prior Research: This so post on getting max values using groupby and transform and this other so post on finding and selecting most frequent data are informative but don't work on series (the result of count_values()) or just trip me up... I've actually gotten the first part to work but the code is bugly and likely inefficient.
Easy Peasy Code Sharing
Check out this IPython Notebook that lays everything out. Otherwise, check out the Python 2.7 code here.

I think you can use join, sort_values. Aggregation in docs.
#join value count to lu dataframe, renaming ans sorting
Exhibit_A = lu.set_index('name').join(af.name.value_counts()).rename(columns={'name': 'addresses'}).sort_values('addresses', ascending=False)
#drop rows with NaN, reset index
print Exhibit_A.dropna().reset_index()
name boxers feels last ate addresses
0 Larry True Elon envy 2015-12-08 3
1 Elon True Alcoa envy 2015-12-01 2
2 Bill True charitable 2015-12-09 1
3 Jeff False like the number 7 2015-12-08 1
#aggregate to min and max date
g = df.groupby(['name']).agg({'date' : [np.max, np.min]})
#reset columns multiindex
levels = g.columns.levels
labels = g.columns.labels
g.columns = levels[1][labels[1]]
g['time_delta'] = g['amax'] - g['amin']
#drop columns
g = g.drop(['amax', 'amin'], axis=1)
#join to Exhibit_A, sort, reset index
Exhibit_D = Exhibit_A.join(g).sort_values('time_delta', ascending=False).reset_index()
#reorder columns
Exhibit_D = Exhibit_D[['boxers', 'feels', 'last ate', 'name', 'addresses' , 'time_delta' ]]
print Exhibit_D
boxers feels last ate name addresses time_delta
0 True Elon envy 2015-12-08 Larry 3 1429 days
1 True charitable 2015-12-09 Bill 1 945 days
2 False like the number 7 2015-12-08 Jeff 1 472 days
3 True Alcoa envy 2015-12-01 Elon 2 143 days
4 True sassy 2015-12-09 Marissa NaN 0 days

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas groupby sorting - python

You can chain another groupby(level = 0) and head(1) to get the result you're looking for. df_clean.groupby(['Profile Name', 'title_clean'])['Duration'].sum().sort_values(ascending=False).groupby(level = 0).head(1)

I would use idxmax: (df_clean.groupby(['Profile Name','title_clean'])['Duration'].sum() .loc[lambda g: g.groupby('Profile Name').idxmax()] .reset_index() ) Output: Profile Name title_clean Duration 0 AAA The Sinner 0 days 00:41:50 1 BBB Avatar 0 days 00:00:15

def function1(dd:pd.DataFrame): dd1=dd.groupby("title_clean")['Duration'].sum().sort_values(ascending=False).head(1) return dd1.rename("Duration") df1.groupby('Profile Name').apply(function1) out： Profile Name title_clean AAA The Sinner 0 days 00:41:50 BBB Avatar 0 days 00:00:15

Related

Calculates a standard deviation columns for timedelta elements

Apply the average to a timedelta column for two group conditions

How to add days to pandas datetime until x day is met?

Can I group by a column and resample a date?

How To Slice, Rank & Wrangle Like A Pandas Boss

Categories

Resources