Say one has a lookup table summarizing the busy lives of a few people on this planet...
import pandas as pd
import numpy as np
import datetime as dt
from datetime import datetime as dt
t=pd.Timestamp
lu = pd.DataFrame({ 'name' : ['Bill','Elon','Larry','Jeff','Marissa'],
'feels' : ['charitable','Alcoa envy','Elon envy','like the number 7','sassy'],
'last ate' : [t('20151209'),t('20151201'),t('20151208'),t('20151208'),t('20151209')],
'boxers' : [True,True,True,False,True]})
Say one also knows where these people live and when they did certain things...
af = pd.DataFrame({ 'name' : ['Bill','Elon','Larry','Elon','Jeff','Larry','Larry'],
'address' : ['in my computer','moon','internet','mars','cardboard box','autonomous car','every where'],
'sq_ft' : [2,2135,69,84535, 1.32, 54,168],
'forks' : [7,1,2,1,0,np.nan,1]})
rand_dates=[t('20141202'),t('20130804'),t('20120508'),t('20150411'),
t('20141209'),t('20091023'),t('20130921'),t('20110102'),
t('20130728'),t('20141119'),t('20151024'),t('20130824')]
df = pd.DataFrame({ 'name' : ['Elon','Bill','Larry','Elon','Jeff','Larry','Larry','Bill','Larry','Elon','Marissa','Jeff'],
'activity' : ['slept','tripped','spoke','swam','spooked','liked','whistled','up dog','smiled','donated','grant men paternity leave','fondled'],
'date' : rand_dates})
One could rank these people according to addresses they live at as follows:
af.name.value_counts()
Larry 3
Elon 2
Jeff 1
Bill 1
Need 1: Using the ranking above, how would one create a new "ranked" dataframe composed of information from lookup table lu? Simply put, how does one make Exhibit A?
# Exhibit A
boxers feels last ate name addresses
0 True Elon envy 2015-12-08 Larry 3
1 True Alcoa envy 2015-12-01 Elon 2
2 False like the number 7 2015-12-08 Jeff 1
3 True charitable 2015-12-09 Bill 1
Need 2: Observe the output of the groupby operation that follows. How can one determine the time delta between the oldest and newest dates to rank members of lu according to such time deltas?.. Simply put, how does one get from the groupby to Exhibit D?
df.groupby(['name','date']).size()
name date
Bill 2011-01-02 1
2013-08-04 1
Elon 2014-11-19 1
2014-12-02 1
2015-04-11 1
Jeff 2013-08-24 1
2014-12-09 1
Larry 2009-10-23 1
2012-05-08 1
2013-07-28 1
2013-09-21 1
Marissa 2015-10-24 1
#Exhibit B - Calculate time deltas
name time_delta
Bill Timedelta('945 days 00:00:00')
Elon Timedelta('143 days 00:00:00')
Jeff Timedelta('472 days 00:00:00')
Larry Timedelta('1429 days 00:00:00')
Marissa Timedelta('0 days 00:00:00')
#Exhibit C - Rank time deltas (this is easy)
name time_delta
Larry Timedelta('1429 days 00:00:00')
Bill Timedelta('945 days 00:00:00')
Jeff Timedelta('472 days 00:00:00')
Elon Timedelta('143 days 00:00:00')
Marissa Timedelta('0 days 00:00:00')
#Exhibit D - Add to and re-rank the table built in Exhibit A according to time_delta
boxers feels last ate name addresses time_delta
0 True Elon envy 2015-12-08 Larry 3 1429 days 00:00:00
1 True charitable 2015-12-09 Bill 1 945 days 00:00:00
2 False like the number 7 2015-12-08 Jeff 1 472 days 00:00:00
3 True Alcoa envy 2015-12-01 Elon 2 143 days 00:00:00
4 True sassy 2015-12-09 Marissa NaN 0 days 00:00:00
Prior Research: This so post on getting max values using groupby and transform and this other so post on finding and selecting most frequent data are informative but don't work on series (the result of count_values()) or just trip me up... I've actually gotten the first part to work but the code is bugly and likely inefficient.
Easy Peasy Code Sharing
Check out this IPython Notebook that lays everything out. Otherwise, check out the Python 2.7 code here.
I think you can use join, sort_values. Aggregation in docs.
#join value count to lu dataframe, renaming ans sorting
Exhibit_A = lu.set_index('name').join(af.name.value_counts()).rename(columns={'name': 'addresses'}).sort_values('addresses', ascending=False)
#drop rows with NaN, reset index
print Exhibit_A.dropna().reset_index()
name boxers feels last ate addresses
0 Larry True Elon envy 2015-12-08 3
1 Elon True Alcoa envy 2015-12-01 2
2 Bill True charitable 2015-12-09 1
3 Jeff False like the number 7 2015-12-08 1
#aggregate to min and max date
g = df.groupby(['name']).agg({'date' : [np.max, np.min]})
#reset columns multiindex
levels = g.columns.levels
labels = g.columns.labels
g.columns = levels[1][labels[1]]
g['time_delta'] = g['amax'] - g['amin']
#drop columns
g = g.drop(['amax', 'amin'], axis=1)
#join to Exhibit_A, sort, reset index
Exhibit_D = Exhibit_A.join(g).sort_values('time_delta', ascending=False).reset_index()
#reorder columns
Exhibit_D = Exhibit_D[['boxers', 'feels', 'last ate', 'name', 'addresses' , 'time_delta' ]]
print Exhibit_D
boxers feels last ate name addresses time_delta
0 True Elon envy 2015-12-08 Larry 3 1429 days
1 True charitable 2015-12-09 Bill 1 945 days
2 False like the number 7 2015-12-08 Jeff 1 472 days
3 True Alcoa envy 2015-12-01 Elon 2 143 days
4 True sassy 2015-12-09 Marissa NaN 0 days
Related
My dataframe (df) is a 12 months data which consist of 5m rows. One of the columns is day_of_week which are Monday to Sunday. This df also has a unique key which is the ride_id column. I want to calculate the average number of rides per day_of_week. I have calculated the number of rides per day_of_week using
copydf.groupby(['day_of_week']).agg(number_of_rides=('day_of_week', 'count'))
However, I find it hard to calculate the mean/average for each day of week. I have tried:
copydf.groupby(['day_of_week']).agg(number_of_rides=('ride_id', 'count')).mean()
and
avg_days = copydf.groupby(['day_of_week']).agg(number_of_rides=('ride_id', 'count'))
avg_days.groupby(['day_of_week']).agg('number_of_rides', 'mean')
They didn't work. I want the output to be in three columns, day_of_week, number_of_rides, and avg_num_of_ride or two columns day_of_week or weekday_num and avg_num_of_rides
This is my df. kindly note that code block have tampered with some columns line due to the long column names.
ride_id rideable_type started_at ended_at start_station_name start_station_id end_station_name end_station_id start_lat start_lng end_lat end_lng member_or_casual ride_length year month day_of_week hour weekday_num
0 9DC7B962304CBFD8 electric_bike 2021-09-28 16:07:10 2021-09-28 16:09:54 Streeter Dr & Grand Ave 13022 Streeter Dr & Grand Ave 13022 41.89 -87.68 41.89 -87.67 casual 2 2021 September Tuesday 16 1
1 F930E2C6872D6B32 electric_bike 2021-09-28 14:24:51 2021-09-28 14:40:05 Streeter Dr & Grand Ave 13022 Streeter Dr & Grand Ave 13022 41.94 -87.64 41.98 -87.67 casual 15 2021 September Tuesday 14 1
2 6EF72137900BB910 electric_bike 2021-09-28 00:20:16 2021-09-28 00:23:57 Streeter Dr & Grand Ave 13022 Streeter Dr & Grand Ave 13022 41.81 -87.72 41.80 -87.72 casual 3 2021 September Tuesday 0 1
This is the output I desire
number_of_rides average_number_of_rides
day_of_week
Saturday 964079 50.4
Sunday 841919 70.9
Wednesday 840272 90.2
Thursday 836973 77.2
Friday 818205 34.4
Tuesday 814496 34.4
Monday 767002 200.3
Again, I have calculated the number of ride per day_of_week, what I want to do is just to add the third column or better still, have average_ride per weekday(Monday or 0, Tuesday or 1, Wednesday or 2) on its own output df
Thanks
To get average number of rides per week day, you need total rides on that week day and number of weeks.
You can compute the week number from date:
df["week_number"] = df["started_at"].dt.isocalendar().week
>> ride_id started_at day_of_week week_number
>> 0 1 2021-09-20 Monday 38
>> 1 2 2021-09-21 Tuesday 38
>> 2 3 2021-09-20 Monday 38
>> 3 4 2021-09-21 Tuesday 38
>> 4 5 2021-09-27 Monday 39
>> 5 6 2021-09-28 Tuesday 39
Then group by day_of_week and week_number to compute an aggregate dataframe:
week_number_group_df = df.groupby(["day_of_week", "week_number"]).agg(number_of_rides_on_day=("ride_id", "count"))
>> number_of_rides_on_day
>> day_of_week week_number
>> Monday 38 2
>> 39 1
>> Tuesday 38 2
>> 39 1
Use the aggregated dataframe to get the final results:
week_number_group_df.groupby("day_of_week").agg(number_of_rides=("number_of_rides_on_day", "sum"), average_number_of_rides=("number_of_rides_on_day", "mean"))
>> number_of_rides average_number_of_rides
>> day_of_week
>> Monday 3 1.5000
>> Tuesday 3 1.5000
As far as I understand, you're not trying to compute the average over a field in your grouped data (as #Azhar Khan pointed out), but an averaged count of rides per weekday over your original 12-months period.
Basically, you need two elements:
First, the count of rides per weekday you observe in your dataframe. That's exactly what you get with copydf.groupby(['day_of_week']).agg(number_of_rides=('ride_id', 'count'))
Let's say you get something like:
Secondly, the count of weekdays in your period. Let's imagine you're considering the year 2022 as an example, you can get such data with the next code snippet:
df_year = pd.DataFrame(data=pd.date_range(start=pd.to_datetime('01-01-2022'),
end=pd.to_datetime('31-12-2022'),
freq='1D'),
columns=['date'])
df_year["day_of_week"] = df_year["date"].dt.weekday
nb_weekdays_in_year = df_year.groupby('day_of_week').agg(nb_days=('date', 'count'))
This gives such a dataframe:
Once you have both these dataframes, you can simply join them with
nb_weekdays_in_year.join(nb_rides_per_day) for instance, and you just need to perform the ratio of both colums to get your average.
The difficulty here lies in the fact you need to get the total number of weekdays of each type over your period, that you cannot get from your observation directly I guess (what if there's some missing value ?). Plus, let's underline you're not trying to get an intra-group average, so that you cannot use simple agg functions like 'mean' directly.
Using pivot we can solve this.
import pandas as pd
import numpy as np
df = pd.read_csv('/content/test.csv')
df.head()
# sample df
date rides
0 2019-10-01 1
1 2019-10-02 2
2 2019-10-03 5
3 2019-10-04 3
4 2019-10-05 2
df['date] = pd.to_datetime(df['date'])
# extracting the week Number
df['weekNo'] = df['date'].dt.week
date rides weekNo
0 2019-10-01 1 40
1 2019-10-02 2 40
2 2019-10-03 5 40
Method 1: Use Pivot table
df.pivot_table(values='rides',index='weekNo',aggfunc='mean')
output
rides
weekNo
40 2.833333
41 2.571429
42 4.000000
Method 2: Use groupby.mean()
df.groupby('weekNo')['rides'].mean()
I've got a dataframe in pandas that stores the Id of a person, the quality of interaction, and the date of the interaction. A person can have multiple interactions across multiple dates, so to help visualise and plot this I converted it into a pivot table grouping first by Id then by date to analyse the pattern over time.
e.g.
import pandas as pd
df = pd.DataFrame({'Id':['A4G8','A4G8','A4G8','P9N3','P9N3','P9N3','P9N3','C7R5','L4U7'],
'Date':['2016-1-1','2016-1-15','2016-1-30','2017-2-12','2017-2-28','2017-3-10','2019-1-1','2018-6-1','2019-8-6'],
'Quality':[2,3,6,1,5,10,10,2,2]})
pt = df.pivot_table(values='Quality', index=['Id','Date'])
print(pt)
Leads to this:
Id
Date
Quality
A4G8
2016-1-1
2
2016-1-15
4
2016-1-30
6
P9N3
2017-2-12
1
2017-2-28
5
2017-3-10
10
2019-1-1
10
C7R5
2018-6-1
2
L4U7
2019-8-6
2
However, I'd also like to...
Measure the time from the first interaction for each interaction per Id
Measure the time from the previous interaction with the same Id
So I'd get a table similar to the one below
Id
Date
Quality
Time From First
Time To Prev
A4G8
2016-1-1
2
0 days
NA days
2016-1-15
4
14 days
14 days
2016-1-30
6
29 days
14 days
P9N3
2017-2-12
1
0 days
NA days
2017-2-28
5
15 days
15 days
2017-3-10
10
24 days
9 days
The Id column is a string type, and I've converted the date column into datetime, and the Quality column into an integer.
The column is rather large (>10,000 unique ids) so for performance reasons I'm trying to avoid using for loops. I'm guessing the solution is somehow using pd.eval but I'm stuck as to how to apply it correctly.
Apologies I'm a python, pandas, & stack overflow) noob and I haven't found the answer anywhere yet so even some pointers on where to look would be great :-).
Many thanks in advance
Convert Dates to datetimes and then substract minimal datetimes per groups by GroupBy.transformb subtracted by column Date and for second new column use DataFrameGroupBy.diff:
df['Date'] = pd.to_datetime(df['Date'])
df['Time From First'] = df['Date'].sub(df.groupby('Id')['Date'].transform('min'))
df['Time To Prev'] = df.groupby('Id')['Date'].diff()
print (df)
Id Date Quality Time From First Time To Prev
0 A4G8 2016-01-01 2 0 days NaT
1 A4G8 2016-01-15 3 14 days 14 days
2 A4G8 2016-01-30 6 29 days 15 days
3 P9N3 2017-02-12 1 0 days NaT
4 P9N3 2017-02-28 5 16 days 16 days
5 P9N3 2017-03-10 10 26 days 10 days
6 P9N3 2019-01-01 10 688 days 662 days
7 C7R5 2018-06-01 2 0 days NaT
8 L4U7 2019-08-06 2 0 days NaT
df["Date"] = pd.to_datetime(df.Date)
df = df.merge(
df.groupby(["Id"]).Date.first(),
on="Id",
how="left",
suffixes=["", "_first"]
)
df["Time From First"] = df.Date-df.Date_first
df['Time To Prev'] = df.groupby('Id').Date.diff()
df.set_index(["Id", "Date"], inplace=True)
df
output:
I have the following dataframe in Python:
ID
country_ID
visit_time
0
ESP
10 days 12:03:00
0
ENG
5 days 10:02:00
1
ENG
3 days 08:05:03
1
ESP
1 days 03:02:00
1
ENG
2 days 07:01:03
1
ENG
3 days 01:00:52
2
ENG
0 days 12:01:02
2
ENG
1 days 22:10:03
2
ENG
0 days 20:00:50
For each ID, I want to get:
avg_visit_ESP and avg_visit_ENG columns.
Average time visit with country_ID = ESP for each ID.
Average time visit with country_ID = ENG for each ID.
ID
avg_visit_ESP
avg_visit_ENG
0
10 days 12:03:00
5 days 10:02:00
1
1 days 03:02:00
(8 days 16:06:58) / 3
2
NaT
(3 days 06:11:55) / 3
I don't know how to specify in groupby a double grouping, first by ID and then by country_ID. If you can help me I would appreciate it.
P.S.: The date format of visit_time (timedelta), can perform addition and division without any apparent problem.
from datetime import datetime, timedelta
date1 = pd.to_datetime('2022-02-04 10:10:21', format='%Y-%m-%d %H:%M:%S')
date2 = pd.to_datetime('2022-02-05 20:15:41', format='%Y-%m-%d %H:%M:%S')
date3 = pd.to_datetime('2022-02-07 20:15:41', format='%Y-%m-%d %H:%M:%S')
sum1date = date2-date1
sum2date = date3-date2
sum3date = date3-date1
print((sum1date+sum2date+sum3date)/3)
(df.groupby(['ID', 'country_ID'])['visit_time']
.mean(numeric_only=False)
.unstack()
.add_prefix('avg_visit_')
)
should do the trick
>>> df = pd.read_clipboard(sep='\s\s+')
>>> df.columns = [s.strip() for s in df]
>>> df['visit_time'] = pd.to_timedelta(df['visit_time'])
>>> df.groupby(['ID', 'country_ID'])['visit_time'].mean(numeric_only=False).unstack().add_prefix('avg_visit_')
country_ID avg_visit_ENG avg_visit_ESP
ID
0 5 days 10:02:00 10 days 12:03:00
1 2 days 21:22:19.333333333 1 days 03:02:00
2 1 days 02:03:58.333333333 NaT
I have a dataframe with one column in datetime format and the other columns in integers and floats. I would like to group the dataframe by the weekday of the first column. The other columns would be added.
print (df)
Day Butter Bread Coffee
2019-07-01 00:00:00 2 2 4
2019-07-01 00:00:00 1 2 1
2019-07-02 00:00:00 5 4 8
Basically the outcome would be sometime alike:
print (df)
Day Butter Bread Coffee
Monday 3 4 5
Tuesday 5 4 8
I am flexible if it says exactly Monday, or MO or 01 for the first day of the week, as long it is visible which consumption was done on Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, and Sunday.
You should convert your "Day" to datetime type and then you can extract the day of the week and aggregate over the rest of the columns:
import pandas as pd
df['Day'] = pd.to_datetime(df['Day'])
df.groupby(df['Day'].dt.day_name()).sum()
try using .dt.day_name() and groupby(),sum()
df = pd.DataFrame(data={'day':['2019-07-01 00:00:00','2019-07-01 00:00:00','2019-07-02 00:00:00'],
'butter':[2,1,5],
'bread':[2,2,4],
'coffee':[4,1,8]})
df['day'] = pd.to_datetime(df['day']).dt.day_name()
df.groupby(['day'],as_index=False).sum()
day butter bread coffee
0 Monday 3 4 5
1 Tuesday 5 4 8
I have a Pandas Dataframe, in which I have examples of soccer games. There are two attributes home_team_name and away_team_name. Teams play each other twice, in the first leg one team is home and other away, then the situation is reversed. One team name can appear many times in the data set as away or home team, but only twice(once home and once away) in combination with one specific team. I want to split the data in two parts.
1498744800,Jun 29 2017 - 2:00pm,complete,8000,Irtysh,Dunav 2010
1498747500,Jun 29 2017 - 2:45pm,complete,15000,Kairat,Atlantas
1499360400,Jul 6 2017 - 5:00pm,complete,5100,Dunav 2010,Irtysh
1499356800,Jul 6 2017 - 4:00pm,complete,1450,Atlantas,Kairat
Example from the .csv file that used to create the dataframe. I want the first example in one part and the second in the second part. The examples are not next to each other in the real .csv, this is just to illustrate what I want. In the example from the .csv the first and second row would go into the first part and the third and fourth in the second.
In the first part there will be the games of the first leg.
In the second part there should be return legs of these games. So the ones in which home_team_name is the away_team_name from the first leg and the away_team_name is the home_team_name from the first leg.
Feel free to ask for a better explanation.
First sort values by numpy.sort and create boolean mask by DataFrame.duplicated, last filter by boolean indexing, ~ is for invert boolean mask:
m = (pd.DataFrame(np.sort(df[['home_team_name','away_team_name']], axis=1), index=df.index)
.duplicated(keep='last'))
print (m)
0 True
1 True
2 False
3 False
dtype: bool
df1 = df[m]
print (df1)
id date state val home_team_name \
0 1498744800 Jun 29 2017 - 2:00pm complete 8000 Irtysh
1 1498747500 Jun 29 2017 - 2:45pm complete 15000 Kairat
away_team_name
0 Dunav 2010
1 Atlantas
df2 = df[~m]
print (df2)
id date state val home_team_name \
2 1499360400 Jul 6 2017 - 5:00pm complete 5100 Dunav 2010
3 1499356800 Jul 6 2017 - 4:00pm complete 1450 Atlantas
away_team_name
2 Irtysh
3 Kairat
Details:
print (pd.DataFrame(np.sort(df[['home_team_name','away_team_name']], axis=1), index=df.index))
0 1
0 Dunav 2010 Irtysh
1 Atlantas Kairat
2 Dunav 2010 Irtysh
3 Atlantas Kairat
Here is a bit of an unorthodox way to do this. It involves assembling a new column mash which is the same for each member of the pair. Then grouping by this column and selecting the first and last halves/legs:
df = pd.read_table(StringIO("""id,date,done,attendance,home,away
1498744800,Jun 29 2017 - 2:00pm,complete,8000,Irtysh,Dunav 2010
1498747500,Jun 29 2017 - 2:45pm,complete,15000,Kairat,Atlantas
1499360400,Jul 6 2017 - 5:00pm,complete,5100,Dunav 2010,Irtysh
1499356800,Jul 6 2017 - 4:00pm,complete,1450,Atlantas,Kairat
1498744800,July 23 2017 - 2:00pm,complete,8000,Arsenal,Chelsea
1498747500,July 26 2017 - 2:45pm,complete,15000,Wolves,Liverpool
1499360400,Jul 28 2017 - 5:00pm,complete,5100,Liverpool,Wolves
1499356800,Aug 3 2017 - 4:00pm,complete,1450,Chelsea,Arsenal"""), sep=",")
df['mash'] = df.home + df.away
df.mash = df.mash.apply(sorted)
df.mash = df.mash.str.join("")
df.date = df.date.astype('datetime64[ns]')
df = df.sort_values('date')
first_leg_df = df.groupby('mash').first().reset_index(drop=True)
second_leg_df = df.groupby('mash').last().reset_index(drop=True)
First Leg Result:
id date done attendance home away
0 1498744800 2017-06-29 14:00:00 complete 8000 Irtysh Dunav 2010
1 1498744800 2017-07-23 14:00:00 complete 8000 Arsenal Chelsea
2 1498747500 2017-06-29 14:45:00 complete 15000 Kairat Atlantas
3 1498747500 2017-07-26 14:45:00 complete 15000 Wolves Liverpool
Second Leg Result:
id date done attendance home away
0 1499360400 2017-07-06 17:00:00 complete 5100 Dunav 2010 Irtysh
1 1499356800 2017-08-03 16:00:00 complete 1450 Chelsea Arsenal
2 1499356800 2017-07-06 16:00:00 complete 1450 Atlantas Kairat
3 1499360400 2017-07-28 17:00:00 complete 5100 Liverpool Wolves