I have the following dataframe in Python:
ID
country_ID
visit_time
0
ESP
10 days 12:03:00
0
ENG
5 days 10:02:00
1
ENG
3 days 08:05:03
1
ESP
1 days 03:02:00
1
ENG
2 days 07:01:03
1
ENG
3 days 01:00:52
2
ENG
0 days 12:01:02
2
ENG
1 days 22:10:03
2
ENG
0 days 20:00:50
For each ID, I want to get:
avg_visit_ESP and avg_visit_ENG columns.
Average time visit with country_ID = ESP for each ID.
Average time visit with country_ID = ENG for each ID.
ID
avg_visit_ESP
avg_visit_ENG
0
10 days 12:03:00
5 days 10:02:00
1
1 days 03:02:00
(8 days 16:06:58) / 3
2
NaT
(3 days 06:11:55) / 3
I don't know how to specify in groupby a double grouping, first by ID and then by country_ID. If you can help me I would appreciate it.
P.S.: The date format of visit_time (timedelta), can perform addition and division without any apparent problem.
from datetime import datetime, timedelta
date1 = pd.to_datetime('2022-02-04 10:10:21', format='%Y-%m-%d %H:%M:%S')
date2 = pd.to_datetime('2022-02-05 20:15:41', format='%Y-%m-%d %H:%M:%S')
date3 = pd.to_datetime('2022-02-07 20:15:41', format='%Y-%m-%d %H:%M:%S')
sum1date = date2-date1
sum2date = date3-date2
sum3date = date3-date1
print((sum1date+sum2date+sum3date)/3)
(df.groupby(['ID', 'country_ID'])['visit_time']
.mean(numeric_only=False)
.unstack()
.add_prefix('avg_visit_')
)
should do the trick
>>> df = pd.read_clipboard(sep='\s\s+')
>>> df.columns = [s.strip() for s in df]
>>> df['visit_time'] = pd.to_timedelta(df['visit_time'])
>>> df.groupby(['ID', 'country_ID'])['visit_time'].mean(numeric_only=False).unstack().add_prefix('avg_visit_')
country_ID avg_visit_ENG avg_visit_ESP
ID
0 5 days 10:02:00 10 days 12:03:00
1 2 days 21:22:19.333333333 1 days 03:02:00
2 1 days 02:03:58.333333333 NaT
Related
I have a column of dates in unix timestamps and i need to convert them into relative dates from the starting activity.
The final output should be the column D, which expresses the relative time from the activity which has index = 1, in particular the relative time has always to refer to the first activity (index=1).
A index timestamp D
activity1 1 1.612946e+09 0
activity2 2 1.614255e+09 80 hours
activity3 1 1.612181e+09 0
activity4 2 1.613045e+09 50 hours
activity5 3 1.637668e+09 430 hours
Any idea?
Use to_datetime with unit='s' and then create groups starting by index equal 1 and get first value, last subtract and convert to hours:
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='s')
s = df.groupby(df['index'].eq(1).cumsum())['timestamp'].transform('first')
df['D1'] = df['timestamp'].sub(s).dt.total_seconds().div(3600)
print (df)
A index timestamp D D1
0 activity1 1 2021-02-10 08:33:20 0 0.000000
1 activity2 2 2021-02-25 12:10:00 80 hours 363.611111
2 activity3 1 2021-02-01 12:03:20 0 0.000000
3 activity4 2 2021-02-11 12:03:20 50 hours 240.000000
4 activity5 3 2021-11-23 11:46:40 430 hours 7079.722222
I have been trying to sum the hours by activity in a dataframe but it didn't work.
the code:
import pandas as pd
fileurl = r'https://docs.google.com/spreadsheets/d/1WuvvsZCfbcioYLvwwHuSunUbs4tjvv05/edit?usp=sharing&ouid=105286407332351152540&rtpof=true&sd=true'
df = pd.read_excel(fileurl, header=0)
df.groupby('Activity').sum()
excel link : https://docs.google.com/spreadsheets/d/1WuvvsZCfbcioYLvwwHuSunUbs4tjvv05/edit?usp=sharing&ouid=105286407332351152540&rtpof=true&sd=true
You have to force hours column to be strings else you will get datetime.time instance.
df = pd.read_excel(fileurl, header=0, dtype={'hours': str})
out = (df.assign(hours=pd.to_timedelta(df['hours']))
.groupby('Activity', as_index=False)['hours'].sum())
print(out)
# Output
Activity hours
0 bushwalking 0 days 04:45:00
1 cycling 0 days 11:30:00
2 football 0 days 03:42:00
3 gym 0 days 07:00:00
4 running 0 days 14:00:00
5 swimming 0 days 13:15:00
6 walking 0 days 04:00:00
I've got a dataframe in pandas that stores the Id of a person, the quality of interaction, and the date of the interaction. A person can have multiple interactions across multiple dates, so to help visualise and plot this I converted it into a pivot table grouping first by Id then by date to analyse the pattern over time.
e.g.
import pandas as pd
df = pd.DataFrame({'Id':['A4G8','A4G8','A4G8','P9N3','P9N3','P9N3','P9N3','C7R5','L4U7'],
'Date':['2016-1-1','2016-1-15','2016-1-30','2017-2-12','2017-2-28','2017-3-10','2019-1-1','2018-6-1','2019-8-6'],
'Quality':[2,3,6,1,5,10,10,2,2]})
pt = df.pivot_table(values='Quality', index=['Id','Date'])
print(pt)
Leads to this:
Id
Date
Quality
A4G8
2016-1-1
2
2016-1-15
4
2016-1-30
6
P9N3
2017-2-12
1
2017-2-28
5
2017-3-10
10
2019-1-1
10
C7R5
2018-6-1
2
L4U7
2019-8-6
2
However, I'd also like to...
Measure the time from the first interaction for each interaction per Id
Measure the time from the previous interaction with the same Id
So I'd get a table similar to the one below
Id
Date
Quality
Time From First
Time To Prev
A4G8
2016-1-1
2
0 days
NA days
2016-1-15
4
14 days
14 days
2016-1-30
6
29 days
14 days
P9N3
2017-2-12
1
0 days
NA days
2017-2-28
5
15 days
15 days
2017-3-10
10
24 days
9 days
The Id column is a string type, and I've converted the date column into datetime, and the Quality column into an integer.
The column is rather large (>10,000 unique ids) so for performance reasons I'm trying to avoid using for loops. I'm guessing the solution is somehow using pd.eval but I'm stuck as to how to apply it correctly.
Apologies I'm a python, pandas, & stack overflow) noob and I haven't found the answer anywhere yet so even some pointers on where to look would be great :-).
Many thanks in advance
Convert Dates to datetimes and then substract minimal datetimes per groups by GroupBy.transformb subtracted by column Date and for second new column use DataFrameGroupBy.diff:
df['Date'] = pd.to_datetime(df['Date'])
df['Time From First'] = df['Date'].sub(df.groupby('Id')['Date'].transform('min'))
df['Time To Prev'] = df.groupby('Id')['Date'].diff()
print (df)
Id Date Quality Time From First Time To Prev
0 A4G8 2016-01-01 2 0 days NaT
1 A4G8 2016-01-15 3 14 days 14 days
2 A4G8 2016-01-30 6 29 days 15 days
3 P9N3 2017-02-12 1 0 days NaT
4 P9N3 2017-02-28 5 16 days 16 days
5 P9N3 2017-03-10 10 26 days 10 days
6 P9N3 2019-01-01 10 688 days 662 days
7 C7R5 2018-06-01 2 0 days NaT
8 L4U7 2019-08-06 2 0 days NaT
df["Date"] = pd.to_datetime(df.Date)
df = df.merge(
df.groupby(["Id"]).Date.first(),
on="Id",
how="left",
suffixes=["", "_first"]
)
df["Time From First"] = df.Date-df.Date_first
df['Time To Prev'] = df.groupby('Id').Date.diff()
df.set_index(["Id", "Date"], inplace=True)
df
output:
I have the following dataframe in Python:
ID
country_ID
visit_time
0
ESP
10 days 12:03:00
0
ESP
5 days 02:03:00
0
ENG
5 days 10:02:00
1
ENG
3 days 08:05:03
1
ESP
1 days 03:02:00
1
ENG
2 days 07:01:03
2
ENG
0 days 12:01:02
For each ID I want to calculate the standard deviation of each country_ID group.
std_visit_ESP and std_visit_ENG columns.
standard deviation of visit time with country_ID = ESP for each ID.
standard deviation of visit time with country_ID = ENG for each ID.
ID
std_visit_ESP
std_visit_ENG
0
2 days 17:00:00
0 days 00:00:00
1
0 days 00:00:00
0 days 12:32:00
2
NaT
0 days 00:00:00
With the groupby method for the mean, you can specify the parameter numeric_only = False, but the std method of groupby does not include this option.
My idea is to convert the timedelta to seconds, calculate the standard deviation and then convert it back to timedelta. Here is an example:
td1 = timedelta(10,0,0,0,3,12,0).total_seconds()
td2 = timedelta(5,0,0,0,3,2,0).total_seconds()
arr = [td1,td2]
var = np.std(arr)
show_s = pd.to_timedelta(var, unit='s')
print(show_s)
I don't know how to use this with groupby to get the desired result. I am grateful for your help.
Use GroupBy.std and pd.to_timedelta
total_seconds = \
pd.to_timedelta(
df['visit_time'].dt.total_seconds()
.groupby([df['ID'], df['country_ID']]).std(),
unit='S').unstack().fillna(pd.Timedelta(days=0))
print(total_seconds)
country_ID ENG ESP
ID
0 0 days 00:00:00 3 days 19:55:25.973595304
1 0 days 17:43:29.315934274 0 days 00:00:00
2 0 days 00:00:00 0 days 00:00:00
If I understand correctly, this should work for you:
stddevs = df['visit_time'].dt.total_seconds().groupby([df['country_ID']]).std().apply(lambda x: pd.Timedelta(seconds=x))
Output:
>>> stddevs
country_ID
ENG 2 days 01:17:43.835702
ESP 4 days 16:40:16.598773
Name: visit_time, dtype: timedelta64[ns]
Formatting:
stddevs = df['visit_time'].dt.total_seconds().groupby([df['country_ID']]).std().apply(lambda x: pd.Timedelta(seconds=x)).to_frame().T.add_prefix('std_visit_').reset_index(drop=True).rename_axis(None, axis=1)
Output:
>>> stddevs
std_visit_ENG std_visit_ESP
0 2 days 01:17:43.835702 4 days 16:40:16.598773
I have a pandas dataframe data-
Round Number Date
1 7/4/2018 20:00
1 8/4/2018 16:00
1 8/4/2018 20:00
1 9/4/2018 20:00
Now I want to create a new dataframe which has two columns
['Date' ,'flag']
The Date column will have the dates of the range of dates in the data dataframe(in the actual data the dates are in the range of 7/4/2018 8:00:00 PM to 27/05/2018 19:00 so the date column in the new dataframe will have dates from 1/4/2018 to 30/05/2018 since 7/4/2018 8:00:00 PM is in the month of April so we will include the whole month of April and similarly since 27/05/2018 is in May so we include dates from 1/05/2018 t0 30/05/2018.
In the flag column we put 1 if that particular date was there in the old dataframe.
Output(partial)-
Date Flag
1/4/2018 0
2/4/2018 0
3/4/2018 0
4/4/2018 0
5/4/2018 0
6/4/2018 0
7/4/2018 1
8/4/2018 1
and so on...
I would use np.where() to address this issue. Furthermore, I'm working to improve the answer by setting the dateranges from old_df to be input of new_df
import pandas as pd
import numpy as np
old_df = pd.DataFrame({'date':['4/7/2018 20:00','4/8/2018 20:00'],'value':[1,2]})
old_df['date'] = pd.to_datetime(old_df['date'],infer_datetime_format=True)
new_df = pd.DataFrame({'date':pd.date_range(start='4/1/2018',end='5/30/2019',freq='d')})
new_df['flag'] = np.where(new_df['date'].dt.date.astype(str).isin(old_df['date'].dt.date.astype(str).tolist()),1,0)
print(new_df.head(10))
Output:
date flag
0 2018-04-01 0
1 2018-04-02 0
2 2018-04-03 0
3 2018-04-04 0
4 2018-04-05 0
5 2018-04-06 0
6 2018-04-07 1
7 2018-04-08 1
8 2018-04-09 0
9 2018-04-10 0
Edit:
Improved version, full code:
import pandas as pd
import numpy as np
old_df = pd.DataFrame({'date':['4/7/2018 20:00','4/8/2018 20:00','5/30/2018 20:00'],'value':[1,2,3]})
old_df['date'] = pd.to_datetime(old_df['date'],infer_datetime_format=True)
if old_df['date'].min().month < 10:
start_date = pd.to_datetime(
("01/0"+str(old_df['date'].min().month)+"/"+str(old_df['date'].min().year)))
else:
start_date = pd.to_datetime(
("01/"+str(old_df['date'].min().month)+"/"+str(old_df['date'].min().year)))
end_date = old_df['date'].max()
end_date = pd.to_datetime(old_df['date'].max())
new_df = pd.DataFrame({'date':pd.date_range(start=start_date,end=end_date,freq='d')})
new_df['flag'] = np.where(new_df['date'].dt.date.astype(str).isin(old_df['date'].dt.date.astype(str).tolist()),1,0)