I have sorted df data like below:
day_name Day_id
time
2019-05-20 19:00:00 Monday 0
2018-12-31 15:00:00 Monday 0
2019-02-25 17:00:00 Monday 0
2019-05-06 20:00:00 Monday 0
2019-03-12 12:00:00 Tuesday 1
2019-04-16 15:00:00 Tuesday 1
2019-04-02 18:00:00 Tuesday 1
2019-02-05 09:00:00 Tuesday 1
2019-05-28 21:00:00 Tuesday 1
2019-01-15 12:00:00 Tuesday 1
2019-06-04 20:00:00 Tuesday 1
2018-12-04 07:00:00 Tuesday 1
2019-01-22 11:00:00 Tuesday 1
2019-01-09 07:00:00 Wednesday 2
2019-03-06 16:00:00 Wednesday 2
2019-06-19 17:00:00 Wednesday 2
2019-04-10 20:00:00 Wednesday 2
2019-04-24 15:00:00 Wednesday 2
2019-01-31 08:00:00 Thursday 3
2019-01-03 08:00:00 Thursday 3
2019-02-28 19:00:00 Thursday 3
2019-05-23 20:00:00 Thursday 3
2018-12-20 07:00:00 Thursday 3
2019-05-09 19:00:00 Thursday 3
2019-06-28 15:00:00 Friday 4
2019-03-22 12:00:00 Friday 4
2019-03-29 14:00:00 Friday 4
2018-12-15 08:00:00 Saturday 5
2019-02-17 11:00:00 Sunday 6
2019-06-16 19:00:00 Sunday 6
2018-12-02 08:00:00 Sunday 6
Currentry with help of this post:
df = df.groupby(df.day_name).count().plot(kind="bar")
plt.show()
my output is:
How to plot histogram with days of week in proper order like: Monday, Tuesday ...?
I have found several approaches: 1, 2, 3, to solve this but can't find method for using them in my case.
Thank You all for hard work.
You need sort=False under groupby:
m = df.groupby(df.day_name,sort=False).count().plot(kind="bar")
plt.show()
Related
I have a data table that looks like this:
ID ARRIVAL_DATE_TIME DISPOSITION_DATE
1 2021-11-07 08:35:00 2021-11-07 17:58:00
2 2021-11-07 13:16:00 2021-11-08 02:52:00
3 2021-11-07 15:12:00 2021-11-07 21:08:00
I want to be able to count the number of patients in our location by date/hour and hour. I imagine I would eventually have to transform this data into a format seen below and then create a pivot table, but I'm not sure how to first transform this data. So for example, ID 1 would have a row for each date/hour and hour between '2021-11-07 08:35:00' and '2021-11-07 17:58:00'.
ID DATE_HOUR_IN_ED HOUR_IN_ED
1 2021-11-07 08:00:00 8:00
1 2021-11-07 09:00:00 9:00
1 2021-11-07 10:00:00 10:00
1 2021-11-07 11:00:00 11:00
...
2 2021-11-07 13:00:00 13:00
2 2021-11-07 14:00:00 14:00
2 2021-11-07 15:00:00 15:00
....
Use to_datetime with Series.dt.floor for remove times, then concat with repeat date_range and last create DataFrame by constructor:
df['ARRIVAL_DATE_TIME'] = pd.to_datetime(df['ARRIVAL_DATE_TIME']).dt.floor('H')
s = pd.concat([pd.Series(r.ID,pd.date_range(r.ARRIVAL_DATE_TIME,
r.DISPOSITION_DATE, freq='H'))
for r in df.itertuples()])
df1 = pd.DataFrame({'ID':s.to_numpy(),
'DATE_HOUR_IN_ED':s.index,
'HOUR_IN_ED': s.index.strftime('%H:%M')})
print (df1)
ID DATE_HOUR_IN_ED HOUR_IN_ED
0 1 2021-11-07 08:00:00 08:00
1 1 2021-11-07 09:00:00 09:00
2 1 2021-11-07 10:00:00 10:00
3 1 2021-11-07 11:00:00 11:00
4 1 2021-11-07 12:00:00 12:00
5 1 2021-11-07 13:00:00 13:00
6 1 2021-11-07 14:00:00 14:00
7 1 2021-11-07 15:00:00 15:00
8 1 2021-11-07 16:00:00 16:00
9 1 2021-11-07 17:00:00 17:00
10 2 2021-11-07 13:00:00 13:00
11 2 2021-11-07 14:00:00 14:00
12 2 2021-11-07 15:00:00 15:00
13 2 2021-11-07 16:00:00 16:00
14 2 2021-11-07 17:00:00 17:00
15 2 2021-11-07 18:00:00 18:00
16 2 2021-11-07 19:00:00 19:00
17 2 2021-11-07 20:00:00 20:00
18 2 2021-11-07 21:00:00 21:00
19 2 2021-11-07 22:00:00 22:00
20 2 2021-11-07 23:00:00 23:00
21 2 2021-11-08 00:00:00 00:00
22 2 2021-11-08 01:00:00 01:00
23 2 2021-11-08 02:00:00 02:00
24 3 2021-11-07 15:00:00 15:00
25 3 2021-11-07 16:00:00 16:00
26 3 2021-11-07 17:00:00 17:00
27 3 2021-11-07 18:00:00 18:00
28 3 2021-11-07 19:00:00 19:00
29 3 2021-11-07 20:00:00 20:00
30 3 2021-11-07 21:00:00 21:00
Alternative solution:
df['ARRIVAL_DATE_TIME'] = pd.to_datetime(df['ARRIVAL_DATE_TIME']).dt.floor('H')
L = [pd.date_range(s,e, freq='H')
for s, e in df[['ARRIVAL_DATE_TIME','DISPOSITION_DATE']].to_numpy()]
df['DATE_HOUR_IN_ED'] = L
df = (df.drop(['ARRIVAL_DATE_TIME','DISPOSITION_DATE'], axis=1)
.explode('DATE_HOUR_IN_ED')
.reset_index(drop=True)
.assign(HOUR_IN_ED = lambda x: x['DATE_HOUR_IN_ED'].dt.strftime('%H:%M')))
Try this:
import pandas as pd
import numpy as np
df = pd.read_excel('test.xls')
df1 = (df.set_index(['ID'])
.assign(DATE_HOUR_IN_ED=lambda x: [pd.date_range(s,d, freq='H')
for s,d in zip(x.ARRIVAL_DATE_TIME, x.DISPOSITION_DATE)])
['DATE_HOUR_IN_ED'].explode()
.reset_index()
)
df1['DATE_HOUR_IN_ED'] = df1['DATE_HOUR_IN_ED'].dt.floor('H')
df1['HOUR_IN_ED'] = df1['DATE_HOUR_IN_ED'].dt.strftime('%H:%M')
print(df1)
Output:
ID DATE_HOUR_IN_ED HOUR_IN_ED
0 1 2021-11-07 08:00:00 08:00
1 1 2021-11-07 09:00:00 09:00
2 1 2021-11-07 10:00:00 10:00
3 1 2021-11-07 11:00:00 11:00
4 1 2021-11-07 12:00:00 12:00
5 1 2021-11-07 13:00:00 13:00
6 1 2021-11-07 14:00:00 14:00
7 1 2021-11-07 15:00:00 15:00
8 1 2021-11-07 16:00:00 16:00
9 1 2021-11-07 17:00:00 17:00
10 2 2021-11-07 13:00:00 13:00
11 2 2021-11-07 14:00:00 14:00
12 2 2021-11-07 15:00:00 15:00
13 2 2021-11-07 16:00:00 16:00
14 2 2021-11-07 17:00:00 17:00
15 2 2021-11-07 18:00:00 18:00
16 2 2021-11-07 19:00:00 19:00
17 2 2021-11-07 20:00:00 20:00
18 2 2021-11-07 21:00:00 21:00
19 2 2021-11-07 22:00:00 22:00
20 2 2021-11-07 23:00:00 23:00
21 2 2021-11-08 00:00:00 00:00
22 2 2021-11-08 01:00:00 01:00
23 2 2021-11-08 02:00:00 02:00
24 3 2021-11-07 15:00:00 15:00
25 3 2021-11-07 16:00:00 16:00
26 3 2021-11-07 17:00:00 17:00
27 3 2021-11-07 18:00:00 18:00
28 3 2021-11-07 19:00:00 19:00
29 3 2021-11-07 20:00:00 20:00
I have this df:
df = pd.DataFrame({"on": [1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0]},
index=pd.date_range(start = "2020-04-09 6:45", periods = 30, freq = '8H'))
and want to create a weekly profile for the column df['on'].
I can insert the weekdays and times as follows:
df['day_name'] = df.index.day_name()
df['time'] = df.index.time
and geht this df:
on day_name time
2020-04-09 06:45:00 1 Thursday 06:45:00
2020-04-09 14:45:00 1 Thursday 14:45:00
2020-04-09 22:45:00 1 Thursday 22:45:00
2020-04-10 06:45:00 1 Friday 06:45:00
2020-04-10 14:45:00 1 Friday 14:45:00
2020-04-10 22:45:00 0 Friday 22:45:00
2020-04-11 06:45:00 0 Saturday 06:45:00
2020-04-11 14:45:00 0 Saturday 14:45:00
2020-04-11 22:45:00 1 Saturday 22:45:00
2020-04-12 06:45:00 0 Sunday 06:45:00
2020-04-12 14:45:00 0 Sunday 14:45:00
2020-04-12 22:45:00 1 Sunday 22:45:00
2020-04-13 06:45:00 1 Monday 06:45:00
2020-04-13 14:45:00 0 Monday 14:45:00
2020-04-13 22:45:00 0 Monday 22:45:00
2020-04-14 06:45:00 0 Tuesday 06:45:00
2020-04-14 14:45:00 0 Tuesday 14:45:00
2020-04-14 22:45:00 1 Tuesday 22:45:00
2020-04-15 06:45:00 0 Wednesday 06:45:00
2020-04-15 14:45:00 1 Wednesday 14:45:00
2020-04-15 22:45:00 1 Wednesday 22:45:00
2020-04-16 06:45:00 0 Thursday 06:45:00
2020-04-16 14:45:00 0 Thursday 14:45:00
2020-04-16 22:45:00 0 Thursday 22:45:00
2020-04-17 06:45:00 1 Friday 06:45:00
2020-04-17 14:45:00 1 Friday 14:45:00
2020-04-17 22:45:00 1 Friday 22:45:00
2020-04-18 06:45:00 0 Saturday 06:45:00
2020-04-18 14:45:00 0 Saturday 14:45:00
2020-04-18 22:45:00 0 Saturday 22:45:00
can someone help me how to get the probability that for a certain time (e.g. tuesday, 22:45) the column df['on'] == 1? And this is best done as a course for the whole week...
(In this example the probability for thursday, 22:45 is: 1/2)
Thanks alot :)
I considered two options in your question:
1. Ratio per week day:
I calculated the ratio (probability) that the column 'on'==1 for a given day:
Ratio per weekday:
df_2=pd.DataFrame()
df_2['Ones']=df[df['on']==1]['day_name'].value_counts()
df_2['All']=df['day_name'].value_counts()
df_2['Ratio']=df_2['Ones']/df_2['All']
df_2
Here's the output:
Ones All Ratio
Friday 5 6 0.833333
Thursday 3 6 0.500000
Wednesday 2 3 0.666667
Monday 1 3 0.333333
Saturday 1 6 0.166667
Sunday 1 3 0.333333
Tuesday 1 3 0.333333
2. Ratio per day per time:
Here i calculated the probability that on day "x" at time "y", the column "on" would be 1:
Ratio per week day and time:
df_3 = df.groupby(['day_name', 'time']).agg({'on': 'count'})
df_3['ones'] = df.groupby(['day_name', 'time']).agg({'on': 'sum'})
df_3['Ratio'] = df_3['ones']/df_3['on']
df_3
Here's the output:
on ones Ratio
day_name time
Friday 06:45:00 2 2 1.0
14:45:00 2 2 1.0
22:45:00 2 1 0.5
Monday 06:45:00 1 1 1.0
14:45:00 1 0 0.0
22:45:00 1 0 0.0
Saturday06:45:00 2 0 0.0
14:45:00 2 0 0.0
22:45:00 2 1 0.5
Sunday 06:45:00 1 0 0.0
14:45:00 1 0 0.0
22:45:00 1 1 1.0
Thursday06:45:00 2 1 0.5
14:45:00 2 1 0.5
22:45:00 2 1 0.5
Tuesday 06:45:00 1 0 0.0
14:45:00 1 0 0.0
22:45:00 1 1 1.0
Wednesday06:45:00 1 0 0.0
14:45:00 1 1 1.0
22:45:00 1 1 1.0
Probabily Plotting
To answer your request, I had to do some modifications of the code above: I needed to sort the indexes as requested, merged them into one index then converted the index into strings to avoid some problems in the plotting. Here's the new code:
#Ratio per week day and time
days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
df_3 = df.groupby(['day_name', 'time']).agg({'on': 'count'})
df_3['ones'] = df.groupby(['day_name', 'time']).agg({'on': 'sum'})
df_3['Ratio'] = df_3['ones']/df_3['on']
df_3 = df_3.reindex(days, level=0)
df_3.index = [str(i) for i in (df_3.index.map('{0[0]} : {0[1]}'.format))]
df_3
Now that we did the previous modefications, we can easily plot the Ratio:
#Graph for probabilities
import matplotlib.pyplot as plt
plt.figure()
plt.plot(df_3.index, df_3['Ratio'])
plt.xlabel('Date')
plt.xticks(rotation=90)
plt.title('Probability of "on"=1')
Here's the graph:
I believe you need mean only:
df1 = df.groupby('day_name', sort=False, as_index=False)['on'].mean()
print (df1)
day_name on
0 Thursday 0.500000
1 Friday 0.833333
2 Saturday 0.166667
3 Sunday 0.333333
4 Monday 0.333333
5 Tuesday 0.333333
6 Wednesday 0.666667
I have a dataframe that has a date time column called start time and it is set to a default of 12:00:00 AM. I would like to reset this column so that the first row is 00:01:00 and the second row is 00:02:00, that is one minute interval.
This is the original table.
ID State Time End Time
A001 12:00:00 12:00:00
A002 12:00:00 12:00:00
A003 12:00:00 12:00:00
A004 12:00:00 12:00:00
A005 12:00:00 12:00:00
A006 12:00:00 12:00:00
A007 12:00:00 12:00:00
I want to reset the start time column so that my output is this:
ID State Time End Time
A001 0:00:00 12:00:00
A002 0:00:01 12:00:00
A003 0:00:02 12:00:00
A004 0:00:03 12:00:00
A005 0:00:04 12:00:00
A006 0:00:05 12:00:00
A007 0:00:06 12:00:00
How do I go about this?
you could use pd.date_range:
df['Start Time'] = pd.date_range('00:00', periods=df['Start Time'].shape[0], freq='1min')
gives you
df
Out[23]:
Start Time
0 2019-09-30 00:00:00
1 2019-09-30 00:01:00
2 2019-09-30 00:02:00
3 2019-09-30 00:03:00
4 2019-09-30 00:04:00
5 2019-09-30 00:05:00
6 2019-09-30 00:06:00
7 2019-09-30 00:07:00
8 2019-09-30 00:08:00
9 2019-09-30 00:09:00
supply a full date/time string to get another starting date.
First we convert your State Time column to datetime type. Then we use pd.date_range and use the first time as starting point with a frequency of 1 minute.
df['State Time'] = pd.to_datetime(df['State Time'])
df['State Time'] = pd.date_range(start=df['State Time'].min(),
periods=len(df),
freq='min').time
Output
ID State Time End Time
0 A001 12:00:00 12:00:00
1 A002 12:01:00 12:00:00
2 A003 12:02:00 12:00:00
3 A004 12:03:00 12:00:00
4 A005 12:04:00 12:00:00
5 A006 12:05:00 12:00:00
6 A007 12:06:00 12:00:00
I have a dataframe like as shown below
df1 = pd.DataFrame({
'subject_id':[1,1,1,1,1,1,1,1,1,1,1],
'time_1' :['2173-04-03 12:35:00','2173-04-03 12:50:00','2173-04-03
12:59:00','2173-04-03 13:14:00','2173-04-03 13:37:00','2173-04-04
11:30:00','2173-04-05 16:00:00','2173-04-05 22:00:00','2173-04-06
04:00:00','2173-04-06 04:30:00','2173-04-06 08:00:00']
})
I would like to create another column called tdiff to calculate the time difference
This is what I tried
df1['time_1'] = pd.to_datetime(df1['time_1'])
df['time_2'] = df['time_1'].shift(-1)
df['tdiff'] = (df['time_2'] - df['time_1']).dt.total_seconds() / 3600
But this produces an output like as shown below. As you can see, it subtracts from the next date. Instead I would like to restrict the time difference only to the same day. Ex: if Jan 15th 20:00:00 PM is the last record for that day, then I expect the tdiff to be 4:00:00 (24:00:00: - 20:00:00)
I understand it is happening because I am shifting the values of time to subtract and it's obvious that the highlighted rows are picking records from next date. But is there a way to avoid this but calculate the time difference between records in a same day?
I expect my output to be like this. Here NaN should be replaced by the current date (23:59:00). if you check the difference, you will get an idea
Is there any existing method or pandas function that can help us do this datewise timedelta? How can I shift the values datewise?
IIUC, you can use:
s=pd.to_timedelta(24,unit='h')-(df1.time_1-df1.time_1.dt.normalize())
df1['tdiff']=df1.groupby(df1.time_1.dt.date).time_1.diff().shift(-1).fillna(s)
#df1.groupby(df1.time_1.dt.date).time_1.diff().shift(-1).fillna(s).dt.total_seconds()/3600
subject_id time_1 tdiff
0 1 2173-04-03 12:35:00 00:15:00
1 1 2173-04-03 12:50:00 00:09:00
2 1 2173-04-03 12:59:00 00:15:00
3 1 2173-04-03 13:14:00 00:23:00
4 1 2173-04-03 13:37:00 10:23:00
5 1 2173-04-04 11:30:00 12:30:00
6 1 2173-04-05 16:00:00 06:00:00
7 1 2173-04-05 22:00:00 02:00:00
8 1 2173-04-06 04:00:00 00:30:00
9 1 2173-04-06 04:30:00 03:30:00
10 1 2173-04-06 08:00:00 16:00:00
you could use df.where and df.dt.ceil to decide if to subtract from time_2 or from midnight of time_1:
sameDayOrMidnight = df.time_2.where(df.time_1.dt.date==df.time_2.dt.date, df.time_1.dt.ceil(freq='1d'))
df['tdiff'] = (sameDayOrMidnight - df.time_1).dt.total_seconds() / 3600
result:
subject_id time_1 time_2 tdiff
0 1 2173-04-03 12:35:00 2173-04-03 12:50:00 0.250000
1 1 2173-04-03 12:50:00 2173-04-03 12:59:00 0.150000
2 1 2173-04-03 12:59:00 2173-04-03 13:14:00 0.250000
3 1 2173-04-03 13:14:00 2173-04-03 13:37:00 0.383333
4 1 2173-04-03 13:37:00 2173-04-04 11:30:00 10.383333
5 1 2173-04-04 11:30:00 2173-04-05 16:00:00 12.500000
6 1 2173-04-05 16:00:00 2173-04-05 22:00:00 6.000000
7 1 2173-04-05 22:00:00 2173-04-06 04:00:00 2.000000
8 1 2173-04-06 04:00:00 2173-04-06 04:30:00 0.500000
9 1 2173-04-06 04:30:00 2173-04-06 08:00:00 3.500000
10 1 2173-04-06 08:00:00 NaT 16.000000
I am working on a dataframe in pandas with four columns of user_id, time_stamp1, time_stamp2, and interval. Time_stamp1 and time_stamp2 are of type datetime64[ns] and interval is of type timedelta64[ns].
I want to sum up interval values for each user_id in the dataframe and I tried to calculate it in many ways as:
1)df["duration"]= df.groupby('user_id')['interval'].apply (lambda x: x.sum())
2)df ["duration"]= df.groupby('user_id').aggregate (np.sum)
3)df ["duration"]= df.groupby('user_id').agg (np.sum)
but none of them work and the value of the duration will be NaT after running the codes.
UPDATE: you can use transform() method:
In [291]: df['duration'] = df.groupby('user_id')['interval'].transform('sum')
In [292]: df
Out[292]:
a user_id b interval duration
0 2016-01-01 00:00:00 0.01 2015-11-11 00:00:00 51 days 00:00:00 838 days 08:00:00
1 2016-03-10 10:39:00 0.01 2015-12-08 18:39:00 NaT 838 days 08:00:00
2 2016-05-18 21:18:00 0.01 2016-01-05 13:18:00 134 days 08:00:00 838 days 08:00:00
3 2016-07-27 07:57:00 0.01 2016-02-02 07:57:00 176 days 00:00:00 838 days 08:00:00
4 2016-10-04 18:36:00 0.01 2016-03-01 02:36:00 217 days 16:00:00 838 days 08:00:00
5 2016-12-13 05:15:00 0.01 2016-03-28 21:15:00 259 days 08:00:00 838 days 08:00:00
6 2017-02-20 15:54:00 0.02 2016-04-25 15:54:00 301 days 00:00:00 1454 days 00:00:00
7 2017-05-01 02:33:00 0.02 2016-05-23 10:33:00 342 days 16:00:00 1454 days 00:00:00
8 2017-07-09 13:12:00 0.02 2016-06-20 05:12:00 384 days 08:00:00 1454 days 00:00:00
9 2017-09-16 23:51:00 0.02 2016-07-17 23:51:00 426 days 00:00:00 1454 days 00:00:00
OLD answer:
Demo:
In [260]: df
Out[260]:
a b interval user_id
0 2016-01-01 00:00:00 2015-11-11 00:00:00 51 days 00:00:00 1
1 2016-03-10 10:39:00 2015-12-08 18:39:00 NaT 1
2 2016-05-18 21:18:00 2016-01-05 13:18:00 134 days 08:00:00 1
3 2016-07-27 07:57:00 2016-02-02 07:57:00 176 days 00:00:00 1
4 2016-10-04 18:36:00 2016-03-01 02:36:00 217 days 16:00:00 1
5 2016-12-13 05:15:00 2016-03-28 21:15:00 259 days 08:00:00 1
6 2017-02-20 15:54:00 2016-04-25 15:54:00 301 days 00:00:00 2
7 2017-05-01 02:33:00 2016-05-23 10:33:00 342 days 16:00:00 2
8 2017-07-09 13:12:00 2016-06-20 05:12:00 384 days 08:00:00 2
9 2017-09-16 23:51:00 2016-07-17 23:51:00 426 days 00:00:00 2
In [261]: df.dtypes
Out[261]:
a datetime64[ns]
b datetime64[ns]
interval timedelta64[ns]
user_id int64
dtype: object
In [262]: df.groupby('user_id')['interval'].sum()
Out[262]:
user_id
1 838 days 08:00:00
2 1454 days 00:00:00
Name: interval, dtype: timedelta64[ns]
In [263]: df.groupby('user_id')['interval'].apply(lambda x: x.sum())
Out[263]:
user_id
1 838 days 08:00:00
2 1454 days 00:00:00
Name: interval, dtype: timedelta64[ns]
In [264]: df.groupby('user_id').agg(np.sum)
Out[264]:
interval
user_id
1 838 days 08:00:00
2 1454 days 00:00:00
So check your data...