Creating a weekly profile in pandas

Creating a weekly profile in pandas - python

I have this df:
df = pd.DataFrame({"on": [1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0]},
index=pd.date_range(start = "2020-04-09 6:45", periods = 30, freq = '8H'))
and want to create a weekly profile for the column df['on'].
I can insert the weekdays and times as follows:
df['day_name'] = df.index.day_name()
df['time'] = df.index.time
and geht this df:
on day_name time
2020-04-09 06:45:00 1 Thursday 06:45:00
2020-04-09 14:45:00 1 Thursday 14:45:00
2020-04-09 22:45:00 1 Thursday 22:45:00
2020-04-10 06:45:00 1 Friday 06:45:00
2020-04-10 14:45:00 1 Friday 14:45:00
2020-04-10 22:45:00 0 Friday 22:45:00
2020-04-11 06:45:00 0 Saturday 06:45:00
2020-04-11 14:45:00 0 Saturday 14:45:00
2020-04-11 22:45:00 1 Saturday 22:45:00
2020-04-12 06:45:00 0 Sunday 06:45:00
2020-04-12 14:45:00 0 Sunday 14:45:00
2020-04-12 22:45:00 1 Sunday 22:45:00
2020-04-13 06:45:00 1 Monday 06:45:00
2020-04-13 14:45:00 0 Monday 14:45:00
2020-04-13 22:45:00 0 Monday 22:45:00
2020-04-14 06:45:00 0 Tuesday 06:45:00
2020-04-14 14:45:00 0 Tuesday 14:45:00
2020-04-14 22:45:00 1 Tuesday 22:45:00
2020-04-15 06:45:00 0 Wednesday 06:45:00
2020-04-15 14:45:00 1 Wednesday 14:45:00
2020-04-15 22:45:00 1 Wednesday 22:45:00
2020-04-16 06:45:00 0 Thursday 06:45:00
2020-04-16 14:45:00 0 Thursday 14:45:00
2020-04-16 22:45:00 0 Thursday 22:45:00
2020-04-17 06:45:00 1 Friday 06:45:00
2020-04-17 14:45:00 1 Friday 14:45:00
2020-04-17 22:45:00 1 Friday 22:45:00
2020-04-18 06:45:00 0 Saturday 06:45:00
2020-04-18 14:45:00 0 Saturday 14:45:00
2020-04-18 22:45:00 0 Saturday 22:45:00
can someone help me how to get the probability that for a certain time (e.g. tuesday, 22:45) the column df['on'] == 1? And this is best done as a course for the whole week...
(In this example the probability for thursday, 22:45 is: 1/2)
Thanks alot :)

I considered two options in your question:
1. Ratio per week day:
I calculated the ratio (probability) that the column 'on'==1 for a given day:
Ratio per weekday:
df_2=pd.DataFrame()
df_2['Ones']=df[df['on']==1]['day_name'].value_counts()
df_2['All']=df['day_name'].value_counts()
df_2['Ratio']=df_2['Ones']/df_2['All']
df_2
Here's the output:
Ones All Ratio
Friday 5 6 0.833333
Thursday 3 6 0.500000
Wednesday 2 3 0.666667
Monday 1 3 0.333333
Saturday 1 6 0.166667
Sunday 1 3 0.333333
Tuesday 1 3 0.333333
2. Ratio per day per time:
Here i calculated the probability that on day "x" at time "y", the column "on" would be 1:
Ratio per week day and time:
df_3 = df.groupby(['day_name', 'time']).agg({'on': 'count'})
df_3['ones'] = df.groupby(['day_name', 'time']).agg({'on': 'sum'})
df_3['Ratio'] = df_3['ones']/df_3['on']
df_3
Here's the output:
on ones Ratio
day_name time
Friday 06:45:00 2 2 1.0
14:45:00 2 2 1.0
22:45:00 2 1 0.5
Monday 06:45:00 1 1 1.0
14:45:00 1 0 0.0
22:45:00 1 0 0.0
Saturday06:45:00 2 0 0.0
14:45:00 2 0 0.0
22:45:00 2 1 0.5
Sunday 06:45:00 1 0 0.0
14:45:00 1 0 0.0
22:45:00 1 1 1.0
Thursday06:45:00 2 1 0.5
14:45:00 2 1 0.5
22:45:00 2 1 0.5
Tuesday 06:45:00 1 0 0.0
14:45:00 1 0 0.0
22:45:00 1 1 1.0
Wednesday06:45:00 1 0 0.0
14:45:00 1 1 1.0
22:45:00 1 1 1.0
Probabily Plotting
To answer your request, I had to do some modifications of the code above: I needed to sort the indexes as requested, merged them into one index then converted the index into strings to avoid some problems in the plotting. Here's the new code:
#Ratio per week day and time
days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
df_3 = df.groupby(['day_name', 'time']).agg({'on': 'count'})
df_3['ones'] = df.groupby(['day_name', 'time']).agg({'on': 'sum'})
df_3['Ratio'] = df_3['ones']/df_3['on']
df_3 = df_3.reindex(days, level=0)
df_3.index = [str(i) for i in (df_3.index.map('{0[0]} : {0[1]}'.format))]
df_3
Now that we did the previous modefications, we can easily plot the Ratio:
#Graph for probabilities
import matplotlib.pyplot as plt
plt.figure()
plt.plot(df_3.index, df_3['Ratio'])
plt.xlabel('Date')
plt.xticks(rotation=90)
plt.title('Probability of "on"=1')
Here's the graph:

I believe you need mean only:
df1 = df.groupby('day_name', sort=False, as_index=False)['on'].mean()
print (df1)
day_name on
0 Thursday 0.500000
1 Friday 0.833333
2 Saturday 0.166667
3 Sunday 0.333333
4 Monday 0.333333
5 Tuesday 0.333333
6 Wednesday 0.666667

Related

From hours to String

I have this df:
Index Dates
0 2017-01-01 23:30:00
1 2017-01-12 22:30:00
2 2017-01-20 13:35:00
3 2017-01-21 14:25:00
4 2017-01-28 22:30:00
5 2017-08-01 13:00:00
6 2017-09-26 09:39:00
7 2017-10-08 06:40:00
8 2017-10-04 07:30:00
9 2017-12-13 07:40:00
10 2017-12-31 14:55:00
The purpose was that between the time ranges 5:00 to 11:59 a new df would be created with data that would say: morning. To achieve this I converted those hours to booleans:
hour_morning=(pd.to_datetime(df['Dates']).dt.strftime('%H:%M:%S').between('05:00:00','11:59:00'))
and then passed them to a list with "morning" str
text_morning=[str('morning') for x in hour_morning if x==True]
I have the error in the last line because it only returns ´morning´ string values, it is as if the 'X' ignored the 'if' condition. Why is this happening and how do i fix it?

Do
text_morning=[str('morning') if x==True else 'not_morning' for x in hour_morning ]
You can also use np.where:
text_morning = np.where(hour_morning, 'morning', 'not morning')

Given:
Dates values
0 2017-01-01 23:30:00 0
1 2017-01-12 22:30:00 1
2 2017-01-20 13:35:00 2
3 2017-01-21 14:25:00 3
4 2017-01-28 22:30:00 4
5 2017-08-01 13:00:00 5
6 2017-09-26 09:39:00 6
7 2017-10-08 06:40:00 7
8 2017-10-04 07:30:00 8
9 2017-12-13 07:40:00 9
10 2017-12-31 14:55:00 10
Doing:
# df.Dates = pd.to_datetime(df.Dates)
df = df.set_index("Dates")
Now we can use pd.DataFrame.between_time:
new_df = df.between_time('05:00:00','11:59:00')
print(new_df)
Output:
values
Dates
2017-09-26 09:39:00 6
2017-10-08 06:40:00 7
2017-10-04 07:30:00 8
2017-12-13 07:40:00 9
Or use it to update the original dataframe:
df.loc[df.between_time('05:00:00','11:59:00').index, 'morning'] = 'morning'
# Output:
values morning
Dates
2017-01-01 23:30:00 0 NaN
2017-01-12 22:30:00 1 NaN
2017-01-20 13:35:00 2 NaN
2017-01-21 14:25:00 3 NaN
2017-01-28 22:30:00 4 NaN
2017-08-01 13:00:00 5 NaN
2017-09-26 09:39:00 6 morning
2017-10-08 06:40:00 7 morning
2017-10-04 07:30:00 8 morning
2017-12-13 07:40:00 9 morning
2017-12-31 14:55:00 10 NaN

How to find occurrence of consecutive events in python timeseries data frame?

I have got a time series of meteorological observations with date and value columns:
df = pd.DataFrame({'date':['11/10/2017 0:00','11/10/2017 03:00','11/10/2017 06:00','11/10/2017 09:00','11/10/2017 12:00',
'11/11/2017 0:00','11/11/2017 03:00','11/11/2017 06:00','11/11/2017 09:00','11/11/2017 12:00',
'11/12/2017 00:00','11/12/2017 03:00','11/12/2017 06:00','11/12/2017 09:00','11/12/2017 12:00'],
'value':[850,np.nan,np.nan,np.nan,np.nan,500,650,780,np.nan,800,350,690,780,np.nan,np.nan],
'consecutive_hour': [ 3,0,0,0,0,3,6,9,0,3,3,6,9,0,0]})
With this DataFrame, I want a third column of consecutive_hours such that if the value in a particular timestamp is less than 1000, we give corresponding value in "consecutive-hours" of "3:00" hours and find consecutive such occurrence like 6:00 9:00 as above.
Lastly, I want to summarize the table counting consecutive hours occurrence and number of days such that the summary table looks like:
df_summary = pd.DataFrame({'consecutive_hours':[3,6,9,12],
'number_of_day':[2,0,2,0]})
I tried several online solutions and methods like shift(), diff() etc. as mentioned in:How to groupby consecutive values in pandas DataFrame
and more, spent several days but no luck yet.
I would highly appreciate help on this issue.
Thanks!

Input data:
>>> df
date value
0 2017-11-10 00:00:00 850.0
1 2017-11-10 03:00:00 NaN
2 2017-11-10 06:00:00 NaN
3 2017-11-10 09:00:00 NaN
4 2017-11-10 12:00:00 NaN
5 2017-11-11 00:00:00 500.0
6 2017-11-11 03:00:00 650.0
7 2017-11-11 06:00:00 780.0
8 2017-11-11 09:00:00 NaN
9 2017-11-11 12:00:00 800.0
10 2017-11-12 00:00:00 350.0
11 2017-11-12 03:00:00 690.0
12 2017-11-12 06:00:00 780.0
13 2017-11-12 09:00:00 NaN
14 2017-11-12 12:00:00 NaN
The cumcount_reset function is adapted from this answer of #jezrael:
Python pandas cumsum with reset everytime there is a 0
cumcount_reset = \
lambda b: b.cumsum().sub(b.cumsum().where(~b).ffill().fillna(0)).astype(int)
df["consecutive_hour"] = (df.set_index("date")["value"] < 1000) \
.groupby(pd.Grouper(freq="D")) \
.apply(lambda b: cumcount_reset(b)).mul(3) \
.reset_index(drop=True)
Output result:
>>> df
date value consecutive_hour
0 2017-11-10 00:00:00 850.0 3
1 2017-11-10 03:00:00 NaN 0
2 2017-11-10 06:00:00 NaN 0
3 2017-11-10 09:00:00 NaN 0
4 2017-11-10 12:00:00 NaN 0
5 2017-11-11 00:00:00 500.0 3
6 2017-11-11 03:00:00 650.0 6
7 2017-11-11 06:00:00 780.0 9
8 2017-11-11 09:00:00 NaN 0
9 2017-11-11 12:00:00 800.0 3
10 2017-11-12 00:00:00 350.0 3
11 2017-11-12 03:00:00 690.0 6
12 2017-11-12 06:00:00 780.0 9
13 2017-11-12 09:00:00 NaN 0
14 2017-11-12 12:00:00 NaN 0
Summary table
df_summary = df.loc[df.groupby(pd.Grouper(key="date", freq="D"))["consecutive_hour"] \
.apply(lambda h: (h - h.shift(-1).fillna(0)) > 0),
"consecutive_hour"] \
.value_counts().reindex([3, 6, 9, 12], fill_value=0) \
.rename("number_of_day") \
.rename_axis("consecutive_hour") \
.reset_index()
>>> df_summary
consecutive_hour number_of_day
0 3 2
1 6 0
2 9 2
3 12 0

How to restrict time difference to same day?

I have a dataframe like as shown below
df1 = pd.DataFrame({
'subject_id':[1,1,1,1,1,1,1,1,1,1,1],
'time_1' :['2173-04-03 12:35:00','2173-04-03 12:50:00','2173-04-03
12:59:00','2173-04-03 13:14:00','2173-04-03 13:37:00','2173-04-04
11:30:00','2173-04-05 16:00:00','2173-04-05 22:00:00','2173-04-06
04:00:00','2173-04-06 04:30:00','2173-04-06 08:00:00']
})
I would like to create another column called tdiff to calculate the time difference
This is what I tried
df1['time_1'] = pd.to_datetime(df1['time_1'])
df['time_2'] = df['time_1'].shift(-1)
df['tdiff'] = (df['time_2'] - df['time_1']).dt.total_seconds() / 3600
But this produces an output like as shown below. As you can see, it subtracts from the next date. Instead I would like to restrict the time difference only to the same day. Ex: if Jan 15th 20:00:00 PM is the last record for that day, then I expect the tdiff to be 4:00:00 (24:00:00: - 20:00:00)
I understand it is happening because I am shifting the values of time to subtract and it's obvious that the highlighted rows are picking records from next date. But is there a way to avoid this but calculate the time difference between records in a same day?
I expect my output to be like this. Here NaN should be replaced by the current date (23:59:00). if you check the difference, you will get an idea
Is there any existing method or pandas function that can help us do this datewise timedelta? How can I shift the values datewise?

IIUC, you can use:
s=pd.to_timedelta(24,unit='h')-(df1.time_1-df1.time_1.dt.normalize())
df1['tdiff']=df1.groupby(df1.time_1.dt.date).time_1.diff().shift(-1).fillna(s)
#df1.groupby(df1.time_1.dt.date).time_1.diff().shift(-1).fillna(s).dt.total_seconds()/3600
subject_id time_1 tdiff
0 1 2173-04-03 12:35:00 00:15:00
1 1 2173-04-03 12:50:00 00:09:00
2 1 2173-04-03 12:59:00 00:15:00
3 1 2173-04-03 13:14:00 00:23:00
4 1 2173-04-03 13:37:00 10:23:00
5 1 2173-04-04 11:30:00 12:30:00
6 1 2173-04-05 16:00:00 06:00:00
7 1 2173-04-05 22:00:00 02:00:00
8 1 2173-04-06 04:00:00 00:30:00
9 1 2173-04-06 04:30:00 03:30:00
10 1 2173-04-06 08:00:00 16:00:00

you could use df.where and df.dt.ceil to decide if to subtract from time_2 or from midnight of time_1:
sameDayOrMidnight = df.time_2.where(df.time_1.dt.date==df.time_2.dt.date, df.time_1.dt.ceil(freq='1d'))
df['tdiff'] = (sameDayOrMidnight - df.time_1).dt.total_seconds() / 3600
result:
subject_id time_1 time_2 tdiff
0 1 2173-04-03 12:35:00 2173-04-03 12:50:00 0.250000
1 1 2173-04-03 12:50:00 2173-04-03 12:59:00 0.150000
2 1 2173-04-03 12:59:00 2173-04-03 13:14:00 0.250000
3 1 2173-04-03 13:14:00 2173-04-03 13:37:00 0.383333
4 1 2173-04-03 13:37:00 2173-04-04 11:30:00 10.383333
5 1 2173-04-04 11:30:00 2173-04-05 16:00:00 12.500000
6 1 2173-04-05 16:00:00 2173-04-05 22:00:00 6.000000
7 1 2173-04-05 22:00:00 2173-04-06 04:00:00 2.000000
8 1 2173-04-06 04:00:00 2173-04-06 04:30:00 0.500000
9 1 2173-04-06 04:30:00 2173-04-06 08:00:00 3.500000
10 1 2173-04-06 08:00:00 NaT 16.000000

Reorder day of week in pandas groupby plot bar

I have sorted df data like below:
day_name Day_id
time
2019-05-20 19:00:00 Monday 0
2018-12-31 15:00:00 Monday 0
2019-02-25 17:00:00 Monday 0
2019-05-06 20:00:00 Monday 0
2019-03-12 12:00:00 Tuesday 1
2019-04-16 15:00:00 Tuesday 1
2019-04-02 18:00:00 Tuesday 1
2019-02-05 09:00:00 Tuesday 1
2019-05-28 21:00:00 Tuesday 1
2019-01-15 12:00:00 Tuesday 1
2019-06-04 20:00:00 Tuesday 1
2018-12-04 07:00:00 Tuesday 1
2019-01-22 11:00:00 Tuesday 1
2019-01-09 07:00:00 Wednesday 2
2019-03-06 16:00:00 Wednesday 2
2019-06-19 17:00:00 Wednesday 2
2019-04-10 20:00:00 Wednesday 2
2019-04-24 15:00:00 Wednesday 2
2019-01-31 08:00:00 Thursday 3
2019-01-03 08:00:00 Thursday 3
2019-02-28 19:00:00 Thursday 3
2019-05-23 20:00:00 Thursday 3
2018-12-20 07:00:00 Thursday 3
2019-05-09 19:00:00 Thursday 3
2019-06-28 15:00:00 Friday 4
2019-03-22 12:00:00 Friday 4
2019-03-29 14:00:00 Friday 4
2018-12-15 08:00:00 Saturday 5
2019-02-17 11:00:00 Sunday 6
2019-06-16 19:00:00 Sunday 6
2018-12-02 08:00:00 Sunday 6
Currentry with help of this post:
df = df.groupby(df.day_name).count().plot(kind="bar")
plt.show()
my output is:
How to plot histogram with days of week in proper order like: Monday, Tuesday ...?
I have found several approaches: 1, 2, 3, to solve this but can't find method for using them in my case.
Thank You all for hard work.

You need sort=False under groupby:
m = df.groupby(df.day_name,sort=False).count().plot(kind="bar")
plt.show()

add timedelta data within a group in pandas dataframe

I am working on a dataframe in pandas with four columns of user_id, time_stamp1, time_stamp2, and interval. Time_stamp1 and time_stamp2 are of type datetime64[ns] and interval is of type timedelta64[ns].
I want to sum up interval values for each user_id in the dataframe and I tried to calculate it in many ways as:
1)df["duration"]= df.groupby('user_id')['interval'].apply (lambda x: x.sum())
2)df ["duration"]= df.groupby('user_id').aggregate (np.sum)
3)df ["duration"]= df.groupby('user_id').agg (np.sum)
but none of them work and the value of the duration will be NaT after running the codes.

UPDATE: you can use transform() method:
In [291]: df['duration'] = df.groupby('user_id')['interval'].transform('sum')
In [292]: df
Out[292]:
a user_id b interval duration
0 2016-01-01 00:00:00 0.01 2015-11-11 00:00:00 51 days 00:00:00 838 days 08:00:00
1 2016-03-10 10:39:00 0.01 2015-12-08 18:39:00 NaT 838 days 08:00:00
2 2016-05-18 21:18:00 0.01 2016-01-05 13:18:00 134 days 08:00:00 838 days 08:00:00
3 2016-07-27 07:57:00 0.01 2016-02-02 07:57:00 176 days 00:00:00 838 days 08:00:00
4 2016-10-04 18:36:00 0.01 2016-03-01 02:36:00 217 days 16:00:00 838 days 08:00:00
5 2016-12-13 05:15:00 0.01 2016-03-28 21:15:00 259 days 08:00:00 838 days 08:00:00
6 2017-02-20 15:54:00 0.02 2016-04-25 15:54:00 301 days 00:00:00 1454 days 00:00:00
7 2017-05-01 02:33:00 0.02 2016-05-23 10:33:00 342 days 16:00:00 1454 days 00:00:00
8 2017-07-09 13:12:00 0.02 2016-06-20 05:12:00 384 days 08:00:00 1454 days 00:00:00
9 2017-09-16 23:51:00 0.02 2016-07-17 23:51:00 426 days 00:00:00 1454 days 00:00:00
OLD answer:
Demo:
In [260]: df
Out[260]:
a b interval user_id
0 2016-01-01 00:00:00 2015-11-11 00:00:00 51 days 00:00:00 1
1 2016-03-10 10:39:00 2015-12-08 18:39:00 NaT 1
2 2016-05-18 21:18:00 2016-01-05 13:18:00 134 days 08:00:00 1
3 2016-07-27 07:57:00 2016-02-02 07:57:00 176 days 00:00:00 1
4 2016-10-04 18:36:00 2016-03-01 02:36:00 217 days 16:00:00 1
5 2016-12-13 05:15:00 2016-03-28 21:15:00 259 days 08:00:00 1
6 2017-02-20 15:54:00 2016-04-25 15:54:00 301 days 00:00:00 2
7 2017-05-01 02:33:00 2016-05-23 10:33:00 342 days 16:00:00 2
8 2017-07-09 13:12:00 2016-06-20 05:12:00 384 days 08:00:00 2
9 2017-09-16 23:51:00 2016-07-17 23:51:00 426 days 00:00:00 2
In [261]: df.dtypes
Out[261]:
a datetime64[ns]
b datetime64[ns]
interval timedelta64[ns]
user_id int64
dtype: object
In [262]: df.groupby('user_id')['interval'].sum()
Out[262]:
user_id
1 838 days 08:00:00
2 1454 days 00:00:00
Name: interval, dtype: timedelta64[ns]
In [263]: df.groupby('user_id')['interval'].apply(lambda x: x.sum())
Out[263]:
user_id
1 838 days 08:00:00
2 1454 days 00:00:00
Name: interval, dtype: timedelta64[ns]
In [264]: df.groupby('user_id').agg(np.sum)
Out[264]:
interval
user_id
1 838 days 08:00:00
2 1454 days 00:00:00
So check your data...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Creating a weekly profile in pandas - python

I believe you need mean only: df1 = df.groupby('day_name', sort=False, as_index=False)['on'].mean() print (df1) day_name on 0 Thursday 0.500000 1 Friday 0.833333 2 Saturday 0.166667 3 Sunday 0.333333 4 Monday 0.333333 5 Tuesday 0.333333 6 Wednesday 0.666667

Related

From hours to String

How to find occurrence of consecutive events in python timeseries data frame?

How to restrict time difference to same day?

Reorder day of week in pandas groupby plot bar

add timedelta data within a group in pandas dataframe

Categories

Resources