How to plot data per weekday and hour in python - python

I want to plot my data of a timeseries (of bikes) by showing the mean of number of bikes per hour and weekday.
here is an extract of the initial data :
date nb_bike
2019-09-20 12:00:00 15
2019-09-20 13:00:00 10
2019-09-20 14:00:00 17
2019-09-20 15:00:00 12
2019-09-20 16:00:00 24
I computed this mean per weekday and hour that way :
data_b = data_b_init.groupby([data_b_init.index.weekday.rename('wkday'),data_b_init.index.hour.rename('hour')]).mean()
data_b = data_b.reset_index()
So I want to plot these data (here an extract) :
data_b
wkday hour nb_bike_mean
0 0 0.44
0 1 0.11
0 2 0.00
0 3 0.11
0 4 0.00
0 5 0.67
0 6 0.78
0 7 6.44
0 8 13.83
0 9 9.78
I would like to do something like that : How to plot data per hour, grouped by days?
(especially like this graph :
but I don't find how to do it by keeping just weekdays and hours information and not days.
For example, I tried this code :
sns.lineplot(x='hour',y='nb_bike_mean',data=data_b, hue='wkday')
but it isn't what I want, because I want both wkday and hour on x axis.
Do you know a way to plot with 2 levels on x axis ?
Or better, to have wkday and hour recognized as datetime and to join them as index ?

Related

int64 to HHMM string

I have the following data frame where the column hour shows hours of the day in int64 form. I'm trying to convert that into a time format; so that hour 1 would show up as '01:00'. I then want to add this to the date column and convert it into a timestamp index.
Using the datetime function in pandas resulted in the column "hr2", which is not what I need. I'm not sure I can even apply datetime directly, as the original data (i.e. in column "hr") is not really a date time format to begin with. Google searches so far have been unproductive.
While I am still in the dark concerning the format of your date column. I will assume the Date column is a string object and the hr column is an int64 object. To create the column TimeStamp in pandas tmestamp format this is how I would proceed>
Given df:
Date Hr
0 12/01/2010 1
1 12/01/2010 2
2 12/01/2010 3
3 12/01/2010 4
4 12/02/2010 1
5 12/02/2010 2
6 12/02/2010 3
7 12/02/2010 4
df['TimeStamp'] = df.apply(lambda row: pd.to_datetime(row['Date']) + pd.to_timedelta(row['Hr'], unit='H'), axis = 1)
yields:
Date Hr TimeStamp
0 12/01/2010 1 2010-12-01 01:00:00
1 12/01/2010 2 2010-12-01 02:00:00
2 12/01/2010 3 2010-12-01 03:00:00
3 12/01/2010 4 2010-12-01 04:00:00
4 12/02/2010 1 2010-12-02 01:00:00
5 12/02/2010 2 2010-12-02 02:00:00
6 12/02/2010 3 2010-12-02 03:00:00
7 12/02/2010 4 2010-12-02 04:00:00
The timestamp column can then be used as your index.

Grouping on Weekday histogram

I have a df in the format:
date number category
2014-02-02 17:00:00 4 red
2014-02-03 17:00:00 5 red
2014-02-04 17:00:00 4 blue
2014-02-05 17:00:00 4 blue
2014-02-06 17:00:00 4 red
2014-02-07 17:00:00 4 red
2014-02-08 17:00:00 4 blue
...
How do I group on day of the week and take a total of 'number' in that day of the week, so Id have a df of 7 items, monday, tuesday etc, and the total number of 'number' on that day. With this I want to make a histogram with number on the y and day of the week on the x.
After reading your question again, I understand why #Quang Hoang answered the way he did. Not so sure if that's what you had wanted or if the below is:
df['date'] = pd.to_datetime(df['date'], infer_datetime_format=True)
df['day'] = df['date'].apply(lambda x: x.day_name())
counts = df.groupby('day')['Number'].sum()
plt.bar(counts.index, counts)
plt.show()
You can use dt.day_name() to extract the day name, then use pd.crosstab to count the number:
pd.crosstab(df['date'].dt.day_name(),df['number'])
Output:
number 4 5
date
Friday 1 0
Monday 0 1
Saturday 1 0
Sunday 1 0
Thursday 1 0
Tuesday 1 0
Wednesday 1 0
And to plot a histogram, you can chain the above with .plot.bar():

Pandas GroupBy with CumulativeSum per Contiguous Groups

I'm looking to understand the number of times we are in an 'Abnormal State' before we have an 'Event'. My objective is to modify my dataframe to get the following output where everytime we reach an 'event', the 'Abnormal State Grouping' resets to count from 0.
We can go through a number of 'Abnormal States' before we reach an 'Event', which is deemed a failure. (i.e. The lightbulb is switched on and off for several periods before it finally shorts out resulting in an event).
I've written the following code to get my AbnormalStateGroupings to increment into relevant groupings for my analysis which has worked fine. However, we want to 'reset' the count of our 'AbnormalStates' after each event (i.e. lightbulb failure):
dataframe['AbnormalStateGrouping'] = (dataframe['AbnormalState']!=dataframe['AbnormalState'].shift()).cumsum()
I have created an additional column which let's me know what 'event' we are at via:
dataframe['Event_Or_Not'].cumsum() #I have a boolean representation of the Event Column represented and we use .cumsum() to get the relevant groupings (i.e. 1st Event, 2nd Event, 3rd Event etc.)
I've come close previously using the following:
eventOrNot = dataframe['Event'].eq(0)
eventMask = (eventOrNot.ne(eventOrNot.shift())&eventOrNot).cumsum()
dataframe['AbnormalStatePerEvent'] =dataframe.groupby(['Event',eventMask]).cumcount().add(1)
However, this hasn't given me the desired output that I'm after (as per below).
I think I'm close however - Could anyone please advise what I could try to do next so that for each lightbulb failure, the abnormal state count resets and starts counting the # of abnormal states we have gone through before the next lightbulb failure?
State I want to get to with AbnormalStateGrouping
You would note that when an 'Event' is detected, the Abnormal State count resets to 1 and then starts counting again.
Current State of Dataframe
Please find an attached data source below:
https://filebin.net/ctjwk7p3gulmbgkn
I assume that your source DataFrame has only Date/Time (either string
or datetime), Event (string) and AbnormalState (int) columns.
To compute your grouping column, run:
dataframe['AbnormalStateGrouping'] = dataframe.groupby(
dataframe['Event'][::-1].notnull().cumsum()).AbnormalState\
.apply(lambda grp: (grp != grp.shift()).cumsum())
The result, for your initial source data, included as a picture, is:
Date/Time Event AbnormalState AbnormalStateGrouping
0 2018-01-01 01:00 NaN 0 1
1 2018-01-01 02:00 NaN 0 1
2 2018-01-01 03:00 NaN 1 2
3 2018-01-01 04:00 NaN 1 2
4 2018-01-01 05:00 NaN 0 3
5 2018-01-01 06:00 NaN 0 3
6 2018-01-01 07:00 NaN 0 3
7 2018-01-01 08:00 NaN 1 4
8 2018-01-01 09:00 NaN 1 4
9 2018-01-01 10:00 NaN 0 5
10 2018-01-01 11:00 NaN 0 5
11 2018-01-01 12:00 NaN 0 5
12 2018-01-01 13:00 NaN 1 6
13 2018-01-01 14:00 NaN 1 6
14 2018-01-01 15:00 NaN 0 7
15 2018-01-01 16:00 Event 0 7
16 2018-01-01 17:00 NaN 1 1
17 2018-01-01 18:00 NaN 1 1
18 2018-01-01 19:00 NaN 0 2
19 2018-01-01 20:00 NaN 0 2
Note the way of grouping:
dataframe['Event'][::-1].notnull().cumsum()
Due to [::-1], cumsum function is computed from the last row
to the first.
Thus:
rows with hours 01:00 thru 16:00 are in group 1,
remaining rows (hour 17:00 thru 20:00) are in group 0.
Then, to AbnormalState, separately for each group, a lambda function
is applied, so each cumulative sum starts from 1 just in each group
(after each Event).
Edit following the comment as of 22:18:12Z
The reason why I compute the cumsum for grouping in reversed order
is that when you run it in normal order:
dataframe['Event'].notnull().cumsum()
then:
rows with index 0 thru 14 (before the row with Event) have
this sum == 0,
row with index 15 and following rows have this sum == 1.
Try yourself both versions, without and with [::-1].
The result in normal order (without [::-1]) is that:
Event row is in the same group with the following rows,
so the reset occurs just on this row.
To check the whole result, run my code without [::-1] and you will see
that the ending part of the result contains:
Date/Time Event AbnormalState AbnormalStateGrouping
14 2018-01-01 15:00:00 NaN 0 7
15 2018-01-01 16:00:00 Event 0 1
16 2018-01-01 17:00:00 NaN 1 2
17 2018-01-01 18:00:00 NaN 1 2
18 2018-01-01 19:00:00 NaN 0 3
19 2018-01-01 20:00:00 NaN 0 3
so that the Event row to has AbnormalStateGrouping == 1.
But you want this row to have AbnormalStateGrouping in a sequence of
previous grouping states (in this case 7) and reset should occur
from the next row on.
So the Event row should be in same group with preceding rows, what
is the result of my code.

summing a time-series column upon it self with a condition

I'm currently working with some data that I receive from an engineering plant, the data comes out(roughly) as the following :
df = pd.DataFrame({'ID' : np.random.randint(1,25,size=5),
'on/off' : np.random.randint(0,2,size=5),
'Time' : pd.date_range(start='01/01/2019',periods=5,freq='5s')})
print(df)
ID on/off Time
0 17 0 2019-01-01 00:00:00
1 21 0 2019-01-01 00:00:05
2 12 1 2019-01-01 00:00:10
3 12 1 2019-01-01 00:00:15
4 12 0 2019-01-01 00:00:20
the 0 and 1 in the on/off column correspond to when a machine is on or off (0 = on 1 = off)
currently, I use the following line of beautiful code to get the difference between my column as the data is rolling
df['Time Difference'] = (df.time - df.time.shift())
print(df)
ID on/off Time Time Difference
0 17 0 2019-01-01 00:00:00 NaT
1 21 0 2019-01-01 00:00:05 00:00:05
2 12 1 2019-01-01 00:00:10 00:00:05
3 12 1 2019-01-01 00:00:15 00:00:05
4 12 0 2019-01-01 00:00:20 00:00:05
now as this dataframe is quite verbose (each week I'll receive about 150k rows)
what would be the best way to sum amount time a machine is off (where df['on/off] == 1) until the next 0 comes along? so in the above example for the 1st of January 2019 the machine of ID 12 didn't run for 15 seconds until it resumed at 00:00:20
Here's an approach that works for a simple example of one machine that varies between on and off during the course of one day. It works regardless of whether the machine is in on or off state in the first row.
df = pd.DataFrame({'ID': [12, 12, 12, 12, 12],
'on/off': [0,0,1,0,1],
'Time': ['2019-01-01 00:00:00', '2019-01-01 00:00:05', '2019-01-01 00:00:10','2019-01-01 00:00:15','2019-01-01 00:00:20']
})
ID on/off Time
0 12 0 2019-01-01 00:00:00
1 12 0 2019-01-01 00:00:05
2 12 1 2019-01-01 00:00:10
3 12 0 2019-01-01 00:00:15
4 12 1 2019-01-01 00:00:20
First I made sure the Time column dtype is datetime64:
df['Time'] = pd.to_datetime(df['Time'])
Then I get the indices of all rows where the state changed (either from off to on, or from on to off:
s = df[df['on/off'].shift(1) != df['on/off']].index
df = df.loc[s]
Then I create a column called time shift, which shows the timestamp of the most recent row where power state changed:
df['time shift'] = df['Time'].shift(1)
At this point the dataframe looks like this:
ID on/off Time time shift
0 12 0 2019-01-01 00:00:00 NaT
2 12 1 2019-01-01 00:00:10 2019-01-01 00:00:00
3 12 0 2019-01-01 00:00:15 2019-01-01 00:00:10
4 12 1 2019-01-01 00:00:20 2019-01-01 00:00:15
Now, since we want to count the duration that the machine was off, I look at only the row indices where the state became on:
r = df[df['on/off'] == 1].index
df = df.loc[r]
At this point, the dataframe looks as it does below. Notice that the time shift column is displaying the point at which the machine most recently turned off, prior to the time being displayed in Time column, which is the timestamp when the machine turned back on. Finding the difference between these two columns will give us the length of each duration that the machine was off during the day:
ID on/off Time time shift
2 12 1 2019-01-01 00:00:10 2019-01-01 00:00:00
4 12 1 2019-01-01 00:00:20 2019-01-01 00:00:15
The following line calculates total off-time, by summing the durations of each period that the machine was in its off state:
(df['Time'] - df['time shift']).sum()
Which outputs:
Timedelta('0 days 00:00:15')
Some additional context on how the Pandas .shift() method works:
Shift takes all the rows in a column, and moves them either forward or back by a certain amount. .shift(1) tells pandas to move the index of each row forward, or up, by 1. .shift(-1) tells pandas to move the index of each row back, or down, by 1. Alternately put, .shift(1) lets you look at the value of a column at the previous row index, and .shift(-1) lets you look at the value of a column at the next row index, relative a given row in a column. It's a handy way to compare a column's values across different rows, without resorting to for-loops.

Combine different tables with time in minutes columns

I have the following data (just a snippet). They start at 0 min and end at 65 min.
R.Time (min) Intensity 215 Intensity 260 Intensity 280
0 0.00000 0 0 0
1 0.01067 0 0 0
2 0.02133 0 0 0
3 0.03200 0 0 0
and
Time %B c B c KCl
0 16.01 0.00 0.0000 0.00
1 16.01 0.00 0.0000 0.00
2 17.00 0.85 0.0085 4.25
3 18.00 1.70 0.0170 8.50
How can I create a dataframe with the time [min] column and all other columns at the correct row for that time. So I need to tell pandas the time column and how to merge and then it sorts the rows? But I also needs to combine rows, when the time is the same.

Categories

Resources