Grouping on Weekday histogram - python

I have a df in the format:
date number category
2014-02-02 17:00:00 4 red
2014-02-03 17:00:00 5 red
2014-02-04 17:00:00 4 blue
2014-02-05 17:00:00 4 blue
2014-02-06 17:00:00 4 red
2014-02-07 17:00:00 4 red
2014-02-08 17:00:00 4 blue
...
How do I group on day of the week and take a total of 'number' in that day of the week, so Id have a df of 7 items, monday, tuesday etc, and the total number of 'number' on that day. With this I want to make a histogram with number on the y and day of the week on the x.

After reading your question again, I understand why #Quang Hoang answered the way he did. Not so sure if that's what you had wanted or if the below is:
df['date'] = pd.to_datetime(df['date'], infer_datetime_format=True)
df['day'] = df['date'].apply(lambda x: x.day_name())
counts = df.groupby('day')['Number'].sum()
plt.bar(counts.index, counts)
plt.show()

You can use dt.day_name() to extract the day name, then use pd.crosstab to count the number:
pd.crosstab(df['date'].dt.day_name(),df['number'])
Output:
number 4 5
date
Friday 1 0
Monday 0 1
Saturday 1 0
Sunday 1 0
Thursday 1 0
Tuesday 1 0
Wednesday 1 0
And to plot a histogram, you can chain the above with .plot.bar():

Related

int64 to HHMM string

I have the following data frame where the column hour shows hours of the day in int64 form. I'm trying to convert that into a time format; so that hour 1 would show up as '01:00'. I then want to add this to the date column and convert it into a timestamp index.
Using the datetime function in pandas resulted in the column "hr2", which is not what I need. I'm not sure I can even apply datetime directly, as the original data (i.e. in column "hr") is not really a date time format to begin with. Google searches so far have been unproductive.
While I am still in the dark concerning the format of your date column. I will assume the Date column is a string object and the hr column is an int64 object. To create the column TimeStamp in pandas tmestamp format this is how I would proceed>
Given df:
Date Hr
0 12/01/2010 1
1 12/01/2010 2
2 12/01/2010 3
3 12/01/2010 4
4 12/02/2010 1
5 12/02/2010 2
6 12/02/2010 3
7 12/02/2010 4
df['TimeStamp'] = df.apply(lambda row: pd.to_datetime(row['Date']) + pd.to_timedelta(row['Hr'], unit='H'), axis = 1)
yields:
Date Hr TimeStamp
0 12/01/2010 1 2010-12-01 01:00:00
1 12/01/2010 2 2010-12-01 02:00:00
2 12/01/2010 3 2010-12-01 03:00:00
3 12/01/2010 4 2010-12-01 04:00:00
4 12/02/2010 1 2010-12-02 01:00:00
5 12/02/2010 2 2010-12-02 02:00:00
6 12/02/2010 3 2010-12-02 03:00:00
7 12/02/2010 4 2010-12-02 04:00:00
The timestamp column can then be used as your index.

How to Get Day of Week as Integer with First of Month Changing Values?

Using .weekday() to find the day of the week as an integer (Monday = 0 ... Sunday = 6) for everyday from today until next year (+365 days from today). Problem now is that if the 1st of the month starts mid week then I need to return the day of the week with the 1st day of the month now being = 0.
Ex. If the month starts Wednesday then Wednesday = 0... Sunday = 4 (for that week only).
Annotated Picture of Month Explaining What I Want to Do
Originally had the below code but wrong as the first statement will run 7 days regardless.
import datetime
from datetime import date
for day in range (1,365):
departure_date = date.today() + datetime.timedelta(days=day)
if departure_date.weekday() < 7:
day_of_week = departure_date.day
else:
day_of_week = departure_date.weekday()
The following seems to do the job properly:
import datetime as dt
def custom_weekday(date):
if date.weekday() > (date.day-1):
return date.day - 1
else:
return date.weekday()
for day in range (1,366):
departure_date = dt.date.today() + dt.timedelta(days=day)
day_of_week = custom_weekday(date=departure_date)
print(departure_date, day_of_week, departure_date.weekday())
Your code had two small bugs:
the if condition was wrong
days are represented inconsistently: date.weekday() is 0-based, date.day is 1-based
For every date, get the first week of that month. Then, check if the date is within that first week. If it is, use the .day - 1 value (since you are 0-based). Otherwise, use the .weekday().
from datetime import date, datetime, timedelta
for day in range (-5, 40):
departure_date = date.today() + timedelta(days=day)
first_week = date(departure_date.year, departure_date.month, 1).isocalendar()[1]
if first_week == departure_date.isocalendar()[1]:
day_of_week = departure_date.day - 1
else:
day_of_week = departure_date.weekday()
print(departure_date, day_of_week)
2021-08-27 4
2021-08-28 5
2021-08-29 6
2021-08-30 0
2021-08-31 1
2021-09-01 0
2021-09-02 1
2021-09-03 2
2021-09-04 3
2021-09-05 4
2021-09-06 0
2021-09-07 1
2021-09-08 2
2021-09-09 3
2021-09-10 4
2021-09-11 5
2021-09-12 6
2021-09-13 0
2021-09-14 1
2021-09-15 2
2021-09-16 3
2021-09-17 4
2021-09-18 5
2021-09-19 6
2021-09-20 0
2021-09-21 1
2021-09-22 2
2021-09-23 3
2021-09-24 4
2021-09-25 5
2021-09-26 6
2021-09-27 0
2021-09-28 1
2021-09-29 2
2021-09-30 3
2021-10-01 0
2021-10-02 1
2021-10-03 2
2021-10-04 0
2021-10-05 1
2021-10-06 2
2021-10-07 3
2021-10-08 4
2021-10-09 5
2021-10-10 6
For any date D.M.Y, get the weekday W of 1.M.Y.
Then you need to adjust weekday value only for the first 7-W days of that month. To adjust, simply subtract the value W.
Example for September 2021: the first date of month (1.9.2021) is a Wednesday, so W is 2. You need to adjust weekdays for dates 1.9.2021 to 5.9.2021 (because 7-2 is 5) in that month by minus 2.

How can I join columns by DatetimeIndex, matching day, month and hour from data from different years?

I have a dataset with meteorological features for 2019, to which I want to join two columns of power consumption datasets for 2017, 2018. I want to match them by hour, day and month, but the data belongs to different years. How can I do that?
The meteo dataset is a 6 column similar dataframe with datetimeindexes belonging to 2019.
You can from the index 3 additional columns that represent the hour, day and month and use them for a later join. DatetimeIndex has attribtues for different parts of the timestamp:
import pandas as pd
ind = pd.date_range(start='2020-01-01', end='2020-01-20', periods=10)
df = pd.DataFrame({'number' : range(10)}, index = ind)
df['hour'] = df.index.hour
df['day'] = df.index.day
df['month'] = df.index.month
print(df)
number hour day month
2020-01-01 00:00:00 0 0 1 1
2020-01-03 02:40:00 1 2 3 1
2020-01-05 05:20:00 2 5 5 1
2020-01-07 08:00:00 3 8 7 1
2020-01-09 10:40:00 4 10 9 1
2020-01-11 13:20:00 5 13 11 1
2020-01-13 16:00:00 6 16 13 1
2020-01-15 18:40:00 7 18 15 1
2020-01-17 21:20:00 8 21 17 1
2020-01-20 00:00:00 9 0 20 1

Python Pandas group datetimes by hour and count row

This is my transaction dataframe, where each row mean a transaction :
date station
30/10/2017 15:20 A
30/10/2017 15:45 A
31/10/2017 07:10 A
31/10/2017 07:25 B
31/10/2017 07:55 B
I need to group the start_date to a hour interval and count each city, so the end result will be:
date hour station count
30/10/2017 16:00 A 2
31/10/2017 08:00 A 1
31/10/2017 08:00 B 2
Where the first row means from 15:00 to 16:00 on 30/10/2017, there are 2 transactions in station A
How to do this in Pandas?
I tried this code, but the result is wrong :
df_start_tmp = df_trip[['Start Date', 'Start Station']]
times = pd.DatetimeIndex(df_start_tmp['Start Date'])
df_start = df_start_tmp.groupby([times.hour, df_start_tmp['Start Station']]).count()
Thanks a lot for the help
IIUC size+pd.Grouper
df.date=pd.to_datetime(df.date)
df.groupby([pd.Grouper(key='date',freq='H'),df.station]).size().reset_index(name='count')
Out[235]:
date station count
0 2017-10-30 15:00:00 A 2
1 2017-10-31 07:00:00 A 1
2 2017-10-31 07:00:00 B 2

Python pandas groupby manipulation?

Examples of how the df looks like:
customer order_datetime
a 01-03-2017 12:00:00 PM
b 01-04-2017 12:00:00 PM
c 01-07-2017 12:00:00 PM
a 01-08-2017 12:00:00 PM
b 01-09-2017 12:00:00 PM
a 01-11-2017 12:00:00 PM
There's 2 thing that I wanted to achieve but I'm still in the learning process, really appreciate any help to guide me in the right direction.
Create a list of "time between orders" where I can find the min, mean, max
Find out if "time between order" gets faster/slower, i.e. time between order_3 and order_2 vs time between order_2 and order_1
This example should set you in the right direction for your assignment.
First I'm creating a DataFrame similar to the one you show in the question:
import pandas as pd
import numpy as np
import datetime as dt
orders = pd.DataFrame({
'client': np.random.randint(65, 70, size=15),
'date': np.random.randint(0, 30, size=15)})
orders.client = orders.client.apply(chr)
orders.date = orders.date.apply(
pd.to_datetime, unit='d', origin=dt.date(2017, 1, 1), box=False)
# Sorting here is not necessary, just for visualization
orders.sort_values(['client', 'date'], inplace=True)
orders.reset_index(inplace=True, drop=True)
orders.head()
>>>>
client date
0 A 2017-01-27
1 A 2017-01-29
2 A 2017-01-30
3 B 2017-01-03
4 B 2017-01-13
The key to the solution is in the line orders.groupby('client').date.apply(pd.Series.sort_values).diff().
First we use groupby to group the orders using client as a key, then we select the date column only and sort the dates in each group with pd.Series.sort_values, finally we use diff to compute the difference of each record with the following one (here's why the dates in each group must be sorted).
The rest of the code is just to visualize the result, i.e. renaming the Series you obtain and concatenating it with the initial DataFrame.
diff_df = pd.concat([
orders,
orders.groupby('client').date.diff().rename('diff')], axis=1)
diff_df.head(10)
>>>>
client date diff
0 A 2017-01-27 NaT
1 A 2017-01-29 2 days
2 A 2017-01-30 1 days
3 B 2017-01-03 NaT
4 B 2017-01-13 10 days
5 B 2017-01-18 5 days
6 B 2017-01-24 6 days
7 C 2017-01-01 NaT
8 C 2017-01-02 1 days
9 C 2017-01-03 1 days
Once you have the time differences you can compute all kinds of in-group metrics you need.
First you can try pd.Series.describe:
diff_df.groupby('client').diff.describe()
>>>>
count mean std min \
client
A 1 5 days 00:00:00 NaT 5 days 00:00:00
B 1 12 days 00:00:00 NaT 12 days 00:00:00
C 3 4 days 00:00:00 1 days 17:34:09.189773 2 days 00:00:00
D 1 4 days 00:00:00 NaT 4 days 00:00:00
E 4 5 days 00:00:00 3 days 03:53:40.789838 2 days 00:00:00
25% 50% 75% max
client
A 5 days 00:00:00 5 days 00:00:00 5 days 00:00:00 5 days 00:00:00
B 12 days 00:00:00 12 days 00:00:00 12 days 00:00:00 12 days 00:00:00
C 3 days 12:00:00 5 days 00:00:00 5 days 00:00:00 5 days 00:00:00
D 4 days 00:00:00 4 days 00:00:00 4 days 00:00:00 4 days 00:00:00
E 2 days 18:00:00 4 days 12:00:00 6 days 18:00:00 9 days 00:00:00
If that is not enough you can define your own aggregations.
You will need a list of functions if you work on a single Series:
metrics = [pd.Series.count, pd.Series.min, pd.Series.max, pd.Series.mean]
diff_df.groupby('client').diff.aggregate(metrics)
>>>>
count nunique min max mean
client
A 1 1 5 days 5 days 5 days
B 1 1 12 days 12 days 12 days
C 3 2 2 days 5 days 4 days
D 1 1 4 days 4 days 4 days
E 4 4 2 days 9 days 5 days
Or a dictionary of of {column -> function, column -> function_list} if you work on the whole DataFrame:
metrics = {
'date': [pd.Series.count, pd.Series.nunique],
'diff': [pd.Series.min, pd.Series.max, pd.Series.mean],
}
diff_df.groupby('client').aggregate(metrics)
>>>>
diff date
min max mean count nunique
client
A 5 days 5 days 5 days 2 2
B 12 days 12 days 12 days 2 2
C 2 days 5 days 4 days 4 4
D 4 days 4 days 4 days 2 2
E 2 days 9 days 5 days 5 5

Categories

Resources