I have a dataframe with one column in datetime format and the other columns in integers and floats. I would like to group the dataframe by the weekday of the first column. The other columns would be added.
print (df)
Day Butter Bread Coffee
2019-07-01 00:00:00 2 2 4
2019-07-01 00:00:00 1 2 1
2019-07-02 00:00:00 5 4 8
Basically the outcome would be sometime alike:
print (df)
Day Butter Bread Coffee
Monday 3 4 5
Tuesday 5 4 8
I am flexible if it says exactly Monday, or MO or 01 for the first day of the week, as long it is visible which consumption was done on Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, and Sunday.
You should convert your "Day" to datetime type and then you can extract the day of the week and aggregate over the rest of the columns:
import pandas as pd
df['Day'] = pd.to_datetime(df['Day'])
df.groupby(df['Day'].dt.day_name()).sum()
try using .dt.day_name() and groupby(),sum()
df = pd.DataFrame(data={'day':['2019-07-01 00:00:00','2019-07-01 00:00:00','2019-07-02 00:00:00'],
'butter':[2,1,5],
'bread':[2,2,4],
'coffee':[4,1,8]})
df['day'] = pd.to_datetime(df['day']).dt.day_name()
df.groupby(['day'],as_index=False).sum()
day butter bread coffee
0 Monday 3 4 5
1 Tuesday 5 4 8
Related
This is a tricky one for me so bear me out. I'm creating a daily dataset that compiles units using timestamps.
day of the week month day hour week of year units
Monday January 3 16 1 1
Monday January 3 19 1 1
Tuesday January 4 21 1 1
Tuesday January 4 22 1 1
Wednesday January 5 23 1 1
Monday January 10 16 2 1
Monday January 10 19 2 1
Tuesday January 11 21 2 1
Tuesday January 11 22 2 1
Wednesday January 12 23 2 1
The various columns are created by using Pandas' excellent time functions and it is relatively trivial to create pivot plots based on a single column, such as day (date of the month), month, hour or even day of the week (thanks to this excellent code sample although lord knows where I found it on SO).
cats = [ 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
from pandas.api.types import CategoricalDtype
cat_type = CategoricalDtype(categories=cats, ordered=True)
df['day of the week'] = df['day of the week'].astype(cat_type)
As the dataset increases in size, what I'd like to be able to do is pivot on both say week of year and day of the week.
units
week of year day of the week
1 Friday 15.2
Monday 22.8
2 Friday 19.0
3 Thursday 28.0
Unfortunately, when I perform pd.pivot_table using week_of_year (numeric) and then categorical (day_of_the_week) I can get the numeric column, but lose the ordering of the categorical.
I'd also like to be able to be able visualise units trend over time (as in week as well as by day of the week.
My head says create matrix plot by week but that misses out the time (day of the week dimension).
Any ideas? I'm not necessarily looking for a solution, although I'll happily write this up if I fix it as I can't see this as a unique problem.
Update: I have a solution in my head in the way I'd solve this in Excel. I'd select day_of_the_week as row (and then sort), pick the numerical (week_of_year) as column, aggregate units as necessary and then plot.
With the sample data and code you provided, you could try this:
new_df = (
df.groupby(["week of year", "day of the week"]).sum().drop(columns=["day", "hour"])
)
new_df = new_df[new_df["units"] > 0]
So that:
print(new_df)
# Ouput
units
week of year day of the week
1 Monday 2
Tuesday 2
Wednesday 1
2 Monday 2
Tuesday 2
Wednesday 1
I Have a df having 2 columns *Total Idle Time and Month as below:
Total Idle Time Month
0 0:00:00 December
1 0:02:24 December
2 26:00:00 December
3 0:53:05 December
4 28:03:39 December
Here the Total Idle Time column is of string format, but I want to convert it into time format as I want to add the total idle time in the month of December.
I tried converting the column to datetime as below:
data['Total Idle Time '] = pd.to_datetime(data['Total Idle Time '], format='%H:%M:%S')
However, I got an error as follow:
time data '28:03:39' does not match format '%H:%M:%S' (match)
I thought of converting the column to int and adding them up based on the hours and minutes, but I am not successful in doing so. Is there any way to do this thing?
You could try using pd.to_timedelta() instead here:
>>> df['Idle Time'] = pd.to_timedelta(df["Idle Time"])
>>> df
Total Idle_Time Month
0 0 0 days 00:00:00 December
1 1 0 days 00:02:24 December
2 2 1 days 02:00:00 December
3 3 0 days 00:53:05 December
4 4 1 days 04:03:39 December
You can use this to convert to numeric if you want, by scaling the results of .total_seconds():
# in hours
>>> df['Idle Time'] = df['Idle Time'].dt.total_seconds() / 3600
>>> df
Total Idle_Time Month
0 0 0.000000 December
1 1 0.040000 December
2 2 26.000000 December
3 3 0.884722 December
4 4 28.060833 December
I am trying to create 2 columns based of a column that contains numerical values.
Value
0
4
10
24
null
49
Expected Output:
Value Day Hour
0 Sunday 12:00am
4 Sunday 4:00am
10 Sunday 10:00am
24 Monday 12:00am
null No Day No Time
49 Tuesday 1:00am
Continued.....
Code I am trying out:
value = df.value.unique()
Sunday_Starting_Point = pd.to_datetime('Sunday 2015')
(Sunday_Starting_Point + pd.to_timedelta(Value, 'h')).dt.strftime('%A %I:%M%P')
Thanks for looking!
I think unique values are not necessary, you can use 2 times dt.strftime for 2 columns with replace with NaT values:
Sunday_Starting_Point = pd.to_datetime('Sunday 2015')
x = pd.to_numeric(df.Value, errors='coerce')
s = Sunday_Starting_Point + pd.to_timedelta(x, unit='h')
df['Day'] = s.dt.strftime('%A').replace('NaT','No Day')
df['Hour'] = s.dt.strftime('%I:%M%p').replace('NaT','No Time')
print (df)
Value Day Hour
0 0.0 Sunday 12:00AM
1 4.0 Sunday 04:00AM
2 10.0 Sunday 10:00AM
3 24.0 Monday 12:00AM
4 NaN No Day No Time
5 49.0 Tuesday 01:00AM
I have a dataframe containing hourly data, i want to get the max for each week of the year, so i used resample to group data by week
weeks = data.resample("W").max()
the problem is that week max is calculated starting the first monday of the year, while i want it to be calculated starting the first day of the year.
I obtain the following result, where you can notice that there is 53 weeks, and the last week is calculated on the next year while 2017 doesn't exist in the data
Date dots
2016-01-03 0.647786
2016-01-10 0.917071
2016-01-17 0.667857
2016-01-24 0.669286
2016-01-31 0.645357
Date dots
2016-12-04 0.646786
2016-12-11 0.857714
2016-12-18 0.670000
2016-12-25 0.674571
2017-01-01 0.654571
is there a way to calculate week for pandas dataframe starting first day of the year?
Find the starting day of the year, for example let say it's Friday, and then you can specify an anchoring suffix to resample in order to calculate week starting first day of the year:
weeks = data.resample("W-FRI").max()
One quick remedy is, given you data in one year, you can group it by day first, then take group of 7 days:
new_df = (df.resample("D", on='Date').dots
.max().reset_index()
)
new_df.groupby(new_df.index//7).agg({'Date': 'min', 'dots': 'max'})
new_df.head()
Output:
Date dots
0 2016-01-01 0.996387
1 2016-01-08 0.999775
2 2016-01-15 0.997612
3 2016-01-22 0.979376
4 2016-01-29 0.998240
5 2016-02-05 0.995030
6 2016-02-12 0.987500
and tail:
Date dots
48 2016-12-02 0.999910
49 2016-12-09 0.992910
50 2016-12-16 0.996877
51 2016-12-23 0.992986
52 2016-12-30 0.960348
I have a pandas dataframe, which contains items and their quantiy brought on a certain date. For eg.
date Item qty
2016-01-04 Rice 3
2016-01-04 Ball 3
2016-01-10 Rice 5
2016-02-02 Coffee 10
2016-02-06 Rice 3
..... ... ..
The data is for 2 years, 2016 to May,2018.
I want to know how much was every item sold across monthwise, from Jan 2016 to May 2018. And Plot a Line Graph for it(x axis - months, y - quantities of products)
For that i thought of creating a dataframe in this format:
Date Rice Coffee Ball
Jan 16 8 0 3
Feb 16 10 17 5
.... ... ... ...
May 18 11 9 12
How can i get the data in this format??
One option i thought was
df.groupby([df.date.dt.year.rename('year'),df.date.dt.month.rename('month')]).agg({'qty':np.sum}).reset_index()
But it is not working, Is there a better way to get the results in the above format, or any better way to store the results so that it will be convinient to plot?
I think you want like this,
df= df.groupby([(df.index.year),(df.index.month),'Item']).sum().unstack(fill_value=0)
df.columns=df.columns.droplevel()
df.plot(kind='bar')
plt.show()
O/P
Given
>>> df
date Item qty
0 2016-01-04 Rice 3
1 2016-01-04 Ball 3
2 2016-01-10 Rice 5
3 2016-02-02 Coffee 10
4 2016-02-06 Rice 3
with
>>> df.dtypes
date datetime64[ns]
Item object
qty int64
dtype: object
you can do
>>> from pandas.tseries.offsets import MonthEnd
>>> offset = MonthEnd()
>>>
>>> df.set_index('date').groupby([offset.rollforward, 'Item']).sum().unstack(fill_value=0)
qty
Item Ball Coffee Rice
2016-01-31 3 0 8
2016-02-29 0 10 3
I'd keep the index like this because there are usable dates in there. If you really must convert these to strings like 'Jan 16', you can do so with:
>>> result = df.set_index('date').groupby([offset.rollforward, 'Item']).sum().unstack(fill_value=0)
>>> result.index = result.index.map(lambda d: d.strftime('%b %y'))
>>> result
qty
Item Ball Coffee Rice
Jan 16 3 0 8
Feb 16 0 10 3
Use Series.dt.strftime for custom format of datetimes and aggregate sum:
df = df.groupby([df.date.dt.strftime('%b %y'), 'Item'])['qty'].sum().unstack(fill_value=0)
If order of datetimes is important use ordered categoricals:
df = df.sort_values('date')
dates = df.date.dt.strftime('%b %y')
dates = pd.Categorical(dates, ordered=True, categories=dates.unique())
df1 = df.groupby([dates, 'Item'])['qty'].sum().unstack(fill_value=0)
Or reindex:
df = df.sort_values('date')
dates = df.date.dt.strftime('%b %y')
df1 = df.groupby([dates, 'Item'])['qty'].sum().unstack(fill_value=0).reindex(dates.unique())
print (df1)
Item Ball Coffee Rice
Jan 16 3 0 8
Feb 16 0 10 3
Last plot by DataFrame.plot.bar:
df1.plot.bar()