Im using pandas to do some calculations, and I need to sum some of values like count1 and count2 based on dates weekly or daily and so on,
my df =
id count1 ... count2 date
0 1 1 ... 52 2019-12-09
1 2 1 ... 23 2019-12-10
2 3 1 ... 0 2019-12-11
3 4 1 ... 17 2019-12-18
4 5 1 ... 20 2019-12-20
5 6 1 ... 4 2019-12-21
6 7 1 ... 2 2019-12-21
how can I do groupby date with weekly freq.?
I tried many ways but I got different errors
many thanks
Use DataFrame.resample by W with sum:
#convert date column to datetimes
df['date'] = pd.to_datetime(df['date'])
df1 = df.resample('W', on='date')['count1','count2'].sum()
Or use Grouper:
df1 = df.groupby(pd.Grouper(freq='W', key='date'))['count1','count2'].sum()
print (df1)
count1 count2
date
2019-12-15 3 75
2019-12-22 4 43
first, generate a week column so you can group by it ;
df['week_id'] = df['date'].dt.week
then group the dataframe and iterate through each group and do your stuff:
grouped_df = df.groupby('week_id')
for index, sub_df in grouped_df:
#use sub_df, it is data for each week
Related
I'm having this data frame:
id date count
1 8/31/22 1
1 9/1/22 2
1 9/2/22 8
1 9/3/22 0
1 9/4/22 3
1 9/5/22 5
1 9/6/22 1
1 9/7/22 6
1 9/8/22 5
1 9/9/22 7
1 9/10/22 1
2 8/31/22 0
2 9/1/22 2
2 9/2/22 0
2 9/3/22 5
2 9/4/22 1
2 9/5/22 6
2 9/6/22 1
2 9/7/22 1
2 9/8/22 2
2 9/9/22 2
2 9/10/22 0
I want to aggregate the count by id and date to get sum of quantities Details:
Date: the all counts in a week should be aggregated on Saturday. A week starts from Sunday and ends on Saturday. The time period (the first day and the last day of counts) is fixed for all of the ids.
The desired output is given below:
id date count
1 9/3/22 11
1 9/10/22 28
2 9/3/22 7
2 9/10/22 13
I have already the following code for this work and it does work but it is not efficient as it takes a long time to run for a large database. I am looking for a much faster and efficient way to get the output:
df['day_name'] = new_df['date'].dt.day_name()
df_week_count = pd.DataFrame(columns=['id', 'date', 'count'])
for id in ids:
# make a dataframe for each id
df_id = new_df.loc[new_df['id'] == id]
df_id.reset_index(drop=True, inplace=True)
# find Starudays index
saturday_indices = df_id.loc[df_id['day_name'] == 'Saturday'].index
j = 0
sat_index = 0
while(j < len(df_id)):
# find sum of count between j and saturday_index[sat_index]
sum_count = df_id.loc[j:saturday_indices[sat_index], 'count'].sum()
# add id, date, sum_count to df_week_count
temp_df = pd.DataFrame([[id, df_id.loc[saturday_indices[sat_index], 'date'], sum_count]], columns=['id', 'date', 'count'])
df_week_count = pd.concat([df_week_count, temp_df], ignore_index=True)
j = saturday_indices[sat_index] + 1
sat_index += 1
if sat_index >= len(saturday_indices):
break
if(j < len(df_id)):
sum_count = df_id.loc[j:, 'count'].sum()
temp_df = pd.DataFrame([[id, df_id.loc[len(df_id) - 1, 'date'], sum_count]], columns=['id', 'date', 'count'])
df_week_count = pd.concat([df_week_count, temp_df], ignore_index=True)
df_final = df_week_count.copy(deep=True)
Create a grouping factor from the dates.
week = pd.to_datetime(df['date'].to_numpy()).strftime('%U %y')
df.groupby(['id',week]).agg({'date':max, 'count':sum}).reset_index()
id level_1 date count
0 1 35 22 9/3/22 11
1 1 36 22 9/9/22 28
2 2 35 22 9/3/22 7
3 2 36 22 9/9/22 13
I tried to understand as much as i can :)
here is my process
# reading data
df = pd.read_csv(StringIO(data), sep=' ')
# data type fix
df['date'] = pd.to_datetime(df['date'])
# initial grouping
df = df.groupby(['id', 'date'])['count'].sum().to_frame().reset_index()
df.sort_values(by=['date', 'id'], inplace=True)
df.reset_index(drop=True, inplace=True)
# getting name of the day
df['day_name'] = df.date.dt.day_name()
# getting week number
df['week'] = df.date.dt.isocalendar().week
# adjusting week number to make saturday the last day of the week
df.loc[df.day_name == 'Sunday','week'] = df.loc[df.day_name == 'Sunday', 'week'] + 1
what i think you are looking for
df.groupby(['id','week']).agg(count=('count','sum'), date=('date','max')).reset_index()
id
week
count
date
0
1
35
11
2022-09-03 00:00:00
1
1
36
28
2022-09-10 00:00:00
2
2
35
7
2022-09-03 00:00:00
3
2
36
13
2022-09-10 00:00:00
I am looking for a way to identify the row that is the 'master' row. The way I am defining the master row is for each group id the row that has the minimum in cust_hierarchy then if it is a tie use the row with the most recent date.
I have supplied some sample tables below:
row_id
group_id
cust_hierarchy
most_recent_date
master(I am looking for)
1
0
2
2020-01-03
1
2
0
7
2019-01-01
0
3
1
7
2019-05-01
0
4
1
6
2019-04-01
0
5
1
6
2019-04-03
1
I was thinking of possibly ordering by the two columns (cust_hierarchy (ascending), most_recent_date (descending), and then a new column that places a 1 on the first row for each group id?
Does anyone have any helpful code for this?
You basically can to an groupby with an idxmin(), but with a little bit of sorting to ensure the most recent use date is selected by the min operation:
import pandas as pd
import numpy as np
# example data
dates = ['2020-01-03','2019-01-01','2019-05-01',
'2019-04-01','2019-04-03']
dates = pd.to_datetime(dates)
df = pd.DataFrame({'group_id':[0,0,1,1,1],
'cust_hierarchy':[2,7,7,6,6,],
'most_recent_date':dates})
# solution
df = df.sort_values('most_recent_date', ascending=False)
idxs = df.groupby('group_id')['cust_hierarchy'].idxmin()
df['master'] = np.where(df.index.isin(idxs), True, False)
df = df.sort_index()
df before:
group_id cust_hierarchy most_recent_date
0 0 2 2020-01-03
1 0 7 2019-01-01
2 1 7 2019-05-01
3 1 6 2019-04-01
4 1 6 2019-04-03
df after:
group_id cust_hierarchy most_recent_date master
0 0 2 2020-01-03 True
1 0 7 2019-01-01 False
2 1 7 2019-05-01 False
3 1 6 2019-04-01 False
4 1 6 2019-04-03 True
Use duplicated on sort_values:
df['master'] = 1- (df.sort_values(['cust_hierarchy', 'most_recent_date'],
ascending=[False, True])
.duplicated('group_id', keep='last')
.astype(int)
)
I am having a data frame like this I have to get missing Weeks value and count between them
year Data Id
20180406 57170 A
20180413 55150 A
20180420 51109 A
20180427 57170 A
20180504 55150 A
20180525 51109 A
The output should be like this.
Id Start year end-year count
A 20180420 20180420 1
A 20180518 20180525 2
Use:
#converting to week period starts in Thursday
df['year'] = pd.to_datetime(df['year'], format='%Y%m%d').dt.to_period('W-Thu')
#resample by start of months with asfreq
df1 = (df.set_index('year')
.groupby('Id')['Id']
.resample('W-Thu')
.asfreq()
.rename('val')
.reset_index())
print (df1)
Id year val
0 A 2018-04-06/2018-04-12 A
1 A 2018-04-13/2018-04-19 A
2 A 2018-04-20/2018-04-26 A
3 A 2018-04-27/2018-05-03 A
4 A 2018-05-04/2018-05-10 A
5 A 2018-05-11/2018-05-17 NaN
6 A 2018-05-18/2018-05-24 NaN
7 A 2018-05-25/2018-05-31 A
#onverting to datetimes with starts dates
#http://pandas.pydata.org/pandas-docs/stable/timeseries.html#converting-between-representations
df1['year'] = df1['year'].dt.to_timestamp('D', how='s')
print (df1)
Id year val
0 A 2018-04-06 A
1 A 2018-04-13 A
2 A 2018-04-20 A
3 A 2018-04-27 A
4 A 2018-05-04 A
5 A 2018-05-11 NaN
6 A 2018-05-18 NaN
7 A 2018-05-25 A
m = df1['val'].notnull().rename('g')
#create index by cumulative sum for unique groups for consecutive NaNs
df1.index = m.cumsum()
#filter only NaNs row and aggregate first, last and count.
df2 = (df1[~m.values].groupby(['Id', 'g'])['year']
.agg(['first','last','size'])
.reset_index(level=1, drop=True)
.reset_index())
print (df2)
Id first last size
0 A 2018-05-11 2018-05-18 2
I have df with columns date, employee and event. 'Event' have value [1,3,5] if someone exit or [0,2,4] if someone entries. 'Employee' it's a private number for each employee. That's a head of df:
employee event registration date
0 4 1 1 2010-10-18 18:11:00
1 17 1 1 2010-10-18 18:15:00
2 6 0 1 2010-10-19 06:28:00
3 8 0 0 2010-10-19 07:04:00
4 15 0 1 2010-10-19 07:34:00
I sorted df and I have value from one month [year and month are my variables].
df = df.where(df['date'].dt.year == year).dropna()
df = df.where(df['date'].dt.month== month).dropna()
I want to create df which shows me sum of time in work for each employee.
Employees come in and come out at the same day and they could do it a few times in each day.
It seems you need boolean indexing with groupby where get difference by diff with sum:
year = 2010
month = 10
df = df[(df['date'].dt.year == year) & (df['date'].dt.month== month)]
More general solution is add to groupby year and month:
df =df['date'].groupby([df['employee'],
df['event'],
df['date'].rename('year').dt.year,
df['date'].rename('month').dt.month]).apply(lambda x: x.diff().sum())
I have the following time series dataset of the number of sales happening for a day as a pandas data frame.
date, sales
20161224,5
20161225,2
20161227,4
20161231,8
Now if I have to include the missing data points here(i. e. missing dates) with a constant value(zero) and want to make it look the following way, how can I do this efficiently(assuming the data frame is ~50MB) using Pandas.
date, sales
20161224,5
20161225,2
20161226,0**
20161227,4
20161228,0**
20161229,0**
20161231,8
**Missing rows which are been added to the data frame.
Any help will be appreciated.
You can first cast to to_datetime column date, then set_index and reindex by min and max value of index, reset_index and if necessary change format by strftime:
df.date = pd.to_datetime(df.date, format='%Y%m%d')
df = df.set_index('date')
df = df.reindex(pd.date_range(df.index.min(), df.index.max()), fill_value=0)
.reset_index()
.rename(columns={'index':'date'})
print (df)
date sales
0 2016-12-24 5
1 2016-12-25 2
2 2016-12-26 0
3 2016-12-27 4
4 2016-12-28 0
5 2016-12-29 0
6 2016-12-30 0
7 2016-12-31 8
Last if need change format:
df.date = df.date.dt.strftime('%Y%m%d')
print (df)
date sales
0 20161224 5
1 20161225 2
2 20161226 0
3 20161227 4
4 20161228 0
5 20161229 0
6 20161230 0
7 20161231 8