Sum time in pandas - python

I have df with columns date, employee and event. 'Event' have value [1,3,5] if someone exit or [0,2,4] if someone entries. 'Employee' it's a private number for each employee. That's a head of df:
employee event registration date
0 4 1 1 2010-10-18 18:11:00
1 17 1 1 2010-10-18 18:15:00
2 6 0 1 2010-10-19 06:28:00
3 8 0 0 2010-10-19 07:04:00
4 15 0 1 2010-10-19 07:34:00
I sorted df and I have value from one month [year and month are my variables].
df = df.where(df['date'].dt.year == year).dropna()
df = df.where(df['date'].dt.month== month).dropna()
I want to create df which shows me sum of time in work for each employee.
Employees come in and come out at the same day and they could do it a few times in each day.

It seems you need boolean indexing with groupby where get difference by diff with sum:
year = 2010
month = 10
df = df[(df['date'].dt.year == year) & (df['date'].dt.month== month)]
More general solution is add to groupby year and month:
df =df['date'].groupby([df['employee'],
df['event'],
df['date'].rename('year').dt.year,
df['date'].rename('month').dt.month]).apply(lambda x: x.diff().sum())

Related

Aggregating the counts on a certain day of the week in python

I'm having this data frame:
id date count
1 8/31/22 1
1 9/1/22 2
1 9/2/22 8
1 9/3/22 0
1 9/4/22 3
1 9/5/22 5
1 9/6/22 1
1 9/7/22 6
1 9/8/22 5
1 9/9/22 7
1 9/10/22 1
2 8/31/22 0
2 9/1/22 2
2 9/2/22 0
2 9/3/22 5
2 9/4/22 1
2 9/5/22 6
2 9/6/22 1
2 9/7/22 1
2 9/8/22 2
2 9/9/22 2
2 9/10/22 0
I want to aggregate the count by id and date to get sum of quantities Details:
Date: the all counts in a week should be aggregated on Saturday. A week starts from Sunday and ends on Saturday. The time period (the first day and the last day of counts) is fixed for all of the ids.
The desired output is given below:
id date count
1 9/3/22 11
1 9/10/22 28
2 9/3/22 7
2 9/10/22 13
I have already the following code for this work and it does work but it is not efficient as it takes a long time to run for a large database. I am looking for a much faster and efficient way to get the output:
df['day_name'] = new_df['date'].dt.day_name()
df_week_count = pd.DataFrame(columns=['id', 'date', 'count'])
for id in ids:
# make a dataframe for each id
df_id = new_df.loc[new_df['id'] == id]
df_id.reset_index(drop=True, inplace=True)
# find Starudays index
saturday_indices = df_id.loc[df_id['day_name'] == 'Saturday'].index
j = 0
sat_index = 0
while(j < len(df_id)):
# find sum of count between j and saturday_index[sat_index]
sum_count = df_id.loc[j:saturday_indices[sat_index], 'count'].sum()
# add id, date, sum_count to df_week_count
temp_df = pd.DataFrame([[id, df_id.loc[saturday_indices[sat_index], 'date'], sum_count]], columns=['id', 'date', 'count'])
df_week_count = pd.concat([df_week_count, temp_df], ignore_index=True)
j = saturday_indices[sat_index] + 1
sat_index += 1
if sat_index >= len(saturday_indices):
break
if(j < len(df_id)):
sum_count = df_id.loc[j:, 'count'].sum()
temp_df = pd.DataFrame([[id, df_id.loc[len(df_id) - 1, 'date'], sum_count]], columns=['id', 'date', 'count'])
df_week_count = pd.concat([df_week_count, temp_df], ignore_index=True)
df_final = df_week_count.copy(deep=True)
Create a grouping factor from the dates.
week = pd.to_datetime(df['date'].to_numpy()).strftime('%U %y')
df.groupby(['id',week]).agg({'date':max, 'count':sum}).reset_index()
id level_1 date count
0 1 35 22 9/3/22 11
1 1 36 22 9/9/22 28
2 2 35 22 9/3/22 7
3 2 36 22 9/9/22 13
I tried to understand as much as i can :)
here is my process
# reading data
df = pd.read_csv(StringIO(data), sep=' ')
# data type fix
df['date'] = pd.to_datetime(df['date'])
# initial grouping
df = df.groupby(['id', 'date'])['count'].sum().to_frame().reset_index()
df.sort_values(by=['date', 'id'], inplace=True)
df.reset_index(drop=True, inplace=True)
# getting name of the day
df['day_name'] = df.date.dt.day_name()
# getting week number
df['week'] = df.date.dt.isocalendar().week
# adjusting week number to make saturday the last day of the week
df.loc[df.day_name == 'Sunday','week'] = df.loc[df.day_name == 'Sunday', 'week'] + 1
what i think you are looking for
df.groupby(['id','week']).agg(count=('count','sum'), date=('date','max')).reset_index()
id
week
count
date
0
1
35
11
2022-09-03 00:00:00
1
1
36
28
2022-09-10 00:00:00
2
2
35
7
2022-09-03 00:00:00
3
2
36
13
2022-09-10 00:00:00

Create a new column based on timestamp values to count the days by steps

I would like to add a column to my dataset which corresponds to the time stamp, and counts the day by steps. That is, for one year there should be 365 "steps" and I would like for all grouped payments for each account on day 1 to be labeled 1 in this column and all payments on day 2 are then labeled 2 and so on up to day 365. I would like it to look something like this:
account time steps
0 A 2022.01.01 1
1 A 2022.01.02 2
2 A 2022.01.02 2
3 B 2022.01.01 1
4 B 2022.01.03 3
5 B 2022.01.05 5
I have tried this:
def day_step(x):
x['steps'] = x.time.dt.day.shift()
return x
df = df.groupby('account').apply(day_step)
however, it only counts for each month, once a new month begins it starts again from 1.
How can I fix this to make it provide the step count for the entire year?
Use GroupBy.transform with first or min Series, subtract column time, convert timedeltas to days and add 1:
df['time'] = pd.to_datetime(df['time'])
df['steps1'] = (df['time'].sub(df.groupby('account')['time'].transform('first'))
.dt.days
.add(1)
print (df)
account time steps steps1
0 A 2022-01-01 1 1
1 A 2022-01-02 2 2
2 A 2022-01-02 2 2
3 B 2022-01-01 1 1
4 B 2022-01-03 3 3
5 B 2022-01-05 5 5
First idea, working only if first row is January 1:
df['steps'] = df['time'].dt.dayofyear

How to crosstab or count dataframe rows by date in pandas

I am fairly new to working with pandas. I have a dataframe with individual entries like this:
dfImport:
id
date_created
date_closed
0
01-07-2020
1
02-09-2020
10-09-2020
2
07-03-2019
02-09-2020
I would like to filter it in a way, that I get the total number of created and closed objects (count id's) grouped by Year and Quarter and Month like this:
dfInOut:
Year
Qrt
month
number_created
number_closed
2019
1
March
1
0
2020
3
July
1
0
September
1
2
I guess I'd have to use some combination of crosstab or group_by, but I tried out alot of ideas and already did research on the problem, but I can't seem to figure out a way. I guess it's an issue of understanding. Thanks in advance!
Use DataFrame.melt with crosstab:
df['date_created'] = pd.to_datetime(df['date_created'], dayfirst=True)
df['date_closed'] = pd.to_datetime(df['date_closed'], dayfirst=True)
df1 = df.melt(value_vars=['date_created','date_closed']).dropna()
df = (pd.crosstab([df1['value'].dt.year.rename('Year'),
df1['value'].dt.quarter.rename('Qrt'),
df1['value'].dt.month.rename('Month')], df1['variable'])
[['date_created','date_closed']])
print (df)
variable date_created date_closed
Year Qrt Month
2019 1 3 1 0
2020 3 7 1 0
9 1 2
df = df.rename_axis(None, axis=1).reset_index()
print (df)
Year Qrt Month date_created date_closed
0 2019 1 3 1 0
1 2020 3 7 1 0
2 2020 3 9 1 2

Filling Pandas dataframe based on date columns and date range

I have a pandas dataframe that looks like this,
id start end
0 1 2020-02-01 2020-04-01
1 2 2020-04-01 2020-04-28
I have two additional parameters that are date values say x and y. x and y will be always a first day of the month.
I want to expand the above data frame to the one shown below for x = "2020-01-01" and y = "2020-06-01",
id month status
0 1 2020-01 -1
1 1 2020-02 1
2 1 2020-03 2
3 1 2020-04 2
4 1 2020-05 -1
5 1 2020-06 -1
6 2 2020-01 -1
7 2 2020-02 -1
8 2 2020-03 -1
9 2 2020-04 1
10 2 2020-05 -1
11 2 2020-06 -1
The dataframe expanded such that for each id, there will be additional months_between(x, y) rows made. And a status columns is made and values are filled in such that,
If the month column value is equal to month of start column then fill status as 1
If the month column value is greater than month of start column but less than or equal to month of end column fill it as 2.
If the month column value is less than month of start month then fill it as -1. Also if the month column value is greater than month of end fill status with -1.
I'm trying to solve this in pandas without looping. The current solution I have is with loops and takes longer to run with huge datasets.
Is there any pandas functions that can help me here?
Thanks #Code Different for the solution. It solves the issue. However there is an extension to the problem where the dataframe can look like this,
id start end
0 1 2020-02-01 2020-02-20
1 1 2020-04-01 2020-05-10
2 2 2020-04-10 2020-04-28
One id can have more than one entry. For the above x and y which is 6 months apart, I want to have 6 rows for each id in the dataframe. The solution currently creates 6 rows for each row in the dataframe. Which is okay but not ideal when dealing with dataframe with millions of ids.
Make sure the start and end columns are of type Timestamp:
# Explode each month between x and y
x = '2020-01-01'
y = '2020-06-01'
df['month'] = [pd.date_range(x, y, freq='MS')] * len(df)
df = df.explode('month').drop_duplicate(['id', 'month'])
# Determine the status
df['status'] = -1
cond = df['start'] == df['month']
df.loc[cond, 'status'] = 1
cond = (df['start'] < df['month']) & (df['month'] <= df['end'])
df.loc[cond, 'status'] = 2

How to calculate weeks difference and add missing Weeks with count in python pandas

I am having a data frame like this I have to get missing Weeks value and count between them
year Data Id
20180406 57170 A
20180413 55150 A
20180420 51109 A
20180427 57170 A
20180504 55150 A
20180525 51109 A
The output should be like this.
Id Start year end-year count
A 20180420 20180420 1
A 20180518 20180525 2
Use:
#converting to week period starts in Thursday
df['year'] = pd.to_datetime(df['year'], format='%Y%m%d').dt.to_period('W-Thu')
#resample by start of months with asfreq
df1 = (df.set_index('year')
.groupby('Id')['Id']
.resample('W-Thu')
.asfreq()
.rename('val')
.reset_index())
print (df1)
Id year val
0 A 2018-04-06/2018-04-12 A
1 A 2018-04-13/2018-04-19 A
2 A 2018-04-20/2018-04-26 A
3 A 2018-04-27/2018-05-03 A
4 A 2018-05-04/2018-05-10 A
5 A 2018-05-11/2018-05-17 NaN
6 A 2018-05-18/2018-05-24 NaN
7 A 2018-05-25/2018-05-31 A
#onverting to datetimes with starts dates
#http://pandas.pydata.org/pandas-docs/stable/timeseries.html#converting-between-representations
df1['year'] = df1['year'].dt.to_timestamp('D', how='s')
print (df1)
Id year val
0 A 2018-04-06 A
1 A 2018-04-13 A
2 A 2018-04-20 A
3 A 2018-04-27 A
4 A 2018-05-04 A
5 A 2018-05-11 NaN
6 A 2018-05-18 NaN
7 A 2018-05-25 A
m = df1['val'].notnull().rename('g')
#create index by cumulative sum for unique groups for consecutive NaNs
df1.index = m.cumsum()
#filter only NaNs row and aggregate first, last and count.
df2 = (df1[~m.values].groupby(['Id', 'g'])['year']
.agg(['first','last','size'])
.reset_index(level=1, drop=True)
.reset_index())
print (df2)
Id first last size
0 A 2018-05-11 2018-05-18 2

Categories

Resources