How to resolve ValueError: cannot reindex from a duplicate axis - python

Input
Client
First name
Last Name
Start Date
End Date
Amount
Invoice Date
XXX
John
Kennedy
15-01-2021
28-02-2021
137,586.00
20-04-2021
YYY
Peter
Paul
7-02-2021
31-03-2021
38,750.00
20-04-2021
ZZZ
Michael
K
10-03-2021
29-04-2021
137,586.00
30-04-2021
Code
df = pd.read_excel ('file.xlsx',parse_dates=['Start Date','End Date'] )
df['Start Date'] = pd.to_datetime(df['Start Date'],format='%d-%m-%Y')
df['End Date'] = pd.to_datetime(df['End Date'],format='%d-%m-%Y')
df['r'] = df.apply(lambda x: pd.date_range(x['Start Date'],x['End Date']), axis=1)
df = df.explode('r')
print(df)
months = df['r'].dt.month
starts, ends = months.ne(months.groupby(level=0).shift(1)), months.ne(months.groupby(level=0).shift(-1))
df2 = pd.DataFrame({'First Name': df['First name'],
'Start Date': df.loc[starts, 'r'].dt.strftime('%Y-%m-%d'),
'End Date': df.loc[ends, 'r'].dt.strftime('%Y-%m-%d'),
'Date Diff': df.loc[ends, 'r'].dt.strftime('%d').astype(int)-df.loc[starts, 'r'].dt.strftime('%d').astype(int)+1})
df = df.loc[~df.index.duplicated(), :]
df2 = pd.merge(df, df2, left_index=True, right_index=True)
df2['Amount'] = df['Amount'].mul(df2['Date_Diff'])
print(df['Amount'])
print (df)
df.to_excel('report.xlsx', index=True)
Error
ValueError: cannot reindex from a duplicate axis
Expected output
how to resolve this issue?

Start with some correction in your input Excel file, namely change First name
to First Name - with capital "N", just like in other columns.
Then, to read your Excel file, it is enough to run:
df = pd.read_excel('Input.xlsx', parse_dates=['Start Date', 'End Date',
'Invoice Date'], dayfirst=True)
No need to call to_datetime.
Note also that since Invoice Date contains also dates, I added this column to parse_dates
list.
Then define two functions:
A function to get monthly data for the current row:
def getMonthData(grp, amnt, dayNo):
return pd.Series([grp.min(), grp.max(), amnt * grp.size / dayNo],
index=['Start Date', 'End Date', 'Amount'])
It converts the input Series of dates (for a single month) into the "new" content of
the output rows (start / end dates and the proper share of the total amount, to be
accounted for this month).
It will be called in the following function.
A function to "explode" the current row:
def rowExpl(row):
ind = pd.date_range(row['Start Date'], row['End Date']).to_series()
rv = ind.groupby(pd.Grouper(freq='M')).apply(getMonthData,
amnt=row.Amount, dayNo=ind.size).unstack().reset_index(drop=True)
rv.insert(0, 'Client', row.Client)
rv.insert(1, 'First Name', row['First Name'])
rv.insert(2, 'Last Name', row['Last Name'])
return rv.assign(**{'Invoice Date': row['Invoice Date']})
And the last step is to get the result. Apply rowExpl to each row and concatenate
the partial results into a single output DataFrame:
result = pd.concat(df.apply(rowExpl, axis=1).values, ignore_index=True)
The result, for your data sample is:
Client First Name Last Name Start Date End Date Amount Invoice Date
0 XXX John Kennedy 2021-01-15 2021-01-31 51976.9 2021-04-20
1 XXX John Kennedy 2021-02-01 2021-02-28 85609.1 2021-04-20
2 YYY Peter Paul 2021-02-07 2021-02-28 16084.9 2021-04-20
3 YYY Peter Paul 2021-03-01 2021-03-31 22665.1 2021-04-20
4 ZZZ Michael K 2021-03-10 2021-03-31 59350.8 2021-04-30
5 ZZZ Michael K 2021-04-01 2021-04-29 78235.2 2021-04-30
Don't be disaffected by seemingly too low precision of Amount column.
It is only the way how Jupyter Notebook displays the DataFrame.
When you run result.iloc[0, 5], you will get:
51976.933333333334
with full, actually held precision.

Related

Locate the Upcoming Expiry date and Assign the Value based on it - Python Data frame

There are two dataframes, need to extract the Nearest upcoming Expiry date from Dataframe2 based on Active date in Dataframe 1 to obtain the correct Value.
This is a sample. Original data contains thousands of rows
Dataframe 1
df_1 = pd.DataFrame({'Category': ['A','B'],
'Active date': ['2021-06-20','2021-06-25']})
Dataframe 2
df_2 = pd.DataFrame({'Category': ['A','A','A','A','A','B','B','B'],
'Expiry date': ['2021-05-22','2021-06-23','2021-06-24','2021-06-28','2021-07-26','2021-06-27','2021-06-28','2021-08-29'],
'Value': [20,21,23,45,12,34,17,34]})
Final Output -
The code I was trying -
df = pd.merge(df_1, df_2, on='Category', how='inner')
#Removed all the dates which are less than Active date
df = df.loc[(df_1['Active Date'] <= df_2['Expiry Date'])]
I believe this solution keeps a lot of your existing code and will accomplish what you are looking for.
df_1 = pd.DataFrame({'Category': ['A','B'],
'Active date': ['2021-06-20','2021-06-25']})
df_2 = pd.DataFrame({'Category': ['A','A','A','A','A','B','B','B'],
'Expiry date': ['2021-05-22','2021-06-23','2021-06-24','2021-06-28','2021-07-26','2021-06-27','2021-06-28','2021-08-29'],
'Value': [20,21,23,45,12,34,17,34]})
df = pd.merge(df_1, df_2, on='Category', how='inner')
# Removed all the dates which are less than Active date
df = df.loc[(df['Active date'] <= df['Expiry date'])]
df = df.rename(columns={'Expiry date': 'Next Expiry Date'})
df = df.loc[df['Next Expiry Date'] == df.groupby('Category')['Next Expiry Date'].transform('min')]
Output:
Category Active date Next Expiry Date Value
1 A 2021-06-20 2021-06-23 21
5 B 2021-06-25 2021-06-27 34
You can use pandas merge_asof with direction set to forward. Note that for merge_asof, both data frames must be sorted :
df_1 = df_1.transform(pd.to_datetime, errors='ignore')
df_2 = df_2.astype({"Expiry date": np.datetime64})
df_2 = df_2.sort_values('Expiry date')
pd.merge_asof(df_1,
df_2,
left_on='Active date',
right_on='Expiry date',
direction='forward',
by='Category')
Category Active date Expiry date Value
0 A 2021-06-20 2021-06-23 21
1 B 2021-06-25 2021-06-27 34

combine two complete rows if certain criteria is met

I've been able to extract data from two separate xlsx and combine them into a single xlsx sheet using pandas.
I know have a table that looks like this.
Home Start Date Gross Earning Tax Gross Rental Commission Net Rental
3157 2020-03-26 00:00:00 -268.8 -28.8 -383.8 -36 -338.66
3157 2020-03-26 00:00:00 268.8 28.8 153.8 36 108.66
3157 2020-03-24 00:00:00 264.32 28.32 149.32 35.4 104.93
3157 2020-03-13 00:00:00 625.46 67.01 510.46 83.7675 405.4225
3157 2020-03-13 00:00:00 558.45 0 443.45 83.7675 342.9325
3157 2020-03-11 00:00:00 142.5 0 27.5 21.375 1.855
3157 2020-03-11 00:00:00 159.6 17.1 44.6 21.375 17.805
3157 2020-03-03 00:00:00 349.52 0 234.52 52.428 171.612
3157 2020-03-03 00:00:00 391.46 41.94 276.46 52.428 210.722
So if you take a look at the first two rows, the name in the Home column is the same (In this example, 3157 Tocoa) but they are also the same for the next few rows. But in the Start date column, only the first two items in that column are the same (In this case 3/26/2020 12:00:00 AM) So what i need to do is the following
If the dates are the same, and the Home is the same, then I need the sum of all of the following columns.
(In this case, I would need the sum of -268.8 and 268.8, the sum of -28.8 and 28.8 and so on) It is also important to mention there are instances where there is a total of more than two matching start dates.
I will include the code I have used to get to where I am now, I would like to mention I am fairly new to python so I'm sure there is a way to do this super simple but I am just not familiar.
I am also new to stackoverflow so if I am missing something or added something I should have please forgive me
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
import numpy as np
import matplotlib.pyplot as plt
import os
# class airbnb:
#Gets the location path for the reports that come raw from the channel
airbnb_excel_file = (r'C:\Users\Christopher\PycharmProjects\Reporting with
python\Data_to_read\Bnb_feb_report.xlsx')
empty_excel_file = (r'C:\Users\Christopher\PycharmProjects\Reporting with
python\Data_to_read\empty.xlsx')
#Defines the data frame
df_airbnb = pd.read_excel(airbnb_excel_file)
df_empty = pd.read_excel(empty_excel_file)
gross_earnings = df_airbnb['Gross Earnings']
tax_amount = df_airbnb['Gross Earnings'] * 0.06
gross_rental = df_airbnb['Gross Earnings'] - df_airbnb['Cleaning Fee']
com = ((gross_rental - tax_amount) + df_airbnb['Cleaning Fee']) * 0.15
net_rental = (gross_rental - (com + df_airbnb['Host Fee']))
house = df_airbnb['Listing']
start_date = df_airbnb['Start Date']
# df = pd.DataFrame(df_empty)
# df_empty.replace('nan', '')
#
# print(net_rental)
df_report = pd.DataFrame(
{'Home': house, 'Start Date': start_date, 'Gross Earning': gross_earnings, 'Tax': tax_amount,
'Gross Rental': gross_rental, 'Commission': com, 'Net Rental': net_rental})
df_report.loc[(df_report.Home == 'New house, Minutes from Disney & Attraction'), 'Home'] = '3161
Tocoa'
df_report.loc[(df_report.Home == 'Brand-New House, located minutes from Disney 5151'), 'Home'] =
'5151 Adelaide'
df_report.loc[(df_report.Home == 'Luxury House, Located Minutes from Disney-World 57'), 'Home'] =
'3157 Tocoa'
df_report.loc[(df_report.Home == 'Big house, Located Minutes from Disney-World 55'), 'Home'] = '3155
Tocoa'
df_report.sort_values(by=['Home'], inplace=True)
# writer = ExcelWriter('Final_Report.xlsx')
# df_report.to_excel(writer, 'sheet1', index=False)
# writer.save()
# class homeaway:
homeaway_excel_file = (r'C:\Users\Christopher\PycharmProjects\Reporting with
python\Data_to_read\PayoutSummaryReport2020-03-01_2020-03-29.xlsx')
df_homeaway = pd.read_excel(homeaway_excel_file)
cleaning = int(115)
house = df_homeaway['Address']
start_date = df_homeaway['Check-in']
gross_earnings = df_homeaway['Gross booking amount']
taxed_amount = df_homeaway['Lodging Tax Owner Remits']
gross_rental = (gross_earnings - cleaning)
com = ((gross_rental-taxed_amount) + cleaning) * 0.15
net_rental = (gross_rental - (com + df_homeaway['Deductions']))
df_report2 = pd.DataFrame(
{'Home': house, 'Start Date': start_date, 'Gross Earning': gross_earnings, 'Tax': taxed_amount,
'Gross Rental': gross_rental, 'Commission': com, 'Net Rental': net_rental})
# writer = ExcelWriter('Final_Report2.xlsx')
# df_report2.to_excel(writer, 'sheet1', index=False)
# writer.save()
df_combined = pd.concat([df_report, df_report2])
writer = ExcelWriter('Final_Report_combined.xlsx')
df_report2.to_excel(writer, 'sheet1', index=False)
writer.save()
One of possible approaches is to group by Home and Start Date and
then compute sum of rows involved:
df.groupby(['Home', 'Start Date']).sum()
Fortunately, all "other" columns are numeric, so no column specification is needed.
But if there are more than 2 rows with same Home and Start Date
and you want to:
break them into pairs of consecutive rows,
and then compute their sums (for each pair separately),
you should apply a "2-tier" grouping:
first tier - group by Home and Start Date (as before),
second tier - group into pairs,
and compute sums for each second-level group.
In this case the code should be:
df.groupby(['Home', 'Start Date']).apply(
lambda grp: grp.groupby(np.arange(len(grp.index)) // 2).sum())\
.reset_index(level=-1, drop=True)
Additional operation required here is to drop the last level of the index
(reset_index).
To test this approach, e.g. add the following row to your DataFrame:
1234 Bogus Street,2020-03-26 00:00:00,20.0,2.0,15.0,3,10.0
so that 1234 Bogus Street / 2020-03-26 00:00:00 group now contains
three rows.
When you run the above code, you will get:
Gross Earning Tax Gross Rental Commission Net Rental
Home Start Date
1234 Bogus Street 2020-03-03 00:00:00 740.98 41.94 510.98 104.856 382.334
2020-03-11 00:00:00 302.10 17.10 72.10 42.750 19.660
2020-03-13 00:00:00 1183.91 67.01 953.91 167.535 748.355
2020-03-24 00:00:00 264.32 28.32 149.32 35.400 104.930
2020-03-26 00:00:00 0.00 0.00 -230.00 0.000 -230.000
2020-03-26 00:00:00 20.00 2.00 15.00 3.000 10.000
Note the last row. It contains:
repeated Start Date (from the previous row),
values from the added row.
And the last but one row contains sums for only two first rows
with respective Home / Start Date.

Resample() returning incorrect figures for non-existent dates

I have a data frame in this format:
Date Posted Receipt Amount Centre Brand
07-10-2019 6000.0 Centre 1 Brand 1
07-05-2019 6346.66 Centre 2 Brand 1
03-01-2019 6173.34 Centre 1 Brand 2
11-06-2019 6000.0 Centre 1 Brand 2
13-09-2019 6346.66 Centre 3 Brand 1
07-11-2019 6098.34 Centre 4 Brand 1
I am re-sampling the data for time series forecasting purposes:
df=pd.read_csv("File Directory")
df["Receipt Amount"] = df["Receipt Amount"].astype(float)
brands=list((pd.Series(df["Brand"].unique())).dropna())
df['Date Posted'] = pd.DatetimeIndex(df['Date Posted'])
df.index = df['Date Posted']
df=df.drop(["Date Posted"],axis=1)
for brand in brands:
brand_filter=df['Brand']==brand
brand_df=df[brand_filter]
brand_df=brand_df[["Receipt Amount"]]
brand_df=brand_df.resample('D').sum()
brand_df.reset_index(level=0, inplace=True)
brand_df = brand_df.rename({'Date Posted': 'ds'}, axis=1)
brand_df = brand_df.rename({'Receipt Amount': 'y'}, axis=1)
However this returns some of the sum values as 0 which I know to be false.
Also it returns values for days in December which once again I know to be false. (All the data is no more recent than November)
This is the code in its entirety so I am unsure where I have made a mistake.
I have now resolved this issue, so here is the solution for future desperate Googlers.
The dates weren't being read in correctly by:
df['Date Posted'] = pd.DatetimeIndex(df['Date Posted'])
Some dates it was reading as dd/mm/yyyy while others were being read for mm/dd/yyyy.
To solve this add dayfirst=True to the function
df['Date Posted'] = pd.to_datetime(df['Date Posted'],dayfirst=True)

Pandas group, aggregate two columns and return the earliest Start Date for one column

I am trying to group by a csv file in Pandas (by one column: ID) in order to get the earliest Start Date and latest End Date. Then I am trying to group by multiple columns in order to get the SUM of a value. For each ID in the second groupedby dataframe, I want to present the dates.
I am loading a csv in order to group and aggregate data.
01) First I load the csv
def get_csv():
#Read csv file
df = pd.read_csv('myFile.csv', encoding = "ISO-8859-1",parse_dates=['Start Date', 'End Date'])
return df
02) Group and aggregate the data for the columns (ID and Site)
def do_stuff():
df = get_csv()
groupedBy = df[df['A or B'].str.contains('AAAA')].groupby([df['ID'], df['Site'].fillna('Other'),]).agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'})
which works as expected and I am getting the following (example):
03) And ideally, for the same ID I want to present the earliest date in the Start Date column and the latest one in the End Date column. The aggregation for the value works perfectly. What I want to get is the following:
I do not know how to change my current code above. I have tried this so far:
def do_stuff():
df = get_csv()
md = get_csv()
minStart = md[md['A or B'].str.contains('AAAA')].groupby([md['ID']]).agg({'Start Date': 'min'})
df['earliestStartDate'] = minStart
groupedBy = df[df['A or B'].str.contains('AAAA')].groupby([df['ID'], df['Site'].fillna('Other'),df['earliestStartDate']]).agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'})
which fails and also tried changing the above to:
def do_stuff():
df = get_csv()
md = get_csv()
df['earliestStartDate'] = md.loc[ md['ID'] == df['ID'], 'Start Date'].min()
groupedBy = df[df['A or B'].str.contains('AAAA')].groupby([df['ID'], df['Site'].fillna('Other'),df['earliestStartDate']]).agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'})
Ideally, I will just change something in the groupedBy instead of having to read the csv twice and aggregate the data twice. Is that possible? If not, what can I change to make the script work? I am trying to test random things to get more experience in Pandas and Python.
I am guessing I have to create two dataframes here. One to get the groupedby data for all the columns needed (and the SUM of the Value). A second one to get the earliest Start Date and latest End Date for each ID. Then I need to find a way to concatenate the two dataframes. Is that a good result or do you think that there is an easier way to achieve that?
UPD: My code where I have created two dataframes (not sure whether this is the right solution) is given below:
#Read csv file
df = pd.read_csv('myFile.csv', encoding = "ISO-8859-1",mangle_dupe_cols=True, parse_dates=['Start Date', 'End Date'])
md = pd.read_csv('myFile.csv', encoding = "ISO-8859-1",mangle_dupe_cols=True, parse_dates=['Start Date', 'End Date'])
#Calculate the Clean Value
df['Clean Cost'] = (df['Value'] - df['Value2']) #.apply(lambda x: round(x,0))
#Get the min/max Dates
minMaxDates = md[md['Random'].str.contains('Y')].groupby([md['ID']]).agg({'Start Date': 'min', 'End Date': 'max'})
#Group by and aggregate (return Earliest Start Date, Latest End Date and SUM of the Values)
groupedBy = df[df['Random'].str.contains('Y')].groupby([df['ID'], df['Site'].fillna('Other')]).agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum', 'Value2': 'sum', 'Clean Cost': 'sum'})
and if I print the two dataframes, I am getting the following:
and
If I print the df.head(), I am getting the following:
ID A or B Start Date End Date Value Site Value2 Random alse.
0 45221 AAAA 2017-12-30 2017-09-30 14 S111 7 Y 1
45221 AAAA 2017-01-15 2017-09-30 15 S222 7 Y 2
85293 BBBB 2017-05-12 2017-07-24 29 S111 3 Y 3
85293 AAAA 2017-03-22 2017-10-14 32 S222 4 Y 4
45221 AAAA 2017-01-15 2017-09-30 30 S222 7 Y
A link of the file is given here:LINK
I think you need transform:
df = pd.read_csv('sampleBionic.csv')
print (df)
ID A or B Start Date End Date Value Site Value2 Random
0 45221 AAAA 12/30/2017 09/30/2017 14 S111 7 Y
1 45221 AAAA 01/15/2017 09/30/2017 15 S222 7 Y
2 85293 BBBB 05/12/2017 07/24/2017 29 S111 3 Y
3 85293 AAAA 03/22/2017 10/14/2017 32 S222 4 Y
4 45221 AAAA 01/15/2017 09/30/2017 30 S222 7 Y
groupedBy = (df[df['A or B'].str.contains('AAAA')]
.groupby([df['ID'], df['Site'].fillna('Other'),])
.agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'}))
print (groupedBy)
Start Date End Date Value
ID Site
45221 S111 12/30/2017 09/30/2017 14
S222 01/15/2017 09/30/2017 45
85293 S222 03/22/2017 10/14/2017 32
g = groupedBy.groupby(level=0)
groupedBy['Start Date'] = g['Start Date'].transform('min')
groupedBy['End Date'] = g['End Date'].transform('max')
print (groupedBy)
Start Date End Date Value
ID Site
45221 S111 01/15/2017 09/30/2017 14
S222 01/15/2017 09/30/2017 45
85293 S222 03/22/2017 10/14/2017 32
I have managed to create a script that does what I want. I will paste the answer in case somebody needs it in the future. Jezrael's answer worked fine too. So, considering that the original csv is like this:
my sript is:
import pandas as pd
import os
import csv
import time
import dateutil.parser as dparser
import datetime
def get_csv():
#Read csv file
df = pd.read_csv('myFile.csv', encoding = "ISO-8859-1",mangle_dupe_cols=True, parse_dates=['Start Date', 'End Date'])
df = df[df['A or B'].str.contains('AAAA')]
return df
def do_stuff():
df = get_csv()
#Get the min Start Date, max End date, sum of the Value and Value2 and calculate the Net Cost
varA = 'ID';
dfGrouped = df.groupby(varA, as_index=False).agg({'Start Date': 'min', 'End Date': 'max'}).copy();
varsToKeep = ['ID', 'Site', 'Random', 'Start Date_grp', 'End Date_grp', 'Value', 'Value2', ];
dfTemp = pd.merge(df, dfGrouped, how='inner', on='ID', suffixes=(' ', '_grp'), copy=True)[varsToKeep];
dfBreakDown = dfTemp.groupby(['ID', 'Site', 'Random', 'Start Date_grp',
'End Date_grp']).sum()
#Calculate the Net Cost
dfTemp['Net Cost'] = (dfTemp['Value'] - dfTemp['Value2'])
groupedBy = dfTemp.groupby(['ID', 'Site', 'Random']).agg({'Start Date_grp': 'min', 'End Date_grp': 'max', 'Value': 'sum', 'Value2': 'sum', 'Net Cost': 'sum'})
csvoutput(groupedBy)
def csvoutput(df):
#Csv output
df.to_csv(path_or_buf='OUT.csv', sep=',', na_rep='', float_format=None, columns=None, header=True, index=True, index_label=None, mode='w', encoding=None, compression=None, quoting=None, quotechar='"', line_terminator='\n', chunksize=None, tupleize_cols=False, date_format=None, doublequote=True, escapechar=None, decimal='.')
if __name__ == "__main__":
# start things here
do_stuff()

Pandas dataframe: Create additional column based on date columns comparison

Assuming I have the following dataset saved in a Pandas dataframe - note the last column [Status] is the column I'd like to create:
Department Employee Issue Date Submission Date ***Status***
A Joe 18/05/2014 25/06/2014 0
A Joe 1/06/2014 28/06/2014 1
A Joe 23/06/2014 30/06/2014 2
A Mark 1/03/2015 13/03/2015 0
A Mark 23/04/2015 15/04/2015 0
A William 15/07/2016 30/07/2016 0
A William 1/08/2016 23/08/2016 0
A William 20/08/2016 19/08/2016 1
B Liz 18/05/2014 7/06/2014 0
B Liz 1/06/2014 15/06/2014 1
B Liz 23/06/2014 16/06/2014 0
B John 1/03/2015 13/03/2015 0
B John 23/04/2015 15/04/2015 0
B Alex 15/07/2016 30/07/2016 0
B Alex 1/08/2016 23/08/2016 0
B Alex 20/08/2016 19/08/2016 1
I'd like to create an additional column [Status] based on the following conditions:
For every unique [Department] & [Employee] combination (e.g. there are three rows corresponding to Joe in Department A), sort the [Issue Date] from oldest to newest
If the current row [Issue Date] is greater than ALL previous rows [Submission Date], then flag the [Status] with 0; else [Status] = no of times that [Issue Date] < [Submission Date]
As an example: for employee Joe in Department A. When [Issue Date] = '1/06/2014', the previous row's [Submission Date] is after the [Issue Date], therefore [Status] = 1 for row 2. Similarly, when [Issue Date] = '23/06/2014', row 1 & 2's [Submission Date]s are both after the [Issue Date], therefore [Status] = 2 for row 3. We need to perform this calculation for every unique combination of Department and Employee.
Note: real dataset is not sorted nicely as the displayed example.
This question was posted 6 months ago but hopefully my answer still provides some help.
First, import the libraries and make the dataframe:
# import libraries
import numpy as np
import pandas as pd
# Make DataFrame
df = pd.DataFrame({'Department' : ['A']*8 + ['B']*8,
'Employee' : ['Joe']*3 +\
['Mark']*2 +\
['William']*3 +\
['Liz']*3 +\
['John']*2 +\
['Alex']*3,
'Issue Date' : ['18/05/2014', '1/06/2014', '23/06/2014',
'1/03/2015', '23/04/2015',
'15/07/2016', '1/08/2016', '20/08/2016',
'18/05/2014', '1/06/2014', '23/06/2014',
'1/03/2015', '23/04/2015',
'15/07/2016', '1/08/2016', '20/08/2016'],
'Submission Date' : ['25/06/2014', '28/06/2014', '30/06/2014',
'13/03/2015', '15/04/2015',
'30/07/2016', '23/08/2016', '19/08/2016',
'7/06/2014', '15/06/2014', '16/06/2014',
'13/03/2015', '15/04/2015',
'30/07/2016', '23/08/2016', '19/08/2016']})
Second, convert Issue Date and Submission Date to datetime:
# Convert 'Issue Date', 'Submission Date' to pd.datetime
df.loc[:, 'Issue Date'] = pd.to_datetime(df.loc[:, 'Issue Date'],
dayfirst = True)
df.loc[:, 'Submission Date'] = pd.to_datetime(df.loc[:, 'Submission Date'],
dayfirst = True)
Third, reset the index and sort the values by Department, Employee, and Issue Date:
# Reset index and sort_values by 'Department', 'Employee', 'Issue Date'
df.reset_index(drop = True).sort_values(by = ['Department',
'Employee',
'Issue Date'],
inplace = True)
Fourth, group by Department, Employee; cumulative count the rows; insert into the originial df:
# Group by 'Department', 'Employee'; cumulative count rows; insert into original df
df.insert(df.shape[1],
'grouped count',
df.groupby(['Department',
'Employee']).cumcount())
Fifth, create a no_issue and no_submission dataframe and merge them together on Department and Employee:
# Create df without 'Issue Date'
no_issue = df.drop('Issue Date', axis = 1)
# Create df without 'Submission Date'
no_submission = df.drop('Submission Date', axis = 1)
# Outer merge no_issue with no_submission on 'Department', 'Employee'
merged = no_issue.merge(no_submission,
how = 'outer',
on = ['Department',
'Employee'])
This duplicates the Submission Date by the number of Issue Dates per Department,Employee group
Here's what it looks like for Joe:
Sixth, create a dataframe that only keeps rows where grouped count_x is less than grouped count_y, then sort by Department, Employee, and Issue Date:
# Create merged1 df that keeps only rows where 'grouped count_x' < 'grouped count_y';
# sort by 'Department', 'Employee', 'Issue Date
merged1 = merged[merged.loc[:, 'grouped count_x'] <
merged.loc[:, 'grouped count_y']].sort_values(by = ['Department',
'Employee',
'Issue Date'])
Seventh, insert the status column as a boolean where Issue Date is less than Submission Date:
# Insert 'Status' as a boolean when 'Issue Date' < 'Submission Date'
merged1.insert(merged.shape[1],
'Status',
merged1.loc[:, 'Issue Date'] < merged1.loc[:, 'Submission Date'])
Eighth, group by Department, Employee, and Issue Date, sum the Status, and reset the index:
# Group by 'Department', 'Employee', 'Issue Date' and sum 'Status'; reset index
merged1 = merged1.groupby(['Department',
'Employee',
'Issue Date']).agg({'Status' : np.sum}).reset_index()
This will return a dataframe with all the correct Statuses minus the minimum Issue Date for each Department, Employee group
Ninth, group the original merged dataframe by Department and Employee, find the minimum Issue Date, and reset the index:
# Group merged by 'Department', 'Employee' and find min 'Issue Date'; reset index
merged = merged.groupby(['Department',
'Employee']).agg({'Issue Date' : 'min'}).reset_index()
Tenth, concatenate merged1 with merged, fill the na with 0 (since the minimum Issue Date will always have a Status of 0) and sort by Department, Employee, and Issue Date:
# Concatenate merged with merged1; fill na with 0; sort by 'Department', 'Employee', 'Issue Date'
concatenated = pd.concat([merged1, merged]).fillna(0).sort_values(by = ['Department',
'Employee',
'Issue Date'])
Eleventh, inner merge the merged dataframe with the concatenated dataframe on Department, Employee, and Issue Date, then drop the grouped count:
# Merge concatenated with df; drop grouped count
final = df.merge(concatenated,
how = 'inner',
on = ['Department',
'Employee',
'Issue Date']).drop('grouped count',
axis = 1)
Voila! Here is your final dataframe:
# Final df
final

Categories

Resources