Pandas dataframe: Create additional column based on date columns comparison - python

Assuming I have the following dataset saved in a Pandas dataframe - note the last column [Status] is the column I'd like to create:
Department Employee Issue Date Submission Date ***Status***
A Joe 18/05/2014 25/06/2014 0
A Joe 1/06/2014 28/06/2014 1
A Joe 23/06/2014 30/06/2014 2
A Mark 1/03/2015 13/03/2015 0
A Mark 23/04/2015 15/04/2015 0
A William 15/07/2016 30/07/2016 0
A William 1/08/2016 23/08/2016 0
A William 20/08/2016 19/08/2016 1
B Liz 18/05/2014 7/06/2014 0
B Liz 1/06/2014 15/06/2014 1
B Liz 23/06/2014 16/06/2014 0
B John 1/03/2015 13/03/2015 0
B John 23/04/2015 15/04/2015 0
B Alex 15/07/2016 30/07/2016 0
B Alex 1/08/2016 23/08/2016 0
B Alex 20/08/2016 19/08/2016 1
I'd like to create an additional column [Status] based on the following conditions:
For every unique [Department] & [Employee] combination (e.g. there are three rows corresponding to Joe in Department A), sort the [Issue Date] from oldest to newest
If the current row [Issue Date] is greater than ALL previous rows [Submission Date], then flag the [Status] with 0; else [Status] = no of times that [Issue Date] < [Submission Date]
As an example: for employee Joe in Department A. When [Issue Date] = '1/06/2014', the previous row's [Submission Date] is after the [Issue Date], therefore [Status] = 1 for row 2. Similarly, when [Issue Date] = '23/06/2014', row 1 & 2's [Submission Date]s are both after the [Issue Date], therefore [Status] = 2 for row 3. We need to perform this calculation for every unique combination of Department and Employee.
Note: real dataset is not sorted nicely as the displayed example.

This question was posted 6 months ago but hopefully my answer still provides some help.
First, import the libraries and make the dataframe:
# import libraries
import numpy as np
import pandas as pd
# Make DataFrame
df = pd.DataFrame({'Department' : ['A']*8 + ['B']*8,
'Employee' : ['Joe']*3 +\
['Mark']*2 +\
['William']*3 +\
['Liz']*3 +\
['John']*2 +\
['Alex']*3,
'Issue Date' : ['18/05/2014', '1/06/2014', '23/06/2014',
'1/03/2015', '23/04/2015',
'15/07/2016', '1/08/2016', '20/08/2016',
'18/05/2014', '1/06/2014', '23/06/2014',
'1/03/2015', '23/04/2015',
'15/07/2016', '1/08/2016', '20/08/2016'],
'Submission Date' : ['25/06/2014', '28/06/2014', '30/06/2014',
'13/03/2015', '15/04/2015',
'30/07/2016', '23/08/2016', '19/08/2016',
'7/06/2014', '15/06/2014', '16/06/2014',
'13/03/2015', '15/04/2015',
'30/07/2016', '23/08/2016', '19/08/2016']})
Second, convert Issue Date and Submission Date to datetime:
# Convert 'Issue Date', 'Submission Date' to pd.datetime
df.loc[:, 'Issue Date'] = pd.to_datetime(df.loc[:, 'Issue Date'],
dayfirst = True)
df.loc[:, 'Submission Date'] = pd.to_datetime(df.loc[:, 'Submission Date'],
dayfirst = True)
Third, reset the index and sort the values by Department, Employee, and Issue Date:
# Reset index and sort_values by 'Department', 'Employee', 'Issue Date'
df.reset_index(drop = True).sort_values(by = ['Department',
'Employee',
'Issue Date'],
inplace = True)
Fourth, group by Department, Employee; cumulative count the rows; insert into the originial df:
# Group by 'Department', 'Employee'; cumulative count rows; insert into original df
df.insert(df.shape[1],
'grouped count',
df.groupby(['Department',
'Employee']).cumcount())
Fifth, create a no_issue and no_submission dataframe and merge them together on Department and Employee:
# Create df without 'Issue Date'
no_issue = df.drop('Issue Date', axis = 1)
# Create df without 'Submission Date'
no_submission = df.drop('Submission Date', axis = 1)
# Outer merge no_issue with no_submission on 'Department', 'Employee'
merged = no_issue.merge(no_submission,
how = 'outer',
on = ['Department',
'Employee'])
This duplicates the Submission Date by the number of Issue Dates per Department,Employee group
Here's what it looks like for Joe:
Sixth, create a dataframe that only keeps rows where grouped count_x is less than grouped count_y, then sort by Department, Employee, and Issue Date:
# Create merged1 df that keeps only rows where 'grouped count_x' < 'grouped count_y';
# sort by 'Department', 'Employee', 'Issue Date
merged1 = merged[merged.loc[:, 'grouped count_x'] <
merged.loc[:, 'grouped count_y']].sort_values(by = ['Department',
'Employee',
'Issue Date'])
Seventh, insert the status column as a boolean where Issue Date is less than Submission Date:
# Insert 'Status' as a boolean when 'Issue Date' < 'Submission Date'
merged1.insert(merged.shape[1],
'Status',
merged1.loc[:, 'Issue Date'] < merged1.loc[:, 'Submission Date'])
Eighth, group by Department, Employee, and Issue Date, sum the Status, and reset the index:
# Group by 'Department', 'Employee', 'Issue Date' and sum 'Status'; reset index
merged1 = merged1.groupby(['Department',
'Employee',
'Issue Date']).agg({'Status' : np.sum}).reset_index()
This will return a dataframe with all the correct Statuses minus the minimum Issue Date for each Department, Employee group
Ninth, group the original merged dataframe by Department and Employee, find the minimum Issue Date, and reset the index:
# Group merged by 'Department', 'Employee' and find min 'Issue Date'; reset index
merged = merged.groupby(['Department',
'Employee']).agg({'Issue Date' : 'min'}).reset_index()
Tenth, concatenate merged1 with merged, fill the na with 0 (since the minimum Issue Date will always have a Status of 0) and sort by Department, Employee, and Issue Date:
# Concatenate merged with merged1; fill na with 0; sort by 'Department', 'Employee', 'Issue Date'
concatenated = pd.concat([merged1, merged]).fillna(0).sort_values(by = ['Department',
'Employee',
'Issue Date'])
Eleventh, inner merge the merged dataframe with the concatenated dataframe on Department, Employee, and Issue Date, then drop the grouped count:
# Merge concatenated with df; drop grouped count
final = df.merge(concatenated,
how = 'inner',
on = ['Department',
'Employee',
'Issue Date']).drop('grouped count',
axis = 1)
Voila! Here is your final dataframe:
# Final df
final

Related

Pandas Dataframe Getting a count of semi-unique values from columns in a CSV

I don't think my title accurately conveys my question but I struggled on it for a bit.
I have a range of CSV files. These files contain column names and values. My current code works exactly as I want it to, in that it groups the data by time and then gets me a count of uses per hour and revenue per hour.
However I now want to refine this, in my CSV there is a column name called Machine Name. Each value in this column is unique, but they share the same naming scheme. They can either be Dryer #39 or Dryer #38 or Washer #1 or Washer #12. What I want is to get a count of Dryers and Washers used per hour and I do not care what number washer or dryer it was. Just that it was a washer or dryer.
Here is my code.
for i in range(1): # len(csvList))
df = wr.s3.read_csv(path=[f's3://{csvList[i].bucket_name}/{csvList[i].key}'])
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df = df.groupby(df['Timestamp'].dt.floor('h')).agg(
machines_used_per_hour=('Machine Name', 'count'),
revenue_per_hour=('Total Revenue', 'sum')
).reset_index() # Reset the index for the timestamp column
for j in df.iterrows():
dbInsert = """INSERT INTO `store-machine-use`(store_id, timestamp, machines_used_per_hour, revenue_per_hour, notes) VALUES (%s, %s, %s, %s, %s)"""
values = (int(storeNumberList[i]), str(j[1]['Timestamp']), int(j[1]['machines_used_per_hour']), int(j[1]['revenue_per_hour']),'')
cursor.execute(dbInsert, values)
cnx.commit()
This data enters the database and looks like:
store_id, Timestamp, machines_used_per_hour, revenue_per_hour, notes
10, 2021-08-22 06:00:00, 4, 14, Test
I want to get an individual count of the types of machines used every hour, in the case of my example it would look like:
store_id, Timestamp, machines_used_per_hour, revenue_per_hour, washers_per_hour, dryers_per_hour, notes
10, 2021-08-22 06:00:00, 4, 14, 1, 3, Test
you cout use pd.Series.str.startswith and then sum in the aggregation:
df['is_dryer'] = df['Machine Name'].startswith('Dryer')
df['is_washer'] = df['Machine Name'].startswith('Washer')
df = df.groupby(df['Timestamp'].dt.floor('h')).agg(
machines_used_per_hour=('Machine Name', 'count'),
revenue_per_hour=('Total Revenue', 'sum'),
washers_per_hour=('is_washer', 'sum'),
dryers_per_hour=('is_dryer', 'sum')
).reset_index() # Reset the index for the timestamp column
note that if you need more complex pattern matching for determining which machine belongs to which category, you can use regexes with pd.Series.str.match
example
for instance with some fake data, if I have:
dataframe = pd.DataFrame(
{"machine": ["Dryer #1", "Dryer #2", "Washer #43", "Washer #89", "Washer #33"],
"aggregation_key": [1, 2, 1, 2, 2]}
)
after creating the boolean columns with
dataframe["is_dryer"] = dataframe.machine.str.startswith("Dryer")
dataframe["is_washer"] = dataframe.machine.str.startswith("Washer")
dataframe will be
machine aggregation_key is_dryer is_washer
0 Dryer #1 1 True False
1 Dryer #2 2 True False
2 Washer #43 1 False True
3 Washer #89 2 False True
4 Washer #33 2 False True
and then aggregation gives you what you want:
dataframe.groupby(dataframe["aggregation_key"]).agg(
washers_per_hour=('is_washer', 'sum'),
dryers_per_hour=('is_dryer', 'sum')
).reset_index()
result will be
aggregation_key washers_per_hour dryers_per_hour
0 1 1 1
1 2 2 1
you can use regex to replace the common machine number identifier pattern to create a machine_type series which you can then use to aggregate on.
df['Machine Type'] = df['Machine Name'].str.replace(' #[0-9]', '', regex=True)
you can then group on the Machine Type
df = df.groupby(df['Timestamp'].dt.floor('h')).agg(
machines_used_per_hour=('Machine Type', 'count'),
revenue_per_hour=('Total Revenue', 'sum')
).reset_index()

Locate the Upcoming Expiry date and Assign the Value based on it - Python Data frame

There are two dataframes, need to extract the Nearest upcoming Expiry date from Dataframe2 based on Active date in Dataframe 1 to obtain the correct Value.
This is a sample. Original data contains thousands of rows
Dataframe 1
df_1 = pd.DataFrame({'Category': ['A','B'],
'Active date': ['2021-06-20','2021-06-25']})
Dataframe 2
df_2 = pd.DataFrame({'Category': ['A','A','A','A','A','B','B','B'],
'Expiry date': ['2021-05-22','2021-06-23','2021-06-24','2021-06-28','2021-07-26','2021-06-27','2021-06-28','2021-08-29'],
'Value': [20,21,23,45,12,34,17,34]})
Final Output -
The code I was trying -
df = pd.merge(df_1, df_2, on='Category', how='inner')
#Removed all the dates which are less than Active date
df = df.loc[(df_1['Active Date'] <= df_2['Expiry Date'])]
I believe this solution keeps a lot of your existing code and will accomplish what you are looking for.
df_1 = pd.DataFrame({'Category': ['A','B'],
'Active date': ['2021-06-20','2021-06-25']})
df_2 = pd.DataFrame({'Category': ['A','A','A','A','A','B','B','B'],
'Expiry date': ['2021-05-22','2021-06-23','2021-06-24','2021-06-28','2021-07-26','2021-06-27','2021-06-28','2021-08-29'],
'Value': [20,21,23,45,12,34,17,34]})
df = pd.merge(df_1, df_2, on='Category', how='inner')
# Removed all the dates which are less than Active date
df = df.loc[(df['Active date'] <= df['Expiry date'])]
df = df.rename(columns={'Expiry date': 'Next Expiry Date'})
df = df.loc[df['Next Expiry Date'] == df.groupby('Category')['Next Expiry Date'].transform('min')]
Output:
Category Active date Next Expiry Date Value
1 A 2021-06-20 2021-06-23 21
5 B 2021-06-25 2021-06-27 34
You can use pandas merge_asof with direction set to forward. Note that for merge_asof, both data frames must be sorted :
df_1 = df_1.transform(pd.to_datetime, errors='ignore')
df_2 = df_2.astype({"Expiry date": np.datetime64})
df_2 = df_2.sort_values('Expiry date')
pd.merge_asof(df_1,
df_2,
left_on='Active date',
right_on='Expiry date',
direction='forward',
by='Category')
Category Active date Expiry date Value
0 A 2021-06-20 2021-06-23 21
1 B 2021-06-25 2021-06-27 34

How to resolve ValueError: cannot reindex from a duplicate axis

Input
Client
First name
Last Name
Start Date
End Date
Amount
Invoice Date
XXX
John
Kennedy
15-01-2021
28-02-2021
137,586.00
20-04-2021
YYY
Peter
Paul
7-02-2021
31-03-2021
38,750.00
20-04-2021
ZZZ
Michael
K
10-03-2021
29-04-2021
137,586.00
30-04-2021
Code
df = pd.read_excel ('file.xlsx',parse_dates=['Start Date','End Date'] )
df['Start Date'] = pd.to_datetime(df['Start Date'],format='%d-%m-%Y')
df['End Date'] = pd.to_datetime(df['End Date'],format='%d-%m-%Y')
df['r'] = df.apply(lambda x: pd.date_range(x['Start Date'],x['End Date']), axis=1)
df = df.explode('r')
print(df)
months = df['r'].dt.month
starts, ends = months.ne(months.groupby(level=0).shift(1)), months.ne(months.groupby(level=0).shift(-1))
df2 = pd.DataFrame({'First Name': df['First name'],
'Start Date': df.loc[starts, 'r'].dt.strftime('%Y-%m-%d'),
'End Date': df.loc[ends, 'r'].dt.strftime('%Y-%m-%d'),
'Date Diff': df.loc[ends, 'r'].dt.strftime('%d').astype(int)-df.loc[starts, 'r'].dt.strftime('%d').astype(int)+1})
df = df.loc[~df.index.duplicated(), :]
df2 = pd.merge(df, df2, left_index=True, right_index=True)
df2['Amount'] = df['Amount'].mul(df2['Date_Diff'])
print(df['Amount'])
print (df)
df.to_excel('report.xlsx', index=True)
Error
ValueError: cannot reindex from a duplicate axis
Expected output
how to resolve this issue?
Start with some correction in your input Excel file, namely change First name
to First Name - with capital "N", just like in other columns.
Then, to read your Excel file, it is enough to run:
df = pd.read_excel('Input.xlsx', parse_dates=['Start Date', 'End Date',
'Invoice Date'], dayfirst=True)
No need to call to_datetime.
Note also that since Invoice Date contains also dates, I added this column to parse_dates
list.
Then define two functions:
A function to get monthly data for the current row:
def getMonthData(grp, amnt, dayNo):
return pd.Series([grp.min(), grp.max(), amnt * grp.size / dayNo],
index=['Start Date', 'End Date', 'Amount'])
It converts the input Series of dates (for a single month) into the "new" content of
the output rows (start / end dates and the proper share of the total amount, to be
accounted for this month).
It will be called in the following function.
A function to "explode" the current row:
def rowExpl(row):
ind = pd.date_range(row['Start Date'], row['End Date']).to_series()
rv = ind.groupby(pd.Grouper(freq='M')).apply(getMonthData,
amnt=row.Amount, dayNo=ind.size).unstack().reset_index(drop=True)
rv.insert(0, 'Client', row.Client)
rv.insert(1, 'First Name', row['First Name'])
rv.insert(2, 'Last Name', row['Last Name'])
return rv.assign(**{'Invoice Date': row['Invoice Date']})
And the last step is to get the result. Apply rowExpl to each row and concatenate
the partial results into a single output DataFrame:
result = pd.concat(df.apply(rowExpl, axis=1).values, ignore_index=True)
The result, for your data sample is:
Client First Name Last Name Start Date End Date Amount Invoice Date
0 XXX John Kennedy 2021-01-15 2021-01-31 51976.9 2021-04-20
1 XXX John Kennedy 2021-02-01 2021-02-28 85609.1 2021-04-20
2 YYY Peter Paul 2021-02-07 2021-02-28 16084.9 2021-04-20
3 YYY Peter Paul 2021-03-01 2021-03-31 22665.1 2021-04-20
4 ZZZ Michael K 2021-03-10 2021-03-31 59350.8 2021-04-30
5 ZZZ Michael K 2021-04-01 2021-04-29 78235.2 2021-04-30
Don't be disaffected by seemingly too low precision of Amount column.
It is only the way how Jupyter Notebook displays the DataFrame.
When you run result.iloc[0, 5], you will get:
51976.933333333334
with full, actually held precision.

Get a single rating against particular id by finding nearest date from a series of dates

Dataframe 1 has two columns (customer_id, date and rating) and Dataframe 2 has (customer_id, start_date, instrument_id). The function needs to run such that the instrument_id in DF2 includes rating for date closest to start_date.
DF1:
customer_id date rating
84952608 31-Mar-20 4-
84952608 31-Dec-19 3-
84952608 30-Jun-19 4
84952608 31-Mar-19 5-
DF2:
Instrument_id customer_id start_date
000LCLN190240003 84952608 31-Mar-2019
Result DF:
Instrument_id customer_id rating
000LCLN190240003 84952608 5-
5- selected since start_date is closest to date
I got a working sample, however the compute time is significant in this case. For around 3k records it takes around 40-50 seconds
DF2 is exposure and DF1 is file
for w in range(len(exposure)):
max_preceeding_date = file.loc[(file['customer_id']==exposure.loc[w,'customer_id']) & (file['date']<=exposure.loc[w,'start_date']),['rating','date']].sort_values('date', ascending=False)
value = max_preceeding_date.iloc[0,0]
I also tried using df.merge to first merge both the dataframes, however unable to figure out how to use groupby to get the final output.
Appreciate your time and effort in helping on this one.
Merging dataframes and comparing datetime objects:
In [254]: res_df = df2.merge(df1, how='left', on='customer_id')
In [255]: res_df[['start_date', 'date']] = res_df[['start_date', 'date']].apply(lambda s: pd.to_datetime(s))
In [256]: res_df[res_df['date'] <= res_df['start_date']].sort_values(['start_date', 'date'], ascending=[False, False]).d
...: rop(['start_date', 'date'], axis=1)
Out[256]:
Instrument_id customer_id rating
3 000LCLN190240003 84952608 5-

Pandas group, aggregate two columns and return the earliest Start Date for one column

I am trying to group by a csv file in Pandas (by one column: ID) in order to get the earliest Start Date and latest End Date. Then I am trying to group by multiple columns in order to get the SUM of a value. For each ID in the second groupedby dataframe, I want to present the dates.
I am loading a csv in order to group and aggregate data.
01) First I load the csv
def get_csv():
#Read csv file
df = pd.read_csv('myFile.csv', encoding = "ISO-8859-1",parse_dates=['Start Date', 'End Date'])
return df
02) Group and aggregate the data for the columns (ID and Site)
def do_stuff():
df = get_csv()
groupedBy = df[df['A or B'].str.contains('AAAA')].groupby([df['ID'], df['Site'].fillna('Other'),]).agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'})
which works as expected and I am getting the following (example):
03) And ideally, for the same ID I want to present the earliest date in the Start Date column and the latest one in the End Date column. The aggregation for the value works perfectly. What I want to get is the following:
I do not know how to change my current code above. I have tried this so far:
def do_stuff():
df = get_csv()
md = get_csv()
minStart = md[md['A or B'].str.contains('AAAA')].groupby([md['ID']]).agg({'Start Date': 'min'})
df['earliestStartDate'] = minStart
groupedBy = df[df['A or B'].str.contains('AAAA')].groupby([df['ID'], df['Site'].fillna('Other'),df['earliestStartDate']]).agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'})
which fails and also tried changing the above to:
def do_stuff():
df = get_csv()
md = get_csv()
df['earliestStartDate'] = md.loc[ md['ID'] == df['ID'], 'Start Date'].min()
groupedBy = df[df['A or B'].str.contains('AAAA')].groupby([df['ID'], df['Site'].fillna('Other'),df['earliestStartDate']]).agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'})
Ideally, I will just change something in the groupedBy instead of having to read the csv twice and aggregate the data twice. Is that possible? If not, what can I change to make the script work? I am trying to test random things to get more experience in Pandas and Python.
I am guessing I have to create two dataframes here. One to get the groupedby data for all the columns needed (and the SUM of the Value). A second one to get the earliest Start Date and latest End Date for each ID. Then I need to find a way to concatenate the two dataframes. Is that a good result or do you think that there is an easier way to achieve that?
UPD: My code where I have created two dataframes (not sure whether this is the right solution) is given below:
#Read csv file
df = pd.read_csv('myFile.csv', encoding = "ISO-8859-1",mangle_dupe_cols=True, parse_dates=['Start Date', 'End Date'])
md = pd.read_csv('myFile.csv', encoding = "ISO-8859-1",mangle_dupe_cols=True, parse_dates=['Start Date', 'End Date'])
#Calculate the Clean Value
df['Clean Cost'] = (df['Value'] - df['Value2']) #.apply(lambda x: round(x,0))
#Get the min/max Dates
minMaxDates = md[md['Random'].str.contains('Y')].groupby([md['ID']]).agg({'Start Date': 'min', 'End Date': 'max'})
#Group by and aggregate (return Earliest Start Date, Latest End Date and SUM of the Values)
groupedBy = df[df['Random'].str.contains('Y')].groupby([df['ID'], df['Site'].fillna('Other')]).agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum', 'Value2': 'sum', 'Clean Cost': 'sum'})
and if I print the two dataframes, I am getting the following:
and
If I print the df.head(), I am getting the following:
ID A or B Start Date End Date Value Site Value2 Random alse.
0 45221 AAAA 2017-12-30 2017-09-30 14 S111 7 Y 1
45221 AAAA 2017-01-15 2017-09-30 15 S222 7 Y 2
85293 BBBB 2017-05-12 2017-07-24 29 S111 3 Y 3
85293 AAAA 2017-03-22 2017-10-14 32 S222 4 Y 4
45221 AAAA 2017-01-15 2017-09-30 30 S222 7 Y
A link of the file is given here:LINK
I think you need transform:
df = pd.read_csv('sampleBionic.csv')
print (df)
ID A or B Start Date End Date Value Site Value2 Random
0 45221 AAAA 12/30/2017 09/30/2017 14 S111 7 Y
1 45221 AAAA 01/15/2017 09/30/2017 15 S222 7 Y
2 85293 BBBB 05/12/2017 07/24/2017 29 S111 3 Y
3 85293 AAAA 03/22/2017 10/14/2017 32 S222 4 Y
4 45221 AAAA 01/15/2017 09/30/2017 30 S222 7 Y
groupedBy = (df[df['A or B'].str.contains('AAAA')]
.groupby([df['ID'], df['Site'].fillna('Other'),])
.agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'}))
print (groupedBy)
Start Date End Date Value
ID Site
45221 S111 12/30/2017 09/30/2017 14
S222 01/15/2017 09/30/2017 45
85293 S222 03/22/2017 10/14/2017 32
g = groupedBy.groupby(level=0)
groupedBy['Start Date'] = g['Start Date'].transform('min')
groupedBy['End Date'] = g['End Date'].transform('max')
print (groupedBy)
Start Date End Date Value
ID Site
45221 S111 01/15/2017 09/30/2017 14
S222 01/15/2017 09/30/2017 45
85293 S222 03/22/2017 10/14/2017 32
I have managed to create a script that does what I want. I will paste the answer in case somebody needs it in the future. Jezrael's answer worked fine too. So, considering that the original csv is like this:
my sript is:
import pandas as pd
import os
import csv
import time
import dateutil.parser as dparser
import datetime
def get_csv():
#Read csv file
df = pd.read_csv('myFile.csv', encoding = "ISO-8859-1",mangle_dupe_cols=True, parse_dates=['Start Date', 'End Date'])
df = df[df['A or B'].str.contains('AAAA')]
return df
def do_stuff():
df = get_csv()
#Get the min Start Date, max End date, sum of the Value and Value2 and calculate the Net Cost
varA = 'ID';
dfGrouped = df.groupby(varA, as_index=False).agg({'Start Date': 'min', 'End Date': 'max'}).copy();
varsToKeep = ['ID', 'Site', 'Random', 'Start Date_grp', 'End Date_grp', 'Value', 'Value2', ];
dfTemp = pd.merge(df, dfGrouped, how='inner', on='ID', suffixes=(' ', '_grp'), copy=True)[varsToKeep];
dfBreakDown = dfTemp.groupby(['ID', 'Site', 'Random', 'Start Date_grp',
'End Date_grp']).sum()
#Calculate the Net Cost
dfTemp['Net Cost'] = (dfTemp['Value'] - dfTemp['Value2'])
groupedBy = dfTemp.groupby(['ID', 'Site', 'Random']).agg({'Start Date_grp': 'min', 'End Date_grp': 'max', 'Value': 'sum', 'Value2': 'sum', 'Net Cost': 'sum'})
csvoutput(groupedBy)
def csvoutput(df):
#Csv output
df.to_csv(path_or_buf='OUT.csv', sep=',', na_rep='', float_format=None, columns=None, header=True, index=True, index_label=None, mode='w', encoding=None, compression=None, quoting=None, quotechar='"', line_terminator='\n', chunksize=None, tupleize_cols=False, date_format=None, doublequote=True, escapechar=None, decimal='.')
if __name__ == "__main__":
# start things here
do_stuff()

Categories

Resources