I collect data and analyze. In this case , there are a times data collected like yesterday or last week missing a value and might get updated when records are available at a later date, or a row value might change. I mean a row value might be modified, see sample dataframe:
First dataframe to receive
import pandas as pd
cars = {'Date': ['2020-09-11','2020-10-11','2021-01-12','2020-01-03', '2021-02-01'],
'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4','Mercedes'],
'Price': [22000,25000,27000,35000,45000],
'Mileage': [2000,'NAN',47000,3500,5000]
}
df = pd.DataFrame(cars, columns = ['Date','Brand', 'Price', 'Mileage'])
print (df)
Modification done on first dataframe
import pandas as pd
cars2 = {'Date': ['2020-09-11','2020-10-11','2021-01-12','2020-01-03', '2021-02-01'],
'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4','Mercedes'],
'Price': [22000,5000,27000,35000,45000],
'Mileage': [2000,100,47000,3500,600]
}
df2 = pd.DataFrame(cars2, columns = ['Date','Brand', 'Price', 'Mileage'])
print (df2)
Now I did like to know how I can select only rows modified from first dataframe. My expected output is only get rows which were modified at a later date . I have tried this but it gives me old rows too
df_diff = pd.concat([df,df2], sort=False).drop_duplicates(keep=False, inplace=False)
Expected output
import pandas as pd
cars3 = {'Date': ['2020-10-11', '2021-02-01'],
'Brand': ['Toyota Corolla','Mercedes'],
'Price': [5000,45000],
'Mileage': [100,600]
}
df3 = pd.DataFrame(cars3, columns = ['Date','Brand', 'Price', 'Mileage'])
print (df3)
Because there are same index and columns is possible use DataFrame.ne for compare for not equal and test if at least one row True by DataFrame.any and filter in boolean indexing:
df3 = df2[df.ne(df2).any(axis=1)]
print (df3)
Date Brand Price Mileage
1 2020-10-11 Toyota Corolla 5000 100
4 2021-02-01 Mercedes 45000 600
Related
import pandas as pd
list_sample = [{'name': 'A', 'fame': 0, 'data': {'date':['2021-01-01', '2021-02-01', '2021-03-01'],
'credit_score':[800, 890, 895],
'spend':[1500, 25000, 2400],
'average_spend':5000}},
{'name': 'B', 'fame': 1, 'data': {'date':['2022-01-01', '2022-02-01', '2022-03-01'],
'credit_score':[2800, 390, 8900],
'spend':[15000, 5000, 400],
'average_spend':3000}}]
df = pd.DataFrame()
for row in list_sample:
name = row['name']
fame = row['fame']
data = row['data']
df_temp = pd.DataFrame(data)
df_temp['name'] = name
df_temp['fame'] = fame
df = pd.concat([df, df_temp])
Above is how I am getting my dataframe. Above is a dummy example, but, the issue with above is when the size of list grow and when the number of entries in each data array grow. Above takes alot of time. May be concat is the issue or something else, is there any better way to do what I am doing above (better in terms of run time !)
One way of doing this is to flatten the nested data dictionary that's inside the list_sample dictionary. You can do this with json_normalize.
import pandas as pd
from pandas.io.json import json_normalize
df = pd.DataFrame(list_sample)
df = pd.concat([df.drop(['data'], axis=1), json_normalize(df['data'])], axis=1)
It looks like you don't care about normalizing the data column. If that's the case, you can just do df = pd.DataFrame(list_sample) to achieve the same result. I think you'd only need to do the kind of iterating you're doing if you wanted to normalize the data.
Combine all dicts in list_sample to fit a dataframe structure and concat them at once:
df = pd.concat([pd.DataFrame(d['data'] | {'name': d['name'], 'fame': d['fame']})
for d in list_sample])
print(df)
date credit_score spend average_spend name fame
0 2021-01-01 800 1500 5000 A 0
1 2021-02-01 890 25000 5000 A 0
2 2021-03-01 895 2400 5000 A 0
0 2022-01-01 2800 15000 3000 B 1
1 2022-02-01 390 5000 3000 B 1
2 2022-03-01 8900 400 3000 B 1
Here I am creating 2 dataframes and then merging them. Now, How to verify if all the
columns are incorporated in the merged DataFrame by using simple comparison Operator in Python?
import pandas as pd
# elements of first dataset
first_Set = {'Prod': ['Laptop', 'Mobile Phone',
'Desktop', 'LED'],
'Price_1': [25000, 8000, 20000, 35000]
}
# creation of Dataframe 1
df1 = pd.DataFrame(first_Set, columns=['Prod', 'Price_1'])
print(df1)
# elements of second dataset
second_Set = {'Prod': ['Laptop', 'Mobile Phone',
'Desktop', 'LED'],
'Price_2': [25000, 10000, 15000, 30000]
}
# creation of Dataframe 2
df2 = pd.DataFrame(second_Set, columns=['Prod', 'Price_2'])
print(df2)
#merging datasets on Cloumn = Prod
df_tech = pd.merge(df1, df2, on = 'Prod')
print(df_tech)
Let's say my modified df1 and df2 have different sets of 'Prod':
import pandas as pd
df1 = pd.DataFrame({'Prod': ['Mobile Phone', 'Desktop', 'LED'], 'Price_1': [8000, 20000, 35000]})
df2 = pd.DataFrame({'Prod': ['Laptop', 'Mobile Phone', 'Desktop'], 'Price_2': [25000, 10000, 15000]})
By default, pd.merge gives you an inner merge so you only see the matched.
df_tech = pd.merge(df1, df2, on = 'Prod')
print(df_tech)
Prod Price_1 Price_2
0 Mobile Phone 8000 10000
1 Desktop 20000 15000
Approach 1 is to do instead an outer merge,
option1 = df1.merge(df2, on='Prod', how='outer')
And check which rows have np.nan for identifying those that could have missed out from an inner merge:
print(option1[option1.isna().any(axis=1)])
Prod Price_1 Price_2
2 LED 35000.0 NaN
3 Laptop NaN 25000.0
And, LED and Laptop as expected.
Or, another approach, is to check dataframe-by-dataframe the original two with the merged one, using isin,
option2a = df1[~df1['Prod'].isin(df_tech['Prod'])]
option2b = df2[~df2['Prod'].isin(df_tech['Prod'])]
option2a, for example, will give us rows that don't get carried on to the merged dataframe
print(option2a)
Prod Price_1
2 LED 35000
and similar for option2b
print(option2b)
Prod Price_2
0 Laptop 25000
There are two dataframes, need to extract the Nearest upcoming Expiry date from Dataframe2 based on Active date in Dataframe 1 to obtain the correct Value.
This is a sample. Original data contains thousands of rows
Dataframe 1
df_1 = pd.DataFrame({'Category': ['A','B'],
'Active date': ['2021-06-20','2021-06-25']})
Dataframe 2
df_2 = pd.DataFrame({'Category': ['A','A','A','A','A','B','B','B'],
'Expiry date': ['2021-05-22','2021-06-23','2021-06-24','2021-06-28','2021-07-26','2021-06-27','2021-06-28','2021-08-29'],
'Value': [20,21,23,45,12,34,17,34]})
Final Output -
The code I was trying -
df = pd.merge(df_1, df_2, on='Category', how='inner')
#Removed all the dates which are less than Active date
df = df.loc[(df_1['Active Date'] <= df_2['Expiry Date'])]
I believe this solution keeps a lot of your existing code and will accomplish what you are looking for.
df_1 = pd.DataFrame({'Category': ['A','B'],
'Active date': ['2021-06-20','2021-06-25']})
df_2 = pd.DataFrame({'Category': ['A','A','A','A','A','B','B','B'],
'Expiry date': ['2021-05-22','2021-06-23','2021-06-24','2021-06-28','2021-07-26','2021-06-27','2021-06-28','2021-08-29'],
'Value': [20,21,23,45,12,34,17,34]})
df = pd.merge(df_1, df_2, on='Category', how='inner')
# Removed all the dates which are less than Active date
df = df.loc[(df['Active date'] <= df['Expiry date'])]
df = df.rename(columns={'Expiry date': 'Next Expiry Date'})
df = df.loc[df['Next Expiry Date'] == df.groupby('Category')['Next Expiry Date'].transform('min')]
Output:
Category Active date Next Expiry Date Value
1 A 2021-06-20 2021-06-23 21
5 B 2021-06-25 2021-06-27 34
You can use pandas merge_asof with direction set to forward. Note that for merge_asof, both data frames must be sorted :
df_1 = df_1.transform(pd.to_datetime, errors='ignore')
df_2 = df_2.astype({"Expiry date": np.datetime64})
df_2 = df_2.sort_values('Expiry date')
pd.merge_asof(df_1,
df_2,
left_on='Active date',
right_on='Expiry date',
direction='forward',
by='Category')
Category Active date Expiry date Value
0 A 2021-06-20 2021-06-23 21
1 B 2021-06-25 2021-06-27 34
I am trying to group by a csv file in Pandas (by one column: ID) in order to get the earliest Start Date and latest End Date. Then I am trying to group by multiple columns in order to get the SUM of a value. For each ID in the second groupedby dataframe, I want to present the dates.
I am loading a csv in order to group and aggregate data.
01) First I load the csv
def get_csv():
#Read csv file
df = pd.read_csv('myFile.csv', encoding = "ISO-8859-1",parse_dates=['Start Date', 'End Date'])
return df
02) Group and aggregate the data for the columns (ID and Site)
def do_stuff():
df = get_csv()
groupedBy = df[df['A or B'].str.contains('AAAA')].groupby([df['ID'], df['Site'].fillna('Other'),]).agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'})
which works as expected and I am getting the following (example):
03) And ideally, for the same ID I want to present the earliest date in the Start Date column and the latest one in the End Date column. The aggregation for the value works perfectly. What I want to get is the following:
I do not know how to change my current code above. I have tried this so far:
def do_stuff():
df = get_csv()
md = get_csv()
minStart = md[md['A or B'].str.contains('AAAA')].groupby([md['ID']]).agg({'Start Date': 'min'})
df['earliestStartDate'] = minStart
groupedBy = df[df['A or B'].str.contains('AAAA')].groupby([df['ID'], df['Site'].fillna('Other'),df['earliestStartDate']]).agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'})
which fails and also tried changing the above to:
def do_stuff():
df = get_csv()
md = get_csv()
df['earliestStartDate'] = md.loc[ md['ID'] == df['ID'], 'Start Date'].min()
groupedBy = df[df['A or B'].str.contains('AAAA')].groupby([df['ID'], df['Site'].fillna('Other'),df['earliestStartDate']]).agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'})
Ideally, I will just change something in the groupedBy instead of having to read the csv twice and aggregate the data twice. Is that possible? If not, what can I change to make the script work? I am trying to test random things to get more experience in Pandas and Python.
I am guessing I have to create two dataframes here. One to get the groupedby data for all the columns needed (and the SUM of the Value). A second one to get the earliest Start Date and latest End Date for each ID. Then I need to find a way to concatenate the two dataframes. Is that a good result or do you think that there is an easier way to achieve that?
UPD: My code where I have created two dataframes (not sure whether this is the right solution) is given below:
#Read csv file
df = pd.read_csv('myFile.csv', encoding = "ISO-8859-1",mangle_dupe_cols=True, parse_dates=['Start Date', 'End Date'])
md = pd.read_csv('myFile.csv', encoding = "ISO-8859-1",mangle_dupe_cols=True, parse_dates=['Start Date', 'End Date'])
#Calculate the Clean Value
df['Clean Cost'] = (df['Value'] - df['Value2']) #.apply(lambda x: round(x,0))
#Get the min/max Dates
minMaxDates = md[md['Random'].str.contains('Y')].groupby([md['ID']]).agg({'Start Date': 'min', 'End Date': 'max'})
#Group by and aggregate (return Earliest Start Date, Latest End Date and SUM of the Values)
groupedBy = df[df['Random'].str.contains('Y')].groupby([df['ID'], df['Site'].fillna('Other')]).agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum', 'Value2': 'sum', 'Clean Cost': 'sum'})
and if I print the two dataframes, I am getting the following:
and
If I print the df.head(), I am getting the following:
ID A or B Start Date End Date Value Site Value2 Random alse.
0 45221 AAAA 2017-12-30 2017-09-30 14 S111 7 Y 1
45221 AAAA 2017-01-15 2017-09-30 15 S222 7 Y 2
85293 BBBB 2017-05-12 2017-07-24 29 S111 3 Y 3
85293 AAAA 2017-03-22 2017-10-14 32 S222 4 Y 4
45221 AAAA 2017-01-15 2017-09-30 30 S222 7 Y
A link of the file is given here:LINK
I think you need transform:
df = pd.read_csv('sampleBionic.csv')
print (df)
ID A or B Start Date End Date Value Site Value2 Random
0 45221 AAAA 12/30/2017 09/30/2017 14 S111 7 Y
1 45221 AAAA 01/15/2017 09/30/2017 15 S222 7 Y
2 85293 BBBB 05/12/2017 07/24/2017 29 S111 3 Y
3 85293 AAAA 03/22/2017 10/14/2017 32 S222 4 Y
4 45221 AAAA 01/15/2017 09/30/2017 30 S222 7 Y
groupedBy = (df[df['A or B'].str.contains('AAAA')]
.groupby([df['ID'], df['Site'].fillna('Other'),])
.agg({'Start Date': 'min', 'End Date': 'max', 'Value': 'sum'}))
print (groupedBy)
Start Date End Date Value
ID Site
45221 S111 12/30/2017 09/30/2017 14
S222 01/15/2017 09/30/2017 45
85293 S222 03/22/2017 10/14/2017 32
g = groupedBy.groupby(level=0)
groupedBy['Start Date'] = g['Start Date'].transform('min')
groupedBy['End Date'] = g['End Date'].transform('max')
print (groupedBy)
Start Date End Date Value
ID Site
45221 S111 01/15/2017 09/30/2017 14
S222 01/15/2017 09/30/2017 45
85293 S222 03/22/2017 10/14/2017 32
I have managed to create a script that does what I want. I will paste the answer in case somebody needs it in the future. Jezrael's answer worked fine too. So, considering that the original csv is like this:
my sript is:
import pandas as pd
import os
import csv
import time
import dateutil.parser as dparser
import datetime
def get_csv():
#Read csv file
df = pd.read_csv('myFile.csv', encoding = "ISO-8859-1",mangle_dupe_cols=True, parse_dates=['Start Date', 'End Date'])
df = df[df['A or B'].str.contains('AAAA')]
return df
def do_stuff():
df = get_csv()
#Get the min Start Date, max End date, sum of the Value and Value2 and calculate the Net Cost
varA = 'ID';
dfGrouped = df.groupby(varA, as_index=False).agg({'Start Date': 'min', 'End Date': 'max'}).copy();
varsToKeep = ['ID', 'Site', 'Random', 'Start Date_grp', 'End Date_grp', 'Value', 'Value2', ];
dfTemp = pd.merge(df, dfGrouped, how='inner', on='ID', suffixes=(' ', '_grp'), copy=True)[varsToKeep];
dfBreakDown = dfTemp.groupby(['ID', 'Site', 'Random', 'Start Date_grp',
'End Date_grp']).sum()
#Calculate the Net Cost
dfTemp['Net Cost'] = (dfTemp['Value'] - dfTemp['Value2'])
groupedBy = dfTemp.groupby(['ID', 'Site', 'Random']).agg({'Start Date_grp': 'min', 'End Date_grp': 'max', 'Value': 'sum', 'Value2': 'sum', 'Net Cost': 'sum'})
csvoutput(groupedBy)
def csvoutput(df):
#Csv output
df.to_csv(path_or_buf='OUT.csv', sep=',', na_rep='', float_format=None, columns=None, header=True, index=True, index_label=None, mode='w', encoding=None, compression=None, quoting=None, quotechar='"', line_terminator='\n', chunksize=None, tupleize_cols=False, date_format=None, doublequote=True, escapechar=None, decimal='.')
if __name__ == "__main__":
# start things here
do_stuff()
Assuming I have the following dataset saved in a Pandas dataframe - note the last column [Status] is the column I'd like to create:
Department Employee Issue Date Submission Date ***Status***
A Joe 18/05/2014 25/06/2014 0
A Joe 1/06/2014 28/06/2014 1
A Joe 23/06/2014 30/06/2014 2
A Mark 1/03/2015 13/03/2015 0
A Mark 23/04/2015 15/04/2015 0
A William 15/07/2016 30/07/2016 0
A William 1/08/2016 23/08/2016 0
A William 20/08/2016 19/08/2016 1
B Liz 18/05/2014 7/06/2014 0
B Liz 1/06/2014 15/06/2014 1
B Liz 23/06/2014 16/06/2014 0
B John 1/03/2015 13/03/2015 0
B John 23/04/2015 15/04/2015 0
B Alex 15/07/2016 30/07/2016 0
B Alex 1/08/2016 23/08/2016 0
B Alex 20/08/2016 19/08/2016 1
I'd like to create an additional column [Status] based on the following conditions:
For every unique [Department] & [Employee] combination (e.g. there are three rows corresponding to Joe in Department A), sort the [Issue Date] from oldest to newest
If the current row [Issue Date] is greater than ALL previous rows [Submission Date], then flag the [Status] with 0; else [Status] = no of times that [Issue Date] < [Submission Date]
As an example: for employee Joe in Department A. When [Issue Date] = '1/06/2014', the previous row's [Submission Date] is after the [Issue Date], therefore [Status] = 1 for row 2. Similarly, when [Issue Date] = '23/06/2014', row 1 & 2's [Submission Date]s are both after the [Issue Date], therefore [Status] = 2 for row 3. We need to perform this calculation for every unique combination of Department and Employee.
Note: real dataset is not sorted nicely as the displayed example.
This question was posted 6 months ago but hopefully my answer still provides some help.
First, import the libraries and make the dataframe:
# import libraries
import numpy as np
import pandas as pd
# Make DataFrame
df = pd.DataFrame({'Department' : ['A']*8 + ['B']*8,
'Employee' : ['Joe']*3 +\
['Mark']*2 +\
['William']*3 +\
['Liz']*3 +\
['John']*2 +\
['Alex']*3,
'Issue Date' : ['18/05/2014', '1/06/2014', '23/06/2014',
'1/03/2015', '23/04/2015',
'15/07/2016', '1/08/2016', '20/08/2016',
'18/05/2014', '1/06/2014', '23/06/2014',
'1/03/2015', '23/04/2015',
'15/07/2016', '1/08/2016', '20/08/2016'],
'Submission Date' : ['25/06/2014', '28/06/2014', '30/06/2014',
'13/03/2015', '15/04/2015',
'30/07/2016', '23/08/2016', '19/08/2016',
'7/06/2014', '15/06/2014', '16/06/2014',
'13/03/2015', '15/04/2015',
'30/07/2016', '23/08/2016', '19/08/2016']})
Second, convert Issue Date and Submission Date to datetime:
# Convert 'Issue Date', 'Submission Date' to pd.datetime
df.loc[:, 'Issue Date'] = pd.to_datetime(df.loc[:, 'Issue Date'],
dayfirst = True)
df.loc[:, 'Submission Date'] = pd.to_datetime(df.loc[:, 'Submission Date'],
dayfirst = True)
Third, reset the index and sort the values by Department, Employee, and Issue Date:
# Reset index and sort_values by 'Department', 'Employee', 'Issue Date'
df.reset_index(drop = True).sort_values(by = ['Department',
'Employee',
'Issue Date'],
inplace = True)
Fourth, group by Department, Employee; cumulative count the rows; insert into the originial df:
# Group by 'Department', 'Employee'; cumulative count rows; insert into original df
df.insert(df.shape[1],
'grouped count',
df.groupby(['Department',
'Employee']).cumcount())
Fifth, create a no_issue and no_submission dataframe and merge them together on Department and Employee:
# Create df without 'Issue Date'
no_issue = df.drop('Issue Date', axis = 1)
# Create df without 'Submission Date'
no_submission = df.drop('Submission Date', axis = 1)
# Outer merge no_issue with no_submission on 'Department', 'Employee'
merged = no_issue.merge(no_submission,
how = 'outer',
on = ['Department',
'Employee'])
This duplicates the Submission Date by the number of Issue Dates per Department,Employee group
Here's what it looks like for Joe:
Sixth, create a dataframe that only keeps rows where grouped count_x is less than grouped count_y, then sort by Department, Employee, and Issue Date:
# Create merged1 df that keeps only rows where 'grouped count_x' < 'grouped count_y';
# sort by 'Department', 'Employee', 'Issue Date
merged1 = merged[merged.loc[:, 'grouped count_x'] <
merged.loc[:, 'grouped count_y']].sort_values(by = ['Department',
'Employee',
'Issue Date'])
Seventh, insert the status column as a boolean where Issue Date is less than Submission Date:
# Insert 'Status' as a boolean when 'Issue Date' < 'Submission Date'
merged1.insert(merged.shape[1],
'Status',
merged1.loc[:, 'Issue Date'] < merged1.loc[:, 'Submission Date'])
Eighth, group by Department, Employee, and Issue Date, sum the Status, and reset the index:
# Group by 'Department', 'Employee', 'Issue Date' and sum 'Status'; reset index
merged1 = merged1.groupby(['Department',
'Employee',
'Issue Date']).agg({'Status' : np.sum}).reset_index()
This will return a dataframe with all the correct Statuses minus the minimum Issue Date for each Department, Employee group
Ninth, group the original merged dataframe by Department and Employee, find the minimum Issue Date, and reset the index:
# Group merged by 'Department', 'Employee' and find min 'Issue Date'; reset index
merged = merged.groupby(['Department',
'Employee']).agg({'Issue Date' : 'min'}).reset_index()
Tenth, concatenate merged1 with merged, fill the na with 0 (since the minimum Issue Date will always have a Status of 0) and sort by Department, Employee, and Issue Date:
# Concatenate merged with merged1; fill na with 0; sort by 'Department', 'Employee', 'Issue Date'
concatenated = pd.concat([merged1, merged]).fillna(0).sort_values(by = ['Department',
'Employee',
'Issue Date'])
Eleventh, inner merge the merged dataframe with the concatenated dataframe on Department, Employee, and Issue Date, then drop the grouped count:
# Merge concatenated with df; drop grouped count
final = df.merge(concatenated,
how = 'inner',
on = ['Department',
'Employee',
'Issue Date']).drop('grouped count',
axis = 1)
Voila! Here is your final dataframe:
# Final df
final