drop.na() not working on dataframe with Nan values? - python

I have a data frame with Nan values. For some reason, df.dropna() doesn't work when I try to drop these rows. Any thoughts?
Example of a row:
30754 22 Nan Nan Nan Nan Nan Nan Jewellery-Women N
df = pd.read_csv('/Users/xxx/Desktop/CS 677/Homework_4/FashionDataset.csv')
df.dropna()
df.head().to_dict()
{'Unnamed: 0': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4},
'BrandName': {0: 'life',
1: 'only',
2: 'fratini',
3: 'zink london',
4: 'life'},
'Deatils': {0: 'solid cotton blend collar neck womens a-line dress - indigo',
1: 'polyester peter pan collar womens blouson dress - yellow',
2: 'solid polyester blend wide neck womens regular top - off white',
3: 'stripes polyester sweetheart neck womens dress - black',
4: 'regular fit regular length denim womens jeans - stone'},
'Sizes': {0: 'Size:Large,Medium,Small,X-Large,X-Small',
1: 'Size:34,36,38,40',
2: 'Size:Large,X-Large,XX-Large',
3: 'Size:Large,Medium,Small,X-Large',
4: 'Size:26,28,30,32,34,36'},
'MRP': {0: 'Rs\n1699',
1: 'Rs\n3499',
2: 'Rs\n1199',
3: 'Rs\n2299',
4: 'Rs\n1699'},
'SellPrice': {0: '849', 1: '2449', 2: '599', 3: '1379', 4: '849'},
'Discount': {0: '50% off',
1: '30% off',
2: '50% off',
3: '40% off',
4: '50% off'},
'Category': {0: 'Westernwear-Women',
1: 'Westernwear-Women',
2: 'Westernwear-Women',
3: 'Westernwear-Women',
4: 'Westernwear-Women'}}
This is what I get when using df.head().to_dict()

Try this;
df = pd.DataFrame({"col1":[12,20,np.nan,np.nan],
"col2":[10,np.nan,np.nan,40]})
df1 = df.dropna()
# df;
col1 col2
0 12.0 10.0
1 20.0 NaN
2 NaN NaN
3 NaN 40.0
# df1;
col1 col2
0 12.0 10.0

Related

Column Pair-wise aggregation and reorganization in Pandas

I am importing a csv file into a pandas dataframe such as:
df = pd.DataFrame( {0: {0: 'ID', 1: '1', 2: '2', 3: '3', 4: '4', 5: '5'}, 1: {0: 'Net Cost', 1: '30', 2: '40', 3: '50', 4: '35', 5: '45'}, 2: {0: 'Charge Description', 1: 'Surcharge A', 2: 'Discount X', 3: 'Discount X', 4: 'Discount X', 5: 'Surcharge A'}, 3: {0: 'Charge Amount', 1: '9.5', 2: '-12.5', 3: '-11.5', 4: '-5.5', 5: '9.5'}, 4: {0: 'Charge Description', 1: 'Discount X', 2: '', 3: '', 4: 'Surcharge B', 5: 'Discount X'}, 5: {0: 'Charge Amount', 1: '-11.5', 2: '', 3: '', 4: '3.5', 5: '-10.5'}, 6: {0: 'Charge Description', 1: 'Discount Y', 2: '', 3: '', 4: '', 5: 'Surcharge B'}, 7: {0: 'Charge Amount', 1: '-3.25', 2: '', 3: '', 4: '', 5: '4.5'}, 8: {0: 'Charge Description', 1: 'Surcharge B', 2: '', 3: '', 4: '', 5: ''}, 9: {0: 'Charge Amount', 1: '2.5', 2: '', 3: '', 4: '', 5: ''}} )
0
1
2
3
4
5
6
7
8
9
ID
Net Cost
Charge Description
Charge Amount
Charge Description
Charge Amount
Charge Description
Charge Amount
Charge Description
Charge Amount
1
30
Surcharge A
9.5
Discount X
-11.5
Discount Y
-3.25
Surcharge B
2.5
2
40
Discount X
-12.5
3
50
Discount X
-11.5
4
35
Discount X
-5.5
Surcharge B
3.5
5
45
Surcharge A
9.5
Discount X
-10.5
Surcharge B
4.5
The first row are the headers with column names Charge Description and Charge Amount forming pairs and appearing multiple times.
Desired output is a df with a unique column for each description, with the reorganized columns sorted alphabetically and NaNs showing as 0:
ID
Net Cost
Surcharge A
Surcharge B
Discount X
Discount Y
1
30
9.5
2.5
-11.5
-3.25
2
40
0
0
-12.5
0
3
50
0
0
-11.5
0
4
35
0
3.5
-5.5
0
5
45
9.5
4.5
-10.5
0
This post looks like a good starting point but then I need a column for each Charge Description and only a single row per ID.
I used the file you shared, and edited the columns with the initial dataframe df shared (Pandas automatically adds suffixes to columns to make them unique) to keep the non uniqueness:
invoice = pd.read_csv('Downloads/Example Invoice.csv')
invoice.columns = ['ID', 'Net Cost', 'Charge Description', 'Charge Amount',
'Charge Description', 'Charge Amount',
'Charge Description', 'Charge Amount',
'Charge Description', 'Charge Amount']
print(invoice)
ID Net Cost Charge Description Charge Amount ... Charge Description Charge Amount Charge Description Charge Amount
0 1 30 Surcharge A 9.5 ... Discount Y -3.25 Surcharge B 2.5
1 2 40 Discount X -12.5 ... NaN NaN NaN NaN
2 3 50 Discount X -11.5 ... NaN NaN NaN NaN
3 4 35 Discount X -5.5 ... NaN NaN NaN NaN
4 5 45 Surcharge A 9.5 ... Surcharge B 4.50 NaN NaN
First step is to transform to long form with pivot_longer from pyjanitor - in this case we take advantage of the fact that charge description is followed by charge amount - we can safely pair them and reshape into two columns. After that is done, we flip back to wide form - getting Surcharge and Discount values as headers. Thankfully, the index is unique, so a pivot works without extras. I used pivot_wider here, primarily for convenience - the same can be achieved with pivot, with just a few cleanup steps - under the hood pivot_wider uses pd.pivot.
# pip install pyjanitor
import pandas as pd
import janitor
index = ['ID', 'Net Cost']
arr = ['Charge Description', 'Charge Amount']
(invoice
.pivot_longer(
index = index,
names_to = arr,
names_pattern = arr,
dropna=True)
.pivot_wider(
index=index,
names_from='Charge Description',
values_from='Charge Amount')
.fillna(0)
)
ID Net Cost Discount X Discount Y Surcharge A Surcharge B
0 1 30 -11.5 -3.25 9.5 2.5
1 2 40 -12.5 0.00 0.0 0.0
2 3 50 -11.5 0.00 0.0 0.0
3 4 35 -5.5 0.00 0.0 3.5
4 5 45 -10.5 0.00 9.5 4.5
Another option - since the data is fairly consistent with the ordering, you can dump down into numpy, reshape into a two column array, keep track of the ID and Net Cost columns (ensure they are correctly paired), and then pivot to get your final data:
index = ['ID', 'Net Cost']
arr = ['Charge Description', 'Charge Amount']
invoice = invoice.set_index(index)
out = invoice.to_numpy().reshape(-1, 2)
out = pd.DataFrame(out, columns = arr)
# reshape above is in order `C` - default
# so we can safely repeat the index
# with a value of 4
# which is what you get ->
# invoice.columns.size // 2
# to correctly pair the index with the new dataframe
out.index = invoice.index.repeat(invoice.columns.size//2)
# get rid of nulls, and flip to wide form
(out
.dropna(how='all')
.set_index('Charge Description', append=True)
.squeeze()
.unstack('Charge Description', fill_value=0)
.rename_axis(columns = None)
.reset_index()
)
ID Net Cost Discount X Discount Y Surcharge A Surcharge B
0 1 30 -11.5 -3.25 9.5 2.5
1 2 40 -12.5 0 0 0
2 3 50 -11.5 0 0 0
3 4 35 -5.5 0 0 3.5
4 5 45 -10.5 0 9.5 4.5
You can convert the data dtypes for Discount to numeric
You can flatten your dataframe first with melt then reshape with pivot_table after cleaning it up:
# 1st pass
out = (pd.DataFrame(df.iloc[1:].values, columns=df.iloc[0].tolist())
.melt(['ID', 'Net Cost'], ignore_index=False))
m = out['variable'] == 'Charge Description'
# 2nd pass
out = (pd.concat([out[m].reset_index(drop=True).add_prefix('_'),
out[~m].reset_index(drop=True)], axis=1)
.query("_value != ''")
.pivot_table(index=['ID', 'Net Cost'], columns='_value',
values='value', aggfunc='first')
.rename_axis(columns=None).reset_index().fillna(0))
Output:
>>> out
ID Net Cost Discount X Discount Y Surcharge A Surcharge B
0 1 30 -11.5 -3.25 9.5 2.5
1 2 40 -12.5 0 0 0
2 3 50 -11.5 0 0 0
3 4 35 -5.5 0 0 3.5
4 5 45 -10.5 0 9.5 4.5
You can use pivot_table after concatenating pair-wise:
import pandas as pd
df = pd.DataFrame.from_dict(
{0: {0: 'ID', 1: '1', 2: '2', 3: '3', 4: '4', 5: '5'}, 1: {0: 'Net Cost', 1: '30', 2: '40', 3: '50', 4: '35', 5: '45'}, 2: {0: 'Charge Description', 1: 'Surcharge A', 2: 'Discount X', 3: 'Discount X', 4: 'Discount X', 5: 'Surcharge A'}, 3: {0: 'Charge Amount', 1: '9.5', 2: '-12.5', 3: '-11.5', 4: '-5.5', 5: '9.5'}, 4: {0: 'Charge Description', 1: 'Discount X', 2: '', 3: '', 4: 'Surcharge B', 5: 'Discount X'}, 5: {0: 'Charge Amount', 1: '-11.5', 2: '', 3: '', 4: '3.5', 5: '-10.5'}, 6: {0: 'Charge Description', 1: 'Discount Y', 2: '', 3: '', 4: '', 5: 'Surcharge B'}, 7: {0: 'Charge Amount', 1: '-3.25', 2: '', 3: '', 4: '', 5: '4.5'}, 8: {0: 'Charge Description', 1: 'Surcharge B', 2: '', 3: '', 4: '', 5: ''}, 9: {0: 'Charge Amount', 1: '2.5', 2: '', 3: '', 4: '', 5: ''}})
# setting first row as header
df.columns = df.iloc[0, :]
df.drop(index=0, inplace=True)
df = pd.concat([df.iloc[:, [0,1,i,i+1]] for i in range(2, len(df.columns), 2)]).replace('', 0)
print(df[df['Charge Description']!=0]
.pivot_table(columns='Charge Description', values='Charge Amount', index=['ID', 'Net Cost'])
.fillna(0))
Output:
Charge Description Discount X Discount Y Surcharge A Surcharge B
ID Net Cost
1 30 -11.5 -3.25 9.5 2.5
2 40 -12.5 0.00 0.0 0.0
3 50 -11.5 0.00 0.0 0.0
4 35 -5.5 0.00 0.0 3.5
5 45 -10.5 0.00 9.5 4.5
I would use melt to stack the identically named columns, then pivot to create the outcome you want.
# Ensure the first line is now the column names, and then delete the first line.
df.columns = df.iloc[0]
df = df[1:]
# Create two melted df's, and join them on index.
df1 = df.melt(['ID', 'Net Cost'], ['Charge Description']).sort_values(by='ID').reset_index(drop=True)
df2 = df.melt(['ID', 'Net Cost'], ['Charge Amount']).sort_values(by='ID').reset_index(drop=True)
df1['Charge Amount'] = df2['value']
# Clean up a little, rename the added 'value' column from df1.
df1 = df1.drop(columns=[0]).rename(columns={'value': 'Charge Description'})
df1 = df1.dropna()
# Pivot the data.
df1 = df1.pivot(index=['ID', 'Net Cost'], columns='Charge Description', values='Charge Amount')
Result of df1:
Charge Description Discount X Discount Y Surcharge A Surcharge B
ID Net Cost
1 30 -11.5 -3.25 9.5 2.5
2 40 -12.5 NaN NaN NaN
3 50 -11.5 NaN NaN NaN
4 35 -5.5 NaN NaN 3.5
5 45 -10.5 NaN 9.5 4.5`
My first thought was to read the data out in to a list of dictionaries representing each Row (making both the keys and values from the data values), then form a new dataframe from that.
For your example, that would make...
[
{
'ID': '1',
'Net Cost': '30',
'Discount X': '-11.5',
'Discount Y': '-3.25',
'Surcharge A': '9.5',
'Surcharge B': '2.5',
},
{
'ID': '2',
'Net Cost': '40',
'Discount X': '-12.5',
},
{
'ID': '3',
'Net Cost': '50',
'Discount X': '-11.5',
},
{
'ID': '4',
'Net Cost': '35',
'Discount X': '-5.5',
'Surcharge B': '3.5',
},
{
'ID': '5',
'Net Cost': '45',
'Discount X': '-10.5',
'Surcharge A': '9.5',
'Surcharge B': '4.5',
},
]
For the SMALL sample dataset, using comprehensions appears to be quite quick for that...
import pandas as pd
from itertools import chain
rows = [
{
name: value
for name, value in chain(
[
("ID", row[0]),
("Net Cost", row[1]),
],
zip(row[2::2], row[3::2]) # pairs of columns: (2,3), (4,5), etc
)
if name
}
for ix, row in df.iloc[1:].iterrows() # Skips the row with the column headers
]
df2 = pd.DataFrame(rows).fillna(0)
Demo (including timings of this and three other answers):
https://trinket.io/python3/555f860855
EDIT:
To sort the column names, add the following...
df2 = df2[['ID', 'Net Cost', *sorted(df2.columns[2:])]]

Dynamic mean calculation of float values using dask and groupby?

I'm trying to dynamically calculate the mean for all float64 columns. The end user should be able to filter through any of the category columns against a chart and get the various means for each instance. In order to achieve this, I've written the further below Python script using dask and groupby function.
However...
When performing groupby on the below columns, the object columns are vanished from the output CSV file as a result of aggregation and mean calculation on the columns containing float64 type. On this example, I'm using dask to read the dataframe (pandas is not an option due to high mem usage) and save the output file as CSV.
The dtypes of the input CSV columns are:
Dtype Columns
String (eg. 2018-P01/02) Time Period
Integer Journey Code
Object Journey Name
Object Route
Object Route Name
Object Area
Object Area Name
Object Fuel Type
Float64 Fuel Load
Float64 Mileage
Float64 Odometer Reading
Float64 Delay Minutes
My code for reading/saving the CSV and performing the mean calculation is:
import numpy as np
import dask.dataframe as dd
import pandas as pd
filename = "H:\\testing3.csv"
data = dd.read_csv("H:\\testing2.csv")
cols=['Time Period','Journey Code','Journey Name','Route',
'Route Name','Area','Area Name','Fuel Type']
x = data.groupby(cols).aggregation(['mean'])
x.to_csv(filename, index = False)
An example of the original dataset is:
Time Period Journey Code Route Route Name Area Area Name
2018-P01 261803 High France-Germany WE West
2018-P01-02 325429 High France-Germany EA Eastern
2018-P01-02 359343 High France-Germany WCS West Coast South
2018-P01-02 359343 High France-Germany WE West
2018-P01-03 370697 High France-Germany WE West
2018-P01-04 392535 High France-Germany EA Eastern
2018-P01-04 394752 High France-Germany WCS West Coast South
2018-P01-05 408713 High France-Germany WE West
Fuel Type Fuel Load Mileage Odometer Reading Delay Minutes
Diesel 165 6 14567.1 2
Diesel 210 12 98765.8 0
Diesel 210 5 23406.2 0
Diesel 130 10 54418.8 0
Diesel 152.5 37 58838.35 2
Diesel 142 140 63257.9 37.1194012
Diesel 131.5 120 67677.45 0
Diesel 121 13 72097 1.25
Why are the object columns vanishing from the resulting CSV file and how can I produce a result like the below?
Desired output (example on line 2 and 3): No average for the starting line but then any future float64 values will contain average (current against previous). I'm splitting each instance separately to get the dynamic result but any ideas are welcome.
Time Period Journey Code Route Route Name Area Area Name
2018-P01-02
325429
High
France-Germany
EA
Eastern
…….
359343 High France-Germany WCS West Coast South
Fuel Type Fuel Load Mileage Odometer Reading Delay Minutes
Diesel 210 12 98765.8 0
Diesel 210 12 98765.8 0
Diesel 210 12 98765.8 0
Diesel 210 12 98765.8 0
Diesel 210 12 98765.8 0
......
Diesel 170 8.5 23406.2 NaN
Edit: Added sample dataset in format of df.head(10).to_dict()
{'Time Period': {0: '2018-P01', 1: '2018-P01-02', 2: '2018-P01-02', 3: '2018-P01-02', 4: '2018-P01-03', 5: '2018-P01-04', 6: '2018-P01-04', 7: '2018-P01-05', 8: '2018-P01-06', 9: '2018-P01-07'}, 'Odometer Reading': {0: 14567.1, 1: 98765.8, 2: 23406.2, 3: 54418.8, 4: 58838.35, 5: 63257.9, 6: 67677.45, 7: 72097.0, 8: 89221.0, 9: 89221.0}, 'Journey Code': {0: 261803, 1: 325429, 2: 359343, 3: 359343, 4: 370697, 5: 392535, 6: 394752, 7: 408713, 8: 408714, 9: 408715}, 'Fuel Type': {0: 'Diesel', 1: 'Diesel', 2: 'Diesel', 3: 'Diesel', 4: 'Diesel', 5: 'Diesel', 6: 'Diesel', 7: 'Diesel', 8: 'Diesel', 9: 'Diesel'}, 'Route Name': {0: 'France-Germany', 1: 'France-Germany', 2: 'France-Germany', 3: 'France-Germany', 4: 'France-Germany', 5: 'France-Germany', 6: 'France-Germany', 7: 'France-Germany', 8: 'France-Germany', 9: 'France-Germany'}, 'Area': {0: 'WE', 1: 'EA', 2: 'WCS', 3: 'WE', 4: 'WE', 5: 'EA', 6: 'WCS', 7: 'WE', 8: 'WE', 9: 'WE'}, 'Route': {0: 'High', 1: 'High', 2: 'High', 3: 'High', 4: 'High', 5: 'High', 6: 'High', 7: 'High', 8: 'High', 9: 'High'}, 'Fuel Load': {0: 165.0, 1: 210.0, 2: 170.0, 3: 130.0, 4: 152.5, 5: 142.0, 6: 131.5, 7: 121.0, 8: 121.0, 9: 121.0}, 'Delay Minutes': {0: 2.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 2.0, 5: 37.119401200000006, 6: 0.0, 7: 1.25, 8: 2.56, 9: 2.56}, 'Mileage': {0: 6.0, 1: 12.0, 2: 8.5, 3: 10.0, 4: 37.0, 5: 140.0, 6: 120.0, 7: 13.0, 8: 13.0, 9: 13.0}, 'Area Name': {0: 'West', 1: 'Eastern', 2: 'West Coast South', 3: 'West', 4: 'West', 5: 'Eastern', 6: 'West Coast South', 7: 'West', 8: 'West', 9: 'West'}}

Including missing combination of values based on a group of grouped data

I am expanding on earlier thread: Including missing combinations of values in a pandas groupby aggregation
In above thread, the accepted answer computes all possible combinations for the grouping variable. In this version, I'd like to compute combinations based on group of groups.
Let's take an example.
Here's input dataframe:
Here, one group is [Year,Quarter] i.e.
Year Quarter
2014 Q1
2015 Q2
2015 Q3
Another set of group is Name:
Name
Adam
Smith
Now, I want to apply groupby and sum such that missing values of the combination of above groups is detected as NaN
Here's sample output:
I'd appreciate any help.
Here's sample input and output in dict format:
input=
{'Year': {0: 2014, 1: 2014, 2: 2015, 3: 2015, 4: 2015},
'Quarter': {0: 'Q1', 1: 'Q1', 2: 'Q2', 3: 'Q2', 4: 'Q3'},
'Name': {0: 'Adam', 1: 'Smith', 2: 'Adam', 3: 'Adam', 4: 'Smith'},
'Value': {0: 2, 1: 3, 2: 4, 3: 5, 4: 5}}
output=
{'Year': {0: 2014, 1: 2014, 2: 2015, 3: 2015, 4: 2015, 5: 2015},
'Quarter': {0: 'Q1', 1: 'Q1', 2: 'Q2', 3: 'Q2', 4: 'Q3', 5: 'Q3'},
'Name': {0: 'Adam', 1: 'Smith', 2: 'Adam', 3: 'Smith', 4: 'Smith', 5: 'Adam'},
'Value': {0: 2.0, 1: 3.0, 2: 9.0, 3: nan, 4: 5.0, 5: nan}}
Clarification:
I am looking for a method without doing melt and cast. i.e. without playing around with long and wide format.
The example post you posted is the correct answer: groupby get the sum then unstack to find the missing value then stack with the param dropna=False here are the docs on stack
df.groupby(['Year','Quarter','Name']).sum().unstack().stack(dropna=False).reset_index()
Year Quarter Name Value
0 2014 Q1 Adam 2.0
1 2014 Q1 Smith 3.0
2 2015 Q2 Adam 9.0
3 2015 Q2 Smith NaN
4 2015 Q3 Adam NaN
5 2015 Q3 Smith 5.0
Using pivot_table, PS you can add reset_index at the end
df.pivot_table(index=['Year','Quarter'],columns='Name',values='Value',aggfunc='sum').stack(dropna=False)
Year Quarter Name
2014 Q1 Adam 2.0
Smith 3.0
2015 Q2 Adam 9.0
Smith NaN
Q3 Adam NaN
Smith 5.0
dtype: float64

Python 2.7: DataFrame groupby and find find the percentage distribution of values within group

I have a dataframe and i would like to find the percentage difference of values in a column within a group.
An example of a group is df.groupby(['race', 'tyre', 'stint']).get_group(("Australian Grand Prix", "Super soft", 1))
I would like to find out what is the percentage distribution of "time diff" values for each row of the group.
Her is the dataframe in dictionary format.There will be many other groups, but below df only shows the first group.
{'driverRef': {0: 'vettel',
1: 'raikkonen',
2: 'rosberg',
4: 'hamilton',
6: 'ricciardo',
7: 'alonso',
14: 'haryanto'},
'race': {0: 'Australian Grand Prix',
1: 'Australian Grand Prix',
2: 'Australian Grand Prix',
4: 'Australian Grand Prix',
6: 'Australian Grand Prix',
7: 'Australian Grand Prix',
14: 'Australian Grand Prix'},
'stint': {0: 1.0, 1: 1.0, 2: 1.0, 4: 1.0, 6: 1.0, 7: 1.0, 14: 1.0},
'total diff': {0: 125147.50728499777,
1: 281292.0366694695,
2: 166278.41312954266,
4: 64044.234019635056,
6: 648383.28046950256,
7: 400675.77449897071,
14: 2846411.2560531585},
'tyre': {0: u'Super soft',
1: u'Super soft',
2: u'Super soft',
4: u'Super soft',
6: u'Super soft',
7: u'Super soft',
14: u'Super soft'}}
If I understand correctly what you need, this might help:
sums = df.groupby(['race', 'tyre', 'stint'])['total diff'].sum()
df = df.set_index(['race', 'tyre', 'stint']).assign(pct=sums).reset_index()
df['pct'] = df['total diff'] / df['pct']
# race tyre stint driverRef total diff pct
# 0 Australian Grand Prix Super soft 1.0 vettel 1.251475e+05 0.027613
# 1 Australian Grand Prix Super soft 1.0 raikkonen 2.812920e+05 0.062065
# 2 Australian Grand Prix Super soft 1.0 rosberg 1.662784e+05 0.036688
# 3 Australian Grand Prix Super soft 1.0 hamilton 6.404423e+04 0.014131
# 4 Australian Grand Prix Super soft 1.0 ricciardo 6.483833e+05 0.143060
# 5 Australian Grand Prix Super soft 1.0 alonso 4.006758e+05 0.088406
# 6 Australian Grand Prix Super soft 1.0 haryanto 2.846411e+06 0.628037

pandas data frames delta (subtraction)

I have 2 non indexed data frames as follow:
df1
John Mullen 12/08/1993
Lisa Bush 06/12/1990
Maria Murphy 30/03/1989
Seth Black 21/06/1991
and df2
John Mullen 12/08/1993
Lisa Bush 06/12/1990
Seth Black 21/06/1991
Joe Maher 28/09/1990
Debby White 03/01/1992
I want to have a data delta, where only the records that are in df2 and not df1 will appear: i.e.
Joe Maher 28/09/1990
Debby White 03/01/1992
I there a way to achieve this?
I tried an inner join, but I couldn't find a way to subtract it from df2.
Any help is much appreciated.
You can use a list comprehension together with join to create unique keys of each table consisting of the the first name, last name and the date field (I assumed date of birth). Each field needs to be converted to a string if it is not already.
You then use another list comprehension together with enumerate to get the index location of each key in key2 that is not also in key1.
Finally, use iloc to get all rows in df2 based on the indexing from the previous step.
df1 = pd.DataFrame({'First': {0: 'John', 1: 'Lisa', 2: 'Maria', 3: 'Seth'},
'Last': {0: 'Mullen', 1: 'Bush', 2: 'Murphy', 3: 'Black'},
'dob': {0: '12/08/1993', 1: '06/12/1990', 2: '30/03/1989', 3: '21/06/1991'}})
df2 = pd.DataFrame({'First': {0: 'John', 1: 'Lisa', 2: 'Seth', 3: 'Joe', 4: 'Debby'},
'Last': {0: 'Mullen', 1: 'Bush', 2: 'Black', 3: 'Maher', 4: 'White'},
'dob': {0: '12/08/1993', 1: '06/12/1990', 2: '21/06/1991', 3: '28/09/1990', 4: '03/01/1992'}})
key1 = ["".join([first, last, dob])
for first, last, dob in zip(df1.First, df1.Last, df1.dob)]
key2 = ["".join([first, last, dob])
for first, last, dob in zip(df2.First, df2.Last, df2.dob)]
idx = [n for n, k in enumerate(key2)
if k not in key1]
>>> df2.iloc[idx, :]
First Last dob
3 Joe Maher 28/09/1990
4 Debby White 03/01/1992
Assuming you do not have any other columns in your dataframe, you could use drop_duplicates as suggested by #SebastianWozny. However, you need to only select the new rows added (not df1). You can do that as follows:
>>> df1.append(df2).drop_duplicates().iloc[df1.shape[0]:, :]
First Last dob
3 Joe Maher 28/09/1990
4 Debby White 03/01/1992
You can append the two frames and use drop_duplicates to get the unique rows, then as suggested by #Alexander you can use iloc to get the rows you want:
df1 = pd.DataFrame({'First': {0: 'John', 1: 'Lisa', 2: 'Maria', 3: 'Seth'},
'Last': {0: 'Mullen', 1: 'Bush', 2: 'Murphy', 3: 'Black'},
'dob': {0: '12/08/1993', 1: '06/12/1990', 2: '30/03/1989', 3: '21/06/1991'}})
df2 = pd.DataFrame({'First': {0: 'John', 1: 'Lisa', 2: 'Seth', 3: 'Joe', 4: 'Debby'},
'Last': {0: 'Mullen', 1: 'Bush', 2: 'Black', 3: 'Maher', 4: 'White'},
'dob': {0: '12/08/1993', 1: '06/12/1990', 2: '21/06/1991', 3: '28/09/1990', 4: '03/01/1992'}})
>>> df1.append(df2).drop_duplicates()
First Last dob
0 John Mullen 12/08/1993
1 Lisa Bush 06/12/1990
2 Maria Murphy 30/03/1989
3 Seth Black 21/06/1991
3 Joe Maher 28/09/1990
4 Debby White 03/01/1992
>>> df1.append(df2).drop_duplicates().iloc[df1.shape[0]:, :]
First Last dob
3 Joe Maher 28/09/1990
4 Debby White 03/01/1992

Categories

Resources