I am importing a csv file into a pandas dataframe such as:
df = pd.DataFrame( {0: {0: 'ID', 1: '1', 2: '2', 3: '3', 4: '4', 5: '5'}, 1: {0: 'Net Cost', 1: '30', 2: '40', 3: '50', 4: '35', 5: '45'}, 2: {0: 'Charge Description', 1: 'Surcharge A', 2: 'Discount X', 3: 'Discount X', 4: 'Discount X', 5: 'Surcharge A'}, 3: {0: 'Charge Amount', 1: '9.5', 2: '-12.5', 3: '-11.5', 4: '-5.5', 5: '9.5'}, 4: {0: 'Charge Description', 1: 'Discount X', 2: '', 3: '', 4: 'Surcharge B', 5: 'Discount X'}, 5: {0: 'Charge Amount', 1: '-11.5', 2: '', 3: '', 4: '3.5', 5: '-10.5'}, 6: {0: 'Charge Description', 1: 'Discount Y', 2: '', 3: '', 4: '', 5: 'Surcharge B'}, 7: {0: 'Charge Amount', 1: '-3.25', 2: '', 3: '', 4: '', 5: '4.5'}, 8: {0: 'Charge Description', 1: 'Surcharge B', 2: '', 3: '', 4: '', 5: ''}, 9: {0: 'Charge Amount', 1: '2.5', 2: '', 3: '', 4: '', 5: ''}} )
0
1
2
3
4
5
6
7
8
9
ID
Net Cost
Charge Description
Charge Amount
Charge Description
Charge Amount
Charge Description
Charge Amount
Charge Description
Charge Amount
1
30
Surcharge A
9.5
Discount X
-11.5
Discount Y
-3.25
Surcharge B
2.5
2
40
Discount X
-12.5
3
50
Discount X
-11.5
4
35
Discount X
-5.5
Surcharge B
3.5
5
45
Surcharge A
9.5
Discount X
-10.5
Surcharge B
4.5
The first row are the headers with column names Charge Description and Charge Amount forming pairs and appearing multiple times.
Desired output is a df with a unique column for each description, with the reorganized columns sorted alphabetically and NaNs showing as 0:
ID
Net Cost
Surcharge A
Surcharge B
Discount X
Discount Y
1
30
9.5
2.5
-11.5
-3.25
2
40
0
0
-12.5
0
3
50
0
0
-11.5
0
4
35
0
3.5
-5.5
0
5
45
9.5
4.5
-10.5
0
This post looks like a good starting point but then I need a column for each Charge Description and only a single row per ID.
I used the file you shared, and edited the columns with the initial dataframe df shared (Pandas automatically adds suffixes to columns to make them unique) to keep the non uniqueness:
invoice = pd.read_csv('Downloads/Example Invoice.csv')
invoice.columns = ['ID', 'Net Cost', 'Charge Description', 'Charge Amount',
'Charge Description', 'Charge Amount',
'Charge Description', 'Charge Amount',
'Charge Description', 'Charge Amount']
print(invoice)
ID Net Cost Charge Description Charge Amount ... Charge Description Charge Amount Charge Description Charge Amount
0 1 30 Surcharge A 9.5 ... Discount Y -3.25 Surcharge B 2.5
1 2 40 Discount X -12.5 ... NaN NaN NaN NaN
2 3 50 Discount X -11.5 ... NaN NaN NaN NaN
3 4 35 Discount X -5.5 ... NaN NaN NaN NaN
4 5 45 Surcharge A 9.5 ... Surcharge B 4.50 NaN NaN
First step is to transform to long form with pivot_longer from pyjanitor - in this case we take advantage of the fact that charge description is followed by charge amount - we can safely pair them and reshape into two columns. After that is done, we flip back to wide form - getting Surcharge and Discount values as headers. Thankfully, the index is unique, so a pivot works without extras. I used pivot_wider here, primarily for convenience - the same can be achieved with pivot, with just a few cleanup steps - under the hood pivot_wider uses pd.pivot.
# pip install pyjanitor
import pandas as pd
import janitor
index = ['ID', 'Net Cost']
arr = ['Charge Description', 'Charge Amount']
(invoice
.pivot_longer(
index = index,
names_to = arr,
names_pattern = arr,
dropna=True)
.pivot_wider(
index=index,
names_from='Charge Description',
values_from='Charge Amount')
.fillna(0)
)
ID Net Cost Discount X Discount Y Surcharge A Surcharge B
0 1 30 -11.5 -3.25 9.5 2.5
1 2 40 -12.5 0.00 0.0 0.0
2 3 50 -11.5 0.00 0.0 0.0
3 4 35 -5.5 0.00 0.0 3.5
4 5 45 -10.5 0.00 9.5 4.5
Another option - since the data is fairly consistent with the ordering, you can dump down into numpy, reshape into a two column array, keep track of the ID and Net Cost columns (ensure they are correctly paired), and then pivot to get your final data:
index = ['ID', 'Net Cost']
arr = ['Charge Description', 'Charge Amount']
invoice = invoice.set_index(index)
out = invoice.to_numpy().reshape(-1, 2)
out = pd.DataFrame(out, columns = arr)
# reshape above is in order `C` - default
# so we can safely repeat the index
# with a value of 4
# which is what you get ->
# invoice.columns.size // 2
# to correctly pair the index with the new dataframe
out.index = invoice.index.repeat(invoice.columns.size//2)
# get rid of nulls, and flip to wide form
(out
.dropna(how='all')
.set_index('Charge Description', append=True)
.squeeze()
.unstack('Charge Description', fill_value=0)
.rename_axis(columns = None)
.reset_index()
)
ID Net Cost Discount X Discount Y Surcharge A Surcharge B
0 1 30 -11.5 -3.25 9.5 2.5
1 2 40 -12.5 0 0 0
2 3 50 -11.5 0 0 0
3 4 35 -5.5 0 0 3.5
4 5 45 -10.5 0 9.5 4.5
You can convert the data dtypes for Discount to numeric
You can flatten your dataframe first with melt then reshape with pivot_table after cleaning it up:
# 1st pass
out = (pd.DataFrame(df.iloc[1:].values, columns=df.iloc[0].tolist())
.melt(['ID', 'Net Cost'], ignore_index=False))
m = out['variable'] == 'Charge Description'
# 2nd pass
out = (pd.concat([out[m].reset_index(drop=True).add_prefix('_'),
out[~m].reset_index(drop=True)], axis=1)
.query("_value != ''")
.pivot_table(index=['ID', 'Net Cost'], columns='_value',
values='value', aggfunc='first')
.rename_axis(columns=None).reset_index().fillna(0))
Output:
>>> out
ID Net Cost Discount X Discount Y Surcharge A Surcharge B
0 1 30 -11.5 -3.25 9.5 2.5
1 2 40 -12.5 0 0 0
2 3 50 -11.5 0 0 0
3 4 35 -5.5 0 0 3.5
4 5 45 -10.5 0 9.5 4.5
You can use pivot_table after concatenating pair-wise:
import pandas as pd
df = pd.DataFrame.from_dict(
{0: {0: 'ID', 1: '1', 2: '2', 3: '3', 4: '4', 5: '5'}, 1: {0: 'Net Cost', 1: '30', 2: '40', 3: '50', 4: '35', 5: '45'}, 2: {0: 'Charge Description', 1: 'Surcharge A', 2: 'Discount X', 3: 'Discount X', 4: 'Discount X', 5: 'Surcharge A'}, 3: {0: 'Charge Amount', 1: '9.5', 2: '-12.5', 3: '-11.5', 4: '-5.5', 5: '9.5'}, 4: {0: 'Charge Description', 1: 'Discount X', 2: '', 3: '', 4: 'Surcharge B', 5: 'Discount X'}, 5: {0: 'Charge Amount', 1: '-11.5', 2: '', 3: '', 4: '3.5', 5: '-10.5'}, 6: {0: 'Charge Description', 1: 'Discount Y', 2: '', 3: '', 4: '', 5: 'Surcharge B'}, 7: {0: 'Charge Amount', 1: '-3.25', 2: '', 3: '', 4: '', 5: '4.5'}, 8: {0: 'Charge Description', 1: 'Surcharge B', 2: '', 3: '', 4: '', 5: ''}, 9: {0: 'Charge Amount', 1: '2.5', 2: '', 3: '', 4: '', 5: ''}})
# setting first row as header
df.columns = df.iloc[0, :]
df.drop(index=0, inplace=True)
df = pd.concat([df.iloc[:, [0,1,i,i+1]] for i in range(2, len(df.columns), 2)]).replace('', 0)
print(df[df['Charge Description']!=0]
.pivot_table(columns='Charge Description', values='Charge Amount', index=['ID', 'Net Cost'])
.fillna(0))
Output:
Charge Description Discount X Discount Y Surcharge A Surcharge B
ID Net Cost
1 30 -11.5 -3.25 9.5 2.5
2 40 -12.5 0.00 0.0 0.0
3 50 -11.5 0.00 0.0 0.0
4 35 -5.5 0.00 0.0 3.5
5 45 -10.5 0.00 9.5 4.5
I would use melt to stack the identically named columns, then pivot to create the outcome you want.
# Ensure the first line is now the column names, and then delete the first line.
df.columns = df.iloc[0]
df = df[1:]
# Create two melted df's, and join them on index.
df1 = df.melt(['ID', 'Net Cost'], ['Charge Description']).sort_values(by='ID').reset_index(drop=True)
df2 = df.melt(['ID', 'Net Cost'], ['Charge Amount']).sort_values(by='ID').reset_index(drop=True)
df1['Charge Amount'] = df2['value']
# Clean up a little, rename the added 'value' column from df1.
df1 = df1.drop(columns=[0]).rename(columns={'value': 'Charge Description'})
df1 = df1.dropna()
# Pivot the data.
df1 = df1.pivot(index=['ID', 'Net Cost'], columns='Charge Description', values='Charge Amount')
Result of df1:
Charge Description Discount X Discount Y Surcharge A Surcharge B
ID Net Cost
1 30 -11.5 -3.25 9.5 2.5
2 40 -12.5 NaN NaN NaN
3 50 -11.5 NaN NaN NaN
4 35 -5.5 NaN NaN 3.5
5 45 -10.5 NaN 9.5 4.5`
My first thought was to read the data out in to a list of dictionaries representing each Row (making both the keys and values from the data values), then form a new dataframe from that.
For your example, that would make...
[
{
'ID': '1',
'Net Cost': '30',
'Discount X': '-11.5',
'Discount Y': '-3.25',
'Surcharge A': '9.5',
'Surcharge B': '2.5',
},
{
'ID': '2',
'Net Cost': '40',
'Discount X': '-12.5',
},
{
'ID': '3',
'Net Cost': '50',
'Discount X': '-11.5',
},
{
'ID': '4',
'Net Cost': '35',
'Discount X': '-5.5',
'Surcharge B': '3.5',
},
{
'ID': '5',
'Net Cost': '45',
'Discount X': '-10.5',
'Surcharge A': '9.5',
'Surcharge B': '4.5',
},
]
For the SMALL sample dataset, using comprehensions appears to be quite quick for that...
import pandas as pd
from itertools import chain
rows = [
{
name: value
for name, value in chain(
[
("ID", row[0]),
("Net Cost", row[1]),
],
zip(row[2::2], row[3::2]) # pairs of columns: (2,3), (4,5), etc
)
if name
}
for ix, row in df.iloc[1:].iterrows() # Skips the row with the column headers
]
df2 = pd.DataFrame(rows).fillna(0)
Demo (including timings of this and three other answers):
https://trinket.io/python3/555f860855
EDIT:
To sort the column names, add the following...
df2 = df2[['ID', 'Net Cost', *sorted(df2.columns[2:])]]
Related
I have a data frame with Nan values. For some reason, df.dropna() doesn't work when I try to drop these rows. Any thoughts?
Example of a row:
30754 22 Nan Nan Nan Nan Nan Nan Jewellery-Women N
df = pd.read_csv('/Users/xxx/Desktop/CS 677/Homework_4/FashionDataset.csv')
df.dropna()
df.head().to_dict()
{'Unnamed: 0': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4},
'BrandName': {0: 'life',
1: 'only',
2: 'fratini',
3: 'zink london',
4: 'life'},
'Deatils': {0: 'solid cotton blend collar neck womens a-line dress - indigo',
1: 'polyester peter pan collar womens blouson dress - yellow',
2: 'solid polyester blend wide neck womens regular top - off white',
3: 'stripes polyester sweetheart neck womens dress - black',
4: 'regular fit regular length denim womens jeans - stone'},
'Sizes': {0: 'Size:Large,Medium,Small,X-Large,X-Small',
1: 'Size:34,36,38,40',
2: 'Size:Large,X-Large,XX-Large',
3: 'Size:Large,Medium,Small,X-Large',
4: 'Size:26,28,30,32,34,36'},
'MRP': {0: 'Rs\n1699',
1: 'Rs\n3499',
2: 'Rs\n1199',
3: 'Rs\n2299',
4: 'Rs\n1699'},
'SellPrice': {0: '849', 1: '2449', 2: '599', 3: '1379', 4: '849'},
'Discount': {0: '50% off',
1: '30% off',
2: '50% off',
3: '40% off',
4: '50% off'},
'Category': {0: 'Westernwear-Women',
1: 'Westernwear-Women',
2: 'Westernwear-Women',
3: 'Westernwear-Women',
4: 'Westernwear-Women'}}
This is what I get when using df.head().to_dict()
Try this;
df = pd.DataFrame({"col1":[12,20,np.nan,np.nan],
"col2":[10,np.nan,np.nan,40]})
df1 = df.dropna()
# df;
col1 col2
0 12.0 10.0
1 20.0 NaN
2 NaN NaN
3 NaN 40.0
# df1;
col1 col2
0 12.0 10.0
I have a dataframe with fields ['Transaction Description', 'Transaction Date', 'Debit Amount']. I want to sum the Debit Amount column by month and by Transaction Description to see how much I am spending on different things each month.
I tried
df.groupby(['Transaction Description', 'Transaction Date'])['Debit Amount'].sum().sort_values(ascending=False).head(180)
which gives me a sum for each Transcripton Description by day.
Alternatively,
df.groupby(['Transaction Description'])['Debit Amount'].sum().sort_values(ascending=False).head(180)
gives me what I want but for the whole dataframe, not split by month.
I would like the output to have the months in order and for each one show the total amount spent on each Transaction Description, sorted from largest to smallest. This is so I can look at a given month and see what I have been spending my money on.
Here is a sample from the dataframe
{'Transaction Date': {0: Timestamp('2022-05-04 00:00:00'),
1: Timestamp('2022-05-04 00:00:00'),
2: Timestamp('2022-04-04 00:00:00'),
3: Timestamp('2022-04-04 00:00:00'),
4: Timestamp('2022-04-04 00:00:00'),
5: Timestamp('2022-04-04 00:00:00'),
6: Timestamp('2022-04-04 00:00:00'),
7: Timestamp('2022-04-04 00:00:00'),
8: Timestamp('2022-04-04 00:00:00'),
9: Timestamp('2022-01-04 00:00:00')},
'Transaction Description': {0: 'School',
1: 'Cleaner',
2: 'Taxi',
3: 'shop',
4: 'MOBILE',
5: 'Restaurant',
6: 'Restaurant',
7: 'shop',
8: 'Taxi',
9: 'shop'},
'Debit Amount': {0: 15.0,
1: 26.0,
2: 48.48,
3: 9.18,
4: 7.0,
5: 10.05,
6: 9.1,
7: 2.14,
8: 16.0,
9: 11.68}
In this case I would like the output to be something like:
2022-01 shop 11.68
2022-04 Taxi 64.48
shop 23.00
MOBILE 7.00
Restaurant 19.15
2022-05 School 15.00
Cleaner 26.00
Use pd.Grouper. I assume your Transaction Date is of type date:
df.groupby([pd.Grouper(key="Transaction Date", freq="MS"), "Transaction Description"]).sum()
Try this out, if you had shared the MRE, i would have validated and shared the resultset
# create a ym column and use that in your groupby
(df.assign(ym=pd.to_datetime(df['Transaction Date']).dt.strftime('%Y-%m'))
.groupby(['ym','Transaction Description' ] )['Debit Amount'].sum()
)
ym Transaction Description
2022-01 shop 11.68
2022-04 MOBILE 7.00
Restaurant 19.15
Taxi 64.48
shop 11.32
2022-05 Cleaner 26.00
School 15.00
Name: Debit Amount, dtype: float64
OR
(df.assign(ym=pd.to_datetime(df['Transaction Date']).dt.strftime('%Y-%m'))
.groupby(['ym','Transaction Description' ], as_index=False )['Debit Amount'].sum()
)
ym Transaction Description Debit Amount
0 2022-01 shop 11.68
1 2022-04 MOBILE 7.00
2 2022-04 Restaurant 19.15
3 2022-04 Taxi 64.48
4 2022-04 shop 11.32
5 2022-05 Cleaner 26.00
6 2022-05 School 15.00
This question already has answers here:
python: how to melt dataframe retaining specific order / custom sorting
(2 answers)
Closed 7 months ago.
The community reviewed whether to reopen this question 7 months ago and left it closed:
Original close reason(s) were not resolved
Say I have a dataframe
data_dict = {'Number': {0: 1, 1: 2, 2: 3}, 'mw link': {0: 'SAM3703_2SAM3944 2', 1: 'SAM3720_2SAM4115 2', 2: 'SAM3729_2SAM4121_ 2'}, 'site_a': {0: 'SAM3703', 1: 'SAM3720', 2: 'SAM3729'}, 'name_a': {0: 'Chelak', 1: 'KattakurganATC', 2: 'Payariq'}, 'site_b': {0: 'SAM3944', 1: 'SAM4115', 2: 'SAM4121'}, 'name_b': {0: 'Turkibolo', 1: 'Kattagurgon Sement Zavod', 2: 'Payariq Dehgonobod'}, 'distance km': {0: 3.618, 1: 7.507, 2: 9.478}, 'manufacture': {0: 'ZTE NR 8150/8250', 1: 'ZTE NR 8150/8250', 2: 'ZTE NR 8150/8250'}}
df = pd.DataFrame(data_dict)
Expected Output :
There are these two columns site_a and site_b which I want to melt into rows but applying a simple melt gives output in series, I want them to be in an alternate fashion.
Number mw link distance km manufacture variable value
0 1 SAM3703_2SAM3944 2 3.618 ZTE NR 8150/8250 site_a SAM3703
1 1 SAM3703_2SAM3944 2 3.618 ZTE NR 8150/8250 site_b SAM3944
2 2 SAM3720_2SAM4115 2 7.507 ZTE NR 8150/8250 site_a SAM3720
3 2 SAM3720_2SAM4115 2 7.507 ZTE NR 8150/8250 site_b SAM4115
4 3 SAM3729_2SAM4121_ 2 9.478 ZTE NR 8150/8250 site_a SAM3729
5 3 SAM3729_2SAM4121_ 2 9.478 ZTE NR 8150/8250 site_b SAM4121
My Solution :
This is what I have tried
df1 = pd.melt(df, id_vars=['Number', 'mw link', 'distance km', 'manufacture'], value_vars=['site_a', 'site_b'])
which gives me :
You just add sort_values(['Number', 'variable']):
pd.melt(df, id_vars=['Number', 'mw link', 'distance km', 'manufacture'], value_vars=['site_a', 'site_b']).sort_values(['Number', 'variable'])
Alternatives:
pd.melt(df, id_vars=['Number', 'mw link', 'distance km', 'manufacture'], value_vars=['site_a', 'site_b']).sort_values(['mw link', 'variable'])
Or:
pd.melt(df, id_vars=['Number', 'mw link', 'distance km', 'manufacture'], value_vars=['site_a', 'site_b']).sort_values(['distance km', 'variable'])
I'm trying to dynamically calculate the mean for all float64 columns. The end user should be able to filter through any of the category columns against a chart and get the various means for each instance. In order to achieve this, I've written the further below Python script using dask and groupby function.
However...
When performing groupby on the below columns, the object columns are vanished from the output CSV file as a result of aggregation and mean calculation on the columns containing float64 type. On this example, I'm using dask to read the dataframe (pandas is not an option due to high mem usage) and save the output file as CSV.
The dtypes of the input CSV columns are:
Dtype Columns
String (eg. 2018-P01/02) Time Period
Integer Journey Code
Object Journey Name
Object Route
Object Route Name
Object Area
Object Area Name
Object Fuel Type
Float64 Fuel Load
Float64 Mileage
Float64 Odometer Reading
Float64 Delay Minutes
My code for reading/saving the CSV and performing the mean calculation is:
import numpy as np
import dask.dataframe as dd
import pandas as pd
filename = "H:\\testing3.csv"
data = dd.read_csv("H:\\testing2.csv")
cols=['Time Period','Journey Code','Journey Name','Route',
'Route Name','Area','Area Name','Fuel Type']
x = data.groupby(cols).aggregation(['mean'])
x.to_csv(filename, index = False)
An example of the original dataset is:
Time Period Journey Code Route Route Name Area Area Name
2018-P01 261803 High France-Germany WE West
2018-P01-02 325429 High France-Germany EA Eastern
2018-P01-02 359343 High France-Germany WCS West Coast South
2018-P01-02 359343 High France-Germany WE West
2018-P01-03 370697 High France-Germany WE West
2018-P01-04 392535 High France-Germany EA Eastern
2018-P01-04 394752 High France-Germany WCS West Coast South
2018-P01-05 408713 High France-Germany WE West
Fuel Type Fuel Load Mileage Odometer Reading Delay Minutes
Diesel 165 6 14567.1 2
Diesel 210 12 98765.8 0
Diesel 210 5 23406.2 0
Diesel 130 10 54418.8 0
Diesel 152.5 37 58838.35 2
Diesel 142 140 63257.9 37.1194012
Diesel 131.5 120 67677.45 0
Diesel 121 13 72097 1.25
Why are the object columns vanishing from the resulting CSV file and how can I produce a result like the below?
Desired output (example on line 2 and 3): No average for the starting line but then any future float64 values will contain average (current against previous). I'm splitting each instance separately to get the dynamic result but any ideas are welcome.
Time Period Journey Code Route Route Name Area Area Name
2018-P01-02
325429
High
France-Germany
EA
Eastern
…….
359343 High France-Germany WCS West Coast South
Fuel Type Fuel Load Mileage Odometer Reading Delay Minutes
Diesel 210 12 98765.8 0
Diesel 210 12 98765.8 0
Diesel 210 12 98765.8 0
Diesel 210 12 98765.8 0
Diesel 210 12 98765.8 0
......
Diesel 170 8.5 23406.2 NaN
Edit: Added sample dataset in format of df.head(10).to_dict()
{'Time Period': {0: '2018-P01', 1: '2018-P01-02', 2: '2018-P01-02', 3: '2018-P01-02', 4: '2018-P01-03', 5: '2018-P01-04', 6: '2018-P01-04', 7: '2018-P01-05', 8: '2018-P01-06', 9: '2018-P01-07'}, 'Odometer Reading': {0: 14567.1, 1: 98765.8, 2: 23406.2, 3: 54418.8, 4: 58838.35, 5: 63257.9, 6: 67677.45, 7: 72097.0, 8: 89221.0, 9: 89221.0}, 'Journey Code': {0: 261803, 1: 325429, 2: 359343, 3: 359343, 4: 370697, 5: 392535, 6: 394752, 7: 408713, 8: 408714, 9: 408715}, 'Fuel Type': {0: 'Diesel', 1: 'Diesel', 2: 'Diesel', 3: 'Diesel', 4: 'Diesel', 5: 'Diesel', 6: 'Diesel', 7: 'Diesel', 8: 'Diesel', 9: 'Diesel'}, 'Route Name': {0: 'France-Germany', 1: 'France-Germany', 2: 'France-Germany', 3: 'France-Germany', 4: 'France-Germany', 5: 'France-Germany', 6: 'France-Germany', 7: 'France-Germany', 8: 'France-Germany', 9: 'France-Germany'}, 'Area': {0: 'WE', 1: 'EA', 2: 'WCS', 3: 'WE', 4: 'WE', 5: 'EA', 6: 'WCS', 7: 'WE', 8: 'WE', 9: 'WE'}, 'Route': {0: 'High', 1: 'High', 2: 'High', 3: 'High', 4: 'High', 5: 'High', 6: 'High', 7: 'High', 8: 'High', 9: 'High'}, 'Fuel Load': {0: 165.0, 1: 210.0, 2: 170.0, 3: 130.0, 4: 152.5, 5: 142.0, 6: 131.5, 7: 121.0, 8: 121.0, 9: 121.0}, 'Delay Minutes': {0: 2.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 2.0, 5: 37.119401200000006, 6: 0.0, 7: 1.25, 8: 2.56, 9: 2.56}, 'Mileage': {0: 6.0, 1: 12.0, 2: 8.5, 3: 10.0, 4: 37.0, 5: 140.0, 6: 120.0, 7: 13.0, 8: 13.0, 9: 13.0}, 'Area Name': {0: 'West', 1: 'Eastern', 2: 'West Coast South', 3: 'West', 4: 'West', 5: 'Eastern', 6: 'West Coast South', 7: 'West', 8: 'West', 9: 'West'}}
I have 2 non indexed data frames as follow:
df1
John Mullen 12/08/1993
Lisa Bush 06/12/1990
Maria Murphy 30/03/1989
Seth Black 21/06/1991
and df2
John Mullen 12/08/1993
Lisa Bush 06/12/1990
Seth Black 21/06/1991
Joe Maher 28/09/1990
Debby White 03/01/1992
I want to have a data delta, where only the records that are in df2 and not df1 will appear: i.e.
Joe Maher 28/09/1990
Debby White 03/01/1992
I there a way to achieve this?
I tried an inner join, but I couldn't find a way to subtract it from df2.
Any help is much appreciated.
You can use a list comprehension together with join to create unique keys of each table consisting of the the first name, last name and the date field (I assumed date of birth). Each field needs to be converted to a string if it is not already.
You then use another list comprehension together with enumerate to get the index location of each key in key2 that is not also in key1.
Finally, use iloc to get all rows in df2 based on the indexing from the previous step.
df1 = pd.DataFrame({'First': {0: 'John', 1: 'Lisa', 2: 'Maria', 3: 'Seth'},
'Last': {0: 'Mullen', 1: 'Bush', 2: 'Murphy', 3: 'Black'},
'dob': {0: '12/08/1993', 1: '06/12/1990', 2: '30/03/1989', 3: '21/06/1991'}})
df2 = pd.DataFrame({'First': {0: 'John', 1: 'Lisa', 2: 'Seth', 3: 'Joe', 4: 'Debby'},
'Last': {0: 'Mullen', 1: 'Bush', 2: 'Black', 3: 'Maher', 4: 'White'},
'dob': {0: '12/08/1993', 1: '06/12/1990', 2: '21/06/1991', 3: '28/09/1990', 4: '03/01/1992'}})
key1 = ["".join([first, last, dob])
for first, last, dob in zip(df1.First, df1.Last, df1.dob)]
key2 = ["".join([first, last, dob])
for first, last, dob in zip(df2.First, df2.Last, df2.dob)]
idx = [n for n, k in enumerate(key2)
if k not in key1]
>>> df2.iloc[idx, :]
First Last dob
3 Joe Maher 28/09/1990
4 Debby White 03/01/1992
Assuming you do not have any other columns in your dataframe, you could use drop_duplicates as suggested by #SebastianWozny. However, you need to only select the new rows added (not df1). You can do that as follows:
>>> df1.append(df2).drop_duplicates().iloc[df1.shape[0]:, :]
First Last dob
3 Joe Maher 28/09/1990
4 Debby White 03/01/1992
You can append the two frames and use drop_duplicates to get the unique rows, then as suggested by #Alexander you can use iloc to get the rows you want:
df1 = pd.DataFrame({'First': {0: 'John', 1: 'Lisa', 2: 'Maria', 3: 'Seth'},
'Last': {0: 'Mullen', 1: 'Bush', 2: 'Murphy', 3: 'Black'},
'dob': {0: '12/08/1993', 1: '06/12/1990', 2: '30/03/1989', 3: '21/06/1991'}})
df2 = pd.DataFrame({'First': {0: 'John', 1: 'Lisa', 2: 'Seth', 3: 'Joe', 4: 'Debby'},
'Last': {0: 'Mullen', 1: 'Bush', 2: 'Black', 3: 'Maher', 4: 'White'},
'dob': {0: '12/08/1993', 1: '06/12/1990', 2: '21/06/1991', 3: '28/09/1990', 4: '03/01/1992'}})
>>> df1.append(df2).drop_duplicates()
First Last dob
0 John Mullen 12/08/1993
1 Lisa Bush 06/12/1990
2 Maria Murphy 30/03/1989
3 Seth Black 21/06/1991
3 Joe Maher 28/09/1990
4 Debby White 03/01/1992
>>> df1.append(df2).drop_duplicates().iloc[df1.shape[0]:, :]
First Last dob
3 Joe Maher 28/09/1990
4 Debby White 03/01/1992