pandas data frames delta (subtraction) - python

I have 2 non indexed data frames as follow:
df1
John Mullen 12/08/1993
Lisa Bush 06/12/1990
Maria Murphy 30/03/1989
Seth Black 21/06/1991
and df2
John Mullen 12/08/1993
Lisa Bush 06/12/1990
Seth Black 21/06/1991
Joe Maher 28/09/1990
Debby White 03/01/1992
I want to have a data delta, where only the records that are in df2 and not df1 will appear: i.e.
Joe Maher 28/09/1990
Debby White 03/01/1992
I there a way to achieve this?
I tried an inner join, but I couldn't find a way to subtract it from df2.
Any help is much appreciated.

You can use a list comprehension together with join to create unique keys of each table consisting of the the first name, last name and the date field (I assumed date of birth). Each field needs to be converted to a string if it is not already.
You then use another list comprehension together with enumerate to get the index location of each key in key2 that is not also in key1.
Finally, use iloc to get all rows in df2 based on the indexing from the previous step.
df1 = pd.DataFrame({'First': {0: 'John', 1: 'Lisa', 2: 'Maria', 3: 'Seth'},
'Last': {0: 'Mullen', 1: 'Bush', 2: 'Murphy', 3: 'Black'},
'dob': {0: '12/08/1993', 1: '06/12/1990', 2: '30/03/1989', 3: '21/06/1991'}})
df2 = pd.DataFrame({'First': {0: 'John', 1: 'Lisa', 2: 'Seth', 3: 'Joe', 4: 'Debby'},
'Last': {0: 'Mullen', 1: 'Bush', 2: 'Black', 3: 'Maher', 4: 'White'},
'dob': {0: '12/08/1993', 1: '06/12/1990', 2: '21/06/1991', 3: '28/09/1990', 4: '03/01/1992'}})
key1 = ["".join([first, last, dob])
for first, last, dob in zip(df1.First, df1.Last, df1.dob)]
key2 = ["".join([first, last, dob])
for first, last, dob in zip(df2.First, df2.Last, df2.dob)]
idx = [n for n, k in enumerate(key2)
if k not in key1]
>>> df2.iloc[idx, :]
First Last dob
3 Joe Maher 28/09/1990
4 Debby White 03/01/1992
Assuming you do not have any other columns in your dataframe, you could use drop_duplicates as suggested by #SebastianWozny. However, you need to only select the new rows added (not df1). You can do that as follows:
>>> df1.append(df2).drop_duplicates().iloc[df1.shape[0]:, :]
First Last dob
3 Joe Maher 28/09/1990
4 Debby White 03/01/1992

You can append the two frames and use drop_duplicates to get the unique rows, then as suggested by #Alexander you can use iloc to get the rows you want:
df1 = pd.DataFrame({'First': {0: 'John', 1: 'Lisa', 2: 'Maria', 3: 'Seth'},
'Last': {0: 'Mullen', 1: 'Bush', 2: 'Murphy', 3: 'Black'},
'dob': {0: '12/08/1993', 1: '06/12/1990', 2: '30/03/1989', 3: '21/06/1991'}})
df2 = pd.DataFrame({'First': {0: 'John', 1: 'Lisa', 2: 'Seth', 3: 'Joe', 4: 'Debby'},
'Last': {0: 'Mullen', 1: 'Bush', 2: 'Black', 3: 'Maher', 4: 'White'},
'dob': {0: '12/08/1993', 1: '06/12/1990', 2: '21/06/1991', 3: '28/09/1990', 4: '03/01/1992'}})
>>> df1.append(df2).drop_duplicates()
First Last dob
0 John Mullen 12/08/1993
1 Lisa Bush 06/12/1990
2 Maria Murphy 30/03/1989
3 Seth Black 21/06/1991
3 Joe Maher 28/09/1990
4 Debby White 03/01/1992
>>> df1.append(df2).drop_duplicates().iloc[df1.shape[0]:, :]
First Last dob
3 Joe Maher 28/09/1990
4 Debby White 03/01/1992

Related

Column Pair-wise aggregation and reorganization in Pandas

I am importing a csv file into a pandas dataframe such as:
df = pd.DataFrame( {0: {0: 'ID', 1: '1', 2: '2', 3: '3', 4: '4', 5: '5'}, 1: {0: 'Net Cost', 1: '30', 2: '40', 3: '50', 4: '35', 5: '45'}, 2: {0: 'Charge Description', 1: 'Surcharge A', 2: 'Discount X', 3: 'Discount X', 4: 'Discount X', 5: 'Surcharge A'}, 3: {0: 'Charge Amount', 1: '9.5', 2: '-12.5', 3: '-11.5', 4: '-5.5', 5: '9.5'}, 4: {0: 'Charge Description', 1: 'Discount X', 2: '', 3: '', 4: 'Surcharge B', 5: 'Discount X'}, 5: {0: 'Charge Amount', 1: '-11.5', 2: '', 3: '', 4: '3.5', 5: '-10.5'}, 6: {0: 'Charge Description', 1: 'Discount Y', 2: '', 3: '', 4: '', 5: 'Surcharge B'}, 7: {0: 'Charge Amount', 1: '-3.25', 2: '', 3: '', 4: '', 5: '4.5'}, 8: {0: 'Charge Description', 1: 'Surcharge B', 2: '', 3: '', 4: '', 5: ''}, 9: {0: 'Charge Amount', 1: '2.5', 2: '', 3: '', 4: '', 5: ''}} )
0
1
2
3
4
5
6
7
8
9
ID
Net Cost
Charge Description
Charge Amount
Charge Description
Charge Amount
Charge Description
Charge Amount
Charge Description
Charge Amount
1
30
Surcharge A
9.5
Discount X
-11.5
Discount Y
-3.25
Surcharge B
2.5
2
40
Discount X
-12.5
3
50
Discount X
-11.5
4
35
Discount X
-5.5
Surcharge B
3.5
5
45
Surcharge A
9.5
Discount X
-10.5
Surcharge B
4.5
The first row are the headers with column names Charge Description and Charge Amount forming pairs and appearing multiple times.
Desired output is a df with a unique column for each description, with the reorganized columns sorted alphabetically and NaNs showing as 0:
ID
Net Cost
Surcharge A
Surcharge B
Discount X
Discount Y
1
30
9.5
2.5
-11.5
-3.25
2
40
0
0
-12.5
0
3
50
0
0
-11.5
0
4
35
0
3.5
-5.5
0
5
45
9.5
4.5
-10.5
0
This post looks like a good starting point but then I need a column for each Charge Description and only a single row per ID.
I used the file you shared, and edited the columns with the initial dataframe df shared (Pandas automatically adds suffixes to columns to make them unique) to keep the non uniqueness:
invoice = pd.read_csv('Downloads/Example Invoice.csv')
invoice.columns = ['ID', 'Net Cost', 'Charge Description', 'Charge Amount',
'Charge Description', 'Charge Amount',
'Charge Description', 'Charge Amount',
'Charge Description', 'Charge Amount']
print(invoice)
ID Net Cost Charge Description Charge Amount ... Charge Description Charge Amount Charge Description Charge Amount
0 1 30 Surcharge A 9.5 ... Discount Y -3.25 Surcharge B 2.5
1 2 40 Discount X -12.5 ... NaN NaN NaN NaN
2 3 50 Discount X -11.5 ... NaN NaN NaN NaN
3 4 35 Discount X -5.5 ... NaN NaN NaN NaN
4 5 45 Surcharge A 9.5 ... Surcharge B 4.50 NaN NaN
First step is to transform to long form with pivot_longer from pyjanitor - in this case we take advantage of the fact that charge description is followed by charge amount - we can safely pair them and reshape into two columns. After that is done, we flip back to wide form - getting Surcharge and Discount values as headers. Thankfully, the index is unique, so a pivot works without extras. I used pivot_wider here, primarily for convenience - the same can be achieved with pivot, with just a few cleanup steps - under the hood pivot_wider uses pd.pivot.
# pip install pyjanitor
import pandas as pd
import janitor
index = ['ID', 'Net Cost']
arr = ['Charge Description', 'Charge Amount']
(invoice
.pivot_longer(
index = index,
names_to = arr,
names_pattern = arr,
dropna=True)
.pivot_wider(
index=index,
names_from='Charge Description',
values_from='Charge Amount')
.fillna(0)
)
ID Net Cost Discount X Discount Y Surcharge A Surcharge B
0 1 30 -11.5 -3.25 9.5 2.5
1 2 40 -12.5 0.00 0.0 0.0
2 3 50 -11.5 0.00 0.0 0.0
3 4 35 -5.5 0.00 0.0 3.5
4 5 45 -10.5 0.00 9.5 4.5
Another option - since the data is fairly consistent with the ordering, you can dump down into numpy, reshape into a two column array, keep track of the ID and Net Cost columns (ensure they are correctly paired), and then pivot to get your final data:
index = ['ID', 'Net Cost']
arr = ['Charge Description', 'Charge Amount']
invoice = invoice.set_index(index)
out = invoice.to_numpy().reshape(-1, 2)
out = pd.DataFrame(out, columns = arr)
# reshape above is in order `C` - default
# so we can safely repeat the index
# with a value of 4
# which is what you get ->
# invoice.columns.size // 2
# to correctly pair the index with the new dataframe
out.index = invoice.index.repeat(invoice.columns.size//2)
# get rid of nulls, and flip to wide form
(out
.dropna(how='all')
.set_index('Charge Description', append=True)
.squeeze()
.unstack('Charge Description', fill_value=0)
.rename_axis(columns = None)
.reset_index()
)
ID Net Cost Discount X Discount Y Surcharge A Surcharge B
0 1 30 -11.5 -3.25 9.5 2.5
1 2 40 -12.5 0 0 0
2 3 50 -11.5 0 0 0
3 4 35 -5.5 0 0 3.5
4 5 45 -10.5 0 9.5 4.5
You can convert the data dtypes for Discount to numeric
You can flatten your dataframe first with melt then reshape with pivot_table after cleaning it up:
# 1st pass
out = (pd.DataFrame(df.iloc[1:].values, columns=df.iloc[0].tolist())
.melt(['ID', 'Net Cost'], ignore_index=False))
m = out['variable'] == 'Charge Description'
# 2nd pass
out = (pd.concat([out[m].reset_index(drop=True).add_prefix('_'),
out[~m].reset_index(drop=True)], axis=1)
.query("_value != ''")
.pivot_table(index=['ID', 'Net Cost'], columns='_value',
values='value', aggfunc='first')
.rename_axis(columns=None).reset_index().fillna(0))
Output:
>>> out
ID Net Cost Discount X Discount Y Surcharge A Surcharge B
0 1 30 -11.5 -3.25 9.5 2.5
1 2 40 -12.5 0 0 0
2 3 50 -11.5 0 0 0
3 4 35 -5.5 0 0 3.5
4 5 45 -10.5 0 9.5 4.5
You can use pivot_table after concatenating pair-wise:
import pandas as pd
df = pd.DataFrame.from_dict(
{0: {0: 'ID', 1: '1', 2: '2', 3: '3', 4: '4', 5: '5'}, 1: {0: 'Net Cost', 1: '30', 2: '40', 3: '50', 4: '35', 5: '45'}, 2: {0: 'Charge Description', 1: 'Surcharge A', 2: 'Discount X', 3: 'Discount X', 4: 'Discount X', 5: 'Surcharge A'}, 3: {0: 'Charge Amount', 1: '9.5', 2: '-12.5', 3: '-11.5', 4: '-5.5', 5: '9.5'}, 4: {0: 'Charge Description', 1: 'Discount X', 2: '', 3: '', 4: 'Surcharge B', 5: 'Discount X'}, 5: {0: 'Charge Amount', 1: '-11.5', 2: '', 3: '', 4: '3.5', 5: '-10.5'}, 6: {0: 'Charge Description', 1: 'Discount Y', 2: '', 3: '', 4: '', 5: 'Surcharge B'}, 7: {0: 'Charge Amount', 1: '-3.25', 2: '', 3: '', 4: '', 5: '4.5'}, 8: {0: 'Charge Description', 1: 'Surcharge B', 2: '', 3: '', 4: '', 5: ''}, 9: {0: 'Charge Amount', 1: '2.5', 2: '', 3: '', 4: '', 5: ''}})
# setting first row as header
df.columns = df.iloc[0, :]
df.drop(index=0, inplace=True)
df = pd.concat([df.iloc[:, [0,1,i,i+1]] for i in range(2, len(df.columns), 2)]).replace('', 0)
print(df[df['Charge Description']!=0]
.pivot_table(columns='Charge Description', values='Charge Amount', index=['ID', 'Net Cost'])
.fillna(0))
Output:
Charge Description Discount X Discount Y Surcharge A Surcharge B
ID Net Cost
1 30 -11.5 -3.25 9.5 2.5
2 40 -12.5 0.00 0.0 0.0
3 50 -11.5 0.00 0.0 0.0
4 35 -5.5 0.00 0.0 3.5
5 45 -10.5 0.00 9.5 4.5
I would use melt to stack the identically named columns, then pivot to create the outcome you want.
# Ensure the first line is now the column names, and then delete the first line.
df.columns = df.iloc[0]
df = df[1:]
# Create two melted df's, and join them on index.
df1 = df.melt(['ID', 'Net Cost'], ['Charge Description']).sort_values(by='ID').reset_index(drop=True)
df2 = df.melt(['ID', 'Net Cost'], ['Charge Amount']).sort_values(by='ID').reset_index(drop=True)
df1['Charge Amount'] = df2['value']
# Clean up a little, rename the added 'value' column from df1.
df1 = df1.drop(columns=[0]).rename(columns={'value': 'Charge Description'})
df1 = df1.dropna()
# Pivot the data.
df1 = df1.pivot(index=['ID', 'Net Cost'], columns='Charge Description', values='Charge Amount')
Result of df1:
Charge Description Discount X Discount Y Surcharge A Surcharge B
ID Net Cost
1 30 -11.5 -3.25 9.5 2.5
2 40 -12.5 NaN NaN NaN
3 50 -11.5 NaN NaN NaN
4 35 -5.5 NaN NaN 3.5
5 45 -10.5 NaN 9.5 4.5`
My first thought was to read the data out in to a list of dictionaries representing each Row (making both the keys and values from the data values), then form a new dataframe from that.
For your example, that would make...
[
{
'ID': '1',
'Net Cost': '30',
'Discount X': '-11.5',
'Discount Y': '-3.25',
'Surcharge A': '9.5',
'Surcharge B': '2.5',
},
{
'ID': '2',
'Net Cost': '40',
'Discount X': '-12.5',
},
{
'ID': '3',
'Net Cost': '50',
'Discount X': '-11.5',
},
{
'ID': '4',
'Net Cost': '35',
'Discount X': '-5.5',
'Surcharge B': '3.5',
},
{
'ID': '5',
'Net Cost': '45',
'Discount X': '-10.5',
'Surcharge A': '9.5',
'Surcharge B': '4.5',
},
]
For the SMALL sample dataset, using comprehensions appears to be quite quick for that...
import pandas as pd
from itertools import chain
rows = [
{
name: value
for name, value in chain(
[
("ID", row[0]),
("Net Cost", row[1]),
],
zip(row[2::2], row[3::2]) # pairs of columns: (2,3), (4,5), etc
)
if name
}
for ix, row in df.iloc[1:].iterrows() # Skips the row with the column headers
]
df2 = pd.DataFrame(rows).fillna(0)
Demo (including timings of this and three other answers):
https://trinket.io/python3/555f860855
EDIT:
To sort the column names, add the following...
df2 = df2[['ID', 'Net Cost', *sorted(df2.columns[2:])]]

drop.na() not working on dataframe with Nan values?

I have a data frame with Nan values. For some reason, df.dropna() doesn't work when I try to drop these rows. Any thoughts?
Example of a row:
30754 22 Nan Nan Nan Nan Nan Nan Jewellery-Women N
df = pd.read_csv('/Users/xxx/Desktop/CS 677/Homework_4/FashionDataset.csv')
df.dropna()
df.head().to_dict()
{'Unnamed: 0': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4},
'BrandName': {0: 'life',
1: 'only',
2: 'fratini',
3: 'zink london',
4: 'life'},
'Deatils': {0: 'solid cotton blend collar neck womens a-line dress - indigo',
1: 'polyester peter pan collar womens blouson dress - yellow',
2: 'solid polyester blend wide neck womens regular top - off white',
3: 'stripes polyester sweetheart neck womens dress - black',
4: 'regular fit regular length denim womens jeans - stone'},
'Sizes': {0: 'Size:Large,Medium,Small,X-Large,X-Small',
1: 'Size:34,36,38,40',
2: 'Size:Large,X-Large,XX-Large',
3: 'Size:Large,Medium,Small,X-Large',
4: 'Size:26,28,30,32,34,36'},
'MRP': {0: 'Rs\n1699',
1: 'Rs\n3499',
2: 'Rs\n1199',
3: 'Rs\n2299',
4: 'Rs\n1699'},
'SellPrice': {0: '849', 1: '2449', 2: '599', 3: '1379', 4: '849'},
'Discount': {0: '50% off',
1: '30% off',
2: '50% off',
3: '40% off',
4: '50% off'},
'Category': {0: 'Westernwear-Women',
1: 'Westernwear-Women',
2: 'Westernwear-Women',
3: 'Westernwear-Women',
4: 'Westernwear-Women'}}
This is what I get when using df.head().to_dict()
Try this;
df = pd.DataFrame({"col1":[12,20,np.nan,np.nan],
"col2":[10,np.nan,np.nan,40]})
df1 = df.dropna()
# df;
col1 col2
0 12.0 10.0
1 20.0 NaN
2 NaN NaN
3 NaN 40.0
# df1;
col1 col2
0 12.0 10.0

Fuzzy Matching with different fuzz ratios

I have two large datasets. df1 is about 1m lines, and df2 is about 10m lines. I need to find matches for lines in df1 from df2.
I have posted an original version of this question separately. See here. Well answered by #laurent but I have some added specificities now. I would now like to:
Get the fuzz ratios for each of fname and lname in a column in my final matched dataframe
Write the code such that fuzz ratio for fname is set to >60, while fuzz ratio for lname is set to >75. In other words, a true match occurs if fuzz_ratio for fname>60 and fuzz ratio for lname>75; otherwise not a true match. A match would not be true if fuzz ratio for fname==80 while fuzz ratio for lname==60. While I understand that this can be done from (1) as a post-hoc filtering, it would make sense to do this at the stage of coding for a different matching.
I post here an example of my data. The solution by #laurent for the original problem can be found in the above link.
import pandas as pd
df1 = pd.DataFrame(
{
"ein": {0: 1001, 1: 1500, 2: 3000},
"ein_name": {0: "H for Humanity", 1: "Labor Union", 2: "Something something"},
"lname": {0: "Cooper", 1: "Cruise", 2: "Pitt"},
"fname": {0: "Bradley", 1: "Thomas", 2: "Brad"},
}
)
df2 = pd.DataFrame(
{
"lname": {0: "Cupper", 1: "Cruise", 2: "Cruz", 3: "Couper"},
"fname": {0: "Bradley", 1: "Tom", 2: "Thomas", 3: "M Brad"},
"score": {0: 3, 1: 3.5, 2: 4, 3: 2.5},
}
)
Expected output is:
df3 = pd.DataFrame(
{
"df1_ein": {0: 1001, 1: 1500, 2: 3000},
"df1_ein_name": {0: "H for Humanity", 1: "Labor Union", 2: "Something something"},
"df1_lname": {0: "Cooper", 1: "Cruise", 2: "Pitt"},
"df1_fname": {0: "Bradley", 1: "Thomas", 2: "Brad"},
"fuzz_ratio_lname": {0: 83, 1: 100, 2: NA},
"fuzz_ratio_fname: {0: 62, 1: 67, 2: NA},
"df2_lname": {0: "Couper", 1: "Cruise", 2: "NA"},
"df2_fname": {0: "M Brad", 1: "Tom", 2: "NA"},
"df2_score": {0: 2.5, 1: 3.5, 2: NA},
}
)
Note from the above expected output: Bradley Cupper is a bad match for Bradley Cooper based on the fuzz ratios that I assigned. The better match for Bradley Cooper is M Brad Couper. Similarly, Thomas Cruise matches with Tom Cruise rather than with Thomas Cruz.
I am a user of Stata primarily (haha) and the reclink2 ado file can do the above in theory, i.e. if Stata can handle the size of the data. However, with the size of data I have, nothing even starts after hours.
Here is one way to do it:
import pandas as pd
from fuzzywuzzy import fuzz
# Setup
df1.columns = [f"df1_{col}" for col in df1.columns]
# Add new columns
df1["fuzz_ratio_lname"] = (
df1["df1_lname"]
.apply(
lambda x: max(
[(value, fuzz.ratio(x, value)) for value in df2["lname"]],
key=lambda x: x[1],
)
)
.apply(lambda x: x if x[1] > 75 else pd.NA)
)
df1[["df2_lname", "fuzz_ratio_lname"]] = pd.DataFrame(
df1["fuzz_ratio_lname"].tolist(), index=df1.index
)
df1 = (
pd.merge(left=df1, right=df2, how="left", left_on="df2_lname", right_on="lname")
.drop(columns="lname")
.rename(columns={"fname": "df2_fname"})
)
df1["df2_fname"] = df1["df2_fname"].fillna(value="")
for i, (x, value) in enumerate(zip(df1["df1_fname"], df1["df2_fname"])):
ratio = fuzz.ratio(x, value)
df1.loc[i, "fuzz_ratio_fname"] = ratio if ratio > 60 else pd.NA
# Cleanup
df1["df2_fname"] = df1["df2_fname"].replace("", pd.NA)
df1 = df1[
[
"df1_ein",
"df1_ein_name",
"df1_lname",
"df1_fname",
"fuzz_ratio_lname",
"fuzz_ratio_fname",
"df2_lname",
"df2_fname",
"score",
]
]
print(df1)
# Output
df1_ein df1_ein_name df1_lname df1_fname fuzz_ratio_lname \
0 1001 H for Humanity Cooper Bradley 83.0
1 1500 Labor Union Cruise Thomas 100.0
2 3000 Something something Pitt Brad NaN
fuzz_ratio_fname df2_lname df2_fname score
0 62.0 Couper M Brad 2.5
1 67.0 Cruise Tom 3.5
2 <NA> <NA> <NA> NaN

Check if column in dataframe is missing values

I have am column full of state names.
I know how to iterate down through it, but I don't know what syntax to use to have it check for empty values as it goes. tried "isnull()" but that seems to be the wrong approach. Anyone know a way?
was thinking something like:
for state_name in datFrame.state_name:
if datFrame.state_name.isnull():
print ('no name value' + other values from row)
else:
print(row is good.)
df.head():
state_name state_ab city zip_code
0 Alabama AL Chickasaw 36611
1 Alabama AL Louisville 36048
2 Alabama AL Columbiana 35051
3 Alabama AL Satsuma 36572
4 Alabama AL Dauphin Island 36528
to_dict():
{'state_name': {0: 'Alabama',
1: 'Alabama',
2: 'Alabama',
3: 'Alabama',
4: 'Alabama'},
'state_ab': {0: 'AL', 1: 'AL', 2: 'AL', 3: 'AL', 4: 'AL'},
'city': {0: 'Chickasaw',
1: 'Louisville',
2: 'Columbiana',
3: 'Satsuma',
4: 'Dauphin Island'},
'zip_code': {0: '36611', 1: '36048', 2: '35051', 3: '36572', 4: '36528'}}
Based on your description, you can use np.where to check if rows are either null or empty strings.
df['status'] = np.where(df['state'].eq('') | df['state'].isnull(), 'Not Good', 'Good')
(MCVE) For example, suppose you have the following dataframe
state
0 New York
1 Nevada
2
3 None
4 New Jersey
then,
state status
0 New York Good
1 Nevada Good
2 Not Good
3 None Not Good
4 New Jersey Good
It's always worth mentioning that you should avoid loops whenever possible, because they are way slower than masking

Including missing combination of values based on a group of grouped data

I am expanding on earlier thread: Including missing combinations of values in a pandas groupby aggregation
In above thread, the accepted answer computes all possible combinations for the grouping variable. In this version, I'd like to compute combinations based on group of groups.
Let's take an example.
Here's input dataframe:
Here, one group is [Year,Quarter] i.e.
Year Quarter
2014 Q1
2015 Q2
2015 Q3
Another set of group is Name:
Name
Adam
Smith
Now, I want to apply groupby and sum such that missing values of the combination of above groups is detected as NaN
Here's sample output:
I'd appreciate any help.
Here's sample input and output in dict format:
input=
{'Year': {0: 2014, 1: 2014, 2: 2015, 3: 2015, 4: 2015},
'Quarter': {0: 'Q1', 1: 'Q1', 2: 'Q2', 3: 'Q2', 4: 'Q3'},
'Name': {0: 'Adam', 1: 'Smith', 2: 'Adam', 3: 'Adam', 4: 'Smith'},
'Value': {0: 2, 1: 3, 2: 4, 3: 5, 4: 5}}
output=
{'Year': {0: 2014, 1: 2014, 2: 2015, 3: 2015, 4: 2015, 5: 2015},
'Quarter': {0: 'Q1', 1: 'Q1', 2: 'Q2', 3: 'Q2', 4: 'Q3', 5: 'Q3'},
'Name': {0: 'Adam', 1: 'Smith', 2: 'Adam', 3: 'Smith', 4: 'Smith', 5: 'Adam'},
'Value': {0: 2.0, 1: 3.0, 2: 9.0, 3: nan, 4: 5.0, 5: nan}}
Clarification:
I am looking for a method without doing melt and cast. i.e. without playing around with long and wide format.
The example post you posted is the correct answer: groupby get the sum then unstack to find the missing value then stack with the param dropna=False here are the docs on stack
df.groupby(['Year','Quarter','Name']).sum().unstack().stack(dropna=False).reset_index()
Year Quarter Name Value
0 2014 Q1 Adam 2.0
1 2014 Q1 Smith 3.0
2 2015 Q2 Adam 9.0
3 2015 Q2 Smith NaN
4 2015 Q3 Adam NaN
5 2015 Q3 Smith 5.0
Using pivot_table, PS you can add reset_index at the end
df.pivot_table(index=['Year','Quarter'],columns='Name',values='Value',aggfunc='sum').stack(dropna=False)
Year Quarter Name
2014 Q1 Adam 2.0
Smith 3.0
2015 Q2 Adam 9.0
Smith NaN
Q3 Adam NaN
Smith 5.0
dtype: float64

Categories

Resources