Pandas pivot with duplicates [duplicate]

Pandas pivot with duplicates [duplicate] - python

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 3 years ago.
The sample data looks like this:
d = pd.DataFrame({'name': ['Adam', 'Adam', 'Bob', 'Bob', 'Craig'],
'number': [111, 222, 333, 444, 555],
'type': ['phone', 'fax', 'phone', 'phone', 'fax']})
name number type
------ ------ -----
Adam 111 phone
Adam 222 fax
Bob 333 phone
Bob 444 phone
Craig 555 fax
I am trying to convert the numbers (phone and fax) to a wide format, the ideal output:
name fax phone
---- ----- -----
Adam 222.0 111.0
Bob NaN 333.0
Bob NaN 444.0
Craig 555.0 NaN
When I tried to use the pivot method and run the following code:
p = d.pivot(index='name', columns = 'type', values='number').reset_index()
I received the error ValueError: Index contains duplicate entries, cannot reshape due to the fact that Bob has two phone numbers.
Is there a workaround for this?

Here you go with cumcount create the additional key
d['key']=d.groupby(['name','type']).cumcount()
p = d.pivot_table(index=['key','name'], columns = 'type', values='number',aggfunc='sum').reset_index()
p
Out[71]:
type key name fax phone
0 0 Adam 222.0 111.0
1 0 Bob NaN 333.0
2 0 Craig 555.0 NaN
3 1 Bob NaN 444.0

Related

Find duplicates in dataframe with tolerance in one column instead of exact value

I have a dataframe of expense claims made by staff:
import pandas as pd
data = {'Claim ID': [1, 2, 3, 4, 5, 6, 7],
'User': ['John', 'John', 'Jake', 'Bob', 'Bob', 'Tom', 'Tom'],
'Category': ['Meal', 'Meal', 'Stationary', 'Phone Charges', 'Phone Charges', 'Transport', 'Transport'],
'Amount': [12.00, 13.00, 20.00, 30, 30, 60, 60]}
df = pd.DataFrame(data)
Output:
Claim ID User Category Amount
1 John Meal 12.0
2 John Meal 13.0
3 Jake Stationary 20.0
4 Bob Phone Charges 30.0
5 Bob Phone Charges 30.0
6 Tom Transport 60.0
7 Tom Transport 60.0
I used the following code to find duplicate claims based on User, Category and Amount and gave a unique group number to each set of duplicates found:
# Tag each duplicate set with a unique number
conditions = ['User', 'Amount', 'Category']
df['Group'] = df.groupby(conditions).ngroup().add(1)
# Then remove groups with only one row
df = df[df.groupby('Group')['Group'].transform('count') > 1]
Output:
Claim ID User Category Amount Group
4 Bob Phone Charges 30.0 1
5 Bob Phone Charges 30.0 1
6 Tom Transport 60.0 5
7 Tom Transport 60.0 5
Now my question is, I want to find duplicates with the same User, Category, but instead of the exact same Amount, I want to allow a tolerance of a few dollars in the amount claimed, let's say around $1. So using the sample dataframe given, the expected output will be like this:
Claim ID User Category Amount Group
1 John Meal 12.0 1
2 John Meal 13.0 1
3 Tom Transport 30.0 2
4 Tom Transport 30.0 2
5 Bob Phone Charges 60.0 3
6 Bob Phone Charges 60.0 3

I don't know if it is the fastest way, but it does work and works great for fuzzy conditions like tolerance:
df['group'] = np.piecewise(
np.zeros(len(df)),
[list((df.User.values == user) & (df.Category.values == category) & (df.Amount.values >= amount-1) & (df.Amount.values <= amount+1)) \
for user, category, amount in zip(df.User.values, df.Category.values, df.Amount.values)],
df['Claim ID'].values
)
df[df.groupby('group')['group'].transform('count') > 1]
# Result:
Claim ID User Category Amount group
0 1 John Meal 12.0 2.0
1 2 John Meal 13.0 2.0
3 4 Bob Phone Charges 30.0 5.0
4 5 Bob Phone Charges 30.0 5.0
5 6 Tom Transport 60.0 7.0
6 7 Tom Transport 60.0 7.0

Python - expand rows in dataframe n-times

I need to make a function to expand a dataframe. For example, the input of the function is :
df = pd.DataFrame({
'Name':['Ali', 'Ali', 'Ali', 'Sasha', 'Sasha', 'Sasha'],
'Cart':['book', 'phonecase', 'shirt', 'phone', 'food', 'bag']
})
suppose the n value is 3. Then, for each person inside the Name column, I have to add 3 more new rows and leave the Cart as np.nan. The output should be like this :
df = pd.DataFrame({
'Name':['Ali', 'Ali', 'Ali', 'Ali', 'Ali', 'Ali', 'Sasha', 'Sasha', 'Sasha', 'Sasha', 'Sasha', 'Sasha'],
'Cart':['book', 'phonecase', 'shirt', np.nan, np.nan, np.nan, 'phone', 'food', 'bag', np.nan, np.nan, np.nan]
})
How can I solve this with using copy() and append()?

You can use np.repeat with pd.Series.unique:
n = 3
print (df.append(pd.DataFrame(np.repeat(df["Name"].unique(), n), columns=["Name"])))
Name Cart
0 Ali book
1 Ali phonecase
2 Ali shirt
3 Sasha phone
4 Sasha food
5 Sasha bag
0 Ali NaN
1 Ali NaN
2 Ali NaN
3 Sasha NaN
4 Sasha NaN
5 Sasha NaN

Try this one: (it adds n rows to each group of rows with the same Name value)
import pandas as pd
import numpy as np
n = 3
list_of_df_unique_names = [df[df["Name"]==name] for name in df["Name"].unique()]
df2 = pd.concat([d.append(pd.DataFrame({"Name":np.repeat(d["Name"].values[-1], n)}))\
for d in list_of_df_unique_names]).reset_index(drop=True)
print(df2)
Output:
Name Cart
0 Ali book
1 Ali phonecase
2 Ali shirt
3 Ali NaN
4 Ali NaN
5 Ali NaN
6 Sasha phone
7 Sasha food
8 Sasha bag
9 Sasha NaN
10 Sasha NaN
11 Sasha NaN

Maybe not the most beautiful of all solutions, but it works. Say that you want to add 4 NaN rows by group. Then, given your df:
df = pd.DataFrame({
'Name':['Ali', 'Ali', 'Ali', 'Sasha', 'Sasha', 'Sasha'],
'Cart':['book', 'phonecase', 'shirt', 'phone', 'food', 'bag']
})
you can creat an empty dataframe DF and loop trough the range (1,4), filter the df you had and in every loop add an empty row:
DF = []
names = list(set(df.Name))
for i in range(4):
for name in names:
gf = df[df['Name']=='{}'.format(name)]
a = pd.concat([gf, gf.groupby('Name')['Cart'].apply(lambda x: x.shift(-1).iloc[-1]).reset_index()]).sort_values('Name').reset_index(drop=True)
DF.append(a)
DF_full = pd.concat(DF)
Now, you'll end up with copies of your original df, so you need to dump them without dumping the NaN rows:
DFF = DF_full.sort_values(['Name','Cart'])
DFF = DFF[(~DFF.duplicated()) | (DFF['Cart'].isnull())]
which gives:
Name Cart
0 Ali book
1 Ali phonecase
2 Ali shirt
3 Ali NaN
3 Ali NaN
3 Ali NaN
3 Ali NaN
2 Sasha bag
1 Sasha food
0 Sasha phone
3 Sasha NaN
3 Sasha NaN
3 Sasha NaN
3 Sasha NaN

Python convert long to wide data frame(Strings) [duplicate]

I have data in long format and am trying to reshape to wide, but there doesn't seem to be a straightforward way to do this using melt/stack/unstack:
Salesman Height product price
Knut 6 bat 5
Knut 6 ball 1
Knut 6 wand 3
Steve 5 pen 2
Becomes:
Salesman Height product_1 price_1 product_2 price_2 product_3 price_3
Knut 6 bat 5 ball 1 wand 3
Steve 5 pen 2 NA NA NA NA
I think Stata can do something like this with the reshape command.

Here's another solution more fleshed out, taken from Chris Albon's site.
Create "long" dataframe
raw_data = {'patient': [1, 1, 1, 2, 2],
'obs': [1, 2, 3, 1, 2],
'treatment': [0, 1, 0, 1, 0],
'score': [6252, 24243, 2345, 2342, 23525]}
df = pd.DataFrame(raw_data, columns = ['patient', 'obs', 'treatment', 'score'])
Make a "wide" data
df.pivot(index='patient', columns='obs', values='score')

A simple pivot might be sufficient for your needs but this is what I did to reproduce your desired output:
df['idx'] = df.groupby('Salesman').cumcount()
Just adding a within group counter/index will get you most of the way there but the column labels will not be as you desired:
print df.pivot(index='Salesman',columns='idx')[['product','price']]
product price
idx 0 1 2 0 1 2
Salesman
Knut bat ball wand 5 1 3
Steve pen NaN NaN 2 NaN NaN
To get closer to your desired output I added the following:
df['prod_idx'] = 'product_' + df.idx.astype(str)
df['prc_idx'] = 'price_' + df.idx.astype(str)
product = df.pivot(index='Salesman',columns='prod_idx',values='product')
prc = df.pivot(index='Salesman',columns='prc_idx',values='price')
reshape = pd.concat([product,prc],axis=1)
reshape['Height'] = df.set_index('Salesman')['Height'].drop_duplicates()
print reshape
product_0 product_1 product_2 price_0 price_1 price_2 Height
Salesman
Knut bat ball wand 5 1 3 6
Steve pen NaN NaN 2 NaN NaN 5
Edit: if you want to generalize the procedure to more variables I think you could do something like the following (although it might not be efficient enough):
df['idx'] = df.groupby('Salesman').cumcount()
tmp = []
for var in ['product','price']:
df['tmp_idx'] = var + '_' + df.idx.astype(str)
tmp.append(df.pivot(index='Salesman',columns='tmp_idx',values=var))
reshape = pd.concat(tmp,axis=1)
#Luke said:
I think Stata can do something like this with the reshape command.
You can but I think you also need a within group counter to get the reshape in stata to get your desired output:
+-------------------------------------------+
| salesman idx height product price |
|-------------------------------------------|
1. | Knut 0 6 bat 5 |
2. | Knut 1 6 ball 1 |
3. | Knut 2 6 wand 3 |
4. | Steve 0 5 pen 2 |
+-------------------------------------------+
If you add idx then you could do reshape in stata:
reshape wide product price, i(salesman) j(idx)

Karl D's solution gets at the heart of the problem. But I find it's far easier to pivot everything (with .pivot_table because of the two index columns) and then sort and assign the columns to collapse the MultiIndex:
df['idx'] = df.groupby('Salesman').cumcount()+1
df = df.pivot_table(index=['Salesman', 'Height'], columns='idx',
values=['product', 'price'], aggfunc='first')
df = df.sort_index(axis=1, level=1)
df.columns = [f'{x}_{y}' for x,y in df.columns]
df = df.reset_index()
Output:
Salesman Height price_1 product_1 price_2 product_2 price_3 product_3
0 Knut 6 5.0 bat 1.0 ball 3.0 wand
1 Steve 5 2.0 pen NaN NaN NaN NaN

A bit old but I will post this for other people.
What you want can be achieved, but you probably shouldn't want it ;)
Pandas supports hierarchical indexes for both rows and columns.
In Python 2.7.x ...
from StringIO import StringIO
raw = '''Salesman Height product price
Knut 6 bat 5
Knut 6 ball 1
Knut 6 wand 3
Steve 5 pen 2'''
dff = pd.read_csv(StringIO(raw), sep='\s+')
print dff.set_index(['Salesman', 'Height', 'product']).unstack('product')
Produces a probably more convenient representation than what you were looking for
price
product ball bat pen wand
Salesman Height
Knut 6 1 5 NaN 3
Steve 5 NaN NaN 2 NaN
The advantage of using set_index and unstacking vs a single function as pivot is that you can break the operations down into clear small steps, which simplifies debugging.

pivoted = df.pivot('salesman', 'product', 'price')
pg. 192 Python for Data Analysis

An old question; this is an addition to the already excellent answers. pivot_wider from pyjanitor may be helpful as an abstraction for reshaping from long to wide (it is a wrapper around pd.pivot):
# pip install pyjanitor
import pandas as pd
import janitor
idx = df.groupby(['Salesman', 'Height']).cumcount().add(1)
(df.assign(idx = idx)
.pivot_wider(index = ['Salesman', 'Height'], names_from = 'idx')
)
Salesman Height product_1 product_2 product_3 price_1 price_2 price_3
0 Knut 6 bat ball wand 5.0 1.0 3.0
1 Steve 5 pen NaN NaN 2.0 NaN NaN

Matching IDs with names on a pandas DataFrame

I have a DataFrame with 3 columns: ID, BossID and Name. Each row has a unique ID and has a corresponding name. BossID is the ID of the boss of the person in that row. Suppose I have the following DataFrame:
df = pd.DataFrame({'id':[1,2,3,4,5], 'bossId':[np.nan, 1, 2, 2, 3],
'name':['Anne Boe','Ben Coe','Cate Doe','Dan Ewe','Erin Aoi']})
So here, Anne is Ben's boss and Ben Coe is Cate and Dan's boss, etc.
Now, I want to have another column that has the boss's name for each person.
The desired output is:
id boss name boss_name
0 1 NaN Anne NaN
1 2 1.0 Ben Anne
2 3 2.0 Cate Ben
3 4 2.0 Dan Ben
4 5 3.0 Erin Cate
I can get my output using an ugly double for-loop. Is there a cleaner way to obtain the desired output?

This should work:
bossmap = df.set_index('id')['name'].squeeze()
df['boss_name'] = df['bossId'].map(bossmap)

You can set id as index, then use pd.Series.reindex
df = df.set_index('id')
df['boss_name'] = df['name'].reindex(df['bossId']).to_numpy() # or .to_list()
id bossId name boss_name
0 1 NaN Anne Boe NaN
1 2 1.0 Ben Coe Anne Boe
2 3 2.0 Cate Doe Ben Coe
3 4 2.0 Dan Ewe Ben Coe
4 5 3.0 Erin Aoi Cate Doe

Create a separate dataframe for 'name' and 'id'.
Renamed 'name' and set 'id' as the index
.merge df with the new dataframe
import pandas as pd
# test dataframe
df = pd.DataFrame({'id':[1,2,3,4,5], 'bossId':[np.nan, 1, 2, 2, 3], 'name':['Anne Boe','Ben Coe','Cate Doe','Dan Ewe','Erin Aoi']})
# separate dataframe with id and name
names = df[['id', 'name']].dropna().set_index('id').rename(columns={'name': 'boss_name'})
# merge the two
df = df.merge(names, left_on='bossId', right_index=True, how='left')
# df
id bossId name boss_name
0 1 NaN Anne Boe NaN
1 2 1.0 Ben Coe Anne Boe
2 3 2.0 Cate Doe Ben Coe
3 4 2.0 Dan Ewe Ben Coe
4 5 3.0 Erin Aoi Cate Doe

Python - Shift Data to a different ROW

I am kind of new to data frames and been trying to find a solution to shift data in specific columns to another row if ID and CODE matches.
df = pd.DataFrame({'ID': ['123', '123', '123', '154', '167', '167'],
'NAME': ['Adam', 'Adam', 'Adam', 'Bob', 'Charlie', 'Charlie'],
'CODE': ['1001', '1001', '1011', '1002', 'A0101', 'A0101'],
'TAG1': ['A123', 'B123', 'K123', 'D123', 'E123', 'G123'],
'TAG2': [ np.NaN, 'C123', 'L123', np.NaN, 'F123', 'H123'],
'TAG3': [ np.NaN, 'M123', np.NaN, np.NaN, np.NaN, np.NaN]})
ID NAME CODE TAG1 TAG2 TAG3
0 123 Adam 1001 A123 NaN NaN
1 123 Adam 1001 B123 C123 M123
2 123 Adam 1011 K123 L123 NaN
3 154 Bob 1002 D123 NaN NaN
4 167 Charlie A0101 E123 F123 NaN
5 167 Charlie A0101 G123 H123 NaN
So above I have added the code which depicts the initial data frame, now we can see that ID='123' has two rows with the same codes and the values in 'TAG' columns vary, I would like to shift the 'TAG' data in the second row to the first row after 'TAG1' or in 'TAG1' if it is empty and delete the second row completely. It should be the same for ID='167'.
Below is another code with a sample data frame I have added manually which depicts the final result, any suggestions would be great. I tried one thing, it exactly did not work the way I wanted it to.
df_result = pd.DataFrame({'ID': ['123', '123', '154', '167'],
'NAME': ['Adam', 'Adam', 'Bob', 'Charlie',],
'CODE': ['1001', '1011', '1002', 'A0101'],
'TAG1': ['A123', 'K123', 'D123', 'E123'],
'TAG2': ['B123', 'L123', np.NaN, 'F123'],
'TAG3': ['C123', np.NaN, np.NaN, 'G123'],
'TAG4': ['M123', np.NaN, np.NaN, 'H123']})
ID NAME CODE TAG1 TAG2 TAG3 TAG4
0 123 Adam 1001 A123 B123 C123 M123
1 123 Adam 1011 K123 L123 NaN NaN
2 154 Bob 1002 D123 NaN NaN NaN
3 167 Charlie A0101 E123 F123 G123 H123
The code that I have tried is below to kind of get the result. But it did not get exactly the output I wanted
df2=pd.pivot_table(test_shift, index=['ID', 'NAME', 'CODE'],
columns=test_shift.groupby(['ID', 'CODE']).cumcount().add(1),
values=['TAG1'], aggfunc='sum')
Output Image
NOTE: Sorry for the bad posting the first time, I tried adding the dataframe as code to help you visually. But I 'FAILED'. I will try to learn that over the coming days and be a better member of 'Stack Overflow.
Thank you for the help.

Here is one-way:
def f(x):
x = x.stack().to_frame(name='val')
x = x.assign(tags='Tag'+x["val"].notna().cumsum().astype(str))
x = (x.reset_index(level=3, drop=True)
.set_index('tags', append=True)['val'].unstack())
return x
df_out = (df.set_index(['ID', 'NAME','CODE'])
.groupby(level=[0,1,2], as_index=False).apply(f)
.reset_index().drop('level_0', axis=1))
print(df_out)
Output:
ID NAME CODE Tag1 Tag2 Tag3 Tag4
0 123 Adam 1001 A123 B123 C123 M123
1 123 Adam 1011 K123 L123 NaN NaN
2 154 Bob 1002 D123 NaN NaN NaN
3 167 Charlie A0101 E123 F123 G123 H123

An approach with pd.wide_to_long for flattening the TAGS and then mapping the index as per the group of ID and CODE , then unstacking and joining the de-duplicated rows from ID,NAME and CODE
u = pd.wide_to_long(df.filter(like='TAG').reset_index(),'TAG','index','j').dropna()
idx = u.index.get_level_values(0).map(df.groupby(['ID','CODE']).ngroup())
u = u.set_index(pd.MultiIndex.from_arrays((idx,u.groupby(idx).cumcount()+1))).unstack()
u.columns = u.columns.map('{0[0]}{0[1]}'.format)
out = df[['ID','NAME','CODE']].drop_duplicates().reset_index(drop=True).join(u)
print(out)
ID NAME CODE TAG1 TAG2 TAG3 TAG4
0 123 Adam 1001 A123 B123 C123 M123
1 123 Adam 1011 K123 L123 NaN NaN
2 154 Bob 1002 D123 NaN NaN NaN
3 167 Charlie A0101 E123 F123 G123 H123

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas pivot with duplicates [duplicate] - python

Related

Find duplicates in dataframe with tolerance in one column instead of exact value

Python - expand rows in dataframe n-times

Python convert long to wide data frame(Strings) [duplicate]

Matching IDs with names on a pandas DataFrame

Python - Shift Data to a different ROW

Categories

Resources