Removing Duplicate columns while doing pandas merge

Removing Duplicate columns while doing pandas merge - python

i have a table in pandas df
id product_1 product_2 count
1 100 200 10
2 200 600 20
3 100 500 30
4 400 100 40
5 500 700 50
6 200 500 60
7 100 400 70
also i have another table in dataframe df2
product price
100 5
200 10
300 15
400 20
500 25
600 30
700 35
i wanted to merge df2 with df1 such that i get price_x and price_y as columns
and then again divide price_y/price_x to get final column as perc_diff.
so i tried to do merge using.
# Add prices for products 1 and 2
df3 = (df1.
merge(df2, left_on='product_1', right_on='product').
merge(df2, left_on='product_2', right_on='product'))
# Calculate the percent difference
df3['perc_diff'] = (df3.price_y - df3.price_x) / df3.price_x
But when i did merge i got multiple columns of product_1 and product_2
for eg. my df3.head(1) after merging is:
id product_1 product_2 count product_1 product_2 price_x price_y
1 100 200 10 100 200 5 10
So how do i remove these multiple column's of product_1 & product_2 while merging or after merging?

df2_ = df2.set_index('product')
df3 = df.join(df2_, on='product_1') \
.join(df2_, on='product_2', lsuffix='_x', rsuffix='_y')
df3.assign(perc_diff=df3.price_y.div(df3.price_x).sub(1))

For removing column is necessary rename:
df3 = df1.merge(df2, left_on='product_1', right_on='product') \
.merge(df2.rename(columns={'product':'product_2'}), on='product_2')
#borrow from piRSquared solution
df3 = df3.assign(perc_diff=df3.price_y.div(df3.price_x).sub(1))
print (df3)
id product_1 product_2 count product price_x price_y perc_diff
0 1 100 200 10 100 5 10 1.00
1 3 100 500 30 100 5 25 4.00
2 6 200 500 60 200 10 25 1.50
3 7 100 400 70 100 5 20 3.00
4 2 200 600 20 200 10 30 2.00
5 4 400 100 40 400 20 5 -0.75
6 5 500 700 50 500 25 35 0.40

Related

Pandas cumsum with keys

I have two DataFrames (first, second):
index_first
value_1
value_2
0
100
1
1
200
2
2
300
3
index_second
value_1
value_2
0
50
10
1
100
20
2
150
30
Next I concat the two DataFrames with keys:
z = pd.concat([first, second],keys=['x','y'])
My goal is to calculate the cumulative sum of value_1 and value_2 in z considering the keys.
So the final DataFrame should look like this:
index_z
value_1
value_2
x,0
100
1
x,1
300
3
x,2
600
6
y,0
50
10
y,1
150
30
y,2
300
60

Use GroupBy.cumsum by first level created by keys from concat:
df = z.groupby(level=0).cumsum()
print (df)
value_1 value_2
index_first
x 0 100 1
1 300 3
2 600 6
y 0 50 10
1 150 30
2 300 60

Python loop for calculating sum of column values in pandas

I have below data frame:
a
100
200
200
b
20
30
40
c
400
50
Need help to calculate sum of values for each item and place it in 2nd column, which ideally should look like below:
a 500
100
200
200
b 90
20
30
40
c 450
400
50

If need sum by groups by column col converted to numeric use GroupBy.transform with repeated non numeric values by ffill:
s = pd.to_numeric(df['col'], errors='coerce')
mask = s.isna()
df.loc[mask, 'new'] = s.groupby(df['col'].where(mask).ffill()).transform('sum')
print (df)
col new
0 a 500.0
1 100 NaN
2 200 NaN
3 200 NaN
4 b 90.0
5 20 NaN
6 30 NaN
7 40 NaN
8 c 450.0
9 400 NaN
10 50 NaN
Or:
df['new'] = np.where(mask, new.astype(int), '')
print (df)
col new
0 a 500
1 100
2 200
3 200
4 b 90
5 20
6 30
7 40
8 c 450
9 400
10 50

Adding a subtotal column to a multilevel column table

This is my dataframe after pivoting:
Country London Shanghai
PriceRange 100-200 200-300 300-400 100-200 200-300 300-400
Code
A 1 1 1 2 2 2
B 10 10 10 20 20 20
Is it possible to add columns after every country to achieve the following:
Country London Shanghai All
PriceRange 100-200 200-300 300-400 SubTotal 100-200 200-300 300-400 SubTotal 100-200 200-300 300-400 SubTotal
Code
A 1 1 1 3 2 2 2 6 3 3 3 9
B 10 10 10 30 20 20 20 60 30 30 30 90
I know I can use margins=True, however that just adds a final grand total.
Are there any options that I can use to achieve this? THanks.

Let us using sum with join
s=df.sum(level=0,axis=1)
s.columns=pd.MultiIndex.from_product([list(s),['subgroup']])
df=df.join(s).sort_index(level=0,axis=1).assign(Group=df.sum(axis=1))
df
A B Group
1 2 3 subgroup 1 2 3 subgroup
Code
A 1 1 1 3 2 2 2 6 9
B 10 10 10 30 20 20 20 60 90

Merge Columns with the Same name in the same dataframe if null

I have a dataframe that looks like this
Depth DT DT DT GR GR GR
1 100 NaN 45 NaN 100 50 NaN
2 200 NaN 45 NaN 100 50 NaN
3 300 NaN 45 NaN 100 50 NaN
4 400 NaN Nan 50 100 50 NaN
5 500 NaN Nan 50 100 50 NaN
I need to merge the same name columns into one if there are null values and keep the first occurrence of the column if other columns are not null.
In the end the data frame should look like
Depth DT GR
1 100 45 100
2 200 45 100
3 300 45 100
4 400 50 100
5 500 50 100
I am beginner in pandas. I tried but wasn't successful. I tried drop duplicate but it couldn't do the what I wanted. Any suggestions?

IIUC, you can do:
(df.set_index('Depth')
.groupby(level=0, axis=1).first()
.reset_index())
output:
Depth DT GR
0 100 45.0 100.0
1 200 45.0 100.0
2 300 45.0 100.0
3 400 50.0 100.0
4 500 50.0 100.0

pandas calulate difference based on a pattern

I have a pandas dataframe as
df
Category NET A B C_DIFF 1 2 DD_DIFF .....
0 tom CD 10 20 NaN 30 40 NaN
1 tom CD 100 200 NaN 300 400 NaN
2 tom CD 100 200 NaN 300 400 NaN
3 tom CD 100 200 NaN 300 400 NaN
4 tom CD 100 200 NaN 300 400 NaN
Now my columns name ending with _DIFF i.e,C_DIFF and DD_DIFF should get the subsequent difference. i.e, A-B values should be in C_DIFF and 1-2 difference should be populated DD_DIFF. How to get this desired output.
Edit : There are 20 columns ending with _DIFF. Need to do this programmatically and not hard code the columns

Generalizing this:
m=df.columns.get_indexer(df.filter(like='DIFF').columns)
df.iloc[:,m]=pd.concat([df.iloc[:,a]-df.iloc[:,b] for a,b in zip(m-2,m-1)],axis=1).values
print(df)
Category NET A B C_DIFF 1 2 DD_DIFF
0 tom CD 10 20 -10 30 40 -10
1 tom CD 100 200 -100 300 400 -100
2 tom CD 100 200 -100 300 400 -100
3 tom CD 100 200 -100 300 400 -100
4 tom CD 100 200 -100 300 400 -100
Explanation:
df.filter() will filter the columns with names DIFF.
df.columns.get_indexer is using pd.Index.get_indexer which gets the index of such columns.
Post this we zip them and calculate the difference, and store in a list and concat them. Finally access the values to assign.
EDIT:
To handle strings you can take help of pd.to_numeric() with errors='coerce':
m=df.columns.get_indexer(df.filter(like='DIFF').columns)
df.iloc[:,m]=pd.concat([pd.to_numeric(df.iloc[:,a],errors='coerce')-
pd.to_numeric(df.iloc[:,b],errors='coerce') for a,b in zip(m-2,m-1)],axis=1).values

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Removing Duplicate columns while doing pandas merge - python

df2_ = df2.set_index('product') df3 = df.join(df2_, on='product_1') \ .join(df2_, on='product_2', lsuffix='_x', rsuffix='_y') df3.assign(perc_diff=df3.price_y.div(df3.price_x).sub(1))

Related

Pandas cumsum with keys

Python loop for calculating sum of column values in pandas

Adding a subtotal column to a multilevel column table

Merge Columns with the Same name in the same dataframe if null

pandas calulate difference based on a pattern

Categories

Resources