pandas calulate difference based on a pattern - python

I have a pandas dataframe as
df
Category NET A B C_DIFF 1 2 DD_DIFF .....
0 tom CD 10 20 NaN 30 40 NaN
1 tom CD 100 200 NaN 300 400 NaN
2 tom CD 100 200 NaN 300 400 NaN
3 tom CD 100 200 NaN 300 400 NaN
4 tom CD 100 200 NaN 300 400 NaN
Now my columns name ending with _DIFF i.e,C_DIFF and DD_DIFF should get the subsequent difference. i.e, A-B values should be in C_DIFF and 1-2 difference should be populated DD_DIFF. How to get this desired output.
Edit : There are 20 columns ending with _DIFF. Need to do this programmatically and not hard code the columns

Generalizing this:
m=df.columns.get_indexer(df.filter(like='DIFF').columns)
df.iloc[:,m]=pd.concat([df.iloc[:,a]-df.iloc[:,b] for a,b in zip(m-2,m-1)],axis=1).values
print(df)
Category NET A B C_DIFF 1 2 DD_DIFF
0 tom CD 10 20 -10 30 40 -10
1 tom CD 100 200 -100 300 400 -100
2 tom CD 100 200 -100 300 400 -100
3 tom CD 100 200 -100 300 400 -100
4 tom CD 100 200 -100 300 400 -100
Explanation:
df.filter() will filter the columns with names DIFF.
df.columns.get_indexer is using pd.Index.get_indexer which gets the index of such columns.
Post this we zip them and calculate the difference, and store in a list and concat them. Finally access the values to assign.
EDIT:
To handle strings you can take help of pd.to_numeric() with errors='coerce':
m=df.columns.get_indexer(df.filter(like='DIFF').columns)
df.iloc[:,m]=pd.concat([pd.to_numeric(df.iloc[:,a],errors='coerce')-
pd.to_numeric(df.iloc[:,b],errors='coerce') for a,b in zip(m-2,m-1)],axis=1).values

Related

Pandas: Row reduction via Groupby

For an example, I have a simple DataFrame like:
index
data1
replace_me
agg_me
ID
1
100
(+)
25
1
2
200
(-)
35
2
3
200
(+)
45
2
4
300
(+)
55
3
5
400
(+)
10
4
6
400
(+)
10
4
7
400
(-)
10
4
8
400
(-)
10
4
I am trying to aggregate some rows together, whereby there exists a len(groupby of ID) > 1. In the cases where len(groupby ID) > 1, I am looking to:
Add column "agg_me" together
Replace (-) and (+) with (=)
Enter the (min(agg_me) / sum(agg_me)) into a new column called "Percent".
Do such that it only "pairs" off rows, ie, it doesnt collapse 4 rows -> 1.
So as a result:
index
data1
replace_me
agg_me
ID
Percent
1
100
(+)
25
1
0
2
200
(=)
80
2
0.4375
4
300
(+)
55
3
0
5
400
(=)
20
4
0.5
6
400
(=)
20
4
0.5
Any help is appreciated!
Try this:
vc = df['ID'].map(df['ID'].value_counts()).gt(1)
pd.concat([df.loc[~vc],
df.loc[vc]
.groupby(['ID',df.groupby('ID').cumcount().floordiv(2)]).agg(
index = ('index','first'),
data1 = ('data1','first'),
replace_me = ('replace_me',lambda x: '(=)'),
agg_me = ('agg_me','sum'),
Percent = ('agg_me',lambda x: x.min()/x.sum()))
.reset_index(level=0)]).fillna(0).sort_values('ID').reset_index(drop=True)
Output:
index data1 replace_me agg_me ID Percent
0 1 100 (+) 25 1 0.0000
1 2 200 (=) 80 2 0.4375
2 4 300 (+) 55 3 0.0000
3 5 400 (=) 20 4 0.5000
4 7 400 (=) 20 4 0.5000

Python loop for calculating sum of column values in pandas

I have below data frame:
a
100
200
200
b
20
30
40
c
400
50
Need help to calculate sum of values for each item and place it in 2nd column, which ideally should look like below:
a 500
100
200
200
b 90
20
30
40
c 450
400
50
If need sum by groups by column col converted to numeric use GroupBy.transform with repeated non numeric values by ffill:
s = pd.to_numeric(df['col'], errors='coerce')
mask = s.isna()
df.loc[mask, 'new'] = s.groupby(df['col'].where(mask).ffill()).transform('sum')
print (df)
col new
0 a 500.0
1 100 NaN
2 200 NaN
3 200 NaN
4 b 90.0
5 20 NaN
6 30 NaN
7 40 NaN
8 c 450.0
9 400 NaN
10 50 NaN
Or:
df['new'] = np.where(mask, new.astype(int), '')
print (df)
col new
0 a 500
1 100
2 200
3 200
4 b 90
5 20
6 30
7 40
8 c 450
9 400
10 50

Shrink the dataset by taking mean or median

Assuming I have the following dataframe df:
Number Apples
1 40
2 50
3 60
4 70
5 80
6 90
7 100
8 110
9 120
I want to shrink this dataset and create dataframe df2 such that there are only 3 observations. Hence, I want to take the average of 1,2,3 and make that one row, then 4,5,6 and make that the second row, and finally, 7,8,9 and make that the 3rd row
The end result will be the following
Number Apples
1 50
2 80
3 110
This is a simpler approach and should run much faster than a groupby -
df.rolling(3).mean()[2::3]
apples
2 50.0
5 80.0
8 110.0
You can do
n=3
s=df.groupby((df.Number-1)//n).Apples.mean()
Number
0 50
1 80
2 110
Name: Apples, dtype: int64

Merge Columns with the Same name in the same dataframe if null

I have a dataframe that looks like this
Depth DT DT DT GR GR GR
1 100 NaN 45 NaN 100 50 NaN
2 200 NaN 45 NaN 100 50 NaN
3 300 NaN 45 NaN 100 50 NaN
4 400 NaN Nan 50 100 50 NaN
5 500 NaN Nan 50 100 50 NaN
I need to merge the same name columns into one if there are null values and keep the first occurrence of the column if other columns are not null.
In the end the data frame should look like
Depth DT GR
1 100 45 100
2 200 45 100
3 300 45 100
4 400 50 100
5 500 50 100
I am beginner in pandas. I tried but wasn't successful. I tried drop duplicate but it couldn't do the what I wanted. Any suggestions?
IIUC, you can do:
(df.set_index('Depth')
.groupby(level=0, axis=1).first()
.reset_index())
output:
Depth DT GR
0 100 45.0 100.0
1 200 45.0 100.0
2 300 45.0 100.0
3 400 50.0 100.0
4 500 50.0 100.0

Removing Duplicate columns while doing pandas merge

i have a table in pandas df
id product_1 product_2 count
1 100 200 10
2 200 600 20
3 100 500 30
4 400 100 40
5 500 700 50
6 200 500 60
7 100 400 70
also i have another table in dataframe df2
product price
100 5
200 10
300 15
400 20
500 25
600 30
700 35
i wanted to merge df2 with df1 such that i get price_x and price_y as columns
and then again divide price_y/price_x to get final column as perc_diff.
so i tried to do merge using.
# Add prices for products 1 and 2
df3 = (df1.
merge(df2, left_on='product_1', right_on='product').
merge(df2, left_on='product_2', right_on='product'))
# Calculate the percent difference
df3['perc_diff'] = (df3.price_y - df3.price_x) / df3.price_x
But when i did merge i got multiple columns of product_1 and product_2
for eg. my df3.head(1) after merging is:
id product_1 product_2 count product_1 product_2 price_x price_y
1 100 200 10 100 200 5 10
So how do i remove these multiple column's of product_1 & product_2 while merging or after merging?
df2_ = df2.set_index('product')
df3 = df.join(df2_, on='product_1') \
.join(df2_, on='product_2', lsuffix='_x', rsuffix='_y')
df3.assign(perc_diff=df3.price_y.div(df3.price_x).sub(1))
For removing column is necessary rename:
df3 = df1.merge(df2, left_on='product_1', right_on='product') \
.merge(df2.rename(columns={'product':'product_2'}), on='product_2')
#borrow from piRSquared solution
df3 = df3.assign(perc_diff=df3.price_y.div(df3.price_x).sub(1))
print (df3)
id product_1 product_2 count product price_x price_y perc_diff
0 1 100 200 10 100 5 10 1.00
1 3 100 500 30 100 5 25 4.00
2 6 200 500 60 200 10 25 1.50
3 7 100 400 70 100 5 20 3.00
4 2 200 600 20 200 10 30 2.00
5 4 400 100 40 400 20 5 -0.75
6 5 500 700 50 500 25 35 0.40

Categories

Resources