Pandas group by rows selection based on condition

Pandas group by rows selection based on condition - python

I need to select the row within pandas group by based on condition.
Condition1 # For a given group R1,R2,W, if TYPE(A) amount2 is equal to TYPE(B) row, we need to bring the complete TYPE(A) row as output.
Condition2 # For a given group R1,R2,W, if TYPE(A) row amount2 is not equal to TYPE(B) row amount2 , we need sum up the amount1 & amount2 of both TYPE (A) & (B) rows & we need to bring the remaining columns from the TYPE(A) row as output.
Input dataframe
R1 R2 W TYPE amount1 amount2 Status Exchange
0 123 12 1 A 111 222 C 1.5
1 123 12 1 B 111 222 D 2.5
2 123 12 2 A 222 222 A 1.5
3 123 12 2 B 333 333 D 2.5
4 123 12 3 A 444 444 D 2.5
5 123 12 3 B 333 333 E 3.5
Expected output
R1 R2 W TYPE amount1 amount2 Status Exchange
0 123 12 1 A 111 222 C 1.5
1 123 12 2 A 555 555 A 1.5
2 123 12 3 A 777 777 D 2.5

First is necessary get all groups with amount1 equal by amount2 by reshape with DataFrame.set_index and DataFrame.unstack, compare selected columns by DataFrame.xs with DataFrame.eq and for test if all columns match is used DataFrame.all, last use DataFrame.merge for same length like original:
df1 = df.set_index(['R1','R2','W','TYPE'])['amount2'].unstack()
m = df1['A'].eq(df1['B']).rename('m')
m = df.join(m, on=['R1','R2','W'])['m']
Then for match rows (here first group) filter by boolean indexing only A rows chained by & for bitwise AND:
df2 = df[m & df['TYPE'].eq('A')]
print (df2)
R1 R2 W TYPE amount1 amount2 Status Exchange
0 123 12 1 A 111 222 C 1.5
Then filter all another groups by inverted mask with ~ and aggregate by GroupBy.agg all columns with GroupBy.first and amount columns by sum:
cols = df.columns.difference(['R1','R2','W','amount1','amount2'])
d1 = dict.fromkeys(['amount1','amount2'], 'sum')
d2 = dict.fromkeys(cols, 'first')
df3 = df[~m].groupby(['R1','R2','W'], as_index=False).agg({**d1, **d2}).assign(TYPE='A')
print (df3)
R1 R2 W amount1 amount2 Exchange Status TYPE
0 123 12 2 555 555 1.5 A A
1 123 12 3 777 777 2.5 D A
Last join together by concat and if necessary sorting by DataFrame.sort_values:
df4 = pd.concat([df2, df3], ignore_index=True, sort=False).sort_values(['R1','R2','W'])
print (df4)
R1 R2 W TYPE amount1 amount2 Status Exchange
0 123 12 1 A 111 222 C 1.5
1 123 12 2 A 555 555 A 1.5
2 123 12 3 A 777 777 D 2.5

Another solution:
#get the rows for A for each grouping
#assumption is TYPE is already sorted with A always ahead of B
core = ['R1','R2','W']
A = df.groupby(core).first()
#get rows for B for each grouping
B = df.groupby(core).last()
#first condition
cond1 = (A.amount1.eq(B.amount1)) & (A.amount2.eq(B.amount2))
#extract outcome from A to get the first part
part1 = A.loc[cond1]
#second condition
cond2 = A.amount2.ne(B.amount2)
#add the 'amount1' and 'amount 2' columns based on the second condition
part2 = B.loc[cond2].filter(['amount1','amount2']) +
A.loc[cond2].filter(['amount1','amount2'])
#merge with A to get the remaining columns
part2 = part2.join(A[['TYPE','Status','Exchange']])
#merge part1 and 2 to get final result
pd.concat([part1,part2]).reset_index()
R1 R2 W TYPE amount1 amount2 Status Exchange
0 123 12 1 A 111 222 C 1.5
1 123 12 2 A 555 555 A 1.5
2 123 12 3 A 777 777 D 2.5

Related

Is there a Pandas function to separate rows based on one column value and other column diversity?

This is a DataFrame sample:
Folder Model
0 123 A
1 123 A
2 123 A
3 4541 A
4 4541 B
5 4541 C
6 4541 A
7 11 B
8 11 C
9 222 D
10 222 D
11 222 B
12 222 A
I need to separate Folders that have items with Model A and also another Model (B, C or D). The final DataFrame should look like that.
Folder Model
3 4541 A
4 4541 B
5 4541 C
6 4541 A
9 222 D
10 222 D
11 222 B
12 222 A
I suppose it is something in the groupby universe, but couldn't get to a conclusion. Any suggestions?

group must have 'A' and must not have only 'A'
use groupby filter
(df
.groupby('Folder')
.filter(
lambda x: (x['Model'].eq('A').sum() > 0) & (x['Model'].ne('A').sum() > 0)
)
)
or if you want use transform + boolean indexing
cond1 = (df
.groupby('Folder')['Model']
.transform(
lambda x: (x.eq('A').sum() > 0) & (x.ne('A').sum() > 0)
)
)
df[cond1]

You can use set operations (is the set of the Models per group greater than A alone?):
out = (df.groupby('Folder')
.filter(lambda x: set(x['Model'])>{'A'})
)
A bit longer, but potentially more efficient approach:
m = df.groupby('Folder')['Model'].agg(lambda x: set(x)>{'A'})
out = df[df['Folder'].isin(m[m].index)]
Output:
Folder Model
3 4541 A
4 4541 B
5 4541 C
6 4541 A
9 222 D
10 222 D
11 222 B
12 222 A

Pandas: select rows by random groups while keeping all of the group's variables

My dataframe looks like this:
id std number
A 1 1
A 0 12
B 123.45 34
B 1 56
B 12 78
C 134 90
C 1234 100
C 12345 111
I'd like to select random rows of Id while retaining all of the information in the other rows, such that dataframe would look like this:
id std number
A 1 1
A 0 12
C 134 90
C 1234 100
C 12345 111
I tried it with
size = 1000
replace = True
fn = lambda obj: obj.loc[np.random.choice(obj.index, size, replace),:]
df2 = df1.groupby('Id', as_index=False).apply(fn)
and
df2 = df1.sample(n=1000).groupby('id')
but obviously that didn't work. Any help would be appreciated.

You need create random ids first and then compare original column id by Series.isin in boolean indexing:
#number of groups
N = 2
df2 = df1[df1['id'].isin(df1['id'].drop_duplicates().sample(N))]
print (df2)
id std number
0 A 1.0 1
1 A 0.0 12
5 C 134.0 90
6 C 1234.0 100
7 C 12345.0 111
Or:
N = 2
df2 = df1[df1['id'].isin(np.random.choice(df1['id'].unique(), N))]

Pandas reshape a multicolumn dataframe long to wide with conditional check

I have a pandas data frame as follows:
id group type action cost
101 A 1 10
101 A 1 repair 3
102 B 1 5
102 B 1 repair 7
102 B 1 grease 2
102 B 1 inflate 1
103 A 2 12
104 B 2 9
I need to reshape it from long to wide, but depending on the value of the action column, as follows:
id group type action_std action_extra
101 A 1 10 3
102 B 1 5 10
103 A 2 12 0
104 B 2 9 0
In other words, for the rows with empty action field the cost value should be put under the action_std column, while for the rows with non-empty action field the cost value should be summarized under the action_extra column.
I've attempted with several combinations of groupby / agg / pivot but I cannot find any fully working solution...

I would suggest you simply split the cost column into a cost, and a cost_extra column. Something like the following:
import numpy as np
result = df.assign(
cost_extra=lambda df: np.where(
df['action'].notnull(), df['cost'], np.nan
)
).assign(
cost=lambda df: np.where(
df['action'].isnull(), df['cost'], np.nan
)
).groupby(
["id", "group", "type"]
)["cost", "cost_extra"].agg(
"sum"
)
result looks like:
cost cost_extra
id group type
101 A 1 10.0 3.0
102 B 1 5.0 10.0
103 A 2 12.0 0.0
104 B 2 9.0 0.0

Check groupby with unstack
df.cost.groupby([df.id,df.group,df.type,df.action.eq('')]).sum().unstack(fill_value=0)
action False True
id group type
101 A 1 3 10
102 B 1 10 5
103 A 2 0 12
104 B 2 0 9

Thanks for your hints, this is the solution that I finally like the most (also for its simplicity):
df["action_std"] = df["cost"].where(df["action"] == "")
df["action_extra"] = df["cost"].where(df["action"] != "")
df = df.groupby(["id", "group", "type"])["action_std", "action_extra"].sum().reset_index()

add values of two columns from 2 different dataframes pandas

i want to add 2 columns of 2 different dataframes based on the condition that name is same:
import pandas as pd
df1 = pd.DataFrame([("Apple",2),("Litchi",4),("Orange",6)], columns=['a','b'])
df2 = pd.DataFrame([("Apple",200),("Orange",400),("Litchi",600)], columns=['a','c'])
now i want to add column b and c if the name is same in a.
I tried this df1['b+c']=df1['b']+df2['c'] but it simply adds column b and c so the result comes as
a b b+c
0 Apple 2 202
1 Litchi 4 404
2 Orange 6 606
but i want to
a b+c
0 Apple 202
1 Litchi 604
2 Orange 406
i guess i have to use isin but i am not getting how?

Columns b and c are aligned by index values in sum operation, so is necessary create index by DataFrame.set_index by column a:
s1 = df1.set_index('a')['b']
s2 = df2.set_index('a')['c']
df1 = (s1+s2).reset_index(name='b+c')
print (df1)
a b+c
0 Apple 202
1 Litchi 604
2 Orange 406
EDIT: If need original value for not matched values use Series.add with parameter fill_value=0
df2 = pd.DataFrame([("Apple",200),("Apple",400),("Litchi",600)], columns=['a','c'])
print (df2)
a c
0 Apple 200
1 Apple 400
2 Litchi 600
s1 = df1.set_index('a')['b']
s2 = df2.set_index('a')['c']
df1 = s1.add(s2, fill_value=0).reset_index(name='b+c')
print (df1)
a b+c
0 Apple 202.0
1 Apple 402.0
2 Litchi 604.0
3 Orange 6.0

Set values in a multi-level pandas data frame python

I have been working with multi-level DataFrames pretty recently, and I have found that they can significantly reduce computation time for large data sets. For example, consider the simple data frame:
df = pd.DataFrame([
[1, 111, 0], [2, 222, 0], [1, 111, 0],
[2, 222, 1], [1, 111, 1], [2, 222, 2]
], columns=["ID", "A", "B"], index=[1, 1, 2, 2, 3, 3]
)
df.head(6)
ID A B
1 1 111 0
1 2 222 0
2 1 111 0
2 2 222 1
3 1 111 1
3 2 222 2
which can be pivoted by ID to create a multi-level data frame:
pivot_df = df.pivot(columns="ID")
pivot_df.head()
A B
ID 1 2 1 2
1 111 222 0 0
2 111 222 0 1
3 111 222 1 2
The great thing about having my data in this format is that I can perform "vector" operations across all IDs simply by referencing the level 0 columns:
pivot_df["A"] * (1 + pivot_df["B"])**2
ID 1 2
1 111 222
2 111 888
3 444 999
These operations are really helpful for me! In real life, my computations are much more complex and need to be performed for > 1000 IDs. A common DataFrame size that I work with contains 10 columns (at level 0) with 1000 IDs (at level 1) with 350 rows.
I am interested in figuring out to do two things: update values for a particular field in this pivoted DataFrame; create a new column for this DataFrame. Something like
pivot_df["A"] = pivot_df["A"] * (1 + pivot_df["B"])**2
or
pivot_df["C"] = pivot_df["A"] * (1 + pivot_df["B"])**2
I do not get any errors when I perform either of these, but the DataFrame remains unchanged. I have also tried using .loc and .iloc, but I am having no success.
I think that the problem is maintaining the multi-level structure of the computed DataFrames, but I am pretty new to using multi-level DataFrames and not sure how to solve this problem efficiently. I have a clumsy workaround that is not efficient (create a dictionary of computed DataFrames and then merge them all together...
df_dict = OrderedDict()
df_dict["A"] = pivot_df["A"]
df_dict["B"] = pivot_df["B"]
df_dict["C"] = pivot_df["A"] * (1 + pivot_df["B"])**2
dfs = [val.T.set_index(np.repeat(key, val.shape[1]), append=True).T for key, val in df_dict.iteritems()]
final_df = reduce(lambda x, y: pd.merge(x, y, left_index=True, right_index=True), dfs)
final_df.columns = final_df.columns.swaplevel(0, 1)
or similarly,
df_dict = OrderedDict()
df_dict["A"] = pivot_df["A"] * (1 + pivot_df["B"])**2
df_dict["B"] = pivot_df["B"]
dfs = [val.T.set_index(np.repeat(key, val.shape[1]), append=True).T for key, val in df_dict.iteritems()]
final_df = reduce(lambda x, y: pd.merge(x, y, left_index=True, right_index=True), dfs)
final_df.columns = final_df.columns.swaplevel(0, 1)
This is not necessarily clunky (I was kind of proud of the workaround), but this is certainly not efficient or computationally optimized. Does anyone have any recommendations?

Option 1
Don't pivot first!
You stated that it was convenient to pivot because you could perform vector calculations in the new pivoted form. This is a mis-representation because you could have easily performed those calculations prior to the pivot.
df['C'] = df["A"] * (1 + df["B"]) ** 2
df.pivot(columns='ID')
A B C
ID 1 2 1 2 1 2
1 111 222 0 0 111 222
2 111 222 0 1 111 888
3 111 222 1 2 444 1998
Or in a piped one-liner if you prefer
df.assign(C=df.A * (1 + df.B) ** 2).pivot(columns='ID')
A B C
ID 1 2 1 2 1 2
1 111 222 0 0 111 222
2 111 222 0 1 111 888
3 111 222 1 2 444 1998
Option 2
pd.concat
But to answer your question...
pdf = df.pivot(columns='ID')
pd.concat([
pdf.A, pdf.B, pdf.A * (1 + pdf.B) ** 2
], axis=1, keys=['A', 'B', 'C'])
A B C
ID 1 2 1 2 1 2
1 111 222 0 0 111 222
2 111 222 0 1 111 888
3 111 222 1 2 444 1998
Option 3
more pd.concat
Add another level to columns prior to concat
pdf = df.pivot(columns='ID')
c = pdf.A * (1 + pdf.B) ** 2
c.columns = [['C'] * len(c.columns), c.columns]
pd.concat([pdf, c], axis=1)
A B C
ID 1 2 1 2 1 2
1 111 222 0 0 111 222
2 111 222 0 1 111 888
3 111 222 1 2 444 1998

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas group by rows selection based on condition - python

Related

Is there a Pandas function to separate rows based on one column value and other column diversity?

Pandas: select rows by random groups while keeping all of the group's variables

Pandas reshape a multicolumn dataframe long to wide with conditional check

add values of two columns from 2 different dataframes pandas

Set values in a multi-level pandas data frame python

Categories

Resources