I have 2 dataframes which look like this:
df1 = pd.DataFrame({'A': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'],
'B': ['C1', 'C1', 'C1', 'C2', 'C2', 'C2', 'C2', 'C2'],
'rank': [2, 5, 1, 8, 6, 3, 4, 7]})
Out[3]:
A B rank
0 A C1 2
1 B C1 5
2 C C1 1
3 D C2 8
4 E C2 6
5 F C2 3
6 G C2 4
7 H C2 7
df2 = pd.DataFrame({'B': ['C1', 'C1', 'C1', 'C2'],
'C': [1, 2, 3, 4]})
Out[6]:
B C
0 C1 1
1 C1 2
2 C1 3
3 C2 4
I would like to select the 3 highest ranked rows (by column "rank") of df1 but can only select a maximum of 4 names per group (column B) and this needs to include the count of rows in each group in df2.
The resulting dataframe should look like this:
A B rank
2 C C1 1
5 F C2 3
6 G C2 4
Logic:
The count of rows in df2 for group C1 is 3 (leaving a maximum of 1 more rows to select from this group in df1) and the count of C2 is 1 (leaving max 3 rows to select from df1)
item C has highest rank so gets selected, now the total count of group C1 is 4
item F and item G are the next highest ranked and are part of group C2, total count is 3 so less than 4
I tried the following:
df1.sort_values('rank').groupby('B').head(4).head(5)
but this restricts to select max 4 rows of group in B only to rows in df1 and ignores df2
Here's an idea:
max_per_group = 4
# maximal rows to pick from each group
max_sizes = max_per_group - df2.groupby('B').size()
# 4 rows from each group
heads = df1.sort_values('rank').groupby('B').head(max_per_group)
# enumerate the rows within each group
enum = heads.groupby('B').cumcount()
# output
heads[enum<heads['B'].map(sizes).fillna(max_per_group)].head(3)
Output:
A B rank
2 C C1 1
5 F C2 3
6 G C2 4
First, find the number remaining by group:
In [4]: remaining = (4 - df2.groupby('B').size()).to_dict()
Then, select that number from each sorted group in your groupby:
In [5]: (
...: df1.sort_values('rank').groupby('B').apply(
...: lambda x: x.sort_values('rank').head(remaining.get(x.name, 4))
...: ).sort_values('rank').iloc[:3].reset_index('B', drop=True)
...: )
Out[5]:
A B rank
2 C C1 1
5 F C2 3
6 G C2 4
Related
I have a dataframe which looks like this:
pd.DataFrame(
{
'A':
[
'C1', 'C1', 'C1', 'C1',
'C2', 'C2', 'C2', 'C2',
'C3', 'C3', 'C3', 'C3'
],
'B':
[
1, 4, 8, 9, 1, 3, 8, 9, 1, 4, 7, 0
]
}
)
Out[40]:
A B
0 C1 1
1 C1 4
2 C1 8
3 C1 9
4 C2 1
5 C2 3
6 C2 8
7 C2 9
8 C3 1
9 C3 4
10 C3 7
11 C3 0
for each group in A, I want to find the row with the smallest value greater than 5
My resulting dataframe should look like this:
A B
2 C1 8
6 C2 8
10 C3 7
I have tried this but this does not give me the whole row
df[df.B >= 4].groupby('A')['B'].min()
What do I need to change?
Use idxmin instead of min to extract the index, then use loc:
df.loc[df[df.B > 5].groupby('A')['B'].idxmin()]
Output:
A B
2 C1 8
6 C2 8
10 C3 7
Alternatively, you can use sort_values followed by drop_duplicates:
df[df.B > 5].sort_values('B').drop_duplicates('A')
Output:
A B
10 C3 7
2 C1 8
6 C2 8
Another way: Filter B greater than five. Groupby A and find B's min value in each group.
df[df.B.gt(5)].groupby('A')['B'].min().reset_index()
A B
0 C1 8
1 C2 8
2 C3 7
I have 2 dataframes which look like this
df1 = pd.DataFrame({'A': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H'],
'B': ['C1', 'C1', 'C1', 'C1', 'C2', 'C2', 'C2', 'C2'],
'Y': [0, 1, 1, 0, 1, 1, 0, 1],
'Z': [4, 5, 2, 1, 2, 1, 3, 5]})
Out[51]:
A B Y Z
0 A C1 0 4
1 B C1 1 5
2 C C1 1 2
3 D C1 0 1
4 E C2 1 2
5 F C2 1 1
6 G C2 0 3
7 H C2 1 5
df2 = pd.DataFrame({'A': ['A', 'B', 'E', 'F', 'H'],
'B': ['C1', 'C1', 'C2', 'C2', 'C2'],
'V': [2, 3, 1, 4, 2]})
Out[52]:
A B V
0 A C1 2
1 B C1 3
2 E C2 1
3 F C2 4
4 H C2 2
I would like to select all rows in df1 where Y==1 and Z.cumsum() <= V.sum() of the respective group value in df2. group is column B.
THIS IS MY DESIRED OUTPUT
A B Y Z
1 B C1 1 5
4 E C2 1 2
5 F C2 1 1
LOGIC
df2.groupby('B')['V'].sum()
Out[57]:
B
C1 5
C2 7
Name: V, dtype: int64
so the following should hold TRUE
for rows in group C1: df1.loc[(df1.Y==1)].groupby('B')['Z'].cumsum() should be <=5
for rows in group C2: df1.loc[(df1.Y==1)].groupby('B')['Z'].cumsum() should be <=7
how can I make this selection in 1 line of code?
you can use use map on df1.B with the result from df2. Note that I use where and not loc that just replace by nan the values in Z where it is not Y==1
print(df1[df1['Z'].where(df1['Y']==1).groupby(df1['B']).cumsum()
<=df1['B'].map(df2.groupby('B')['V'].sum())])
A B Y Z
1 B C1 1 5
4 E C2 1 2
5 F C2 1 1
This question already has answers here:
How to unnest (explode) a column in a pandas DataFrame, into multiple rows
(16 answers)
Closed 4 years ago.
There are two Pandas DataFrame
df_A = pd.DataFrame([['r1', ['a','b']], ['r2',['aabb','b']], ['r3', ['xyz']]], columns=['col1', 'col2'])
col1 col2
r1 [a, b]
r2 [aabb, b]
r3 [xyz]
df_B = pd.DataFrame([['a', 10], ['b',2]], columns=['C1', 'C2'])
C1 C2
a 10
b 2
I want to join both dataframes such as df_C is
col1 C1 C2
r1 a 10
r1 b 2
r2 aabb 0
r2 b 2
r3 xyz 0
You need:
df = pd.DataFrame([['r1', ['a','b']], ['r2',['aabb','b']], ['r3', ['xyz']]], columns=['col1', 'col2'])
df= pd.DataFrame({'col1':np.repeat(df.col1.values, df.col2.str.len()),
'C1':np.concatenate(df.col2.values)})
df_B = pd.DataFrame([['a', 10], ['b',2]], columns=['C1', 'C2'])
df_B = dict(zip(df_B.C1, df_B.C2))
# {'a': 10, 'b': 2}
df['C2']= df['C1'].apply(lambda x: df_B[x] if x in df_B.keys() else 0)
print(df)
Output:
col1 C1 C2
0 r1 a 10
1 r1 b 2
2 r2 aabb 0
3 r2 b 2
4 r3 xyz 0
Edit
The below code will give you the length of the list in each row.
print(df.col2.str.len())
# 0 2
# 1 2
# 2 1
np.repeat will repeat the values from col1 based length obtained using above.
eg. r1,r2 will repeat twice.
print(np.repeat(df.col1.values, df.col2.str.len())
# ['r1' 'r1' 'r2' 'r2' 'r3']
Using np.concatenate on col2.values will result in plain 1D List
print(np.concatenate(df.col2.values))
# ['a' 'b' 'aabb' 'b' 'xyz']
I have two dataframes, say df1 and df2, with the same column names.
Example:
df1
C1 | C2 | C3 | C4
A 1 2 AA
B 1 3 A
A 3 2 B
df2
C1 | C2 | C3 | C4
A 1 3 E
B 1 2 C
Q 4 1 Z
I would like to filter out rows in df1 based on common values in a fixed subset of columns between df1 and df2. In the above example, if the columns are C1 and C2, I would like the first two rows to be filtered out, as their values in both df1 and df2 for these columns are identical.
What would be a clean way to do this in Pandas?
So far, based on this answer, I have been able to find the common rows.
common_df = pandas.merge(df1, df2, how='inner', on=['C1','C2'])
This gives me a new dataframe with only those rows that have common values in the specified columns, i.e., the intersection.
I have also seen this thread, but the answers all seem to assume a difference on all the columns.
The expected result for the above example (rows common on specified columns removed):
C1 | C2 | C3 | C4
A 3 2 B
Maybe not the cleanest, but you could add a key column to df1 to check against.
Setting up the datasets
import pandas as pd
df1 = pd.DataFrame({ 'C1': ['A', 'B', 'A'],
'C2': [1, 1, 3],
'C3': [2, 3, 2],
'C4': ['AA', 'A', 'B']})
df2 = pd.DataFrame({ 'C1': ['A', 'B', 'Q'],
'C2': [1, 1, 4],
'C3': [3, 2, 1],
'C4': ['E', 'C', 'Z']})
Adding a key, using your code to find the commons
df1['key'] = range(1, len(df1) + 1)
common_df = pd.merge(df1, df2, how='inner', on=['C1','C2'])
df_filter = df1[~df1['key'].isin(common_df['key'])].drop('key', axis=1)
You can use an anti-join method where you do an outer join on the specified columns while returning the method of the join with an indicator. Only downside is that you'd have to rename and drop the extra columns after the join.
>>> import pandas as pd
>>> df1 = pd.DataFrame({'C1':['A','B','A'],'C2':[1,1,3],'C3':[2,3,2],'C4':['AA','A','B']})
>>> df2 = pd.DataFrame({'C1':['A','B','Q'],'C2':[1,1,4],'C3':[3,2,1],'C4':['E','C','Z']})
>>> df_merged = df1.merge(df2, on=['C1','C2'], indicator=True, how='outer')
>>> df_merged
C1 C2 C3_x C4_x C3_y C4_y _merge
0 A 1 2.0 AA 3.0 E both
1 B 1 3.0 A 2.0 C both
2 A 3 2.0 B NaN NaN left_only
3 Q 4 NaN NaN 1.0 Z right_only
>>> df1_setdiff = df_merged[df_merged['_merge'] == 'left_only'].rename(columns={'C3_x': 'C3', 'C4_x': 'C4'}).drop(['C3_y', 'C4_y', '_merge'], axis=1)
>>> df1_setdiff
C1 C2 C3 C4
2 A 3 2.0 B
>>> df2_setdiff = df_merged[df_merged['_merge'] == 'right_only'].rename(columns={'C3_y': 'C3', 'C4_y': 'C4'}).drop(['C3_x', 'C4_x', '_merge'], axis=1)
>>> df2_setdiff
C1 C2 C3 C4
3 Q 4 1.0 Z
import pandas as pd
df1 = pd.DataFrame({'C1':['A','B','A'],'C2':[1,1,3],'C3':[2,3,2],'C4':['AA','A','B']})
df2 = pd.DataFrame({'C1':['A','B','Q'],'C2':[1,1,4],'C3':[3,2,1],'C4':['E','C','Z']})
common = pd.merge(df1, df2,on=['C1','C2'])
R1 = df1[~((df1.C1.isin(common.C1))&(df1.C2.isin(common.C2)))]
R2 = df2[~((df2.C1.isin(common.C1))&(df2.C2.isin(common.C2)))]
df1:
C1 C2 C3 C4
0 A 1 2 AA
1 B 1 3 A
2 A 3 2 B
df2:
C1 C2 C3 C4
0 A 1 3 E
1 B 1 2 C
2 Q 4 1 Z
common:
C1 C2 C3_x C4_x C3_y C4_y
0 A 1 2 AA 3 E
1 B 1 3 A 2 C
R1:
C1 C2 C3 C4
2 A 3 2 B
R2:
C1 C2 C3 C4
2 Q 4 1 Z
I have two data frames:
df1
A1 B1
1 a
2 s
3 d
and
df2
A1 B1
1 a
2 x
3 d
I want to compare df1 and df2 on column B1. The column A1 can be used to join. I want to know:
Which rows are different in df1 and df2 with respect to column B1?
If there is a mismatch in the values of column A1. For example whether df2 is missing some values that are there in df1 and vice versa. And if so, which ones?
I tried using merge and join but that is not what I am looking for.
I've edited the raw data to illustrate the case of A1 keys in one dataframe but not the other.
When doing your merge, you want to specify an 'outer' merge so that you can see these items with an A1 key in one dataframe but not the other.
I've included the suffixes '_1' and '_2' to indicate the dataframe source (_1 = df1 and _2 = df2) of column B1.
df1 = pd.DataFrame({'A1': [1, 2, 3, 4], 'B1': ['a', 'b', 'c', 'd']})
df2 = pd.DataFrame({'A1': [1, 2, 3, 5], 'B1': ['a', 'd', 'c', 'e']})
df3 = df1.merge(df2, how='outer', on='A1', suffixes=['_1', '_2'])
df3['check'] = df3.B1_1 == df3.B1_2
>>> df3
A1 B1_1 B1_2 check
0 1 a a True
1 2 b d False
2 3 c c True
3 4 d NaN False
4 5 NaN e False
To check for missing A1 keys in df1 and df2:
# A1 value missing in `df1`
>>> d3[df3.B1_1.isnull()]
A1 B1_1 B1_2 check
4 5 NaN e False
# A1 value missing in `df2`
>>> df3[df3.B1_2.isnull()]
A1 B1_1 B1_2 check
3 4 d NaN False
EDIT
Thanks to #EdChum (the source of all Pandas knowledge...).
df3 = df1.merge(df2, how='outer', on='A1', suffixes=['_1', '_2'], indicator=True)
df3['check'] = df3.B1_1 == df3.B1_2
>>> df3
A1 B1_1 B1_2 _merge check
0 1 a a both True
1 2 b d both False
2 3 c c both True
3 4 d NaN left_only False
4 5 NaN e right_only False