Drop rows in a pandas dataframe by criteria from another dataframe - python

I have the following dataframe containing scores for a competition as well as a column that counts what number entry for each person.
import pandas as pd
df = pd.DataFrame({'Name': ['John', 'Jim', 'John','Jim', 'John','Jim','John','Jim','John','Jim','Jack','Jack','Jack','Jack'],'Score': [10,8,9,3,5,0, 1, 2,3, 4,5,6,8,9]})
df['Entry_No'] = df.groupby(['Name']).cumcount() + 1
df
Then I have another table that stores data on the maximum number of entries that each person can have:
df2 = pd.DataFrame({'Name': ['John', 'Jim', 'Jack'],'Limit': [2,3,1]})
df2
I am trying to drop rows from df where the entry number is greater than the Limit according to each person in df2 so that my expected output is this:
If there are any ideas on how to help me achieve this that would be fantastic! Thanks

You can use pandas.merge to create another dataframe and drop columns by your condition:
df3 = pd.merge(df, df2, on="Name", how="left")
df3[df3["Entry_No"] <= df3["Limit"]][df.columns].reset_index(drop=True)
Name Score Entry_No
0 John 10 1
1 Jim 8 1
2 John 9 2
3 Jim 3 2
4 Jim 0 3
5 Jack 5 1
I used how="left" to keep the order of df and reset_index(drop=True) to reset the index of the resulting dataframe.

You could join the 2 dataframes, and then drop with a condition:
import pandas as pd
df = pd.DataFrame({'Name': ['John', 'Jim', 'John','Jim', 'John','Jim','John','Jim','John','Jim','Jack','Jack','Jack','Jack'],'Score': [10,8,9,3,5,0, 1, 2,3, 4,5,6,8,9]})
df['Entry_No'] = df.groupby(['Name']).cumcount() + 1
df2 = pd.DataFrame({'Name': ['John', 'Jim', 'Jack'],'Limit': [2,3,1]})
df2 = df2.set_index('Name')
df = df.join(df2, on='Name')
df.drop(df[df.Entry_No>df.Limit].index, inplace = True)
gives the expected output

Related

Comparison of two data frames and getting the differences by two columns - Pandas

I would like to compare two data frames, for example
import pandas as pd
az_df = pd.DataFrame({'name': ['CR1', 'CR2'], 'age': [1, 5], 'dr':[1, 2]})[['name', 'age']]
za_df = pd.DataFrame({'name': ['CR2', 'CR1'], 'age': [2, 1], 'dr':[2, 4]})[['name', 'age']]
AZ_DF table:
name
age
dr
CR1
1
1
CR2
5
2
ZA_DF table:
name
age
dr
CR1
1
4
CR2
2
2
And I want to get the summary table of different values grouped by 'name' and 'age' columns between az_df and za_df, like:
name
only in AZ
only in ZA
CR2
5
2
So far, I did merged them,
merge = pd.merge(az_df, za_df, how='outer', indicator=True)
For az_df, different values are:
only_in_az = merge[merge['_merge'] == 'left_only']
And for za_df:
only_in_za = merge[merge['_merge'] == 'right_only']
However, I don't know how to build the summary table, I mentioned above, showing the different names' ages for az and za data frames.
Thank you.
Try this:
import pandas as pd
az_df = pd.DataFrame({'name': ['CR1', 'CR2'], 'age': [1, 5], 'dr':[1, 2]})[['name', 'age']]
za_df = pd.DataFrame({'name': ['CR2', 'CR1'], 'age': [2, 1], 'dr':[2, 4]})[['name', 'age']]
merge = pd.merge(az_df, za_df, on='name', how='outer')
merge.rename(columns={'age_x': 'only in AZ', 'age_y': 'only in ZA'}, inplace=True)
merge
name only in AZ only in ZA
0 CR1 1 1
1 CR2 5 2
If you want to remove duplicates:
merge = merge[merge['only in AZ'] != merge['only in ZA']]
merge
name only in AZ only in ZA
1 CR2 5 2
I think you can get what you want from the merge dataframe using pivot_table:
pd.pivot_table(merge.query('_merge != "both"'), values='age', index='name', columns='_merge').reset_index().rename_axis(None, axis=1).rename(columns={'left_only':'only in AZ','right_only':'only in ZA'})
name only in AZ only in ZA
0 CR2 5 2

Combining columns for each row in a data frame

I would like to combine two columns: Column 1 + Column 2 and that for each row individually. Unfortunately it didn't work for me. How do i solve this?
import pandas as pd
import numpy as np
d = {'Nameid': [1, 2, 3, 1], 'Name': ['Michael', 'Max', 'Susan', 'Michael'], 'Project': ['S455', 'G874', 'B7445', 'Z874']}
df = pd.DataFrame(data=d)
display(df.head(10))
df['Dataframe']='df'
d2 = {'Nameid': [4, 2, 5, 1], 'Name': ['Petrova', 'Michael', 'Mike', 'Gandalf'], 'Project': ['Z845', 'Q985', 'P512', 'Y541']}
df2 = pd.DataFrame(data=d2)
display(df2.head(10))
df2['Dataframe']='df2'
What I tried
df_merged = pd.concat([df,df2])
df_merged.head(10)
df3 = pd.concat([df,df2])
df3['unique_string'] = df['Nameid'].astype(str) + df['Dataframe'].astype(str)
df3.head(10)
As you can see, he didn't combine every row. He probably only has the first combined with all of them. How can I combine the two columns row by row?
What I want
You can simply concat strings like this:
You don't need to do df['Dataframe'].astype(str)
In [363]: df_merged['unique_string'] = df_merged.Nameid.astype(str) + df_merged.Dataframe
In [365]: df_merged
Out[365]:
Nameid Name Project Dataframe unique_string
0 1 Michael S455 df 1df
1 2 Max G874 df 2df
2 3 Susan B7445 df 3df
3 1 Michael Z874 df 1df
0 4 Petrova Z845 df2 4df2
1 2 Michael Q985 df2 2df2
2 5 Mike P512 df2 5df2
3 1 Gandalf Y541 df2 1df2
Please make sure you are using the df3 assign back to df3 ,also do reset_index
df3 = df3.reset_index()
df3['unique_string'] = df3['Nameid'].astype(str) + df3['Dataframe'].astype(str)
Use df3 instead df, also ignore_index=True for default index is added:
df3 = pd.concat([df,df2], ignore_index=True)
df3['unique_string'] = df3['Nameid'].astype(str) + df3['Dataframe']
print (df3)
Nameid Name Project Dataframe unique_string
0 1 Michael S455 df 1df
1 2 Max G874 df 2df
2 3 Susan B7445 df 3df
3 1 Michael Z874 df 1df
4 4 Petrova Z845 df2 4df2
5 2 Michael Q985 df2 2df2
6 5 Mike P512 df2 5df2
7 1 Gandalf Y541 df2 1df2

How to aggregate, combining dataframes, with pandas groupby

I have a dataframe df and a column df['table'] such that each item in df['table'] is another dataframe with the same headers/number of columns. I was wondering if there's a way to do a groupby like this:
Original dataframe:
name table
Bob Pandas df1
Joe Pandas df2
Bob Pandas df3
Bob Pandas df4
Emily Pandas df5
After groupby:
name table
Bob Pandas df containing the appended df1, df3, and df4
Joe Pandas df2
Emily Pandas df5
I found this code snippet to do a groupby and lambda for strings in a dataframe, but haven't been able to figure out how to append entire dataframes in a groupby.
df['table'] = df.groupby(['name'])['table'].transform(lambda x : ' '.join(x))
I've also tried df['table'] = df.groupby(['name'])['HTML'].apply(list), but that gives me a df['table'] of all NaN.
Thanks for your help!!
Given 3 dataframes
import pandas as pd
dfa = pd.DataFrame({'a': [1, 2, 3]})
dfb = pd.DataFrame({'a': ['a', 'b', 'c']})
dfc = pd.DataFrame({'a': ['pie', 'steak', 'milk']})
Given another dataframe, with dataframes in the columns
df = pd.DataFrame({'name': ['Bob', 'Joe', 'Bob', 'Bob', 'Emily'], 'table': [dfa, dfa, dfb, dfc, dfb]})
# print the type for the first value in the table column, to confirm it's a dataframe
print(type(df.loc[0, 'table']))
[out]:
<class 'pandas.core.frame.DataFrame'>
Each group of dataframes, can be combined into a single dataframe, by using .groupby and aggregating a list for each group, and combining the dataframes in the list, with pd.concat
# if there is only one column, or if there are multiple columns of dataframes to aggregate
dfg = df.groupby('name').agg(lambda x: pd.concat(list(x)).reset_index(drop=True))
# display(dfg.loc['Bob', 'table'])
a
0 1
1 2
2 3
3 a
4 b
5 c
6 pie
7 steak
8 milk
# to specify a single column, or specify multiple columns, from many columns
dfg = df.groupby('name')[['table']].agg(lambda x: pd.concat(list(x)).reset_index(drop=True))
Not a duplicate
Originally, I had marked this question as a duplicate of How to group dataframe rows into list in pandas groupby, thinking the dataframes could be aggregated into a list, and then combined with pd.concat.
df.groupby('name')['table'].apply(list)
df.groupby('name').agg(list)
df.groupby('name')['table'].agg(list)
df.groupby('name').agg({'table': list})
df.groupby('name').agg(lambda x: list(x))
However, these all result in a StopIteration error, when there are dataframes to aggregate.
Here let's create a dataframe with dataframes as columns:
First, I start with three dataframes:
import pandas as pd
#creating dataframes that we will assign to Bob and Joe, notice b's and j':
df1 = pd.DataFrame({'var1':[12, 34, -4, None], 'letter':['b1', 'b2', 'b3', 'b4']})
df2 = pd.DataFrame({'var1':[1, 23, 44, 0], 'letter':['j1', 'j2', 'j3', 'j4']})
df3 = pd.DataFrame({'var1':[22, -3, 7, 78], 'letter':['b5', 'b6', 'b7', 'b8']})
#lets make a list of dictionaries:
list_of_dfs = [
{'name':'Bob' ,'table':df1},
{'name':'Joe' ,'table':df2},
{'name':'Bob' ,'table':df3}
]
#constuct the main dataframe:
original_df = pd.DataFrame(list_of_dfs)
print(original_df)
original_df.shape #shows (3, 2)
Now we have the original dataframe created as the input, we will produce the resulting new dataframe. In doing so, we use groupby(),agg(), and pd.concat(). We also reset the index.
new_df = original_df.groupby('name')['table'].agg(lambda series: pd.concat(series.tolist())).reset_index()
print(new_df)
#check that Bob's table is now a concatenated table of df1 and df3:
new_df[new_df['name']=='Bob']['table'][0]
The output to the last line of code is:
var1 letter
0 12.0 b1
1 34.0 b2
2 -4.0 b3
3 NaN b4
0 22.0 b5
1 -3.0 b6
2 7.0 b7
3 78.0 b8

compare two dataframes and get nearest matching dataframe

have two dataframes with columns
df1
name cell marks
tom 2 21862
df2
name cell marks passwd
tom 2 11111 2548
matt 2 158416 2483
2 21862 26846
How to compare df2 with df1 and get nearest matched data frames
expected_output:
df2
name cell marks passwd
tom 2 11111 2548
2 21862 26846
tried merge but data is dynamic. On one case name might change and in another case marks might change
You can try the following:
import pandas as pd
dict1 = {'name': ['tom'], 'cell': [2], 'marks': [21862]}
dict2 = {'name': ['tom', 'matt'], 'cell': [2, 2],
'marks': [21862, 158416], 'passwd': [2548, 2483]}
df1 = pd.DataFrame(dict1)
df2 = pd.DataFrame(dict2)
compare = df2.isin(df1)
df2 = df2.iloc[df2.where(compare).dropna(how='all').index]
print(df2)
Output:
name cell marks passwd
0 tom 2 21862 2548
You can use pandas.merge with the option indicator=True, filtering the result for 'both':
import pandas as pd
df1 = pd.DataFrame([['tom', 2, 11111]], columns=["name", "cell", "marks"])
df2 = pd.DataFrame([['tom', 2, 11111, 2548],
['matt', 2, 158416, 2483]
], columns=["name", "cell", "marks", "passwd"])
def compare_dataframes(df1, df2):
"""Find rows which are similar between two DataFrames."""
comparison_df = df1.merge(df2,
indicator=True,
how='outer')
return comparison_df[comparison_df['_merge'] == 'both'].drop(columns=["_merge"])
print(compare_dataframes(df1, df2))
Returns:
name cell marks passwd
0 tom 2 11111 2548

Processing a dataframe with another dataframe

I have two data frames: df1 and df2. They both include information like 'ID', 'Name', 'Score' and 'Status', which I need is to update the 'Score' in df1 if that person's status in df2 is "Edit", and I also need to drop the row in df1 if that person's status in df2 is "Cancel".
For example:
dic1 = {'ID': [1, 2, 3],
'Name':['Jack', 'Tom', 'Annie'],
'Score':[20, 10, 25],
'Status':['New', 'New', 'New']}
dic2 = {'ID': [1, 2],
'Name':['Jack', 'Tom'],
'Score':[28, 10],
'Status':['Edit', 'Cancel']}
df1 = pd.DataFrame(dic1)
df2 = pd.DataFrame(dic2)
The output should be like:
ID Name Score Status
1 Jack 28 Edit
3 Annie 25 New
Any pointers or hints?
Use DataFrame.merge with left join first and then filter out Cancel rows and also columns ending with _ from original DataFrame:
df = df1.merge(df2, on=['ID','Name'], how='left', suffixes=('_', ''))
df = df.loc[df['Status'] != 'Cancel', ~df.columns.str.endswith('_')]
print (df)
ID Name Score Status
0 1 Jack 28 Edit
EDIT Add DataFrame.combine_first for repalce missing rows:
df = df1.merge(df2, on=['ID','Name'], how='left', suffixes=('', '_'))
df = df.loc[df['Status_'] != 'Cancel']
df1 = df.loc[:, df.columns.str.endswith('_')]
df = df1.rename(columns=lambda x: x.rstrip('_')).combine_first(df).drop(df1.columns, axis=1)
print (df)
ID Name Score Status
0 1.0 Jack 28.0 Edit
2 3.0 Annie 25.0 New
Use pandas.DataFrame.update commnad of pandas package.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.update.html
df1.update(df2)
print(df1)
df1 = df1[df1.Status != "Cancel"]
print(df1)

Categories

Resources