Check which rows of pandas exist in another

Check which rows of pandas exist in another - python

I have two Pandas Data Frames of different sizes (at least 500,000 rows in both of them). For simplicity, you can call them df1 and df2 . I'm interested in finding the rows of df1 which are not present in df2. It is not necessary that any of the data frames would be the subset of the other. Also, the order of the rows does not matter.
For example, ith observation in df1 may be jth observation in df2 and I need to consider it as being present (order won't matter). Another important thing is that both data frames may contain null values (so the operation has to work also for that).
A simple example of both data frame would be
df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 100], 'col2' : [10, 11, NaN, 50})
df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 100], 'col2' : [20, 21, NaN, 13, 14, 50]})
in this case the solution would be
df3 = pandas.DataFrame(data = {'col1' : [1, 2 ], 'col2' : [10, 11]})
Please note that in reality, both data frames have 15 columns (exactly same columns names, exact same data type). Also, I'm using Python 2.7 on Jupyter Notebook on windows 7. I have used Pandas built in function df1.isin(df2) but it does not provide the accurate results that I want.
Moreover, I have also seen this question
but this assumes that one data frame is the subset of another which is not necessarily true in my case.

Here's one way:
import pandas as pd, numpy as np
df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 100], 'col2' : [10, 11, np.nan, 50]})
df2 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 100], 'col2' : [20, 21, np.nan, 13, 14, 50]})
x = set(map(tuple, df1.fillna(-1).values)) - set(map(tuple, df2.fillna(-1).values))
# {(1.0, 10.0), (2.0, 11.0)}
pd.DataFrame(list(x), columns=['col1', 'col2'])
If you have np.nan data in your result, it'll come through as -1, but you can easily convert back. Assumes you won't have negative numbers in your underlying data [if so, replace by some impossible value].
The reason for the complication is that np.nan == np.nan is considered False.

Here is on solution
pd.concat([df1,df2.loc[df2.col1.isin(df1.col1)]],keys=[1,2]).drop_duplicates(keep=False).loc[1]
Out[892]:
col1 col2
0 1 10.0
1 2 11.0

Related

groupby using combination of column index and column name

I'm creating a function to filter many dataframes using groupby. The dataframes look like below. However each dataframe does not always contain the same number of columns.
df = pd.DataFrame({
'xyz CODE': [1,2,3,3,4, 5,6,7,7,8],
'a': [4, 5, 3, 1, 2, 20, 10, 40, 50, 30],
'b': [20, 10, 40, 50, 30, 4, 5, 3, 1, 2],
'c': [25, 20, 5, 15, 10, 25, 20, 5, 15, 10] })
For each dataframe I always apply groupby to the first column - which are named differently across dataframes. All other columns are named consistently across all dataframes.
My question: Is it possible to run groupby using a combination of column location and column names? How can I do it?
I wrote the following function and got an error TypeError: unhashable type: 'list'
def filter_all_df(df):
df['max_c'] = df.groupby(df.columns[0])['a'].transform('max')
newdf = df[df['a'] == df['max_c']].drop(['max_c'], axis=1)
newdf['max_score'] = newdf.groupby([newdf.columns[0],'a','b'])['c'].transform('max')
newdf = newdf[newdf['c'] == newdf['max_score']]
newdf = newdf.sort_values([newdf.columns[0]]).drop_duplicates([newdf.columns[0], 'a','b', 'c'], keep='last')
newdf.to_csv('newdf_all.csv')
return newdf

Mapping a multiindex dataframe to another using row ID

I have two dataframes of different shape
The 'ANTENNA1' and 'ANTENNA2' in the bigger dataframe correspond to the ID columns in the smaller dataframe. I want to create merge the smaller dataframe to the bigger one so that the bigger dataframe will have '(POSITION, col1)', '(POSITION, col2)', '(POSITION, col3)' according to ANTENNA1 == ID
Edit: I tried with pd.merge but it is changing the original dataframe column values
Original:
df = pd.merge(df_main, df_sub, left_on='ANTENNA1', right_on ='id', how = 'left')
Result:
I want to keep the original dataframe columns as it is.

Assuming your first dataframe (with positions) is called df1, and the second is called df2, with your loaded data, you could just use pandas.DataFrame.merge: ( -> pd.merge(...) )
df = pd.merge(df1,df2,left_on='id', right_on='ANTENNA1')
Than you might select the df on your needed columns(col1,col2,..) to get the desired result df[["col1","col2",..]].
simple example:
# import pandas as pd
import pandas as pd
# creating dataframes as df1 and df2
df1 = pd.DataFrame({'ID': [1, 2, 3, 5, 7, 8],
'Name': ['Sam', 'John', 'Bridge',
'Edge', 'Joe', 'Hope']})
df2 = pd.DataFrame({'id': [1, 2, 4, 5, 6, 8, 9],
'Marks': [67, 92, 75, 83, 69, 56, 81]})
# merging df1 and df2 by ID
# i.e. the rows with common ID's get
# merged i.e. {1,2,5,8}
df = pd.merge(df1, df2, left_on="ID", right_on="id")
print(df)

How to match two dataframe columns and return matching values on a separate column in Python?

I am a newbie in Python and I need some help.
I have 2 data frame containing a list of users with a list of recommended friends from two tables.
I would like to achieve the following:
Sort the list of recommended friends by ascending order from 2 data frame for each user.
Match the list of matching recommended friends from dataframe2 to dataframe1 for each user. Return only the matched values.
I have tried my code but it didn't achieve the desired results.
import pandas as pd
import numpy as np
///load data from csv
df1 = pd.read_csv('CommonFriend.csv')
df2 = pd.read_csv('InfluenceFriend.csv')
print(df1)
print(df2)
///convert values to list to sort by recommended friends ID
df1.values.tolist()
df1.sort_values(by=['User','RecommendedFriends'])
df2.values.tolist()
df2.sort_values(by=['User','RecommendedFriends'])
///obtain only matched values from list of recommended friends from df1 and df2.
df3 = df1.merge(df2, how='inner', on='User')
/// return dataframe with user, matched recommendedfriends ID
print(df3)
Problem encountered:
The elements in each list are not sorted in ascending order.
While matching each elements through pandas merge with "inner-join". It seems that it is not able to read certain elements.
Updates: Below are the data frame header which cause some error in the code.

This should be the solution to your problem. You might have to change a few variables but you get the idea: you merge the two dataframes on the users, so you get a dataframe with both lists for each user. You then take the intersection of both lists and store that in a new column.
df1 = pd.DataFrame(np.array([[1, [5, 7, 10, 11]], [2, [3, 8, 5, 12]]]),
columns=['User', 'Recommended friends'])
df2 = pd.DataFrame(np.array([[1, [5, 7, 9]], [2, [4, 7, 10]], [3, [15, 7, 9]]]),
columns=['User', 'Recommended friends'])
df3 = pd.merge(df1, df2, on='User')
df3['intersection'] = [list(set(a).intersection(set(b))) for a, b in zip(df3['Recommended friends_x'], df3['Recommended friends_y'])]
Output df3:
User Recommended friends_x Recommended friends_y intersection
0 1 [5, 7, 10, 11] [5, 7, 9] [5, 7]
1 2 [3, 8, 5, 12] [4, 7, 10] []

I do not quite understand what exactly your problem is, but in general you will have to assign the dataframe to itself again.
import pandas as pd
import numpy as np
df1 = pd.read_csv('CommonFriend.csv')
df2 = pd.read_csv('InfluenceFriend.csv')
print(df1)
print(df2)
df1 = df1.values.tolist()
df1 = df1.sort_values(by=['User','RecommendedFriends'])
df2 = df2.values.tolist()
df2 = df2.sort_values(by=['User','RecommendedFriends'])
df3 = df1.merge(df2, how='inner', on='User')
print(df3)

concatenate in place in sub function with pandas concat function?

I'm trying to write a function that take a pandas Dataframe as argument and at some concatenate this datagframe with another.
for exemple:
def concat(df):
df = pd.concat((df, pd.DataFrame({'E': [1, 1, 1]})), axis=1)
I would like this function to modify in place the input df but I can't find how to achieve this. When I do
...
print(df)
concat(df)
print(df)
The dataframe df is identical before and after the function call
Note: I don't want to do df['E'] = [1, 1, 1] because I don't know how many column will be added to df. So I want to use pd.concat(), if possible...

This will edit the original DataFrame inplace and give the desired output as long as the new data contains the same number of rows as the original, and there are no conflicting column names.
It's the same idea as your df['E'] = [1, 1, 1] suggestion, except it will work for an arbitrary number of columns.
I don't think there is a way to achieve this using pd.concat, as it doesn't have an inplace parameter as some Pandas functions do.
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'C': [10, 20, 30], 'D': [40, 50, 60]})
df[df2.columns] = df2
Results (df):
A B C D
0 1 4 10 40
1 2 5 20 50
2 3 6 30 60

Drop all data in a pandas dataframe

I would like to drop all data in a pandas dataframe, but am getting TypeError: drop() takes at least 2 arguments (3 given). I essentially want a blank dataframe with just my columns headers.
import pandas as pd
web_stats = {'Day': [1, 2, 3, 4, 2, 6],
'Visitors': [43, 43, 34, 23, 43, 23],
'Bounce_Rate': [3, 2, 4, 3, 5, 5]}
df = pd.DataFrame(web_stats)
df.drop(axis=0, inplace=True)
print df

You need to pass the labels to be dropped.
df.drop(df.index, inplace=True)
By default, it operates on axis=0.
You can achieve the same with
df.iloc[0:0]
which is much more efficient.

My favorite:
df = df.iloc[0:0]
But be aware df.index.max() will be nan.
To add items I use:
df.loc[0 if math.isnan(df.index.max()) else df.index.max() + 1] = data

My favorite way is:
df = df[0:0]

Overwrite the dataframe with something like that
import pandas as pd
df = pd.DataFrame(None)
or if you want to keep columns in place
df = pd.DataFrame(columns=df.columns)

If your goal is to drop the dataframe, then you need to pass all columns. For me: the best way is to pass a list comprehension to the columns kwarg. This will then work regardless of the different columns in a df.
import pandas as pd
web_stats = {'Day': [1, 2, 3, 4, 2, 6],
'Visitors': [43, 43, 34, 23, 43, 23],
'Bounce_Rate': [3, 2, 4, 3, 5, 5]}
df = pd.DataFrame(web_stats)
df.drop(columns=[i for i in check_df.columns])

This code make clean dataframe:
df = pd.DataFrame({'a':[1,2], 'b':[3,4]})
#clean
df = pd.DataFrame()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Check which rows of pandas exist in another - python

Here is on solution pd.concat([df1,df2.loc[df2.col1.isin(df1.col1)]],keys=[1,2]).drop_duplicates(keep=False).loc[1] Out[892]: col1 col2 0 1 10.0 1 2 11.0

Related

groupby using combination of column index and column name

Mapping a multiindex dataframe to another using row ID

How to match two dataframe columns and return matching values on a separate column in Python?

concatenate in place in sub function with pandas concat function?

Drop all data in a pandas dataframe

Categories

Resources