I have two dataframes containing the result of a corr() from different parts of a single source (csv). Now I want to compare all the values in the two dataframes to check if they are equal or even if they fall within a particular range. So the puseudo code would be something like:
df1['column1']['row1'] == df2['column1']['row1']
Is there a simple way of doing this in Pandas?
You have many ways to do that. One of the ways I follow is as below:
df3 = df2[df1.ne(df2).any(axis=1)]
df3 will list out all the rows in which atleast one cell will not match.
FYI, ne here stands for not equal.
Example:
create df1
data = [['batman', 10], ['joker', 15], ['alfred', 14]]
df1 = pd.DataFrame(data, columns = ['Name', 'Age'])
create df2 which is slightly different from df1
data = [['batman', 10], ['joker', 6], ['alfred', 17]]
df2 = pd.DataFrame(data, columns = ['Name', 'Age'])
extract the rows with atleast one unequal cell
df3 = df2[df1.ne(df2).any(axis=1)]
df3
print the the resultant df3
Name Age
1 joker 6 // the age is different in df1 and df2 for joker
2 alfred 17 // the age is different in df1 and df2 for alfred
Now, from the resultant dataframe, you can check the range requirements as per your business case.
Related
I need to compare two df1 (blue) and df2 (orange), store only the rows of df2 (orange) that are not in df1 in a separate data frame, and then add that to df1 while assigning function 6 and sector 20 for the employees that were not present in df1 (blue)
I know how to find the differences between the data frames and store that in a third data frame, but I'm stuck trying to figure out how to store only the rows of df2 that are not in df1.
Can try this:
Get a list with the data os orange u want to keep
Filter df2 with that list
Append
df1 --> blue, df2 --> orange
import pandas as pd
df2['Function'] = 6
df2['Sector'] = 20
ids_df2_keep = [e for e in df2['ID'] if e not in list(df1['ID'])]
df2 = df2[df2['ID'].isin(ids_df2_keep)
df1 = df1.append(df2)
This has been answered in pandas get rows which are NOT in other dataframe
Store it as a merge and simply select the rows that do not share common values.
~ negates the expression, select all that are NOT IN instead of IN.
common = df1.merge(df2,on=['ID','Name'])
df = df2[(~df2['ID'].isin(common['ID']))&(~df2['Name'].isin(common['Name']))]
This was tested using some of your data:
df1 = pd.DataFrame({'ID':[125,134,156],'Name':['John','Mary','Bill'],'func':[1,2,2]})
df2 = pd.DataFrame({'ID':[125,139,133],'Name':['John','Joana','Linda']})
Output is:
ID Name
1 139 Joana
2 133 Linda
I have two dataframes, df1 and df2.
df1 contains integers and df2 contains booleans.
df1 and df2 are exactly the same size (like both are 10x10).
I would like to create a df3 that would take the data from df1 only if the value in the same location in df2 is True. All False would be replaced by Nan in df3
Thanks in advance!
As someone who is super new in merge/append on Python, I am trying to merge two different DF together.
DF1 has 2 columns with Text and ID columns and 100 rows
DF2 has 3 columns with Text, ID, and Match columns and has 20 rows
My goal is to combine the two DFs together so the "Match" column from DF2 can be merged into DF1.
The Match column is all "True" value, so when it gets merged over the other 80 rows on DF1 can be NaN and I can fix it later.
Thank you to everyone for the help and support!
Try a left merge using .merge(), like this:
DF_out = DF1.merge(DF2, on=['Text', 'ID'], how='left')
I would like to display all rows in the DataFrame whose values under the column 'Nameid' correspond to the two first different values found in that column.
In the example below, the two first different values under the column named 'Nameid' are 1 and 2. I want to select all rows for which 'Nameid' equals either 1 or 2, and discard the rest. How do I do that?
What I have:
import pandas as pd
df = pd.DataFrame(data={
'Nameid': [1, 2, 3, 1],
'Name': ['Michael', 'Max', 'Susan', 'Michael'],
'Project': ['S455', 'G874', 'B7445', 'Z874'],
})
display(df.head(10))
What I want:
First sorting by column Nameid by DataFrame.sort_values:
df = df.sort_values('Nameid')
then use Series.isin with 2 first unique values by Series.unique:
df1 = df[df['Nameid'].isin(df['Nameid'].unique()[:2])].reset_index(drop=True)
print (df1)
Nameid Name Project
0 1 Michael S455
1 1 Michael Z874
2 2 Max G874
Alternative with Series.drop_duplicates:
df1 = df[df['Nameid'].isin(df['Nameid'].drop_duplicates()[:2])].reset_index(drop=True)
EDIT: If want filter by equal or less like 2, thank you #DarrylG:
df2 = df[df['Nameid'] <= 2].reset_index(drop=True)
I don't seem to be able to subset data using integer column names using loc command
# 6*4 data set with column names as x,y,8,9
df = pd.DataFrame(np.random.randint(0,10,(6,4)),
index=('a','b','c','1','2','3'),
columns=['x','y', 8, 9])
df2 = df.loc[:,:'x']
df3 = df.loc[:,:'8']
df2 works but df3 throws error.
You can do either:
df3 = df.loc[:,8]
To get only column 8
Or:
df3 = df.loc[:,df.columns[:list(df.columns).index(8)+1]]
To get all columns until column 8 (inclusive - remove +1 to get exclusive).