Find Mismatched Data Between Two Different Dataframe Columns - python

I am trying to compare two columns, both from different dataframes. The issue is that these columns are not in the same order.
Given the two dataframes below:
df1
index Letter
1 C,D,X
2 E,F
3 A,B
df2
index Letter
1 A
2 C,D
3 F
I want to identify all data that these two dataframes do not have in common. For example, one dataframe has 'A,B' while the other one has only 'A'. I am trying to flag the missing 'B' in df2. Would the best approach be to split each row of 'Letter' into a list of lists and then create a dictionary to compare?

Related

Get the index numbers of mismatching column values between to dataframes

I have two similar dataframes like this
Dataframe 1:
ID classification
1 MISS
2 MISS
3 CORRECT
4 MISS
5 CORRECT
Dataframe 2:
ID classification
1 CORRECT
2 CORRECT
3 MISS
4 MISS
5 CORRECT
I would like get the index numbers for each time there is a mismatch between values in the classification column between dataset 1 and dataset 2. The datasets are of similar length and the remaining columns are also equal to each other.
Because same number of rows and same indices you can compare classification between both DataFrames for not equal by Series.ne and filter values in boolean indexing:
#ID is index
df1.index[df1['classification'].ne(df2['classification'])]
Or if ID in column:
df1.loc[df1['classification'].ne(df2['classification']), 'ID']
If not same number of rows use Series.map, here ID is column:
s = df2.set_index('ID')['classification']
df1.loc[df1['classification'].ne(df1['ID'].map(s)), 'ID']

How to remove mirror duplicate pair rows in pandas?

This is my dataframe consisting of columns a,b,c,d.
Here
1 2 3 4
has mirror duplicate pair row of
3 4 1 2
Removing the duplicate pair should give me
df.loc[pd.DataFrame(np.sort(df[['a','b','c','d']],1),index=df.index).drop_duplicates(keep='first').index]
U can use np.sort to sort columns in ascending order and then use .drop duplicates to get rid of the duplicate rows.

Filtering pandas dataframe using dictionary for column values

Premise
I need to use a dictionary as a filter on a large dataframe, where the key-value pairs are values in different columns.
This dictionary is obtained from a separate dataframe, using dict(zip(df.id_col, df.rank_col)) so if a dictionary isn't the best way to go, that is open to change.
This is very similar to this question: Filter a pandas dataframe using values from a dict but fundamentally (I think) different because my dictionary contains column-paired values:
Example data
df_x = pd.DataFrame({'id':[1,1,1,2,2,2,3,3,3],
'B':[1,1,1,0,1,0,1,0,1], 'Rank':['1','2','3','1', '2','3','1','2','3'],'D':[1,2,3,4,5,6,7,8,9]})
filter_dict = {'1':'1', '2':'3', '3':'2'}
For this dataframe df_x I would want to be able to look at the filter dictionary and apply it to a set of columns, here id and Rank, so the dataframe is pared down to:
The actual source dataframe is approx 1M rows, and the dictionary is >100 key-value pairs.
Thanks for any help.
You can check with isin
df_x[df_x[['id','Rank']].astype(str).apply(tuple,1).isin(filter_dict.items())]
Out[182]:
id B Rank D
0 1 1 1 1
5 2 0 3 6
7 3 0 2 8

DataFrame merge on column gives NaN

I have two DataFrames with the first df:
indegree interrupts Subject
1 2 Weather
2 3 Weather
4 5 Weather
The second join:
Subject interrupts_mean indegree_mean
weather 2 3
But the second is a lot shorter since I made that the means of all the different subjects in the first dataframe.
When I want to merge both DataFrames
pd.merge(df,join,left_index=True,right_index=True,how='left')
it merges but it gives NaNs on the second dataframe in the new dataframe and I suppose it it so since the DataFrames are not the same length. How can I still merge on subject so that the values from the second DataFrame are duplicated in the new DataFrame?

Merge pandas dataframe, with column operation

I searched archive, but did not find what I wanted (probably because I don't really know what key words to use)
Here is my problem: I have a bunch of dataframes need to be merged; I also want to update the values of a subset of columns with the sum across the dataframes.
For example, I have two dataframes, df1 and df2:
df1=pd.DataFrame([ [1,2],[1,3], [0,4]], columns=["a", "b"])
df2=pd.DataFrame([ [1,6],[1,4]], columns=["a", "b"])
a b a b
0 1 2 0 1 5
1 1 3 2 0 6
2 0 4
after merging, I'd like to have the column 'b' updated with the sum of matched records, while column 'a' should be just like df1 (or df2, don't really care) as before:
a b
0 1 7
1 1 3
2 0 10
Now, expand this to merging three or more data frames.
Are there straightforward, build-in tricks to do this? or I need to process one by one, line by line?
===== Edit / Clarification =====
In the real world example, each data frame may contain indexes that are not in the other data frames. In this case, the merged data frame should have all of them and update the shared entries/indexes with sum (or some other operation).
Only partial, not complete solution yet. But the main point is solved:
df3 = pd.concat([df1, df2], join = "outer", axis=1)
df4 = df3.b.sum(axis=1)
df3 will have two 'a' columns, and two 'b' columns. the sum() function on df3.b add two 'b' columns and ignore NaNs. Now df4 has column 'b' with sum of df1 and df2's 'b' columns, and all the indexes.
did not solve the column 'a' though. In my real case, there are quite few number of NaN in df3.a , while others in df3.a should be the same. I haven't found a straightforward way to make a column 'a' in df4 and fill value with non-NaN. Now searching for a "count" function to get occurance of elements in rows of df3.a (imagine it has a few dozens column 'a').

Categories

Resources