Comparing two arrays element wise in pandas - python

I have two data frames.
My final goal is to compare a column in both data frames and return those values which donot match with each other
example:
df_1["column_1"]= ["A45", "kl24", "mhg", "tz22" ]
df_2["column_2"]= ["KL24", "tz22", "mhg", "A 45"]
I need a code that comparing two array values in the respective dataframe["column"] and returns those values from df_1 which did not match in df_2(Ex: from our example "A45" and "kl24" will return because there is a space and upper and lower case error)
Can anyone kindly please help me with this !

Use pd.merge(..., how='left', indicator=True) and select the rows with left_only:
>>> df = df_1.merge(df_2, how='outer', left_on='column_1', right_on='column_2', indicator=True)
>>> df
column_1 column_2 _merge
0 A45 NaN left_only
1 kl24 NaN left_only
2 mhg mhg both
3 tz22 tz22 both
4 NaN KL24 right_only
5 NaN A 45 right_only
>>> df[df._merge == 'left_only']['column_1']
0 A45
1 kl24
Name: column_1, dtype: object

You can use df.isin and df.loc to do the trick
df_1["column_1"].loc[~df_1["column_1"].isin(df_2["column_2"])]
column_1
0 A45
1 kl24

Related

Merge dataframes based on substrings

I want to merge/join two large dataframes while the 'id' column the dataframe on the right is assumed to be substrings of the left 'id' column.
For illustration purposes:
import pandas as pd
import numpy as np
df1=pd.DataFrame({'id':['abc','adcfek','acefeasdq'],'numbers':[1,2,np.nan],'add_info':[3123,np.nan,312441]})
df2=pd.DataFrame({'matching':['adc','fek','acefeasdq','abcef','acce','dcf'],'needed_info':[1,2,3,4,5,6],'other_info':[22,33,11,44,55,66]})
This is df1:
id numbers add_info
0 abc 1.0 3123.0
1 adcfek 2.0 NaN
2 acefeasdq NaN 312441.0
And this is df2:
matching needed_info other_info
0 adc 1 22
1 fek 2 33
2 acefeasdq 3 11
3 abcef 4 44
4 acce 5 55
5 dcf 6 66
And this is the desired output:
id numbers add_info needed_info other_info
0 abc 1.0 3123.0 NaN NaN
1 adcfek 2.0 NaN 2.0 33.0
2 adcfek 2.0 NaN 6.0 66.0
3 acefeasdq NaN 312441.0 3.0 11.0
So as described, I only want to merge the additional columns only when the 'matching' column is a substring of the 'id' column. If it is the other way around, e.g. 'abc' is a substring of 'adcef', nothing should happen.
In my data, a lot of the matches between df1 and df2 are actually exact, like the 'acefeasdq' row. But there are cases where 'id's contain multiple 'matching's. For the moment, it is okish to ignore these cases but I'd like to learn how I can tackle this issue. And additionally, is it possible to mark out the rows that are merged based on substrings and the rows that are merged exactly?
You can use pd.merge(how='cross') to create a dataframe containing all combinations of the rows. And then filter the dataframe using a boolean series:
df = pd.merge(df1, df2, how="cross")
include_row = df.apply(lambda row: row.matching in row.id, axis=1)
filtered = df.loc[include_row]
print(filtered)
Docs:
pd.merge
Indexing and selecting data
If your processing can handle CROSS JOIN (problematic with large datasets), then you could cross join and then delete/filter only those you want.
map= cross.apply(lambda x: str(x['matching']) in str(x['id']), axis=1) #create map of booleans
final = cross[map] #get only those where condition was met

Add default value to merge in pandas

Similar to this topic : Add default values while merging tables in pandas
The answer to this topic fills all NaN in the resulting DataFrame and that's not what I want to do.
Let's imagine the following situation : I have two dataframes df1 and df2. Each of this DataFrame might contains some Nan, the columns of df1 are 'a' and col1, the columns of df2 are 'a' and col2 where col1 and col2 are disjoints list of columns name (For example df1 and df2 could have respectively 'a', 'b', 'c' and 'a', 'd', 'e' as columns names). I want to perform a left merge on df1 and df2 and fill all the missing values of that merge(any row of df1 with a value of the column 'a' that is not a value of column 'a' in df2) with a default value. We can imagine that I have a dict default_values that match any element of col2 to a default values.
To give you a concrete example :
df1
a b c
0 0 0.038108 0.961687
1 1 0.107457 0.616689
2 2 0.661485 0.240353
3 3 0.457169 0.560912
4 5 5.000000 5.000000
df2
a d e
0 0 0.405170 0.934776
1 1 0.684532 0.168738
2 2 0.729693 0.967310
3 3 0.844770 NaN
4 4 0.842673 0.941324
default_values = {'d':42, 'e':43}
Expected Output :
a b c d e
0 0 0.038108 0.961687 0.405170 0.934776
1 1 0.107457 0.616689 0.684532 0.168738
2 2 0.661485 0.240353 0.729693 0.967310
3 3 0.457169 0.560912 0.844770 NaN
4 5 5.000000 5.000000 42 43
While writing this question, I found a working solution. I still think it's an interesting question. Here's a solution to get the expected output :
df3 = pd.DataFrame(default_values,
index = df1.set_index('a').index.difference(df2.a))
df3['a'] = df3.index
df1.merge(pd.concat((df2, df3), sort=False))
This solution works for a left/right merge, and it can be extended to work for an outer merge (by completing the first dataframe as well).
Edit : The how='left' argument is not specified in my merge because the DataFrame I'm merging with is constructed to have all the value of the column 'a' in df1 in its own column 'a'. We could add an how='left' to this call of merge, and it would give the same output.

Python adding two dataframes based on index (edited)

(no idea how to introduce a matrix here for readability)
I have two dataframes obtained with Panda and Python.
df1 = pd.DataFrame({'Index': ['0','1','2'], 'number':[3,'dd',1], 'people':[3,'s',3]})
df1 = df1.set_index('Index')
df2 = pd.DataFrame({'Index': ['0','1','2'], 'quantity':[3,2,'hi'], 'persons':[1,5,np.nan]})
I would like to sum the quantities of columns based on Index. Columns do not have the same name and may contain strings. (I have in fact 50 columns on each df). I want to consider nan as 0. The result should look:
df3
Index column 1 column 2
0 6 4
1 nan nan
2 nan nan
I was wondering how could this be done.
Note:
For sure a double while or for would do the trick, just not very elegant...
indices=0
columna=0
while indices<len(df.index)-1:
while columna<numbercolumns-1:
df3.iloc[indices,columna]=df1.iloc[indices,columna] +df2.iloc[indices,columna]
indices += 1
columna += 1
Thank you.
You can try of concatenating both dataframes, then add based on the index group
df1.columns = df.columns
df1.people = pd.to_numeric(df1.people,errors='coerce')
pd.concat([df,df1]).groupby('Index').sum()
Out:
number people
Index
A 8 5.0
B 2 2.0
C 2 5.0
F 3 3.0

How would I pivot this basic table using pandas?

What I want is this:
visit_id atc_1 atc_2 atc_3 atc_4 atc_5 atc_6 atc_7
48944282 A02AG J01CA04 J095AX02 N02BE01 R05X NaN NaN
48944305 A02AG A03AX13 N02BE01 R05X NaN NaN NaN
I don't know how many atc_1...atc_7...?atc_100 columns there will need to be in advance. I just need to gather all associated atc_codes into one row with each visit_id.
This seems like a group_by and then a pivot but I have tried many times and failed. I also tried to self-join a la SQL using pandas' merge() but that doesn't work either.
The end result is that I will paste together atc_1, atc_7, ... atc_100 to form one long atc_code. This composite atc_code will be my "Y" or "labels" column of my dataset that I am trying to predict.
Thank you!
Use cumcount first for count values per groups which create columns by function pivot. Then add missing columns with reindex_axis and change column names by add_prefix. Last reset_index:
g = df.groupby('visit_id').cumcount() + 1
print (g)
0 1
1 2
2 3
3 4
4 5
5 1
6 2
7 3
8 4
dtype: int64
df = pd.pivot(index=df['visit_id'], columns=g, values=df['atc_code'])
.reindex_axis(range(1, 8), 1)
.add_prefix('atc_')
.reset_index()
print (df)
visit_id atc_1 atc_2 atc_3 atc_4 atc_5 atc_6 atc_7
0 48944282 A02AG J01CA04 J095AX02 N02BE01 R05X NaN NaN
1 48944305 A02AG A03AX13 N02BE01 R05X None NaN NaN

pandas - merging with missing values

There appears to be a quirk with the pandas merge function. It considers NaN values to be equal, and will merge NaNs with other NaNs:
>>> foo = DataFrame([
['a',1,2],
['b',4,5],
['c',7,8],
[np.NaN,10,11]
], columns=['id','x','y'])
>>> bar = DataFrame([
['a',3],
['c',9],
[np.NaN,12]
], columns=['id','z'])
>>> pd.merge(foo, bar, how='left', on='id')
Out[428]:
id x y z
0 a 1 2 3
1 b 4 5 NaN
2 c 7 8 9
3 NaN 10 11 12
[4 rows x 4 columns]
This is unlike any RDB I've seen, normally missing values are treated with agnosticism and won't be merged together as if they are equal. This is especially problematic for datasets with sparse data (every NaN will be merged to every other NaN, resulting in a huge DataFrame!)
Is there a way to ignore missing values during a merge without first slicing them out?
You could exclude values from bar (and indeed foo if you wanted) where id is null during the merge. Not sure it's what you're after, though, as they are sliced out.
(I've assumed from your left join that you're interested in retaining all of foo, but only want to merge the parts of bar that match and are not null.)
foo.merge(bar[pd.notnull(bar.id)], how='left', on='id')
Out[11]:
id x y z
0 a 1 2 3
1 b 4 5 NaN
2 c 7 8 9
3 NaN 10 11 NaN
If You want to preserve the NaNs from both tables without slicing them out, you could use the outer join method as follows:
pd.merge(foo, bar.dropna(subset=['id']), how='outer', on='id')
It basically returns the union of foo and bar
if do not need NaN in both left and right DF, use
pd.merge(foo.dropna(subset=['id']), bar.dropna(subset=['id']), how='left', on='id')
else if need NaN in left DF, use
pd.merge(foo, bar.dropna(subset=['id']), how='left', on='id')
Another approach, which also keeps all rows if performing an outer join:
foo['id'] = foo.id.fillna('missing')
pd.merge(foo, bar, how='left', on='id')

Categories

Resources