Get Index of matching string from Two dataframe

Get Index of matching string from Two dataframe - python

I have two data frames. I need to search through datframe 2 to see whichone matches in in datframe 1. And replace the string with its index.
So I Want a third data frame indicating the index of the matching string from dataframe 2 to dataframe 1.
X = pd.DataFrame(np.array(['A','B','C','D','AA','AB','AC','AD','BA','BB','BC','AD']).reshape(4,3),columns=['a','b','c'])
a b c
0 A B C
1 D AA AB
2 AC AD BA
3 BB BC AD
Y = pd.DataFrame(np.array(['A','AA','AC','D','B','AB','C','AD','BC','BB']).reshape(10,1),columns=['X'])
X
0 A
1 AA
2 AC
3 D
4 B
5 AB
6 C
7 AD
8 BC
9 BB
Resulting Datafreme
a b c
0 0 4 6
1 3 1 5
2 2 7 NA
3 9 8 7
Some guy suggested me with the following code but does not seems okay. Not working.
t = pd.merge(df1.stack().reset_index(), df2.reset_index(), left_on = 0, right_on = "0")
res = t.set_index(["level_0", "level_1"]).drop([0, "0"], axis=1).unstack()
print(res)

Use apply with map:
Y = Y.reset_index().set_index('X')['index']
X = X.apply(lambda x: x.map(Y))
print(X)
a b c
0 0 4 6.0
1 3 1 5.0
2 2 7 NaN
3 9 8 7.0

Step1 : Create a mapping from Y :
mapping = {value: key for key, value in Y.T.to_dict("records")[0].items()}
mapping
{'A': 0,
'AA': 1,
'AC': 2,
'D': 3,
'B': 4,
'AB': 5,
'C': 6,
'AD': 7,
'BC': 8,
'BB': 9}
Step 2: stack the X column, map the mapping to the stacked dataframe, and unstack to get back to the original shape :
X.stack().map(mapping).unstack()
a b c
0 0.0 4.0 6.0
1 3.0 1.0 5.0
2 2.0 7.0 NaN
3 9.0 8.0 7.0
Alternatively, you can avoid the stack/unstack step and use replace, with pd.to_numeric :
X.replace(mapping).apply(pd.to_numeric, errors="coerce")
No tests done, just my gut feeling that mapping should be faster.

Short solution based on applymap:
X.applymap(lambda x: Y[Y.X==x].index.max())
result:
a b c
0 0 4 6.0
1 3 1 5.0
2 2 7 NaN
3 9 8 7.0

Y = pd.Series(Y.index, index=Y.X).sort_index()
will give you a more easily searchable object... then something like
flat = X.to_numpy().flatten()
Y = Y.reindex(np.unique(flatten)) # all items need to be in index to be able to use loc[list]
res = pd.DataFrame(Y.loc[flat].reshape(X.shape), columns=X.columns)

Let us do
X = X.where(X.isin(Y.X.tolist())).replace(dict(zip(Y.X,Y.index)))
Out[15]:
a b c
0 0 4 6.0
1 3 1 5.0
2 2 7 NaN
3 9 8 7.0

Related

Pandas: Find the max value in one column containing lists

I have a dataframe like this:
fly_frame:
day plcae
0 [1,2,3,4,5] A
1 [1,2,3,4] B
2 [1,2] C
3 [1,2,3,4] D
If I want to find the max value in each entry in the day column.
For example:
fly_frame:
day plcae
0 5 A
1 4 B
2 2 C
3 4 D
What should I do?
Thanks for your help.

df.day.apply(max)
#0 5
#1 4
#2 2
#3 4

Use apply with max:
#if strings
#import ast
#print (type(df.loc[0, 'day']))
#<class 'str'>
#df['day'] = df['day'].apply(ast.literal_eval)
print (type(df.loc[0, 'day']))
<class 'list'>
df['day'] = df['day'].apply(max)
Or list comprehension:
df['day'] = [max(x) for x in df['day']]
print (df)
day plcae
0 5 A
1 4 B
2 2 C
3 4 D

Try a combination of pd.concat() and df.apply() with:
import numpy as np
import pandas as pd
fly_frame = pd.DataFrame({'day':[[1,2,3,4,5],[1,2,3,4],[1,2],[1,2,3,4]],'place':['A','B','C','D']})
df = pd.concat([fly_frame['day'].apply(max),fly_frame.drop('day',axis=1)],axis=1)
print(df)
day place
0 5 A
1 4 B
2 2 C
3 4 D
Edit
You can also use df.join() with:
fly_frame.drop('day',axis=1).join(fly_frame['day'].apply(np.max,axis=0))
place day
0 A 5
1 B 4
2 C 2
3 D 4

I suggest bringing your dataframe into a better format first.
>>> df
day plcae
0 [1, 2, 3, 4, 5] A
1 [1, 2, 3, 4] B
2 [1, 2] C
3 [1, 2, 3, 4] D
>>>
>>> df = pd.concat([df.pop('day').apply(pd.Series), df], axis=1)
>>> df
0 1 2 3 4 plcae
0 1.0 2.0 3.0 4.0 5.0 A
1 1.0 2.0 3.0 4.0 NaN B
2 1.0 2.0 NaN NaN NaN C
3 1.0 2.0 3.0 4.0 NaN D
Now everything is easier, for example computing the maximum of numeric values along the columns.
>>> df.max(axis=1)
0 5.0
1 4.0
2 2.0
3 4.0
dtype: float64
edit: renaming the index might also be useful to you.
>>> df.max(axis=1).rename(df['plcae'])
A 5.0
B 4.0
C 2.0
D 4.0
dtype: float64

drops a column if it exceeds a specific number of NA values

i want to write a program that drops a column if it exceeds a specific number of NA values .This is what i did.
def check(x):
for column in df:
if df.column.isnull().sum() > 2:
df.drop(column,axis=1)
there is no error in executing the above code , but while doing df.apply(check), there are a ton of errors.
P.S:I know about the thresh arguement in df.dropna(thresh,axis)
Any tips?Why isnt my code working?
Thanks

Although jezrael's answer works that is not the approach you should do. Instead, create a mask: ~df.isnull().sum().gt(2) and apply it with .loc[:,m] to access columns.
Full example:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A':list('abcdef'),
'B':[np.nan,np.nan,np.nan,5,5,np.nan],
'C':[np.nan,8,np.nan,np.nan,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,np.nan],
'F':list('aaabbb')
})
m = ~df.isnull().sum().gt(2)
df = df.loc[:,m]
print(df)
Returns:
A D E F
0 a 1 5.0 a
1 b 3 3.0 a
2 c 5 6.0 a
3 d 7 9.0 b
4 e 1 2.0 b
5 f 0 NaN b
Explanation
Assume we print the columns and the mask before applying it.
print(df.columns.tolist())
print(m.tolist())
It would return this:
['A', 'B', 'C', 'D', 'E', 'F']
[True, False, False, True, True, True]
Columns B and C are unwanted (False). They are removed when the mask is applied.

I think best here is use dropna with parameter thresh:
thresh : int, optional
Require that many non-NA values.
So for vectorize solution subtract it from length of DataFrame:
N = 2
df = df.dropna(thresh=len(df)-N, axis=1)
print (df)
A D E F
0 a 1 5.0 a
1 b 3 3.0 a
2 c 5 6.0 a
3 d 7 9.0 b
4 e 1 2.0 b
5 f 0 NaN b
I suggest use DataFrame.pipe for apply function for input DataFrame with change df.column to df[column], because dot notation with dynamic column names from variable failed (it try select column name column):
df = pd.DataFrame({'A':list('abcdef'),
'B':[np.nan,np.nan,np.nan,5,5,np.nan],
'C':[np.nan,8,np.nan,np.nan,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,np.nan],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a NaN NaN 1 5.0 a
1 b NaN 8.0 3 3.0 a
2 c NaN NaN 5 6.0 a
3 d 5.0 NaN 7 9.0 b
4 e 5.0 2.0 1 2.0 b
5 f NaN 3.0 0 NaN b
def check(df):
for column in df:
if df[column].isnull().sum() > 2:
df.drop(column,axis=1, inplace=True)
return df
print (df.pipe(check))
A D E F
0 a 1 5.0 a
1 b 3 3.0 a
2 c 5 6.0 a
3 d 7 9.0 b
4 e 1 2.0 b
5 f 0 NaN b

Alternatively, you can use count which counts non-null values
In [23]: df.loc[:, df.count().gt(len(df.index) - 2)]
Out[23]:
A D E F
0 a 1 5.0 a
1 b 3 3.0 a
2 c 5 6.0 a
3 d 7 9.0 b
4 e 1 2.0 b
5 f 0 NaN b

Pandas merge issue on key of object type containing number and string values

I have two dataframes df1 and df2 as shown below:-
df1 = pd.DataFrame({'x': [1, '3', 5,'t','m','u'],'y':[2, 4, 6, 4, 4, 8]})
df2 = pd.DataFrame({'x': [1, 3, '4','t'],'z':[2, 4, 6,7]})
I am trying to merge(left join) the two data frames as:-
df=pd.merge(df1, df2, how='left', on='x')
the output is:-
df
Out[25]:
x y z
0 1 2 2.0
1 3 4 NaN
2 5 6 NaN
3 t 4 7.0
4 m 4 NaN
5 u 8 NaN
Clearly for second row above i.e for x=3, I would like to have z=4 instead of NaN.Is there an option to define data type of the key during merge or any other workaround where I can change the dtype of the keys to string in both data frames and get the desired output.

You can use assign to temporarily assign new dtype to the x column:
pd.merge(df1.assign(x=df1.x.astype(str)),
df2.assign(x=df2.x.astype(str)),
how='left', on='x')
Output:
x y z
0 1 2 2.0
1 3 4 4.0
2 5 6 NaN
3 t 4 7.0
4 m 4 NaN
5 u 8 NaN

Your df1 and df2 ,have different dtype for 3 one is numeric another is str, so we convert them all to string they can get match
df=pd.merge(df1.astype(str), df2.astype(str), how='left', on='x')
df
Out[914]:
x y z
0 1 2 2
1 3 4 4
2 5 6 NaN
3 t 4 7
4 m 4 NaN
5 u 8 NaN

Understanding how pandas join works

Can somebody please explain this result to me? In particular, I don't know where the NaNs come from in the result. Also, I don't know how the join will decide what row to match with what row in this case.
left_df = pd.DataFrame.from_dict({'unique_l':[0, 1, 2, 3, 4], 'join':['a', 'a', 'b','b', 'c'] })
right_df = pd.DataFrame.from_dict({'unique_r':[10, 11, 12, 13, 14], 'join':['a', 'b', 'b','c', 'c'] })
join unique_l
0 a 0
1 a 1
2 b 2
3 b 3
4 c 4
join unique_r
0 a 10
1 b 11
2 b 12
3 c 13
4 c 14
print left_df.join(right_df, on='join', rsuffix='_r')
join unique_l join_r unique_r
0 a 0 NaN NaN
1 a 1 NaN NaN
2 b 2 NaN NaN
3 b 3 NaN NaN
4 c 4 NaN NaN

The join method makes use of indices. What you want is merge:
In [6]: left_df.merge(right_df, on="join", suffixes=("_l", "_r"))
Out[6]:
join unique_l unique_r
0 a 0 10
1 a 1 10
2 b 2 11
3 b 2 12
4 b 3 11
5 b 3 12
6 c 4 13
7 c 4 14
Here is a related (but, IMO, not quite a duplicate) question that explains the difference between join and merge in more detail.

Pandas sum two columns, skipping NaN

If I add two columns to create a third, any columns containing NaN (representing missing data in my world) cause the resulting output column to be NaN as well. Is there a way to skip NaNs without explicitly setting the values to 0 (which would lose the notion that those values are "missing")?
In [42]: frame = pd.DataFrame({'a': [1, 2, np.nan], 'b': [3, np.nan, 4]})
In [44]: frame['c'] = frame['a'] + frame['b']
In [45]: frame
Out[45]:
a b c
0 1 3 4
1 2 NaN NaN
2 NaN 4 NaN
In the above, I would like column c to be [4, 2, 4].
Thanks...

with fillna()
frame['c'] = frame.fillna(0)['a'] + frame.fillna(0)['b']
or as suggested :
frame['c'] = frame.a.fillna(0) + frame.b.fillna(0)
giving :
a b c
0 1 3 4
1 2 NaN 2
2 NaN 4 4

Another approach:
>>> frame["c"] = frame[["a", "b"]].sum(axis=1)
>>> frame
a b c
0 1 3 4
1 2 NaN 2
2 NaN 4 4

As an expansion to the answer above, doing frame[["a", "b"]].sum(axis=1) will fill sum of all NaNs as 0
>>> frame["c"] = frame[["a", "b"]].sum(axis=1)
>>> frame
a b c
0 1 3 4
1 2 NaN 2
2 NaN 4 4
3 NaN NaN 0
If you want the sum of all NaNs to be NaN, you can add the min_count flag as referenced in the docs
>>> frame["c"] = frame[["a", "b"]].sum(axis=1, min_count=1)
>>> frame
a b c
0 1 3 4
1 2 NaN 2
2 NaN 4 4
3 NaN NaN NaN

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get Index of matching string from Two dataframe - python

Use apply with map: Y = Y.reset_index().set_index('X')['index'] X = X.apply(lambda x: x.map(Y)) print(X) a b c 0 0 4 6.0 1 3 1 5.0 2 2 7 NaN 3 9 8 7.0

Short solution based on applymap: X.applymap(lambda x: Y[Y.X==x].index.max()) result: a b c 0 0 4 6.0 1 3 1 5.0 2 2 7 NaN 3 9 8 7.0

Let us do X = X.where(X.isin(Y.X.tolist())).replace(dict(zip(Y.X,Y.index))) Out[15]: a b c 0 0 4 6.0 1 3 1 5.0 2 2 7 NaN 3 9 8 7.0

Related

Pandas: Find the max value in one column containing lists

drops a column if it exceeds a specific number of NA values

Pandas merge issue on key of object type containing number and string values

Understanding how pandas join works

Pandas sum two columns, skipping NaN

Categories

Resources