Get Index of matching string from Two dataframe - python

I have two data frames. I need to search through datframe 2 to see whichone matches in in datframe 1. And replace the string with its index.
So I Want a third data frame indicating the index of the matching string from dataframe 2 to dataframe 1.
X = pd.DataFrame(np.array(['A','B','C','D','AA','AB','AC','AD','BA','BB','BC','AD']).reshape(4,3),columns=['a','b','c'])
a b c
0 A B C
1 D AA AB
2 AC AD BA
3 BB BC AD
Y = pd.DataFrame(np.array(['A','AA','AC','D','B','AB','C','AD','BC','BB']).reshape(10,1),columns=['X'])
X
0 A
1 AA
2 AC
3 D
4 B
5 AB
6 C
7 AD
8 BC
9 BB
Resulting Datafreme
a b c
0 0 4 6
1 3 1 5
2 2 7 NA
3 9 8 7
Some guy suggested me with the following code but does not seems okay. Not working.
t = pd.merge(df1.stack().reset_index(), df2.reset_index(), left_on = 0, right_on = "0")
res = t.set_index(["level_0", "level_1"]).drop([0, "0"], axis=1).unstack()
print(res)

Use apply with map:
Y = Y.reset_index().set_index('X')['index']
X = X.apply(lambda x: x.map(Y))
print(X)
a b c
0 0 4 6.0
1 3 1 5.0
2 2 7 NaN
3 9 8 7.0

Step1 : Create a mapping from Y :
mapping = {value: key for key, value in Y.T.to_dict("records")[0].items()}
mapping
{'A': 0,
'AA': 1,
'AC': 2,
'D': 3,
'B': 4,
'AB': 5,
'C': 6,
'AD': 7,
'BC': 8,
'BB': 9}
Step 2: stack the X column, map the mapping to the stacked dataframe, and unstack to get back to the original shape :
X.stack().map(mapping).unstack()
a b c
0 0.0 4.0 6.0
1 3.0 1.0 5.0
2 2.0 7.0 NaN
3 9.0 8.0 7.0
Alternatively, you can avoid the stack/unstack step and use replace, with pd.to_numeric :
X.replace(mapping).apply(pd.to_numeric, errors="coerce")
No tests done, just my gut feeling that mapping should be faster.

Short solution based on applymap:
X.applymap(lambda x: Y[Y.X==x].index.max())
result:
a b c
0 0 4 6.0
1 3 1 5.0
2 2 7 NaN
3 9 8 7.0

Y = pd.Series(Y.index, index=Y.X).sort_index()
will give you a more easily searchable object... then something like
flat = X.to_numpy().flatten()
Y = Y.reindex(np.unique(flatten)) # all items need to be in index to be able to use loc[list]
res = pd.DataFrame(Y.loc[flat].reshape(X.shape), columns=X.columns)

Let us do
X = X.where(X.isin(Y.X.tolist())).replace(dict(zip(Y.X,Y.index)))
Out[15]:
a b c
0 0 4 6.0
1 3 1 5.0
2 2 7 NaN
3 9 8 7.0

Related

Pandas: Find the max value in one column containing lists

I have a dataframe like this:
fly_frame:
day plcae
0 [1,2,3,4,5] A
1 [1,2,3,4] B
2 [1,2] C
3 [1,2,3,4] D
If I want to find the max value in each entry in the day column.
For example:
fly_frame:
day plcae
0 5 A
1 4 B
2 2 C
3 4 D
What should I do?
Thanks for your help.
df.day.apply(max)
#0 5
#1 4
#2 2
#3 4
Use apply with max:
#if strings
#import ast
#print (type(df.loc[0, 'day']))
#<class 'str'>
#df['day'] = df['day'].apply(ast.literal_eval)
print (type(df.loc[0, 'day']))
<class 'list'>
df['day'] = df['day'].apply(max)
Or list comprehension:
df['day'] = [max(x) for x in df['day']]
print (df)
day plcae
0 5 A
1 4 B
2 2 C
3 4 D
Try a combination of pd.concat() and df.apply() with:
import numpy as np
import pandas as pd
fly_frame = pd.DataFrame({'day':[[1,2,3,4,5],[1,2,3,4],[1,2],[1,2,3,4]],'place':['A','B','C','D']})
df = pd.concat([fly_frame['day'].apply(max),fly_frame.drop('day',axis=1)],axis=1)
print(df)
day place
0 5 A
1 4 B
2 2 C
3 4 D
Edit
You can also use df.join() with:
fly_frame.drop('day',axis=1).join(fly_frame['day'].apply(np.max,axis=0))
place day
0 A 5
1 B 4
2 C 2
3 D 4
I suggest bringing your dataframe into a better format first.
>>> df
day plcae
0 [1, 2, 3, 4, 5] A
1 [1, 2, 3, 4] B
2 [1, 2] C
3 [1, 2, 3, 4] D
>>>
>>> df = pd.concat([df.pop('day').apply(pd.Series), df], axis=1)
>>> df
0 1 2 3 4 plcae
0 1.0 2.0 3.0 4.0 5.0 A
1 1.0 2.0 3.0 4.0 NaN B
2 1.0 2.0 NaN NaN NaN C
3 1.0 2.0 3.0 4.0 NaN D
Now everything is easier, for example computing the maximum of numeric values along the columns.
>>> df.max(axis=1)
0 5.0
1 4.0
2 2.0
3 4.0
dtype: float64
edit: renaming the index might also be useful to you.
>>> df.max(axis=1).rename(df['plcae'])
A 5.0
B 4.0
C 2.0
D 4.0
dtype: float64

drops a column if it exceeds a specific number of NA values

i want to write a program that drops a column if it exceeds a specific number of NA values .This is what i did.
def check(x):
for column in df:
if df.column.isnull().sum() > 2:
df.drop(column,axis=1)
there is no error in executing the above code , but while doing df.apply(check), there are a ton of errors.
P.S:I know about the thresh arguement in df.dropna(thresh,axis)
Any tips?Why isnt my code working?
Thanks
Although jezrael's answer works that is not the approach you should do. Instead, create a mask: ~df.isnull().sum().gt(2) and apply it with .loc[:,m] to access columns.
Full example:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A':list('abcdef'),
'B':[np.nan,np.nan,np.nan,5,5,np.nan],
'C':[np.nan,8,np.nan,np.nan,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,np.nan],
'F':list('aaabbb')
})
m = ~df.isnull().sum().gt(2)
df = df.loc[:,m]
print(df)
Returns:
A D E F
0 a 1 5.0 a
1 b 3 3.0 a
2 c 5 6.0 a
3 d 7 9.0 b
4 e 1 2.0 b
5 f 0 NaN b
Explanation
Assume we print the columns and the mask before applying it.
print(df.columns.tolist())
print(m.tolist())
It would return this:
['A', 'B', 'C', 'D', 'E', 'F']
[True, False, False, True, True, True]
Columns B and C are unwanted (False). They are removed when the mask is applied.
I think best here is use dropna with parameter thresh:
thresh : int, optional
Require that many non-NA values.
So for vectorize solution subtract it from length of DataFrame:
N = 2
df = df.dropna(thresh=len(df)-N, axis=1)
print (df)
A D E F
0 a 1 5.0 a
1 b 3 3.0 a
2 c 5 6.0 a
3 d 7 9.0 b
4 e 1 2.0 b
5 f 0 NaN b
I suggest use DataFrame.pipe for apply function for input DataFrame with change df.column to df[column], because dot notation with dynamic column names from variable failed (it try select column name column):
df = pd.DataFrame({'A':list('abcdef'),
'B':[np.nan,np.nan,np.nan,5,5,np.nan],
'C':[np.nan,8,np.nan,np.nan,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,np.nan],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a NaN NaN 1 5.0 a
1 b NaN 8.0 3 3.0 a
2 c NaN NaN 5 6.0 a
3 d 5.0 NaN 7 9.0 b
4 e 5.0 2.0 1 2.0 b
5 f NaN 3.0 0 NaN b
def check(df):
for column in df:
if df[column].isnull().sum() > 2:
df.drop(column,axis=1, inplace=True)
return df
print (df.pipe(check))
A D E F
0 a 1 5.0 a
1 b 3 3.0 a
2 c 5 6.0 a
3 d 7 9.0 b
4 e 1 2.0 b
5 f 0 NaN b
Alternatively, you can use count which counts non-null values
In [23]: df.loc[:, df.count().gt(len(df.index) - 2)]
Out[23]:
A D E F
0 a 1 5.0 a
1 b 3 3.0 a
2 c 5 6.0 a
3 d 7 9.0 b
4 e 1 2.0 b
5 f 0 NaN b

Pandas merge issue on key of object type containing number and string values

I have two dataframes df1 and df2 as shown below:-
df1 = pd.DataFrame({'x': [1, '3', 5,'t','m','u'],'y':[2, 4, 6, 4, 4, 8]})
df2 = pd.DataFrame({'x': [1, 3, '4','t'],'z':[2, 4, 6,7]})
I am trying to merge(left join) the two data frames as:-
df=pd.merge(df1, df2, how='left', on='x')
the output is:-
df
Out[25]:
x y z
0 1 2 2.0
1 3 4 NaN
2 5 6 NaN
3 t 4 7.0
4 m 4 NaN
5 u 8 NaN
Clearly for second row above i.e for x=3, I would like to have z=4 instead of NaN.Is there an option to define data type of the key during merge or any other workaround where I can change the dtype of the keys to string in both data frames and get the desired output.
You can use assign to temporarily assign new dtype to the x column:
pd.merge(df1.assign(x=df1.x.astype(str)),
df2.assign(x=df2.x.astype(str)),
how='left', on='x')
Output:
x y z
0 1 2 2.0
1 3 4 4.0
2 5 6 NaN
3 t 4 7.0
4 m 4 NaN
5 u 8 NaN
Your df1 and df2 ,have different dtype for 3 one is numeric another is str, so we convert them all to string they can get match
df=pd.merge(df1.astype(str), df2.astype(str), how='left', on='x')
df
Out[914]:
x y z
0 1 2 2
1 3 4 4
2 5 6 NaN
3 t 4 7
4 m 4 NaN
5 u 8 NaN

Understanding how pandas join works

Can somebody please explain this result to me? In particular, I don't know where the NaNs come from in the result. Also, I don't know how the join will decide what row to match with what row in this case.
left_df = pd.DataFrame.from_dict({'unique_l':[0, 1, 2, 3, 4], 'join':['a', 'a', 'b','b', 'c'] })
right_df = pd.DataFrame.from_dict({'unique_r':[10, 11, 12, 13, 14], 'join':['a', 'b', 'b','c', 'c'] })
join unique_l
0 a 0
1 a 1
2 b 2
3 b 3
4 c 4
join unique_r
0 a 10
1 b 11
2 b 12
3 c 13
4 c 14
print left_df.join(right_df, on='join', rsuffix='_r')
join unique_l join_r unique_r
0 a 0 NaN NaN
1 a 1 NaN NaN
2 b 2 NaN NaN
3 b 3 NaN NaN
4 c 4 NaN NaN
The join method makes use of indices. What you want is merge:
In [6]: left_df.merge(right_df, on="join", suffixes=("_l", "_r"))
Out[6]:
join unique_l unique_r
0 a 0 10
1 a 1 10
2 b 2 11
3 b 2 12
4 b 3 11
5 b 3 12
6 c 4 13
7 c 4 14
Here is a related (but, IMO, not quite a duplicate) question that explains the difference between join and merge in more detail.

Pandas sum two columns, skipping NaN

If I add two columns to create a third, any columns containing NaN (representing missing data in my world) cause the resulting output column to be NaN as well. Is there a way to skip NaNs without explicitly setting the values to 0 (which would lose the notion that those values are "missing")?
In [42]: frame = pd.DataFrame({'a': [1, 2, np.nan], 'b': [3, np.nan, 4]})
In [44]: frame['c'] = frame['a'] + frame['b']
In [45]: frame
Out[45]:
a b c
0 1 3 4
1 2 NaN NaN
2 NaN 4 NaN
In the above, I would like column c to be [4, 2, 4].
Thanks...
with fillna()
frame['c'] = frame.fillna(0)['a'] + frame.fillna(0)['b']
or as suggested :
frame['c'] = frame.a.fillna(0) + frame.b.fillna(0)
giving :
a b c
0 1 3 4
1 2 NaN 2
2 NaN 4 4
Another approach:
>>> frame["c"] = frame[["a", "b"]].sum(axis=1)
>>> frame
a b c
0 1 3 4
1 2 NaN 2
2 NaN 4 4
As an expansion to the answer above, doing frame[["a", "b"]].sum(axis=1) will fill sum of all NaNs as 0
>>> frame["c"] = frame[["a", "b"]].sum(axis=1)
>>> frame
a b c
0 1 3 4
1 2 NaN 2
2 NaN 4 4
3 NaN NaN 0
If you want the sum of all NaNs to be NaN, you can add the min_count flag as referenced in the docs
>>> frame["c"] = frame[["a", "b"]].sum(axis=1, min_count=1)
>>> frame
a b c
0 1 3 4
1 2 NaN 2
2 NaN 4 4
3 NaN NaN NaN

Categories

Resources