Create and populate dataframe column simulating (excel) vlookup function - python

I am trying to create a new column in a dataframe and polulate it with a value from another data frame column which matches a common column from both data frames columns.
DF1 DF2
A B W B
——— ———
Y 2 X 2
N 4 F 4
Y 5 T 5
I though the following could do the tick.
df2[‘new_col’] = df1[‘A’] if df1[‘B’] == df2[‘B’] else “Not found”
So result should be:
DF2
W B new_col
X 2 Y -> Because DF1[‘B’] == 2 and value in same row is Y
F 4 N
T 5 Y
but I get the below error, I believe that is because dataframes are different sizes?
raise ValueError("Can only compare identically-labeled Series objects”)
Can you help me understand what am I doing wrong and what is the best way to achieve what I am after?
Thank you in advance.
UPDATE 1
Trying Corralien solution I still get the below:
ValueError: You are trying to merge on int64 and object columns. If you wish to proceed you should use pd.concat
This is the code I wrote
df1 = pd.DataFrame(np.array([['x', 2, 3], ['y', 5, 6], ['z', 8, 9]]),
columns=['One', 'b', 'Three'])
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
df2.reset_index().merge(df1.reset_index(), on=['b'], how='left') \
.drop(columns='index').rename(columns={'One': 'new_col'})
UPDATE 2
Here is the second option, but it does not seem to add columns in df2.
df1 = pd.DataFrame(np.array([['x', 2, 3], ['y', 5, 6], ['z', 8, 9]]),
columns=['One', 'b', 'Three'])
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
df2 = df2.set_index('b', append=True).join(df1.set_index('b', append=True)) \
.reset_index('b').rename(columns={'One': 'new_col'})
print(df2)
b a c new_col Three
0 2 1 3 NaN NaN
1 5 4 6 NaN NaN
2 8 7 9 NaN NaN
Why is the code above not working?

Your question is not clear because why is F associated with N and T with Y? Why not F with Y and T with N?
Using merge:
>>> df2.merge(df1, on='B', how='left')
W B A
0 X 2 Y
1 F 4 N # What you want
2 F 4 Y # Another solution
3 T 4 N # What you want
4 T 4 Y # Another solution
How do you decide on the right value? With row index?
Update
So you need to use the index position:
>>> df2.reset_index().merge(df1.reset_index(), on=['index', 'B'], how='left') \
.drop(columns='index').rename(columns={'A': 'new_col'})
W B new_col
0 X 2 Y
1 F 4 N
2 T 4 Y
In fact you can consider the column B as an additional index of each dataframe.
Using join
>>> df2.set_index('B', append=True).join(df1.set_index('B', append=True)) \
.reset_index('B').rename(columns={'A': 'new_col'})
B W new_col
0 2 X Y
1 4 F N
2 4 T Y
Setup:
df1 = pd.DataFrame([['x', 2, 3], ['y', 5, 6], ['z', 8, 9]],
columns=['One', 'b', 'Three'])
df2 = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]],
columns=['a', 'b', 'c'])

Related

Find top N highest values in a pandas dataframe, and return column name [duplicate]

I have a code with multiple columns and I would like to add two more, one for the highest number on the row, and another one for the second highest. However, instead of the number, I would like to show the column name where they are found.
Assume the following data frame:
import pandas as pd
df = pd.DataFrame({'A': [1, 5, 10], 'B': [2, 6, 11], 'C': [3, 7, 12], 'D': [4, 8, 13], 'E': [5, 9, 14]})
To extract the highest number on every row, I can just apply max(axis=1) like this:
df['max1'] = df[['A', 'B', 'C', 'D', 'E']].max(axis = 1)
This gets me the max number, but not the column name itself.
How can this be applied to the second max number as well?
You can sorting values and assign top2 values:
cols = ['A', 'B', 'C', 'D', 'E']
df[['max2','max1']] = np.sort(df[cols].to_numpy(), axis=1)[:, -2:]
print (df)
A B C D E max2 max1
0 1 2 3 4 5 4 5
1 5 6 7 8 9 8 9
2 10 11 12 13 14 13 14
df[['max1','max2']] = np.sort(df[cols].to_numpy(), axis=1)[:, -2:][:, ::-1]
EDIT: For get top2 columns names and top2 values use:
df = pd.DataFrame({'A': [1, 50, 10], 'B': [2, 6, 11],
'C': [3, 7, 12], 'D': [40, 8, 13], 'E': [5, 9, 14]})
cols = ['A', 'B', 'C', 'D', 'E']
#values in numpy array
vals = df[cols].to_numpy()
#columns names in array
cols = np.array(cols)
#get indices that would sort an array in descending order
arr = np.argsort(-vals, axis=1)
#top 2 columns names
df[['top1','top2']] = cols[arr[:, :2]]
#top 2 values
df[['max2','max1']] = vals[np.arange(arr.shape[0])[:, None], arr[:, :2]]
print (df)
A B C D E top1 top2 max2 max1
0 1 2 3 40 5 D E 40 5
1 50 6 7 8 9 A E 50 9
2 10 11 12 13 14 E D 14 13
Another approaches to you can get first max then remove it and get max again to get the second max
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 15, 10], 'B': [2, 89, 11], 'C': [80, 7, 12], 'D': [4, 8, 13], 'E': [5, 9, 14]})
max1=df.max(axis=1)
maxcolum1=df.idxmax(axis=1)
max2 = df.replace(np.array(df.max(axis=1)),0).max(axis=1)
maxcolum2=df.replace(np.array(df.max(axis=1)),0).idxmax(axis=1)
df2 =pd.DataFrame({ 'max1': max1, 'max2': max2 ,'maxcol1':maxcolum1,'maxcol2':maxcolum2 })
df.join(df2)

Pandas dataframe remove rows by aggregated data

I have a dataframe like this
test1 = pd.DataFrame(np.array([[1, 9, 3], [1, 5, 6], [2, 1, 9]]),
columns=['a', 'b', 'c'])
a
b
c
0
1
9
3
1
1
5
6
2
2
1
9
I want to keep 'a' iff the sum of 'b's under the same 'a' is greater than 10.
For this case, the desire output is:
a
b
c
0
1
9
3
1
1
5
6
My solution is as below:
test1 = pd.DataFrame(np.array([[1, 9, 3], [1, 5, 6], [2, 1, 9]]),
columns=['a', 'b', 'c'])
tmp_ = test1.groupby("a").sum().reset_index()
test1[test1["a"].isin(tmp_[tmp_["b"]>10]["a"].to_list())]
I am just wondering if there is a more elegant way to do that?
Group by 'a' and use transform
test1 = pd.DataFrame(np.array([[1, 9, 3], [1, 5, 6], [2, 1, 9]]),
columns=['a', 'b', 'c'])
b_sum_by_a = test1.groupby('a')['b'].transform('sum')
test1 = test1[b_sum_by_a > 10]
>>> test1
a b c
0 1 9 3
1 1 5 6

column is not getting dropped

Why column A is not getting dropped in train,valid,test data frames?
import pandas as pd
train = pd.DataFrame({'A': [0, 1, 2, 3, 4],'B': [5, 6, 7, 8, 9],'C': ['a', 'b', 'c', 'd', 'e']})
test = pd.DataFrame({'A': [0, 1, 2, 3, 4],'B': [5, 6, 7, 8, 9],'C': ['a', 'b', 'c', 'd', 'e']})
valid = pd.DataFrame({'A': [0, 1, 2, 3, 4],'B': [5, 6, 7, 8, 9],'C': ['a', 'b', 'c', 'd', 'e']})
for df in [train,valid,test]:
df = df.drop(['A'],axis=1)
print('A' in train.columns)
print('A' in test.columns)
print('A' in valid.columns)
#True
#True
#True
You can use inplace=True parameter, because DataFrame.drop function working also inplace:
for df in [train,valid,test]:
df.drop(['A'],axis=1, inplace=True)
print('A' in train.columns)
False
print('A' in test.columns)
False
print('A' in valid.columns)
False
Reason why is not removed column is df is not assign back, so DataFrames are not changed.
Another idea is create list of DataFrames and assign each changed DataFrame back:
L = [train,valid,test]
for i in range(len(L)):
L[i] = L[i].drop(['A'],axis=1)
print (L)
[ B C
0 5 a
1 6 b
2 7 c
3 8 d
4 9 e, B C
0 5 a
1 6 b
2 7 c
3 8 d
4 9 e, B C
0 5 a
1 6 b
2 7 c
3 8 d
4 9 e]

Pandas keep column after multiple aggregations

I'm trying to do multiple aggragations over a pandas dataframe, the problem is that I want to keep the column over I aggregate
df3 = pd.DataFrame({'X' : ['A', 'B', 'A', 'B'], 'Y' : [1, 4, 3, 2]})
df3.groupby('X', as_index=False).agg('sum')
X Y
0 A 4
1 B 6
That's good but what I want is multiple aggregations like this
df3 = pd.DataFrame({'X' : ['A', 'B', 'A', 'B'], 'Y' : [1, 4, 3, 2]})
df3.groupby('X', as_index=False).agg(['sum', 'mean'])
It gives me
Y
sum mean
X
A 4 2
B 6 3
But I want this
X Y
sum mean
0 A 4 2
1 B 6 3
To move X from the index to a column use reset_index:
In [4]: df3 = pd.DataFrame({'X' : ['A', 'B', 'A', 'B'], 'Y' : [1, 4, 3, 2]})
In [5]: df3.groupby('X', as_index=False).agg(['sum', 'mean']).reset_index()
Out[5]:
X Y
sum mean
0 A 4 2
1 B 6 3

Selecting from multi-level groupby in pandas

Lets say I have two dataframes: df with columns ('a', 'b', 'c') and tf with columns ('a', 'b'). I do a group-combine on the two common columns in df:
grouped_sum = df.groupby(('a', 'b')).sum()
How can I "add" the column c to tf according to grouped_sum, i.e.
tf[i]['c'] = grouped_sum[tf[i]['a'], tf[i]['b']]
for all rows i of the second data frame? For a groupby with a single level it works simply by indexing the group with the corresponding column of tf.
If you groupby with as_index=False you can merge with tf:
In [11]: tf = pd.DataFrame([[1, 2], [3, 4]], columns=list('ab'))
In [12]: df = pd.DataFrame([[1, 2, 3], [1, 2, 4], [3, 4, 5]], columns=list('abc'))
In [13]: grouped_sum = df.groupby(['a', 'b'], as_index=False).sum()
In [14]: grouped_sum
Out[14]:
a b c
0 1 2 7
1 3 4 5
In [15]: tf.merge(grouped_sum) # this won't always be the same as grouped_sum!
Out[15]:
a b c
0 1 2 7
1 3 4 5
another option is to set a and b as the index of tf.

Categories

Resources