column is not getting dropped - python

Why column A is not getting dropped in train,valid,test data frames?
import pandas as pd
train = pd.DataFrame({'A': [0, 1, 2, 3, 4],'B': [5, 6, 7, 8, 9],'C': ['a', 'b', 'c', 'd', 'e']})
test = pd.DataFrame({'A': [0, 1, 2, 3, 4],'B': [5, 6, 7, 8, 9],'C': ['a', 'b', 'c', 'd', 'e']})
valid = pd.DataFrame({'A': [0, 1, 2, 3, 4],'B': [5, 6, 7, 8, 9],'C': ['a', 'b', 'c', 'd', 'e']})
for df in [train,valid,test]:
df = df.drop(['A'],axis=1)
print('A' in train.columns)
print('A' in test.columns)
print('A' in valid.columns)
#True
#True
#True

You can use inplace=True parameter, because DataFrame.drop function working also inplace:
for df in [train,valid,test]:
df.drop(['A'],axis=1, inplace=True)
print('A' in train.columns)
False
print('A' in test.columns)
False
print('A' in valid.columns)
False
Reason why is not removed column is df is not assign back, so DataFrames are not changed.
Another idea is create list of DataFrames and assign each changed DataFrame back:
L = [train,valid,test]
for i in range(len(L)):
L[i] = L[i].drop(['A'],axis=1)
print (L)
[ B C
0 5 a
1 6 b
2 7 c
3 8 d
4 9 e, B C
0 5 a
1 6 b
2 7 c
3 8 d
4 9 e, B C
0 5 a
1 6 b
2 7 c
3 8 d
4 9 e]

Related

Find top N highest values in a pandas dataframe, and return column name [duplicate]

I have a code with multiple columns and I would like to add two more, one for the highest number on the row, and another one for the second highest. However, instead of the number, I would like to show the column name where they are found.
Assume the following data frame:
import pandas as pd
df = pd.DataFrame({'A': [1, 5, 10], 'B': [2, 6, 11], 'C': [3, 7, 12], 'D': [4, 8, 13], 'E': [5, 9, 14]})
To extract the highest number on every row, I can just apply max(axis=1) like this:
df['max1'] = df[['A', 'B', 'C', 'D', 'E']].max(axis = 1)
This gets me the max number, but not the column name itself.
How can this be applied to the second max number as well?
You can sorting values and assign top2 values:
cols = ['A', 'B', 'C', 'D', 'E']
df[['max2','max1']] = np.sort(df[cols].to_numpy(), axis=1)[:, -2:]
print (df)
A B C D E max2 max1
0 1 2 3 4 5 4 5
1 5 6 7 8 9 8 9
2 10 11 12 13 14 13 14
df[['max1','max2']] = np.sort(df[cols].to_numpy(), axis=1)[:, -2:][:, ::-1]
EDIT: For get top2 columns names and top2 values use:
df = pd.DataFrame({'A': [1, 50, 10], 'B': [2, 6, 11],
'C': [3, 7, 12], 'D': [40, 8, 13], 'E': [5, 9, 14]})
cols = ['A', 'B', 'C', 'D', 'E']
#values in numpy array
vals = df[cols].to_numpy()
#columns names in array
cols = np.array(cols)
#get indices that would sort an array in descending order
arr = np.argsort(-vals, axis=1)
#top 2 columns names
df[['top1','top2']] = cols[arr[:, :2]]
#top 2 values
df[['max2','max1']] = vals[np.arange(arr.shape[0])[:, None], arr[:, :2]]
print (df)
A B C D E top1 top2 max2 max1
0 1 2 3 40 5 D E 40 5
1 50 6 7 8 9 A E 50 9
2 10 11 12 13 14 E D 14 13
Another approaches to you can get first max then remove it and get max again to get the second max
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 15, 10], 'B': [2, 89, 11], 'C': [80, 7, 12], 'D': [4, 8, 13], 'E': [5, 9, 14]})
max1=df.max(axis=1)
maxcolum1=df.idxmax(axis=1)
max2 = df.replace(np.array(df.max(axis=1)),0).max(axis=1)
maxcolum2=df.replace(np.array(df.max(axis=1)),0).idxmax(axis=1)
df2 =pd.DataFrame({ 'max1': max1, 'max2': max2 ,'maxcol1':maxcolum1,'maxcol2':maxcolum2 })
df.join(df2)

Create and populate dataframe column simulating (excel) vlookup function

I am trying to create a new column in a dataframe and polulate it with a value from another data frame column which matches a common column from both data frames columns.
DF1 DF2
A B W B
——— ———
Y 2 X 2
N 4 F 4
Y 5 T 5
I though the following could do the tick.
df2[‘new_col’] = df1[‘A’] if df1[‘B’] == df2[‘B’] else “Not found”
So result should be:
DF2
W B new_col
X 2 Y -> Because DF1[‘B’] == 2 and value in same row is Y
F 4 N
T 5 Y
but I get the below error, I believe that is because dataframes are different sizes?
raise ValueError("Can only compare identically-labeled Series objects”)
Can you help me understand what am I doing wrong and what is the best way to achieve what I am after?
Thank you in advance.
UPDATE 1
Trying Corralien solution I still get the below:
ValueError: You are trying to merge on int64 and object columns. If you wish to proceed you should use pd.concat
This is the code I wrote
df1 = pd.DataFrame(np.array([['x', 2, 3], ['y', 5, 6], ['z', 8, 9]]),
columns=['One', 'b', 'Three'])
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
df2.reset_index().merge(df1.reset_index(), on=['b'], how='left') \
.drop(columns='index').rename(columns={'One': 'new_col'})
UPDATE 2
Here is the second option, but it does not seem to add columns in df2.
df1 = pd.DataFrame(np.array([['x', 2, 3], ['y', 5, 6], ['z', 8, 9]]),
columns=['One', 'b', 'Three'])
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
df2 = df2.set_index('b', append=True).join(df1.set_index('b', append=True)) \
.reset_index('b').rename(columns={'One': 'new_col'})
print(df2)
b a c new_col Three
0 2 1 3 NaN NaN
1 5 4 6 NaN NaN
2 8 7 9 NaN NaN
Why is the code above not working?
Your question is not clear because why is F associated with N and T with Y? Why not F with Y and T with N?
Using merge:
>>> df2.merge(df1, on='B', how='left')
W B A
0 X 2 Y
1 F 4 N # What you want
2 F 4 Y # Another solution
3 T 4 N # What you want
4 T 4 Y # Another solution
How do you decide on the right value? With row index?
Update
So you need to use the index position:
>>> df2.reset_index().merge(df1.reset_index(), on=['index', 'B'], how='left') \
.drop(columns='index').rename(columns={'A': 'new_col'})
W B new_col
0 X 2 Y
1 F 4 N
2 T 4 Y
In fact you can consider the column B as an additional index of each dataframe.
Using join
>>> df2.set_index('B', append=True).join(df1.set_index('B', append=True)) \
.reset_index('B').rename(columns={'A': 'new_col'})
B W new_col
0 2 X Y
1 4 F N
2 4 T Y
Setup:
df1 = pd.DataFrame([['x', 2, 3], ['y', 5, 6], ['z', 8, 9]],
columns=['One', 'b', 'Three'])
df2 = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]],
columns=['a', 'b', 'c'])

Pandas dataframe remove rows by aggregated data

I have a dataframe like this
test1 = pd.DataFrame(np.array([[1, 9, 3], [1, 5, 6], [2, 1, 9]]),
columns=['a', 'b', 'c'])
a
b
c
0
1
9
3
1
1
5
6
2
2
1
9
I want to keep 'a' iff the sum of 'b's under the same 'a' is greater than 10.
For this case, the desire output is:
a
b
c
0
1
9
3
1
1
5
6
My solution is as below:
test1 = pd.DataFrame(np.array([[1, 9, 3], [1, 5, 6], [2, 1, 9]]),
columns=['a', 'b', 'c'])
tmp_ = test1.groupby("a").sum().reset_index()
test1[test1["a"].isin(tmp_[tmp_["b"]>10]["a"].to_list())]
I am just wondering if there is a more elegant way to do that?
Group by 'a' and use transform
test1 = pd.DataFrame(np.array([[1, 9, 3], [1, 5, 6], [2, 1, 9]]),
columns=['a', 'b', 'c'])
b_sum_by_a = test1.groupby('a')['b'].transform('sum')
test1 = test1[b_sum_by_a > 10]
>>> test1
a b c
0 1 9 3
1 1 5 6

python pandas elegant dataframe access rows 2:end

I have a dataframe, dF = pd.DataFrame(X) where X is a numpy array of doubles. I want to remove the last row from the dataframe. I know for the first row I can do something like this dF.ix[1:]. I want to do something similar for the last row. I know in matlab you could do something like this dF[1:end-1]. What is a good and readable way to do this with pandas?
The end goal is to achieve this:
first matrix
1 2 3
4 5 6
7 8 9
second matrix
a b c
d e f
g h i
now get rid of first row of first matrix and last row of second matrix and horizontally concatentate them like so:
4 5 6 a b c
7 8 9 d e f
done. In matlab a = firstMatrix. b = secondMatrix. c = [a[2:end,:] b[1:end-1,:]] where c is the resulting matrix.
you can do it this way:
In [129]: df1
Out[129]:
c1 c2 c3
0 1 2 3
1 4 5 6
2 7 8 9
In [130]: df2
Out[130]:
c1 c2 c3
0 a b c
1 d e f
2 g h i
In [131]: df1.iloc[1:].reset_index(drop=1).join(df2.iloc[:-1].reset_index(drop=1), rsuffix='_2')
Out[131]:
c1 c2 c3 c1_2 c2_2 c3_2
0 4 5 6 a b c
1 7 8 9 d e f
Or a pure NumPy solution:
In [132]: a1 = df1.values
In [133]: a1
Out[133]:
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]], dtype=int64)
In [134]: a2 = df2.values
In [135]: a2
Out[135]:
array([['a', 'b', 'c'],
['d', 'e', 'f'],
['g', 'h', 'i']], dtype=object)
In [136]: a1[1:]
Out[136]:
array([[4, 5, 6],
[7, 8, 9]], dtype=int64)
In [137]: a2[:-1]
Out[137]:
array([['a', 'b', 'c'],
['d', 'e', 'f']], dtype=object)
In [138]: np.concatenate((a1[1:], a2[:-1]), axis=1)
Out[138]:
array([[4, 5, 6, 'a', 'b', 'c'],
[7, 8, 9, 'd', 'e', 'f']], dtype=object)

Pandas keep column after multiple aggregations

I'm trying to do multiple aggragations over a pandas dataframe, the problem is that I want to keep the column over I aggregate
df3 = pd.DataFrame({'X' : ['A', 'B', 'A', 'B'], 'Y' : [1, 4, 3, 2]})
df3.groupby('X', as_index=False).agg('sum')
X Y
0 A 4
1 B 6
That's good but what I want is multiple aggregations like this
df3 = pd.DataFrame({'X' : ['A', 'B', 'A', 'B'], 'Y' : [1, 4, 3, 2]})
df3.groupby('X', as_index=False).agg(['sum', 'mean'])
It gives me
Y
sum mean
X
A 4 2
B 6 3
But I want this
X Y
sum mean
0 A 4 2
1 B 6 3
To move X from the index to a column use reset_index:
In [4]: df3 = pd.DataFrame({'X' : ['A', 'B', 'A', 'B'], 'Y' : [1, 4, 3, 2]})
In [5]: df3.groupby('X', as_index=False).agg(['sum', 'mean']).reset_index()
Out[5]:
X Y
sum mean
0 A 4 2
1 B 6 3

Categories

Resources