Pandas dataframe remove rows by aggregated data - python

I have a dataframe like this
test1 = pd.DataFrame(np.array([[1, 9, 3], [1, 5, 6], [2, 1, 9]]),
columns=['a', 'b', 'c'])
a
b
c
0
1
9
3
1
1
5
6
2
2
1
9
I want to keep 'a' iff the sum of 'b's under the same 'a' is greater than 10.
For this case, the desire output is:
a
b
c
0
1
9
3
1
1
5
6
My solution is as below:
test1 = pd.DataFrame(np.array([[1, 9, 3], [1, 5, 6], [2, 1, 9]]),
columns=['a', 'b', 'c'])
tmp_ = test1.groupby("a").sum().reset_index()
test1[test1["a"].isin(tmp_[tmp_["b"]>10]["a"].to_list())]
I am just wondering if there is a more elegant way to do that?

Group by 'a' and use transform
test1 = pd.DataFrame(np.array([[1, 9, 3], [1, 5, 6], [2, 1, 9]]),
columns=['a', 'b', 'c'])
b_sum_by_a = test1.groupby('a')['b'].transform('sum')
test1 = test1[b_sum_by_a > 10]
>>> test1
a b c
0 1 9 3
1 1 5 6

Related

Find top N highest values in a pandas dataframe, and return column name [duplicate]

I have a code with multiple columns and I would like to add two more, one for the highest number on the row, and another one for the second highest. However, instead of the number, I would like to show the column name where they are found.
Assume the following data frame:
import pandas as pd
df = pd.DataFrame({'A': [1, 5, 10], 'B': [2, 6, 11], 'C': [3, 7, 12], 'D': [4, 8, 13], 'E': [5, 9, 14]})
To extract the highest number on every row, I can just apply max(axis=1) like this:
df['max1'] = df[['A', 'B', 'C', 'D', 'E']].max(axis = 1)
This gets me the max number, but not the column name itself.
How can this be applied to the second max number as well?
You can sorting values and assign top2 values:
cols = ['A', 'B', 'C', 'D', 'E']
df[['max2','max1']] = np.sort(df[cols].to_numpy(), axis=1)[:, -2:]
print (df)
A B C D E max2 max1
0 1 2 3 4 5 4 5
1 5 6 7 8 9 8 9
2 10 11 12 13 14 13 14
df[['max1','max2']] = np.sort(df[cols].to_numpy(), axis=1)[:, -2:][:, ::-1]
EDIT: For get top2 columns names and top2 values use:
df = pd.DataFrame({'A': [1, 50, 10], 'B': [2, 6, 11],
'C': [3, 7, 12], 'D': [40, 8, 13], 'E': [5, 9, 14]})
cols = ['A', 'B', 'C', 'D', 'E']
#values in numpy array
vals = df[cols].to_numpy()
#columns names in array
cols = np.array(cols)
#get indices that would sort an array in descending order
arr = np.argsort(-vals, axis=1)
#top 2 columns names
df[['top1','top2']] = cols[arr[:, :2]]
#top 2 values
df[['max2','max1']] = vals[np.arange(arr.shape[0])[:, None], arr[:, :2]]
print (df)
A B C D E top1 top2 max2 max1
0 1 2 3 40 5 D E 40 5
1 50 6 7 8 9 A E 50 9
2 10 11 12 13 14 E D 14 13
Another approaches to you can get first max then remove it and get max again to get the second max
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 15, 10], 'B': [2, 89, 11], 'C': [80, 7, 12], 'D': [4, 8, 13], 'E': [5, 9, 14]})
max1=df.max(axis=1)
maxcolum1=df.idxmax(axis=1)
max2 = df.replace(np.array(df.max(axis=1)),0).max(axis=1)
maxcolum2=df.replace(np.array(df.max(axis=1)),0).idxmax(axis=1)
df2 =pd.DataFrame({ 'max1': max1, 'max2': max2 ,'maxcol1':maxcolum1,'maxcol2':maxcolum2 })
df.join(df2)

Appending columns to other columns in Pandas

Given the dataframe:
d = {'col1': [1, 2, 3, 4, 7], 'col2': [4, 5, 6, 9, 5], 'col3': [7, 8, 12, 1, 11], 'col4': [12, 13, 14, 15, 16]}
What is the easiest way to append the third column to the first and the fourth column to the second?
The result should look like.
d = {'col1': [1, 2, 3, 4, 7, 7, 8, 12, 1, 11], 'col2': [4, 5, 6, 9, 5, 12, 13, 14, 15, 16],
I need to use this for a script with different column names, thus referencing columns by name is not possible. I have tried something along the lines of df.iloc[:,x] to achieve this.
You can use:
out = pd.concat([subdf.set_axis(['col1', 'col2'], axis=1)
for _, subdf in df.groupby(pd.RangeIndex(df.shape[1]) // 2, axis=1)])
print(out)
# Output
col1 col2
0 1 4
1 2 5
2 3 6
3 4 9
4 7 5
0 7 12
1 8 13
2 12 14
3 1 15
4 11 16
You can change the column names and concat:
pd.concat([df[['col1', 'col2']],
df[['col3', 'col4']].set_axis(['col1', 'col2'], axis=1)])
Add ignore_index=True to reset the index in the process.
Output:
col1 col2
0 1 4
1 2 5
2 3 6
3 4 9
4 7 5
0 7 12
1 8 13
2 12 14
3 1 15
4 11 16
Or, using numpy:
N = 2
pd.DataFrame(
df
.values.reshape((-1,df.shape[1]//2,N))
.reshape(-1,N,order='F'),
columns=df.columns[:N]
)
This may not be the most efficient solution but, you can do it using the pd.concat() function in pandas.
First convert your initial dict d into a pandas Dataframe and then apply the concat function.
d = {'col1': [1, 2, 3, 4, 7], 'col2': [4, 5, 6, 9, 5], 'col3': [7, 8, 12, 1, 11], 'col4': [12, 13, 14, 15, 16]}
df = pd.DataFrame(d)
d_2 = {'col1':pd.concat([df.iloc[:,0],df.iloc[:,2]]),'col2':pd.concat([df.iloc[:,1],df.iloc[:,3]])}
d_2 is your required dict. Convert it to a dataframe if you need it to,
df_2 = pd.DataFrame(d_2)

Create and populate dataframe column simulating (excel) vlookup function

I am trying to create a new column in a dataframe and polulate it with a value from another data frame column which matches a common column from both data frames columns.
DF1 DF2
A B W B
——— ———
Y 2 X 2
N 4 F 4
Y 5 T 5
I though the following could do the tick.
df2[‘new_col’] = df1[‘A’] if df1[‘B’] == df2[‘B’] else “Not found”
So result should be:
DF2
W B new_col
X 2 Y -> Because DF1[‘B’] == 2 and value in same row is Y
F 4 N
T 5 Y
but I get the below error, I believe that is because dataframes are different sizes?
raise ValueError("Can only compare identically-labeled Series objects”)
Can you help me understand what am I doing wrong and what is the best way to achieve what I am after?
Thank you in advance.
UPDATE 1
Trying Corralien solution I still get the below:
ValueError: You are trying to merge on int64 and object columns. If you wish to proceed you should use pd.concat
This is the code I wrote
df1 = pd.DataFrame(np.array([['x', 2, 3], ['y', 5, 6], ['z', 8, 9]]),
columns=['One', 'b', 'Three'])
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
df2.reset_index().merge(df1.reset_index(), on=['b'], how='left') \
.drop(columns='index').rename(columns={'One': 'new_col'})
UPDATE 2
Here is the second option, but it does not seem to add columns in df2.
df1 = pd.DataFrame(np.array([['x', 2, 3], ['y', 5, 6], ['z', 8, 9]]),
columns=['One', 'b', 'Three'])
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
df2 = df2.set_index('b', append=True).join(df1.set_index('b', append=True)) \
.reset_index('b').rename(columns={'One': 'new_col'})
print(df2)
b a c new_col Three
0 2 1 3 NaN NaN
1 5 4 6 NaN NaN
2 8 7 9 NaN NaN
Why is the code above not working?
Your question is not clear because why is F associated with N and T with Y? Why not F with Y and T with N?
Using merge:
>>> df2.merge(df1, on='B', how='left')
W B A
0 X 2 Y
1 F 4 N # What you want
2 F 4 Y # Another solution
3 T 4 N # What you want
4 T 4 Y # Another solution
How do you decide on the right value? With row index?
Update
So you need to use the index position:
>>> df2.reset_index().merge(df1.reset_index(), on=['index', 'B'], how='left') \
.drop(columns='index').rename(columns={'A': 'new_col'})
W B new_col
0 X 2 Y
1 F 4 N
2 T 4 Y
In fact you can consider the column B as an additional index of each dataframe.
Using join
>>> df2.set_index('B', append=True).join(df1.set_index('B', append=True)) \
.reset_index('B').rename(columns={'A': 'new_col'})
B W new_col
0 2 X Y
1 4 F N
2 4 T Y
Setup:
df1 = pd.DataFrame([['x', 2, 3], ['y', 5, 6], ['z', 8, 9]],
columns=['One', 'b', 'Three'])
df2 = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]],
columns=['a', 'b', 'c'])

Pandas DataFrame filter by multiple column criterias and multiple intervals

I have checked several answers but found no luck so far.
My dataset is like this:
df = pd.DataFrame({
'Location':['A', 'A', 'A', 'B', 'C', 'C'],
'Place':[1, 2, 3, 4, 2, 3],
'Value1':[1, 1, 2, 3, 4, 5],
'Value2':[1, 1, 2, 3, 4, 5]
}, columns = ['Location','Place','Value1','Value2'])
Location Place Value1 Value2
A 1 1 1
A 2 1 1
A 3 2 2
B 4 3 3
C 2 4 4
C 3 5 5
and I have a list of intervals:
A: [0, 1]
A: [3, 5]
B: [1, 3]
C: [1, 4]
C: [6, 10]
Now I want that every row that have Location equal to that of the filter list, should have the Place in range of the filter. So the desired output will be:
Location Place Value1 Value2
A 1 1 1
A 3 2 2
C 2 4 4
C 3 5 5
I know that I can chain multiple between conditions by | , but I have a really long list of intervals so manually enter the condition is not feasible. I also consider forloop to slice the data by location first, but I think there could be more efficient way.
Thank you for your help.
Edit: Currently the list of intervals is just strings like this
A 0 1
A 3 5
B 1 3
C 1 4
C 6 10
but I would like to slice them into list of dicts. Better structure for it is also welcome!
First define dataframe df and filters dff:
df = pd.DataFrame({
'Location':['A', 'A', 'A', 'B', 'C', 'C'],
'Place':[1, 2, 3, 4, 2, 3],
'Value1':[1, 1, 2, 3, 4, 5],
'Value2':[1, 1, 2, 3, 4, 5]
}, columns = ['Location','Place','Value1','Value2'])
dff = pd.DataFrame({'Location':['A','A','B','C','C'],
'fPlace':[[0,1], [3, 5], [1, 3], [1, 4], [6, 10]]})
dff[['p1', 'p2']] = pd.DataFrame(dff["fPlace"].to_list())
now dff is:
Location fPlace p1 p2
0 A [0, 1] 0 1
1 A [3, 5] 3 5
2 B [1, 3] 1 3
3 C [1, 4] 1 4
4 C [6, 10] 6 10
where fPlace transformed to lower and upper bounds p1 and p2 indicates filters that should be applied to Place. Next:
df.merge(dff).query('Place >= p1 and Place <= p2').drop(columns = ['fPlace','p1','p2'])
result:
Location Place Value1 Value2
0 A 1 1 1
5 A 3 2 2
7 C 2 4 4
9 C 3 5 5
Prerequisites:
# presumed setup for your intervals:
intervals = {
"A": [
[0, 1],
[3, 5],
],
"B": [
[1, 3],
],
"C": [
[1, 4],
[6, 10],
],
}
Actual solution:
x = df["Location"].map(intervals).explode().str
l, r = x[0], x[1]
res = df["Place"].loc[l.index].between(l, r)
res = res.loc[res].index.unique()
res = df.loc[res]
Outputs:
>>> res
Location Place Value1 Value2
0 A 1 1 1
2 A 3 2 2
4 C 2 4 4
5 C 3 5 5

column is not getting dropped

Why column A is not getting dropped in train,valid,test data frames?
import pandas as pd
train = pd.DataFrame({'A': [0, 1, 2, 3, 4],'B': [5, 6, 7, 8, 9],'C': ['a', 'b', 'c', 'd', 'e']})
test = pd.DataFrame({'A': [0, 1, 2, 3, 4],'B': [5, 6, 7, 8, 9],'C': ['a', 'b', 'c', 'd', 'e']})
valid = pd.DataFrame({'A': [0, 1, 2, 3, 4],'B': [5, 6, 7, 8, 9],'C': ['a', 'b', 'c', 'd', 'e']})
for df in [train,valid,test]:
df = df.drop(['A'],axis=1)
print('A' in train.columns)
print('A' in test.columns)
print('A' in valid.columns)
#True
#True
#True
You can use inplace=True parameter, because DataFrame.drop function working also inplace:
for df in [train,valid,test]:
df.drop(['A'],axis=1, inplace=True)
print('A' in train.columns)
False
print('A' in test.columns)
False
print('A' in valid.columns)
False
Reason why is not removed column is df is not assign back, so DataFrames are not changed.
Another idea is create list of DataFrames and assign each changed DataFrame back:
L = [train,valid,test]
for i in range(len(L)):
L[i] = L[i].drop(['A'],axis=1)
print (L)
[ B C
0 5 a
1 6 b
2 7 c
3 8 d
4 9 e, B C
0 5 a
1 6 b
2 7 c
3 8 d
4 9 e, B C
0 5 a
1 6 b
2 7 c
3 8 d
4 9 e]

Categories

Resources