Building column assignment from list of columns - python

I am trying to assign values to a column (lets call it 'AAA') based on other columns ('BBB', 'CCC') in a pandas dataframe. It works great when I know the exact column names, but in my scenario, 'BBB' and 'CCC' come from a list.
A loop works, but is there a more elegant and faster solution?
columns = ['BBB', 'CCC']
df = pd.DataFrame({'AAA': [4, 5, 6, 7],
'BBB': [10, 20, 30, 40],
'CCC': [100, 50, -30, -50]})
#This obviously works
df.loc[(df['BBB'] > 40) | (df['CCC'] > 40), 'AAA'] = 0.1
#This works as well
for col in columns:
df.loc[df[col]>40, 'AAA'] = 0.1

IIUC, You need any() over axis=1 here:
df.AAA=np.where(df[columns].gt(40).any(1),0.1,df.AAA)
#df.AAA=df.AAA.mask(df[columns].gt(40).any(1),0.1)
print(df)
AAA BBB CCC
0 0.1 10 100
1 0.1 20 50
2 6.0 30 -30
3 7.0 40 -50

Related

Writing a DataFrame to an excel file where items in a list are put into separate cells

Consider a dataframe like pivoted, where replicates of some data are given as lists in a dataframe:
d = {'Compound': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'C'],
'Conc': [1, 0.5, 0.1, 1, 0.5, 0.1, 2, 1, 0.5, 0.1],
'Data': [[100, 90, 80], [50, 40, 30], [10, 9.7, 8],
[20, 15, 10], [3, 4, 5, 6], [100, 110, 80],
[30, 40, 50, 20], [10, 5, 9, 3], [2, 1, 2, 2], [1, 1, 0]]}
df = pd.DataFrame(data=d)
pivoted = df.pivot(index='Conc', columns='Compound', values='Data')
This df can be written to an excel file as such:
with pd.ExcelWriter('output.xlsx') as writer:
pivoted.to_excel(writer, sheet_name='Sheet1', index_label='Conc')
How can this instead be written where replicate data are given in side-by-side cells? Desired excel file:
Then you need to pivot your data in a slightly different way, first explode the Data column, and deduplicate with groupby.cumcount:
(df.explode('Data')
.assign(n=lambda d: d.groupby(level=0).cumcount())
.pivot(index='Conc', columns=['Compound', 'n'], values='Data')
.droplevel('n', axis=1).rename_axis(columns=None)
)
Output:
A A A B B B B C C C C
Conc
0.1 10 9.7 8 100 110 80 NaN 1 1 0 NaN
0.5 50 40 30 3 4 5 6 2 1 2 2
1.0 100 90 80 20 15 10 NaN 10 5 9 3
2.0 NaN NaN NaN NaN NaN NaN NaN 30 40 50 20
Beside the #mozway's answer, just for formatting, you can use:
piv = (df.explode('Data').assign(col=lambda x: x.groupby(level=0).cumcount())
.pivot(index='Conc', columns=['Compound', 'col'], values='Data')
.rename_axis(None))
piv.columns = pd.Index([i if j == 0 else '' for i, j in piv.columns], name='Conc')
piv.to_excel('file.xlsx')

pandas groupby ID and select row with minimal value of specific columns

i want to select the whole row in which the minimal value of 3 selected columns is found, in a dataframe like this:
it is supposed to look like this afterwards:
I tried something like
dfcheckminrow = dfquery[dfquery == dfquery['A':'C'].min().groupby('ID')]
obviously it didn't work out well.
Thanks in advance!
Bkeesey's answer looks like it almost got you to your solution. I added one more step to get the overall minimum for each group.
import pandas as pd
# create sample df
df = pd.DataFrame({'ID': [1, 1, 2, 2, 3, 3],
'A': [30, 14, 100, 67, 1, 20],
'B': [10, 1, 2, 5, 100, 3],
'C': [1, 2, 3, 4, 5, 6],
})
# set "ID" as the index
df = df.set_index('ID')
# get the min for each column
mindf = df[['A','B']].groupby('ID').transform('min')
# get the min between columns and add it to df
df['min'] = mindf.apply(min, axis=1)
# filter df for when A or B matches the min
df2 = df.loc[(df['A'] == df['min']) | (df['B'] == df['min'])]
print(df2)
In my simplified example, I'm just finding the minimum between columns A and B. Here's the output:
A B C min
ID
1 14 1 2 1
2 100 2 3 2
3 1 100 5 1
One method do filter the initial DataFrame based on a groupby conditional could be to use transform to find the minimum for a "ID" group and then use loc to filter the initial DataFrame where `any(axis=1) (checking rows) is met.
# create sample df
df = pd.DataFrame({'ID': [1, 1, 2, 2, 3, 3],
'A': [30, 14, 100, 67, 1, 20],
'B': [10, 1, 2, 5, 100, 3]})
# set "ID" as the index
df = df.set_index('ID')
Sample df:
A B
ID
1 30 10
1 14 1
2 100 2
2 67 5
3 1 100
3 20 3
Use groupby and transform to find minimum value based on "ID" group.
Then use loc to filter initial df to where any(axis=1) is valid
df.loc[(df == df.groupby('ID').transform('min')).any(axis=1)]
Output:
A B
ID
1 14 1
2 100 2
2 67 5
3 1 100
3 20 3
In this example only the first row should be removed as it in both columns is not a minimum for the "ID" group.

Pandas find value corresponding to absolute minimum

I am trying to find the actual value that corresponds to the absolute minimum from multiple columns. For example:
df = pd.DataFrame({'A': [10, -5, -20, 50], 'B': [-5, 10, 30, 300], 'C': [15, 30, 15, 10]})
The output for this should be another another column with values -5, -5, 15 and 10.
I tried df['D'] = df[['A', 'B', 'C']].abs().min(axis=1), but it returns the minimum of absolutes, thereby losing the sign.
Try with idxmin
df['D'] = df.values[df.index,df.columns.get_indexer(df[['A', 'B', 'C']].abs().idxmin(1))]
df
Out[176]:
A B C D
0 10 -5 15 -5
1 -5 10 30 -5
2 -20 30 15 15
3 50 300 10 10

df.loc produces an error if the dtype of the index is mixed int/str

I have a data set with mixed index values, int and str, which df.to_csv reads as an object.
If I try to slice the rows this does not work, I get a TypeError.
I know I can work around it by changing the index dtype, but I would like to understand why this happens, or if there's a different way of slicing these mixed dtype indices?
I've created the following test case:
import os
import pandas as pd
import numpy as np
#all str index
df1 = pd.DataFrame({'Col': [0, 20, 30, 10]}, index=['a', 'b','c','d'])
#all int index
df2 = pd.DataFrame({'Col': [0, 20, 30, 10]}, index=[1, 2, 3, 4])
#all str index with numbers
df3 = pd.DataFrame({'Col': [0, 20, 30, 10]}, index=['a', 'b', '3', '4'])
#mixed str/int
df4 = pd.DataFrame({'Col': [0, 20, 30, 10]}, index=['a', 'b', 3, 4 ])
df1.loc['b':'d']
Col
b 20
c 30
d 10
df2.loc[2:4]
Col
2 20
3 30
4 10
df3.loc['b':'4']
Col
b 20
3 30
4 10
df4.loc['b':4]
TypeError
df4.index = df4.index.map(str)
df4.loc['b':'4']
Col
b 20
3 30
4 10
Why does the slice not work for df4?
Can you 'fix it' within the slice?
Is changing the dtype of the index the only option?
is changing the dtype of the index the only option?
No, You can achieve this using get_loc which finds the position of the index of the label, which you can use under iloc[]:
df4.iloc[df4.index.get_loc('b') : df4.index.get_loc(4)+1]
Col
b 20
3 30
4 10

How to aggregate two largest values per group in pandas?

I was going through this link: Return top N largest values per group using pandas
and found multiple ways to find the topN values per group.
However, I prefer dictionary method with agg function and would like to know if it is possible to get the equivalent of the dictionary method for the following problem?
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [1, 1, 1, 2, 2],
'B': [1, 1, 2, 2, 1],
'C': [10, 20, 30, 40, 50],
'D': ['X', 'Y', 'X', 'Y', 'Y']})
print(df)
A B C D
0 1 1 10 X
1 1 1 20 Y
2 1 2 30 X
3 2 2 40 Y
4 2 1 50 Y
I can do this:
df1 = df.groupby(['A'])['C'].nlargest(2).droplevel(-1).reset_index()
print(df1)
A C
0 1 30
1 1 20
2 2 50
3 2 40
# also this
df1 = df.sort_values('C', ascending=False).groupby('A', sort=False).head(2)
print(df1)
# also this
df.set_index('C').groupby('A')['B'].nlargest(2).reset_index()
Required
df.groupby('A',as_index=False).agg(
{'C': lambda ser: ser.nlargest(2) # something like this
})
Is it possible to use the dictionary here?
If you want to get a dictionary like A: 2 top values from C,
you can run:
df.groupby(['A'])['C'].apply(lambda x:
x.nlargest(2).tolist()).to_dict()
For your DataFrame, the result is:
{1: [30, 20], 2: [50, 40]}

Categories

Resources