I am trying to find the actual value that corresponds to the absolute minimum from multiple columns. For example:
df = pd.DataFrame({'A': [10, -5, -20, 50], 'B': [-5, 10, 30, 300], 'C': [15, 30, 15, 10]})
The output for this should be another another column with values -5, -5, 15 and 10.
I tried df['D'] = df[['A', 'B', 'C']].abs().min(axis=1), but it returns the minimum of absolutes, thereby losing the sign.
Try with idxmin
df['D'] = df.values[df.index,df.columns.get_indexer(df[['A', 'B', 'C']].abs().idxmin(1))]
df
Out[176]:
A B C D
0 10 -5 15 -5
1 -5 10 30 -5
2 -20 30 15 15
3 50 300 10 10
Related
Consider a dataframe like pivoted, where replicates of some data are given as lists in a dataframe:
d = {'Compound': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'C'],
'Conc': [1, 0.5, 0.1, 1, 0.5, 0.1, 2, 1, 0.5, 0.1],
'Data': [[100, 90, 80], [50, 40, 30], [10, 9.7, 8],
[20, 15, 10], [3, 4, 5, 6], [100, 110, 80],
[30, 40, 50, 20], [10, 5, 9, 3], [2, 1, 2, 2], [1, 1, 0]]}
df = pd.DataFrame(data=d)
pivoted = df.pivot(index='Conc', columns='Compound', values='Data')
This df can be written to an excel file as such:
with pd.ExcelWriter('output.xlsx') as writer:
pivoted.to_excel(writer, sheet_name='Sheet1', index_label='Conc')
How can this instead be written where replicate data are given in side-by-side cells? Desired excel file:
Then you need to pivot your data in a slightly different way, first explode the Data column, and deduplicate with groupby.cumcount:
(df.explode('Data')
.assign(n=lambda d: d.groupby(level=0).cumcount())
.pivot(index='Conc', columns=['Compound', 'n'], values='Data')
.droplevel('n', axis=1).rename_axis(columns=None)
)
Output:
A A A B B B B C C C C
Conc
0.1 10 9.7 8 100 110 80 NaN 1 1 0 NaN
0.5 50 40 30 3 4 5 6 2 1 2 2
1.0 100 90 80 20 15 10 NaN 10 5 9 3
2.0 NaN NaN NaN NaN NaN NaN NaN 30 40 50 20
Beside the #mozway's answer, just for formatting, you can use:
piv = (df.explode('Data').assign(col=lambda x: x.groupby(level=0).cumcount())
.pivot(index='Conc', columns=['Compound', 'col'], values='Data')
.rename_axis(None))
piv.columns = pd.Index([i if j == 0 else '' for i, j in piv.columns], name='Conc')
piv.to_excel('file.xlsx')
We start with an interval axis that is divided into bins of length 5. (0,5], (5, 10], ...
There is a timestamp column that has some timestamps >= 0. By using pd.cut() the interval bin that corresponds to the timestamp is determined. (e.g. "timestamp" = 3.0 -> "time_bin" = (0,5]).
If there is a time bin that has no corresponding timestamp, it does not show up in the interval column. Thus, there can be interval gaps in the "time_bin" column, e.g., (5,10], (15,20]. (i.e., interval (10,15] is missing // note that the timestamp column is sorted)
The goal is to obtain a column "connected_interval" that indicates whether the current row interval is connected to the previous row interval; connected meaning no interval gaps, i.e., (0,5], (5,10], (10, 15] would be assigned the same integer ID) and a column "conn_interv_length" that indicates for each largest possible connected interval the length of the interval. The interval (0,5], (5,10], (10, 15] would be of length 15.
The initial dataframe has columns "group_id", "timestamp", "time_bin". Columns "connected_interval" & "conn_interv_len" should be computed.
Note: any solution to obtaining the length of populated connected intervals is welcome.
df = pd.DataFrame({"group_id":['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B'],\
"timestamp": [0.0, 3.0, 9.0, 24.2, 30.2, 0.0, 136.51, 222.0, 237.0, 252.0],\
"time_bin": [pd.Interval(0, 5, closed='left'), pd.Interval(0, 5, closed='left'), pd.Interval(5, 10, closed='left'), pd.Interval(20, 25, closed='left'), pd.Interval(30, 35, closed='left'), pd.Interval(0, 5, closed='left'), pd.Interval(135, 140, closed='left'), pd.Interval(220, 225, closed='left'), pd.Interval(235, 240, closed='left'), pd.Interval(250, 255, closed='left')],\
"connected_interval":[0, 0, 0, 1, 2, 0, 1, 2, 3, 4],\
"conn_interv_len":[10, 10, 10, 5, 5, 5, 5, 5, 5, 5],\
})
input with expected output columns:
group_id timestamp time_bin connected_interval conn_interv_len
0 A 0.00 [0, 5) 0 10
1 A 3.00 [0, 5) 0 10
2 A 9.00 [5, 10) 0 10
3 A 24.20 [20, 25) 1 5
4 A 30.20 [30, 35) 2 5
5 B 0.00 [0, 5) 0 5
6 B 136.51 [135, 140) 1 5
7 B 222.00 [220, 225) 2 5
8 B 237.00 [235, 240) 3 5
9 B 252.00 [250, 255) 4 5
IIUC, you can sort the intervals, drop duplicates, extract the left/right bound, create groups based on the match/mismatch of the successive left/right, then merge again the output to the original:
df2 = (df[['group_id', 'time_bin']]
# extract bounds and sort intervals
.assign(left=df['time_bin'].array.left,
right=df['time_bin'].array.right)
.sort_values(by=['group_id', 'left', 'right'])
# ensure no duplicates
.drop_duplicates(['group_id', 'time_bin'])
# compute connected intervals and connected length
.assign(connected_interval=lambda d:
d.groupby('group_id', group_keys=False)
.apply(lambda g: g['left'].ne(g['right'].shift())
.cumsum().sub(1)),
conn_interv_len=lambda d:
(g := d.groupby(['group_id', 'connected_interval']))['right'].transform('max')
-g['left'].transform('min')
)
.drop(columns=['left', 'right'])
)
# merge to restore missing dropped duplicated rows
out = df.merge(df2)
output:
group_id timestamp time_bin connected_interval conn_interv_len
0 A 0.00 [0, 5) 0 10
1 A 3.00 [0, 5) 0 10
2 A 9.00 [5, 10) 0 10
3 A 24.20 [20, 25) 1 5
4 A 30.20 [30, 35) 2 5
5 B 0.00 [0, 5) 0 5
6 B 136.51 [135, 140) 1 5
7 B 222.00 [220, 225) 2 5
8 B 237.00 [235, 240) 3 5
9 B 252.00 [250, 255) 4 5
I have two dataframes:
df_1 = pd.DataFrame({'a' : [7,8, 2], 'b': [6, 6, 11], 'c': [4, 8, 6]})
df_1
and
df_2 = pd.DataFrame({'d' : [8, 4, 12], 'e': [16, 2, 1], 'f': [9, 3, 4]})
df_2
My goal is something like:
In a way that 'in one shot' I can subtract each column multiple times.
I'm trying for loop but I´m stuck!
You can subtract them as numpy arrays (using .values) and then put the result in a dataframe:
df_3 = pd.DataFrame(df_1.values - df_2.values, columns=list('xyz'))
# x y z
# 0 -1 -10 -5
# 1 4 4 5
# 2 -10 10 2
Or rename df_1.columns and df_2.columns to ['x','y','z'] and you can subtract them directly:
df_1.columns = df_2.columns = list('xyz')
df_3 = df_1 - df_2
# x y z
# 0 -1 -10 -5
# 1 4 4 5
# 2 -10 10 2
I have the following dataframe:
import pandas as pd
data = {0: [-1, -14], 1: [-3, 2], 2: [7, 10], 4: [-10, 15]}
df = pd.DataFrame(data)
I know how to sort an specific row:
df.sort_values(by=0, ascending=False, axis=1)
How is it possible to sort the dataframe by the absolute value of the first row?
In this case I will have something like:
sorted_data = {0: [-10, 15], 1: [7, 10], 2: [-3, 2], 4: [-1, -14]}
sort series by slicing of row 0 and passing its index to indexing the original df
df_sorted = df[df.iloc[0].abs().sort_values(ascending=False).index]
Out[94]:
4 2 1 0
0 -10 7 -3 -1
1 15 10 2 -14
Pandas 1.1 gives a key argument :
df.sort_values(0, axis=1, key=np.abs, ascending=False)
4 2 1 0
0 -10 7 -3 -1
1 15 10 2 -14
Let us try argsort
df = df.iloc[:,(-df.loc[0].abs()).argsort()]
I am trying to assign values to a column (lets call it 'AAA') based on other columns ('BBB', 'CCC') in a pandas dataframe. It works great when I know the exact column names, but in my scenario, 'BBB' and 'CCC' come from a list.
A loop works, but is there a more elegant and faster solution?
columns = ['BBB', 'CCC']
df = pd.DataFrame({'AAA': [4, 5, 6, 7],
'BBB': [10, 20, 30, 40],
'CCC': [100, 50, -30, -50]})
#This obviously works
df.loc[(df['BBB'] > 40) | (df['CCC'] > 40), 'AAA'] = 0.1
#This works as well
for col in columns:
df.loc[df[col]>40, 'AAA'] = 0.1
IIUC, You need any() over axis=1 here:
df.AAA=np.where(df[columns].gt(40).any(1),0.1,df.AAA)
#df.AAA=df.AAA.mask(df[columns].gt(40).any(1),0.1)
print(df)
AAA BBB CCC
0 0.1 10 100
1 0.1 20 50
2 6.0 30 -30
3 7.0 40 -50