Select rows with minimum mean column - python

I have a DataFrame with a 3-level MultiIndex, for example:
df = pd.DataFrame({
'col0': [0,8,3,1,2,2,0,0],
'col1': range(8),
}, index=pd.MultiIndex.from_product([[0,1]] * 3, names=['idx0', 'idx1', 'idx2']))
>>> df
col0 col1
idx0 idx1 idx2
0 0 0 0 0
1 8 1
1 0 3 2
1 1 3
1 0 0 2 4
1 2 5
1 0 0 6
1 0 7
For each idx0, I want to find the idx1 that has the lowest mean of col0. This gives me idx0, idx1 pairs. Then I'd like to select all the rows matching those pairs.
In the example above, the pairs are [(0, 1), (1, 1)] (with means 2 and 0, respectively) and the desired result is:
col0 col1
idx0 idx1 idx2
0 1 0 3 2
1 1 3
1 1 0 0 6
1 0 7
What I have tried
Step 1: Group by idx0, idx1 and calculate the mean of col0:
mean_col0 = df.groupby(['idx0', 'idx1'])['col0'].mean()
>>> mean_col0
idx0 idx1
0 0 4.0
1 2.0
1 0 2.0
1 0.0
Step 2: Select the indexmin (idx1) by group of idx0:
level_idxs = mean_col0.groupby('idx0').idxmin()
>>> level_idxs
idx0
0 (0, 1)
1 (1, 1)
Step 3: Use that to filter the original dataframe.
That's the main problem. When I simply try df.loc[ix], I get a ValueError due to shape mismatch. I would need the third index value or a wildcard.
I think I have a solution. Putting it all together with the steps above:
mean_col0 = df.groupby(['idx0', 'idx1'])['col0'].mean()
level_idxs = mean_col0.groupby(["idx0"]).idxmin()
result = df[df.index.droplevel(2).isin(level_idxs)]
But it seems quite complicated. Is there a better way?

You can use .apply().
For each group of idx0: query only those idx1-s which have the smallest mean in col0:
df.groupby('idx0').apply(lambda g:
g.query(f"idx1 == {g.groupby('idx1')['col0'].mean().idxmin()}")
).droplevel(0)
The same can be written in this (hopefully more readable) way:
def f(df):
chosen_idx1 = df.groupby('idx1')['col0'].mean().idxmin()
return df.query('idx1 == #chosen_idx1')
df.groupby('idx0').apply(f).droplevel(0)

Related

Sort dataframe based on minimum value of two columns

Let's assume I have the following dataframe:
import pandas as pd
d = {'col1': [1, 2,3,4], 'col2': [4, 2, 1, 3], 'col3': [1,0,1,1], 'outcome': [1,0,1,0]}
df = pd.DataFrame(data=d)
I want this dataframe sorted by col1 and col2 on the minimum value. The order of the indexes should be 2, 0, 1, 3.
I tried this with df.sort_values(by=['col2', 'col1']), but than it takes the minimum of col1 first and then of col2. Is there anyway to order by taking the minimum of two columns?
Using numpy.lexsort:
order = np.lexsort(np.sort(df[['col1', 'col2']])[:, ::-1].T)
out = df.iloc[order]
Output:
col1 col2 col3 outcome
2 3 1 1 1
0 1 4 1 1
1 2 2 0 0
3 4 3 1 0
Note that you can easily handle any number of columns:
df.iloc[np.lexsort(np.sort(df[['col1', 'col2', 'col3']])[:, ::-1].T)]
col1 col2 col3 outcome
1 2 2 0 0
2 3 1 1 1
0 1 4 1 1
3 4 3 1 0
One way (not the most efficient):
idx = df[['col2', 'col1']].apply(lambda x: tuple(sorted(x)), axis=1).sort_values().index
Output:
>>> df.loc[idx]
col1 col2 col3 outcome
2 3 1 1 1
0 1 4 1 1
1 2 2 0 0
3 4 3 1 0
>>> idx
Int64Index([2, 0, 1, 3], dtype='int64')
you can decorate-sort-undecorate where decoration is minimal and other (i.e., maximal) values per row:
cols = ["col1", "col2"]
(df.assign(_min=df[cols].min(axis=1), _other=df[cols].max(axis=1))
.sort_values(["_min", "_other"])
.drop(columns=["_min", "_other"]))
to get
col1 col2 col3 outcome
2 3 1 1 1
0 1 4 1 1
1 2 2 0 0
3 4 3 1 0
I would compute min(col1, col2) as new column and then sort by it
import pandas as pd
d = {'col1': [1, 2,3,4], 'col2': [4, 2, 1, 3], 'col3': [1,0,1,1], 'outcome': [1,0,1,0]}
df = pd.DataFrame(data=d)
df['colmin'] = df[['col1','col2']].min(axis=1) # compute min
df = df.sort_values(by='colmin').drop(columns='colmin') # sort then drop min
print(df)
gives output
col1 col2 col3 outcome
0 1 4 1 1
2 3 1 1 1
1 2 2 0 0
3 4 3 1 0

Get first occurrence of 1 in column2 based on column 1 condition

There is a dataframe df with 2 columns col1 and col2. Both columns have randomly spread 0s and 1s. More zeros than ones.
If col1 has a 1 on an index, program should be able to look for next first 1 in col2 and get the difference of indices of both rows.
Everytime this distribution is different also the sequence length.
Try with idxmax
id1 = df.col1.idxmax()
id2 = df.loc[id1:,'col2'].idxmax()
id2-id1
2
id2
4
id1
2
I cannot see your posted image.
How about this.
import random
import pandas as pd
numrows = 10
df = pd.DataFrame({'c1': [random.randint(0, 1) for _ in range(numrows)], 'c2': [random.randint(0, 1) for _ in range(numrows)]})
print(df)
col1_index = None
for index, row in df.iterrows():
if col1_index is not None:
if row['c2'] == 1:
diff = col1_index - index
print(f'first occurrence of 1 at c2 is at index {index}, the index diff is {diff}')
col1_index = None
elif row['c1'] == 1:
col1_index = index
print(f'this index {index} has value 1 at c1')
Typical output
c1 c2
0 1 0
1 0 0
2 0 0
3 1 1
4 0 1
5 0 0
6 1 1
7 0 1
8 0 1
9 1 1
this index 0 has value 1 at c1
first occurrence of 1 at c2 is at index 3, the index diff is -3
this index 6 has value 1 at c1
first occurrence of 1 at c2 is at index 7, the index diff is -1
this index 9 has value 1 at c1

Pandas DataFrame replace negative values with latest preceding positive value

Consider a DataFrame such as
df = pd.DataFrame({'a': [1,-2,0,3,-1,2],
'b': [-1,-2,-5,-7,-1,-1],
'c': [-1,-2,-5,4,5,3]})
For each column, how to replace any negative value with the last positive value or zero ? Last here refers from top to bottom for each column. The closest solution noticed is for instance df[df < 0] = 0.
The expected result would be a DataFrame such as
df_res = pd.DataFrame({'a': [1,1,0,3,3,2],
'b': [0,0,0,0,0,0],
'c': [0,0,0,4,5,3]})
You can use DataFrame.mask to convert all values < 0 to NaN then use ffill and fillna:
df = df.mask(df.lt(0)).ffill().fillna(0).convert_dtypes()
a b c
0 1 0 0
1 1 0 0
2 0 0 0
3 3 0 4
4 3 0 5
5 2 0 3
Use pandas where
df.where(df.gt(0)).ffill().fillna(0).astype(int)
a b c
0 1 0 0
1 1 0 0
2 1 0 0
3 3 0 4
4 3 0 5
5 2 0 3
Expected result may obtained with this manipulations:
mask = df >= 0 #creating boolean mask for non-negative values
df_res = (df.where(mask, np.nan) #replace negative values to nan
.ffill() #apply forward fill for nan values
.fillna(0)) # fill rest nan's with zeros

pandas series add with previous row on condition

I need to add a series with previous rows only if a condition matches in current cell. Here's the dataframe:
import pandas as pd
data = {'col1': [1, 2, 1, 0, 0, 0, 0, 3, 2, 2, 0, 0]}
df = pd.DataFrame(data, columns=['col1'])
df['continuous'] = df.col1
print(df)
I need to +1 a cell with previous sum if it's value > 0 else -1. So, result I'm expecting is;
col1 continuous
0 1 1//+1 as its non-zero
1 2 2//+1 as its non-zero
2 1 3//+1 as its non-zero
3 0 2//-1 as its zero
4 0 1
5 0 0
6 0 0// not to go less than 0
7 3 1
8 2 2
9 2 3
10 0 2
11 0 1
Case 2 : where I want instead of >0 , I need <-0.1
data = {'col1': [-0.097112634,-0.092674324,-0.089176841,-0.087302284,-0.087351866,-0.089226185,-0.092242213,-0.096446987,-0.101620036,-0.105940337,-0.109484752,-0.113515648,-0.117848816,-0.121133266,-0.123824577,-0.126030136,-0.126630895,-0.126015218,-0.124235003,-0.122715224,-0.121746573,-0.120794916,-0.120291174,-0.120323152,-0.12053229,-0.121491186,-0.122625851,-0.123819704,-0.125751858,-0.127676591,-0.129339428,-0.132342431,-0.137119556,-0.142040092,-0.14837848,-0.15439201,-0.159282645,-0.161271982,-0.162377701,-0.162838307,-0.163204393,-0.164095634,-0.165496071,-0.167224488,-0.167057078,-0.165706164,-0.163301617,-0.161423938,-0.158669389,-0.156508912,-0.15508329,-0.15365104,-0.151958972,-0.150317528,-0.149234892,-0.148259354,-0.14737422,-0.145958527,-0.144633388,-0.143120273,-0.14145652,-0.139930163,-0.138774126,-0.136710524,-0.134692221,-0.132534879,-0.129921444,-0.127974949,-0.128294058,-0.129241763,-0.132263506,-0.137828981,-0.145549768,-0.154244588,-0.163125109,-0.171814857,-0.179911465,-0.186223859,-0.190653162,-0.194761064,-0.197988536,-0.200500606,-0.20260121,-0.204797089,-0.208281065,-0.211846904,-0.215312626,-0.218696339,-0.221489975,-0.221375209,-0.220996031,-0.218558429,-0.215936558,-0.213933531,-0.21242896,-0.209682125,-0.208196607,-0.206243585,-0.202190476,-0.19913106,-0.19703291,-0.194244664,-0.189609518,-0.186600526,-0.18160171,-0.175875689,-0.170767095,-0.167453329,-0.163516985,-0.161168703,-0.158197984,-0.156378046,-0.154794499,-0.153236804,-0.15187487,-0.151623385,-0.150628282,-0.149039072,-0.14826268,-0.147535739,-0.145557646,-0.142223729,-0.139343068,-0.135355686,-0.13047743,-0.125999173,-0.12218752,-0.117021996,-0.111542982,-0.106409901,-0.101904095,-0.097910825,-0.094683375,-0.092079967,-0.088953862,-0.086268097,-0.082907394,-0.080723466,-0.078117426,-0.075431993,-0.072079536,-0.068962411,-0.064831759,-0.061257701,-0.05830671,-0.053889968,-0.048972414,-0.044763431,-0.042162829,-0.039328369,-0.038968862,-0.040450835,-0.041974942,-0.042161609,-0.04280523,-0.042702428,-0.042593856,-0.043166561,-0.043691795,-0.044093492,-0.043965231,-0.04263305,-0.040836102,-0.039605133,-0.037204273,-0.034368645,-0.032293737,-0.029037983,-0.025509509,-0.022704668,-0.021346266,-0.019881524,-0.018675734,-0.017509566,-0.017148129,-0.016671088,-0.016015011,-0.016241862,-0.016416445,-0.016548878,-0.016475455,-0.016405742,-0.015567737,-0.014190101,-0.012373151,-0.010370329,-0.008131459,-0.006729419,-0.005667607,-0.004883919,-0.004841328,-0.005403019,-0.005343759,-0.005377974,-0.00548823,-0.004889709,-0.003884973,-0.003149113,-0.002975268,-0.00283163,-0.00322658,-0.003546589,-0.004233582,-0.004448617,-0.004706967,-0.007400356,-0.010104064,-0.01230257,-0.014430498,-0.016499501,-0.015348355,-0.013974229,-0.012845464,-0.012688459,-0.012552231,-0.013719074,-0.014404172,-0.014611632,-0.013401283,-0.011807386,-0.007417753,-0.003321279,0.000363954,0.004908491,0.010151584,0.013223831,0.016746553,0.02106351,0.024571507,0.027588073,0.031313637,0.034419301,0.037016545,0.038172954,0.038237253,0.038094387,0.037783779,0.036482515,0.036080763,0.035476154,0.034107081,0.03237083,0.030934259,0.029317076,0.028236195,0.027850758,0.024612491,0.01964433,0.015153308,0.009684456,0.003336172]}
df = pd.DataFrame(data, columns=['col1'])
lim = float(-0.1)
s = df['col1'].lt(lim)
out = s.where(s, -1).cumsum()
df['sol'] = out - out.where((out < 0) & (~s)).ffill().fillna(0)
print(df)
The key problem here, to me, is to control the out not to go below zero. With that in mind, we can mask the output where it's negative and adjust accordingly:
# a little longer data for corner case
df = pd.DataFrame({'col1': [1, 2, 1, 0, 0, 0, 0, 3, 2, 2, 0, 0,0,0,0,2,3,4]})
s = df.col1.gt(0)
out = s.where(s,-1).cumsum()
df['continuous'] = out - out.where((out<0)&(~s)).ffill().fillna(0)
Output:
col1 continuous
0 1 1
1 2 2
2 1 3
3 0 2
4 0 1
5 0 0
6 0 0
7 3 1
8 2 2
9 2 3
10 0 2
11 0 1
12 0 0
13 0 0
14 0 0
15 2 1
16 3 2
17 4 3
You can do this using cumsum function on booleans:
Give me a +1 whenever col1 is not zero:
(df.col1 != 0 ).cumsum()
Give me a -1 whenever col1 is zero:
- (df.col1 == 0 ).cumsum()
Then just add them together!
df['continuous'] = (df.col1 != 0 ).cumsum() - (df.col1 == 0 ).cumsum()
However this does not resolve the dropping below zero criteria you mentioned

pandas: Use if-else to populate new column

I have a DataFrame like this:
col1 col2
1 0
0 1
0 0
0 0
3 3
2 0
0 4
I'd like to add a column that is a 1 if col2 is > 0 or 0 otherwise. If I was using R I'd do something like
df1[,'col3'] <- ifelse(df1$col2 > 0, 1, 0)
How would I do this in python / pandas?
You could convert the boolean series df.col2 > 0 to an integer series (True becomes 1 and False becomes 0):
df['col3'] = (df.col2 > 0).astype('int')
(To create a new column, you simply need to name it and assign it to a Series, array or list of the same length as your DataFrame.)
This produces col3 as:
col2 col3
0 0 0
1 1 1
2 0 0
3 0 0
4 3 1
5 0 0
6 4 1
Another way to create the column could be to use np.where, which lets you specify a value for either of the true or false values and is perhaps closer to the syntax of the R function ifelse. For example:
>>> np.where(df['col2'] > 0, 4, -1)
array([-1, 4, -1, -1, 4, -1, 4])
I assume that you're using Pandas (because of the 'df' notation). If so, you can assign col3 a boolean flag by using .gt (greater than) to compare col2 against zero. Multiplying the result by one will convert the boolean flags into ones and zeros.
df1 = pd.DataFrame({'col1': [1, 0, 0, 0, 3, 2, 0],
'col2': [0, 1, 0, 0, 3, 0, 4]})
df1['col3'] = df1.col2.gt(0) * 1
>>> df1
Out[70]:
col1 col2 col3
0 1 0 0
1 0 1 1
2 0 0 0
3 0 0 0
4 3 3 1
5 2 0 0
6 0 4 1
You can also use a lambda expression to achieve the same result, but I believe the method above is simpler for your given example.
df1['col3'] = df1['col2'].apply(lambda x: 1 if x > 0 else 0)

Categories

Resources