Pandas DataFrame Conditional Groupby

Pandas DataFrame Conditional Groupby - python

I have this DF:
df = pd.DataFrame(data=[[-2.000000, -1.958010, 0.2],
[-1.958010, -1.916030, 0.4],
[-1.916030, -1.874040, 0.3],
[-1.874040, -1.832050, 0.6],
[-1.832050, -1.790070, 0.8],
[-1.790070, -1.748080, 0.2]],columns=['egystart','egyend','fx'])
So I want to groupby every two rows and get fx as the mean value of the two rows. egystart should by egystart of the first row and egyend should by egyend of the second row.
In this case I should obtain:
-2.000000 -1.916030 0.3
-1.916030 -1.832050 0.45
-1.832050 -1.748080 0.5
So I have tried something like this:
df.groupby((df.egystart == df.egyend.shift(1)).cumsum()).agg({'egystart':min, 'egyend':max, 'fx':HERE_THE_MEAN_VALUE})
But it doesnt work

You could try this to get the mean of fx every 2 rows:
result = df.groupby(np.arange(len(df))//2).mean()
print(result)
egystart egyend fx
0 -1.979005 -1.937020 0.30
1 -1.895035 -1.853045 0.45
2 -1.811060 -1.769075 0.50

Related

Add columns to pandas dataframe from two separate dataframes with condition

I'll admit that this question is quite specific. I'm trying to write a function that reads two time columns (same label) in separate dataframes df1['gps'] and df2['gps']. I want to look for elements in the first column which are close to those in the second column, not necessarily in same row. When the condition on time distance is met, I want to save the close elements in df1['gps'] and df1['gps'] in a new dataframe called coinc in separate columns coinc['gps1'] and coinc['gps2'] in the fastest and most efficient way. This is my code:
def find_coinc(df1, df2=None, tdelta=.25, shift=0):
index_boolean = False
if df2 is None:
df2 = df1.copy()
coincs = pd.DataFrame()
for _, r1 in tqdm(df1.iterrows(), total=len(df1)):
ctrig = df2.loc[abs(r1.gps+shift-df2.gps)<tdelta]
print(r1.gps)
coincs_single = pd.DataFrame()
if len(ctrig)>0:
coincs_single['gps1'] = r1.gps
coincs_single['gps2'] = ctrig.gps
coincs = pd.concat((coincs, coincs_single), axis = 0, ignore_index=index_boolean)
index_boolean=True
else:
pass
return coincs
The script runs fine, but when investigating the output, I find that one column of coinc is all NaN and I don't understand why. Test case with generated data:
a = pd.DataFrame() #define dataframes and fill them
b = pd.DataFrame()
a['gps'] = [0.12, 0.13, 0.6, 0.7]
b['gps'] = [0.1, 0.3, 0.5, 0.81, 0.82, 0.83]
find_coinc(a, b, 0.16, 0)
The output yielded is:
gps1 gps2
0 NaN 0.10
1 NaN 0.10
2 NaN 0.50
3 NaN 0.81
4 NaN 0.82
5 NaN 0.83
How can I write coinc so that both columns turn out fine?

Well, here is another solution. Instead of concat two dataframes just add new rows to 'coincs' DataFrame. I will show you below.
def find_coinc(df1, df2=None, tdelta=.25, shift=0):
if df2 is None:
df2 = df1.copy()
coincs = pd.DataFrame(columns=['gps1', 'gps2'])
for _, r1 in tqdm(df1.iterrows(), total=len(df1)):
ctrig = df2.loc[abs(r1.gps+shift-df2.gps) < tdelta]
if len(ctrig)>0:
for ctrig_value in ctrig['gps']:
# Add n rows based on 'ctrig' length.
coincs.loc[len(coincs)] = [r1.gps, ctrig_value]
else:
pass
return coincs
# -------------------
a = pd.DataFrame() # define dataframes and fill them
b = pd.DataFrame()
a['gps'] = [0.12, 0.13, 0.6, 0.7]
b['gps'] = [0.1, 0.3, 0.5, 0.81, 0.82, 0.83]
coins = find_coinc(a, b, 0.16, 0)
print('\n\n')
print(coins.to_string())
Result:
gps1 gps2
0 0.12 0.10
1 0.13 0.10
2 0.60 0.50
3 0.70 0.81
4 0.70 0.82
5 0.70 0.83
I hope I could help! :D

So the issue is that there are multiple elements in df2['gps'] which satisfy the condition of being within a time window of df1['gps']. I think I found a solution, but looking for a better one if possible. Highlighting the modified line in the original function as ### FIX UPDATE comment:
def find_coinc(df1, df2=None, tdelta=.25, shift=0):
index_boolean = False
if df2 is None:
df2 = df1.copy()
coincs = pd.DataFrame()
for _, r1 in tqdm(df1.iterrows(), total=len(df1)):
ctrig = df2.loc[abs(r1.gps+shift-df2.gps)<tdelta]
ctrig.reset_index(drop=True, inplace=True)
coincs_single = pd.DataFrame()
if len(ctrig)>0:
coincs_single['gps1'] = [r1.gps]*len(ctrig) ### FIX UPDATE
coincs_single['gps2'] = ctrig.gps
print(ctrig.gps)
coincs = pd.concat((coincs, coincs_single), axis = 0, ignore_index=index_boolean)
index_boolean=True
else:
pass
return coincs
The solution I chose, since I want to have all the instances of the condition being met, was to write the same element in df1['gps'] into coinc['gps1'] the needed amount of times.

Cumulative product by group without groups' last row in pandas

I have a simple dataframe as the following:
n_obs = 3
dd = pd.DataFrame({
'WTL_exploded': [0, 1, 2]*n_obs,
'hazard': [0.3, 0.4, 0.5, 0.2, 0.8, 0.9, 0.6,0.6,0.65],
}, index=[1,1,1,2,2,2,3,3,3])
dd
I want to group by the index and get the cumulative product of the hazard column. However, I want to multiply all but the last element of each group.
Desired output:
index
hazard
1
0.3
1
0.12
2
0.2
2
0.16
3
0.6
3
0.36
How can I do that?

You can use:
out = dd.groupby(level=0, group_keys=False).apply(lambda x: x.cumprod().iloc[:-1])
Or:
out = dd.groupby(level=0).apply(lambda x: x.cumprod().iloc[:-1]).droplevel(1)
output:
WTL_exploded hazard
1 0 0.30
1 0 0.12
2 0 0.20
2 0 0.16
3 0 0.60
3 0 0.36
NB. you can also use lambda x: x.cumprod().head(-1).

The solution I found is a bit intricate but works for the test case.
First, get rid of the last row of each group:
ff = dd.groupby(lambda x:x, as_index=False).apply(lambda x: x.iloc[:-1])
ff
Then, restore the original index, group-by again and use pandas cumprod:
ff.reset_index().set_index('level_1').groupby(lambda x:x).cumprod()
Is there a more direct way?

Clean method for Pandas Dataframe to set the lowest n values in each row to zero

I would like to transform the values of a Pandas Dataframe so that the 3 smallest columns for instance is set to zero:
row1: 0.21, 0.11, 0.24, 0.52, 0.12
row2: 0.31, 0.01, 0.44, 0.52, 0.52
Would become:
row1: 0.0, 0.0, 0.24, 0.52, 0.0
row2: 0.0, 0.0. 0.0, 0.52, 0.52
I would preferably like to do this without some loop.

We can use where + rank on axis=1. rank with method='min' and ascending=False will establish an ordering within the row such that the smallest value is 1 and the largest is 5 (the total length of the row). We then use where to replace all values with rank less than 3:
df = df.where(df.rank(axis=1, method='min', ascending=False) < 3, 0)
We can also use the opposite condition with mask to keep values that rank higher than 3 and replace those which are 3 or lower with 0:
df = df.mask(df.rank(axis=1, method='min', ascending=False) >= 3, 0)
Either option produces df:
0 1 2 3 4
0 0.0 0.0 0.24 0.52 0.00
1 0.0 0.0 0.00 0.52 0.52
*Note depending on desired behaviour we may also want method='dense' or method='first' which will change how duplicated values are handled in the ranking.
Setup:
import pandas as pd
df = pd.DataFrame({
0: [0.21, 0.31],
1: [0.11, 0.01],
2: [0.24, 0.44],
3: [0.52, 0.52],
4: [0.12, 0.52]
})

You can try:
A - Use list(df["col"].unique()) and sort/sorted to get the first three values. Put it into a list.
B - Use df.loc to remove the rows with a value within this new list
(something like df.loc[df["col"].isin(a)] )

Increase value of several rows based on condition fulfilling all rows

I have a pandas dataframe with three columns and want to multiply/increase the float numbers of each row by the same amount until the sum of all three cells (one row) fulfils the critera (value equal or greater than 0.9)
df = pd.DataFrame({'A':[0.03, 0.0, 0.4],
'B': [0.1234, 0.4, 0.333],
'C': [0.5, 0.4, 0.0333]})
Outcome:
The different cells in each row were multiplied so that the sum of all three cells of each row is 0.9 (The sum of each row is not exactly 0.9 as I tried to come close with simple multiplication, hence the actual outcome would get to 0.9). It is important that the cells which are 0 would stay 0.
print (df)
A B C
0 0.0414 0.170292 0.690000
1 0.0000 0.452000 0.452000
2 0.4720 0.392940 0.039294

You can take sum on axis=1 and subtract with 0.9 ,then divide with df.shape[1] to add it back:
df.add((0.9-df.sum(axis=1))/df.shape[1],axis=0)
A B C
0 0.112200 0.205600 0.582200
1 0.033333 0.433333 0.433333
2 0.444567 0.377567 0.077867

You want to apply a scaling function along the rows:
def scale(xs, target=0.9):
"""Scale the features such that their sum equals the target."""
xs_sum = xs.sum()
if xs_sum < target:
return xs * (target / xs_sum)
else:
return xs
df.apply(scale), axis=1)
For example:
df = pd.DataFrame({'A':[0.03, 0.0, 0.4],
'B': [0.1234, 0.4, 0.333],
'C': [0.5, 0.4, 0.0333]})
df.apply(scale, axis=1)
Should give:
A B C
0 0.041322 0.169972 0.688705
1 0.000000 0.450000 0.450000
2 0.469790 0.391100 0.039110
The rows of that dataframe all sum to 0.9:
df.apply(scale), axis=1).sum(axis=1)
0 0.9
1 0.9
2 0.9
dtype: float64

Efficiently combine min/max on different columns of a pandas dataframe

I have a pandas dataframe that contains the results of computation and need to:
take the maximum value of a column and for that value find the maximum value of another column
take the minimum value of a column and for that value find the maximum value of another column
Is there a more efficient way to do it?
Setup
metrictuple = namedtuple('metrics', 'prob m1 m2')
l1 =[metrictuple(0.1, 0.4, 0.04),metrictuple(0.2, 0.4, 0.04),metrictuple(0.4, 0.4, 0.1),metrictuple(0.7, 0.2, 0.3),metrictuple(1.0, 0.1, 0.5)]
df = pd.DataFrame(l1)
# df
# prob m1 m2
#0 0.1 0.4 0.04
#1 0.2 0.4 0.04
#2 0.4 0.4 0.10
#3 0.7 0.2 0.30
#4 1.0 0.1 0.50
tmp = df.loc[(df.m1.max() == df.m1), ['prob','m1']]
res1 = tmp.loc[tmp.prob.max() == tmp.prob, :].to_records(index=False)[0]
#(0.4, 0.4)
tmp = df.loc[(df.m2.min() == df.m2), ['prob','m2']]
res2 = tmp.loc[tmp.prob.max() == tmp.prob, :].to_records(index=False)[0]
#(0.2, 0.04)

Pandas isn't ideal for numerical computations. This is because there is a significant overhead in slicing and selecting data, in this example df.loc.
The good news is that pandas interacts well with numpy, so you can easily drop down to the underlying numpy arrays.
Below I've defined some helper functions which makes the code more readable. Note that numpy slicing is performed via row and column numbers starting from 0.
arr = df.values
def arr_max(x, col):
return x[x[:,col]==x[:,col].max()]
def arr_min(x, col):
return x[x[:,col]==x[:,col].min()]
res1 = arr_max(arr_max(arr, 1), 0)[:,:2] # array([[ 0.4, 0.4]])
res2 = arr_max(arr_min(arr, 2), 0)[:,[0,2]] # array([[ 0.2 , 0.04]])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas DataFrame Conditional Groupby - python

You could try this to get the mean of fx every 2 rows: result = df.groupby(np.arange(len(df))//2).mean() print(result) egystart egyend fx 0 -1.979005 -1.937020 0.30 1 -1.895035 -1.853045 0.45 2 -1.811060 -1.769075 0.50

Related

Add columns to pandas dataframe from two separate dataframes with condition

Cumulative product by group without groups' last row in pandas

Clean method for Pandas Dataframe to set the lowest n values in each row to zero

Increase value of several rows based on condition fulfilling all rows

Efficiently combine min/max on different columns of a pandas dataframe

Categories

Resources