How to combine rows in groupby with several conditions? - python

I want to combine rows in pandas df with the following logic:
dataframe is grouped by users
rows are ordered by start_at_min
rows are combiend when:
Case A:
if start_at_min<=200:
row1[stop_at_min] - row2[start_at_min] < 5
(eg: 101 -100 = 1 -> combine; 200-100=100: -> dont combine)
Case Bif 200> start_at_min<400:
change threhsold to 3
Case C if start_at_min>400:
Never combine
Example df
user start_at_min stop_at_min
0 1 100 150
1 1 152 201 #row0 with row1 combine
2 1 205 260 #row1 with row 2 NO -> start_at_min above 200 -> threshol =3
3 2 65 100 #no
4 2 200 265 #no
5 2 300 451 #no
6 2 452 460 #no -> start_at_min above 400-> never combine
Expected output:
user start_at_min stop_at_min
0 1 100 201 #row1 with row2 combine
2 1 205 260 #row2 with row 3 NO -> start_at_min above 200 -> threshol =3
3 2 65 100 #no
4 2 200 265 #no
5 2 300 451 #no
6 2 452 460 #no -> start_at_min above 400-> never combine
I have written the funciton combine_rows, that takes in 2 Series and applies this logic
def combine_rows (s1:pd.Series, s2:pd.Series):
# take 2 rows and combine them if start_at_min row2 - stop_at_min row1 < 5
if s2['start_at_min'] - s1['stop_at_min'] <5:
return pd.Series({
'user': s1['user'],
'start_at_min': s1['start_at_min'],
'stop_at_min' : s2['stop_at_min']
})
else:
return pd.concat([s1,s2],axis=1).T
Howver I am unable to apply this function to the dataframe.
This was my attempt:
df.groupby('user').sort_values(by=['start_at_min']).apply(combine_rows) # this not working
Here is the full code:
import pandas as pd
import numpy as np
df = pd.DataFrame({
"user" : [1, 1, 2,2],
'start_at_min': [60, 101, 65, 200],
'stop_at_min' : [100, 135, 100, 265]
})
def combine_rows (s1:pd.Series, s2:pd.Series):
# take 2 rows and combine them if start_at_min row2 - stop_at_min row1 < 5
if s2['start_at_min'] - s1['stop_at_min'] <5:
return pd.Series({
'user': s1['user'],
'start_at_min': s1['start_at_min'],
'stop_at_min' : s2['stop_at_min']
})
else:
return pd.concat([s1,s2],axis=1).T
df.groupby('user').sort_values(by=['start_at_min']).apply(combine_rows) # this not working

version 1: one condition
Perform a custom groupby.agg:
threshold = 5
# if the successive stop/start per group are above threshold
# start a new group
group = (df['start_at_min']
.sub(df.groupby('user')['stop_at_min'].shift())
.ge(threshold).cumsum()
)
# groupby.agg
out = (df.groupby(['user', group], as_index=False)
.agg({'start_at_min': 'min',
'stop_at_min': 'max'
})
)
Output:
user start_at_min stop_at_min
0 1 60 135
1 2 65 100
2 2 200 265
Intermediate:
(df['start_at_min']
.sub(df.groupby('user')['stop_at_min'].shift())
)
0 NaN
1 1.0 # below threshold, this will be merged
2 NaN
3 100.0 # above threshold, keep separate
dtype: float64
version 2: multiple conditions
# define variable threshold
threshold = np.where(df['start_at_min'].le(200), 5, 3)
# array([3, 3, 5, 3, 3, 5, 5])
# compute the new starts of group like in version 1
# but using the now variable threshold
m1 = (df['start_at_min']
.sub(df.groupby('user')['stop_at_min'].shift())
.ge(threshold)
)
# add a second restart condition (>400)
m2 = df['start_at_min'].gt(400)
# if either mask is True, start a new group
group = (m1|m2).cumsum()
# groupby.agg
out = (df.groupby(['user', group], as_index=False)
.agg({'start_at_min': 'min',
'stop_at_min': 'max'
})
)
Output:
user start_at_min stop_at_min
0 1 100 201
1 1 205 260
2 2 65 100
3 2 200 265
4 2 300 451
5 2 452 460

Related

Sample Pandas Dataframe with equal number based on binary column

I have a dataframe with a data column, and a value column, as in the example below. The value column is always binary, 0 or 1:
data,value
173,1
1378,0
926,0
643,0
1279,0
472,0
706,0
1345,0
1167,1
1401,1
1236,0
447,1
1204,1
398,0
714,0
734,0
1732,0
98,0
1696,0
160,0
1611,0
274,1
562,0
625,0
1028,0
1766,0
511,0
1691,0
898,1
I need to sample the dataset so that basically I have an equal number of both values. So, if I originally have less 1 class, I'll need to use that one as a reference. In turn, if I have less 0 classes, I need to use that.
Any clues on how to do this? I'm working on a jupyter notebook, Python 3.6 (I cannot go up versions).
Sample data
data = [173,926,634,706,398]
value = [1,0,0,1,0]
df = pd.DataFrame({"data": data, "value": value})
print(df)
# data value
# 0 173 1
# 1 926 0
# 2 634 0
# 3 706 1
# 4 398 0
Filter to two DFs
ones = df[df['value'] == 1]
zeros = df[df['value'] == 0]
print(ones)
print()
print()
print(zeros)
# data value
# 0 173 1
# 3 706 1
# data value
# 1 926 0
# 2 634 0
# 4 398 0
Truncate as required
Find the minimum and then truncate it (take n first rows)
if len(ones) <= len(zeros):
zeros = zeros.iloc[:len(ones), :]
else:
ones = ones.iloc[:len(zeros), :]
print(ones)
print()
print()
print(zeros)
# data value
# 0 173 1
# 3 706 1
#
#
# data value
# 1 926 0
# 2 634 0
Group your dataframe by values, and then take a sample of the smallest count from each group.
grouped = df.groupby(['value'])
smallest = grouped.count().min().values
try: # Pandas 1.1.0+
print(grouped.sample(smallest))
except AttributeError: # Pre-Pandas 1.1.0
print(grouped.apply(lambda df: df.sample(smallest)))
Output:
data value
25 1766 0
3 643 0
10 1236 0
1 1378 0
14 714 0
6 706 0
24 1028 0
8 1167 1
9 1401 1
0 173 1
12 1204 1
11 447 1
28 898 1
21 274 1
This should do it.
df.groupby('value').sample(df.groupby('value').size().min())

Count until condition is reached in Pandas

I need some input from you. The idea is that I would like to see how long (in rows) it takes before you can see
a new value in column SUB_B1, and
a new value in SUB_B2
i.e, how many steps is there between
SUB_A1 and SUB B1, and
between SUB A2 and SUB B2
I have structured the data something like this: (I sort the index in descending order by the results column. After that I separate index B and A and place them in new columns)
df.sort_values(['A','result'], ascending=[True,False]).set_index(['A','B'])
result
SUB_A1
SUB_A2
SUB_B1
SUB_B2
A
B
10_125
10_173
0.903257
10
125
10
173
10_332
0.847333
10
125
10
332
10_243
0.842802
10
125
10
243
10_522
0.836335
10
125
10
522
58_941
0.810760
10
125
58
941
...
...
...
...
...
...
10_173
10_125
0.903257
10
173
10
125
58_941
0.847333
10
173
58
941
1_941
0.842802
10
173
1
941
96_512
0.836335
10
173
96
512
10_513
0.810760
10
173
10
513
This is what I have done so far: (edit: I think I need to iterate over values[] However, I havent manage to loop beyond the first rows yet...)
def func(group):
if group.SUB_A1.values[0] == group.SUB_B1.values[0]:
group.R1.values[0] = 1
else:
group.R1.values[0] = 0
if group.SUB_A1.values[0] == group.SUB_B1.values[1] and group.R1.values[0] == 1:
group.R1.values[1] = 2
else:
group.R1.values[1] = 0
df['R1'] = 0
df.groupby('A').apply(func)
Expected outcome:
result
SUB_B1
SUB_B2
R1
R2
A
B
10_125
10_173
0.903257
10
173
1
0
10_332
0.847333
10
332
2
0
10_243
0.842802
10
243
3
0 
10_522
0.836335
10
522
4
0
58_941
0.810760
58
941
0
0
...
...
...
...
...
...
Are you looking for something like this:
Sample dataframe:
df = pd.DataFrame(
{"SUB_A": [1, -1, -2, 3, 3, 4, 3, 6, 6, 6],
"SUB_B": [1, 2, 3, 3, 3, 3, 4, 6, 6, 6]},
index=pd.MultiIndex.from_product([range(1, 3), range(1, 6)], names=("A", "B"))
)
SUB_A SUB_B
A B
1 1 1 1
2 -1 2
3 -2 3
4 3 3
5 3 3
2 1 4 3
2 3 4
3 6 6
4 6 6
5 6 6
Now this
equal = df.SUB_A == df.SUB_B
df["R"] = equal.groupby(equal.groupby("A").diff().fillna(True).cumsum()).cumsum()
leads to
SUB_A SUB_B R
A B
1 1 1 1 1
2 -1 2 0
3 -2 3 0
4 3 3 1
5 3 3 2
2 1 4 3 0
2 3 4 0
3 6 6 1
4 6 6 2
5 6 6 3
Try using pandas.DataFrame.iterrows and pandas.DataFrame.shift.
You can iterate through the dataframe and compare current row with the previous one, then add some condition:
df['SUB_A2_last'] = df['SUB_A2'].shift()
count = 0
#Fill column with zeros
df['count_series'] = 0
for index, row in df.iterrows():
subA = row['sub_A2']
subA_last = row['sub_A2_last']
if subA == subA_last:
count += 1
else:
count = 0
df.loc[index, 'count_series'] = count
Then repeat for B column. It is possible to get a better aproach using pandas.DataFrame.apply and a custom function.
Puh! Super! Thanks for the input you guys
def func(group):
for each in range(len(group)):
if group.SUB_A1.values[0] == group.SUB_B1.values[each]:
group.R1.values[each] = each + 1
continue
elif group.SUB_A1.values[0] == group.SUB_B1.values[each] and group.R1.values[each] == each + 1:
group.R1.values[each] = each + 1
else:
group.R1.values[each] = 0
return group
df['R1'] = 0
df.groupby('A').apply(func)

How do I create a while loop for this df that has moving average in every stage? [duplicate]

This question already has an answer here:
For loop that adds and deducts from pandas columns
(1 answer)
Closed 1 year ago.
So I want to spread the shipments per ID in the group one by one by looking at avg sales to determine who to give it to.
Here's my dataframe:
ID STOREID BAL SALES SHIP
1 STR1 50 5 18
1 STR2 6 7 18
1 STR3 74 4 18
2 STR1 35 3 500
2 STR2 5 4 500
2 STR3 54 7 500
While SHIP (grouped by ID) is greater than 0, calculate AVG (BAL/SALES) and the lowest AVG per group give +1 to its column BAL and +1 to its column final. And then repeat the process until SHIP is 0. The AVG would be different every stage which is why I wanted it to be a while loop.
Sample output of first round is below. So do this until SHIP is 0 and SUM of Final per ID is = to SHIP:
ID STOREID BAL SALES SHIP AVG Final
1 STR1 50 5 18 10 0
1 STR2 6 4 18 1.5 1
1 STR3 8 4 18 2 0
2 STR1 35 3 500 11.67 0
2 STR2 5 4 500 1.25 1
2 STR3 54 7 500 7.71 0
I've tried a couple of ways in SQL, I thought it would be better to do it in python but I haven't been doing a great job with my loop. Here's what I tried so far:
df['AVG'] = 0
df['FINAL'] = 0
for i in df.groupby(["ID"])['SHIP']:
if i > 0:
df['AVG'] = df['BAL'] / df['SALES']
df['SHIP'] = df.groupby(["ID"])['SHIP']-1
total = df.groupby(["ID"])["FINAL"].transform("cumsum")
df['FINAL'] = + 1
df['A'] = + 1
else:
df['FINAL'] = 0
This was challenging because more than one row in the group can have the same average calculation. then it throws off the allocation.
This works on the example dataframe, if I understood you correctly.
d = {'ID': [1, 1, 1, 2,2,2], 'STOREID': ['str1', 'str2', 'str3','str1', 'str2', 'str3'],'BAL':[50, 6, 74, 35,5,54], 'SALES': [5, 7, 4, 3,4,7], 'SHIP': [18, 18, 18, 500,500,500]}
df = pd.DataFrame(data=d)
df['AVG'] = 0
df['FINAL'] = 0
def calc_something(x):
# print(x.iloc[0]['SHIP'])
for i in range(x.iloc[0]['SHIP'])[0:500]:
x['AVG'] = x['BAL'] / x['SALES']
x['SHIP'] = x['SHIP']-1
x = x.sort_values('AVG').reset_index(drop=True)
# print(x.iloc[0, 2])
x.iloc[0, 2] = x['BAL'][0] + 1
x.iloc[0, 6] = x['FINAL'][0] + 1
return x
df_final = df.groupby('ID').apply(calc_something).reset_index(drop=True).sort_values(['ID', 'STOREID'])
df_final
ID STOREID BAL SALES SHIP AVG FINAL
1 1 STR1 50 5 0 10.000 0
0 1 STR2 24 7 0 3.286 18
2 1 STR3 74 4 0 18.500 0
4 2 STR1 127 3 0 42.333 92
5 2 STR2 170 4 0 42.500 165
3 2 STR3 297 7 0 42.286 243

Group by one column compare another column and add values to a new column in Python?

I have these columns:
index, area, key0
I have to group by index (it is a normal column called index) in order to take the rows that have the same value.
#all the ones, all the twos, etc
Some of them (rows) are unique though.
About the ones that are not unique now:
What I have done so far:
I have to check with a group by which of the groups have the largest area and give its respected key0 value to the others in its group in a new column called key1.
The unique values are going to still have the same value they had in key0 in the now key1 column
First I checked which of those occur more than once in order to know which are going to form groups.
df['index'].value_counts()[df['index'].value_counts()>1]
359 9
391 8
376 7
374 6
354 5
446 4
403 4
348 4
422 4
424 4
451 4
364 3
315 3
100 3
245 3
345 3
247 3
346 3
347 3
351 3
which worked fine. The thing now is how to do the rest?
the dataset:
df = pd.DataFrame({"index": [1,2,3,5,1,2,3,3,3], "area":
[50,60,70,80,90,100,10,20,70], "key0": ["1f",2,"3d",4,5,6,7,8,9]})
print df
# INPUT
area index key0
50 1 1f
60 2 2
70 3 3d
80 5 4
90 1 5
100 2 6
10 3 7
20 3 8
70 3 9
dataset
import geopandas as gpd
inte=gpd.read_file('in.shp')
inte["rank_gr"] = inte.groupby("index")["area_of_poly"].rank(ascending = False, method =
"first")
inte["key1_temp"] = inte.apply(lambda row: str(row[""]) if row["rank_gr"] == 1.0
else "", axis = 1)
inte["CAD_ADMIN_FINAL"] = inte.groupby("index")["key1_temp"].transform("sum")
print (inte[["area_of_poly", "index", "CAD_ADMIN", "CAD_ADMIN_FINAL"]])
Check with the data you provided. And it works. Haven't found any "key0" column so assumed it can be "CAD_ADMIN". "AREA" is only one value so I took "AREA_2".
import geopandas as gpd
# set your path
path = r"p\in.shp"
p = gpd.read_file(path)
p["rank_gr"] = p.groupby("index")["AREA_2"].rank(ascending = False, method =
"first")
p["key1_temp"] = p.apply(lambda row: str(row["CAD_ADMIN"]) if row["rank_gr"] == 1.0
else "", axis = 1)
p["key1"] = p.groupby("index")["key1_temp"].transform("sum")
p = p[["AREA_2", "index", "CAD_ADMIN", "key1"]]
print(p.sort_values(by = ["index"]))
AREA_2 index CAD_ADMIN key1
1.866706e+06 0 0113924 0113924
1.559865e+06 1 0113927 0113926
1.593623e+06 1 0113926 0113926
1.927774e+06 2 0113922 0113922
1.927774e+06 3 0113922 0113922
Do you mean something like this?
import pandas as pd
df = pd.DataFrame({"index": [1,2,3,5,1,2,3,3,3], "area":
[50,60,70,80,90,100,10,20,70], "key0": ["1f",2,"3d",4,5,6,7,8,9]})
print df
# INPUT
area index key0
50 1 1f
60 2 2
70 3 3d
80 5 4
90 1 5
100 2 6
10 3 7
20 3 8
70 3 9
df["rank_gr"] = df.groupby("index")["area"].rank(ascending = False, method =
"first")
df["key1_temp"] = df.apply(lambda row: str(row["key0"]) if row["rank_gr"] == 1.0
else "", axis = 1)
df["key1"] = df.groupby("index")["key1_temp"].transform("sum")
print df[["area", "index", "key0", "key1"]]
# OUTPUT
area index key0 key1
50 1 1f 5
60 2 2 6
70 3 3d 3d
80 5 4 4
90 1 5 5
100 2 6 6
10 3 7 3d
20 3 8 3d
70 3 9 3d

Subtracting many columns in a df by one column in another df

I'm trying to substract a df "stock_returns" (144 rows x 517 col) by a df "p_df" (144 rows x 1 col).
I have tried;
stock_returns - p_df
stock_returns.rsub(p_df,axis=1)
stock_returns.substract(p_df)
But none of them work and all return Nan values.
I'm passing it through this fnc, and using the for loop to get args:
def disp_calc(returns, p, wi): #apply(disp_calc, rows = ...)
wi = wi/np.sum(wi)
rp = (col_len(returns)*(returns-p)**2).sum() #returns - p causing problems
return np.sqrt(rp)
for i in sectors:
stock_returns = returns_rolling[sectordict[i]]#.apply(np.mean,axis=1)
portfolio_return = returns_rolling[i]; p_df = portfolio_return.to_frame()
disp_df[i] = stock_returns.apply(disp_calc,args=(portfolio_return,wi))
My expected output is to subtract all 517 cols in the first df by the 1 col in p_df. so final results would still have 517 cols. Thanks
You're almost there, just need to set axis=0 to subtract along the indexes:
>>> stock_returns = pd.DataFrame([[10,100,200],
[15, 115, 215],
[20,120, 220],
[25,125,225],
[30,130,230]], columns=['A', 'B', 'C'])
>>> stock_returns
A B C
0 10 100 200
1 15 115 215
2 20 120 220
3 25 125 225
4 30 130 230
>>> p_df = pd.DataFrame([1,2,3,4,5], columns=['P'])
>>> p_df
P
0 1
1 2
2 3
3 4
4 5
>>> stock_returns.sub(p_df['P'], axis=0)
A B C
0 9 99 199
1 13 113 213
2 17 117 217
3 21 121 221
4 25 125 225
data['new_col3'] = data['col1'] - data['col2']

Categories

Resources