How to efficinetly combine dataframe rows based on conditions?

How to efficinetly combine dataframe rows based on conditions? - python

I have the following dataset, which contains a column with the cluster number, the number of observations in that cluster and the maximum value of another variable x grouped by that cluster.
clust = np.arange(0, 10)
obs = np.array([1041, 544, 310, 1648, 1862, 2120, 2916, 5148, 12733, 1])
x_max = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90, 100])
df = pd.DataFrame(np.c_[clust, obs, x_max], columns=['clust', 'obs', 'x_max'])
clust obs x_max
0 0 1041 10
1 1 544 20
2 2 310 30
3 3 1648 40
4 4 1862 50
5 5 2120 60
6 6 2916 70
7 7 5148 80
8 8 12733 90
9 9 1 100
My task is to combine the clust row values with adjasent rows, so that each cluster contains at least 1000 observations.
My current attempt gets stuck in an infinite loop because the last cluster has only 1 observation.
condition = True
while (condition):
condition = False
for i in np.arange(0, len(df) + 1):
if df.loc[i, 'x'] < 1000:
df.loc[i, 'id'] = df.loc[i, 'id'] + 1
df = df.groupby('id', as_index=False).agg({'x': 'sum', 'y': 'max'})
condition = True
break
Is there perhaps a more efficient way of doing this? I come from a background in SAS, where such situations would be solved with the if last.row condition, but it seems here is no such condition in python.
The resulting table should look like this
clust obs x_max
0 1041 10
1 2502 40
2 1862 50
3 2120 60
4 2916 70
5 5148 80
6 12734 100

Here is another way. A vectorize way here is difficult to implement, but using for loop on an array (or a list) will be faster than using loc at each iteration. Also, not a good practice to change df within the loop, it can only bring problem.
# define variables
s = 0 #for the sum of observations
gr = [] #for the final grouping values
i = 0 #for the group indices
# loop over observations from an array
for obs in df['obs'].to_numpy():
s+= obs
gr.append(i)
# check that the size of the group is big enough
if s>1000:
s = 0
i+=1
# condition to deal with last rows if last group not big enough
if s!=0:
gr = [i-1 if val==i else val for val in gr]
# now create your new df
new_df = (
df.groupby(gr).agg({'obs':sum, 'x_max':max})
.reset_index().rename(columns={'index':'cluster'})
)
print(new_df)
# cluster obs x_max
# 0 0 1041 10
# 1 1 2502 40
# 2 2 1862 50
# 3 3 2120 60
# 4 4 2916 70
# 5 5 5148 80
# 6 6 12734 100

Related

How to combine rows in groupby with several conditions?

I want to combine rows in pandas df with the following logic:
dataframe is grouped by users
rows are ordered by start_at_min
rows are combiend when:
Case A:
if start_at_min<=200:
row1[stop_at_min] - row2[start_at_min] < 5
(eg: 101 -100 = 1 -> combine; 200-100=100: -> dont combine)
Case Bif 200> start_at_min<400:
change threhsold to 3
Case C if start_at_min>400:
Never combine
Example df
user start_at_min stop_at_min
0 1 100 150
1 1 152 201 #row0 with row1 combine
2 1 205 260 #row1 with row 2 NO -> start_at_min above 200 -> threshol =3
3 2 65 100 #no
4 2 200 265 #no
5 2 300 451 #no
6 2 452 460 #no -> start_at_min above 400-> never combine
Expected output:
user start_at_min stop_at_min
0 1 100 201 #row1 with row2 combine
2 1 205 260 #row2 with row 3 NO -> start_at_min above 200 -> threshol =3
3 2 65 100 #no
4 2 200 265 #no
5 2 300 451 #no
6 2 452 460 #no -> start_at_min above 400-> never combine
I have written the funciton combine_rows, that takes in 2 Series and applies this logic
def combine_rows (s1:pd.Series, s2:pd.Series):
# take 2 rows and combine them if start_at_min row2 - stop_at_min row1 < 5
if s2['start_at_min'] - s1['stop_at_min'] <5:
return pd.Series({
'user': s1['user'],
'start_at_min': s1['start_at_min'],
'stop_at_min' : s2['stop_at_min']
})
else:
return pd.concat([s1,s2],axis=1).T
Howver I am unable to apply this function to the dataframe.
This was my attempt:
df.groupby('user').sort_values(by=['start_at_min']).apply(combine_rows) # this not working
Here is the full code:
import pandas as pd
import numpy as np
df = pd.DataFrame({
"user" : [1, 1, 2,2],
'start_at_min': [60, 101, 65, 200],
'stop_at_min' : [100, 135, 100, 265]
})
def combine_rows (s1:pd.Series, s2:pd.Series):
# take 2 rows and combine them if start_at_min row2 - stop_at_min row1 < 5
if s2['start_at_min'] - s1['stop_at_min'] <5:
return pd.Series({
'user': s1['user'],
'start_at_min': s1['start_at_min'],
'stop_at_min' : s2['stop_at_min']
})
else:
return pd.concat([s1,s2],axis=1).T
df.groupby('user').sort_values(by=['start_at_min']).apply(combine_rows) # this not working

version 1: one condition
Perform a custom groupby.agg:
threshold = 5
# if the successive stop/start per group are above threshold
# start a new group
group = (df['start_at_min']
.sub(df.groupby('user')['stop_at_min'].shift())
.ge(threshold).cumsum()
)
# groupby.agg
out = (df.groupby(['user', group], as_index=False)
.agg({'start_at_min': 'min',
'stop_at_min': 'max'
})
)
Output:
user start_at_min stop_at_min
0 1 60 135
1 2 65 100
2 2 200 265
Intermediate:
(df['start_at_min']
.sub(df.groupby('user')['stop_at_min'].shift())
)
0 NaN
1 1.0 # below threshold, this will be merged
2 NaN
3 100.0 # above threshold, keep separate
dtype: float64
version 2: multiple conditions
# define variable threshold
threshold = np.where(df['start_at_min'].le(200), 5, 3)
# array([3, 3, 5, 3, 3, 5, 5])
# compute the new starts of group like in version 1
# but using the now variable threshold
m1 = (df['start_at_min']
.sub(df.groupby('user')['stop_at_min'].shift())
.ge(threshold)
)
# add a second restart condition (>400)
m2 = df['start_at_min'].gt(400)
# if either mask is True, start a new group
group = (m1|m2).cumsum()
# groupby.agg
out = (df.groupby(['user', group], as_index=False)
.agg({'start_at_min': 'min',
'stop_at_min': 'max'
})
)
Output:
user start_at_min stop_at_min
0 1 100 201
1 1 205 260
2 2 65 100
3 2 200 265
4 2 300 451
5 2 452 460

Create a master data set comprised of multiple data frames

I have been stuck on this problem for a while now! Included below is a very simplified version of my program, along with some context. Essentially I want to view is one large dataframe which has all of my desired permutations based on my input variables. This is in the context of scenario analysis and it will help me avoid doing on-demand calculations through my BI tool when the user wants to change variables to visualise the output.
I have tried:
Creating a function out of my code and trying to apply the function with each of the step size changes of my input variables ( no idea what I am doing there).
Literally manually changing the input variables myself (as a noob I realise this is not the way to go but had to first see my code was working to append df's).
Essentially what I want to achieve is as follows:
use the variables "date_offset" and "cost" and vary each of them by the required number of defined steps sizes and number of steps
As an example, if there are 2 values for date_offset (step size 1) and two values for cost (step size one) there are a possible 4 combinations, therefore the data set will be 4 times the size of the df in my code below.
Now I have all of the permutations of the input variable and the corresponding data frame to go with each of those permutations, I would like to append each one of the data frames together.
I should be left with one data frame for all of the possible scenarios which I can then visualise with a BI tool.
I hope you guys can help :)
Here is my code.....
import pandas as pd
import numpy as np
#want to iterate through starting at a date_offset of 0 with a total of 5 steps and a step size of 1
date_offset = 0
steps_1 = 5
stepsize_1 = 1
#want to iterate though starting at a cost of 5 with a total number of steps of 5 and a step size of 1
cost = 5
steps_2 = 4
step_size = 1
df = {'id':['1a', '2a', '3a', '4a'],'run_life':[10,20,30,40]}
df = pd.DataFrame(df)
df['date_offset'] = date_offset
df['cost'] = cost
df['calc_col1'] = df['run_life']*cost

Are you trying to do something like this:
from itertools import product
data = {'id': ['1a', '2a', '3a', '4a'], 'run_life': [10, 20, 30, 40]}
df = pd.DataFrame(data)
date_offset = 0
steps_1 = 5
stepsize_1 = 1
cost = 5
steps_2 = 4
stepsize_2 = 1
df2 = pd.DataFrame(
product(
range(date_offset, date_offset + steps_1 * stepsize_1 + 1, stepsize_1),
range(cost, cost + steps_2 * stepsize_2 + 1, stepsize_2)
),
columns=['offset', 'cost']
)
result = df.merge(df2, how='cross')
result['calc_col1'] = result['run_life'] * result['cost']
Output:
id run_life offset cost calc_col1
0 1a 10 0 5 50
1 1a 10 0 6 60
2 1a 10 0 7 70
3 1a 10 0 8 80
4 1a 10 0 9 90
.. .. ... ... ... ...
115 4a 40 5 5 200
116 4a 40 5 6 240
117 4a 40 5 7 280
118 4a 40 5 8 320
119 4a 40 5 9 360
[120 rows x 5 columns]

How to use multiple conditions, including selecting on quantile in Python

Imagine the following dataset df:
Row
Population_density
Distance
1
400
50
2
500
30
3
300
40
4
200
120
5
500
60
6
1000
50
7
3300
30
8
500
90
9
700
100
10
1000
110
11
900
200
12
850
30
How can I make a new dummy column that represents a 1 when values of df['Population_density'] are above the third quantile (>75%) AND the df['Distance'] is < 100, while a 0 is given to the remainder of the data? Consequently, rows 6 and 7 should have a 1 while the other rows should have a 0.
Creating a dummy variable with only one criterium can be fairly easy. For instance, the following condition works for creating a new dummy variable that contains a 1 when the Distance is <100 and a 0 otherwise: df['Distance_Below_100'] = np.where(df['Distance'] < 100, 1, 0). However, I do not know how to combine conditions whereby one of the conditions includes a quantile selection (in this case, the upper 25% of the variable Population_density.
import pandas as pd
# assign data of lists.
data = {'Row': range(1,13,1), 'Population_density': [400, 500, 300, 200, 500, 1000, 3300, 500, 700, 1000, 900, 850],
'Distance': [50, 30, 40, 120, 60, 50, 30, 90, 100, 110, 200, 30]}
# Create DataFrame
df = pd.DataFrame(data)

You can use & or | to join the conditions
import numpy as np
df['Distance_Below_100'] = np.where(df['Population_density'].gt(df['Population_density'].quantile(0.75)) & df['Distance'].lt(100), 1, 0)
print(df)
Row Population_density Distance Distance_Below_100
0 1 400 50 0
1 2 500 30 0
2 3 300 40 0
3 4 200 120 0
4 5 500 60 0
5 6 1000 50 1
6 7 3300 30 1
7 8 500 90 0
8 9 700 100 0
9 10 1000 110 0
10 11 900 200 0
11 12 850 30 0

he, to make a function on data frame i recommended to use lambda.
for example this is your function:
def myFunction(value):
pass
to create a new column 'new_column', (pick_cell) is which cell you want to make a function on:
df['new_column']= df.apply(lambda x : myFunction(x.pick_cell))

How to spilt each column into more columns in a given dataframe

I have over 100 columns of the week and for each column of the week, I want to proportion it into days and assign row-specific values to each row over 7 new columns. Like this
I am new to python, I know I need a while loops and for loops but now sure how to go about doing this. Can anyone help?
Based on previous advice I had from this forum, the below work for Week1, can someone advise me on how to loop each week for weeks 2, 3, 4 to nth week?
import pandas as pd
df = pd.DataFrame({"Week1": [9, 30, 35, 65],"Week2": [20, 10, 25, 55],"Week3": [19, 35, 40, 15],"Week4": [7, 10, 70, 105]})
# define which columns need to be created
# this will be the range between 1 and the maximum of the Total Number column
columns_to_fill = ["col" + str(i) for i in range(1, 8)]
# columns_to_fill = [ col1, col2, col3, col4, col5, col6, col7, col8, col9, col10, .... , col28 ]
# now, go through each row of your dataframe
for indx, row in df.iterrows():
# and for each column in the new columns to be filled
# check if the number is smaller or equal than the row's Total Number
# if it is smaller, fill the column with 1
# else fill the column with 0
for number, column in enumerate(columns_to_fill):
if number + 1 <= row["Week1"]:
df.loc[indx, column] = 1
else:
df.loc[indx, column] = 0
# now check if there is a remainder
remainder = row["Week1"] - 7
# while remainder is greater than 0
# we need to continue adding +1 to the columns
while remainder > 0:
for number, column in enumerate(columns_to_fill):
if number + 1 <= remainder:
df.loc[indx, column] += 1
else:
continue
# update remainder
remainder = remainder - 7

Here is a vectorized option, first repeat each row 7 times (number of days a week), and add an extra index level with set_index being the day number.
_df = (
df.loc[df.index.repeat(7)]
.set_index(np.array(list(range(1,8))*len(df)), append=True)
)
print(_df.head(10))
# Week1 Week2 Week3 Week4
# 0 1 9 20 19 7
# 2 9 20 19 7
# 3 9 20 19 7
# 4 9 20 19 7
# 5 9 20 19 7
# 6 9 20 19 7
# 7 9 20 19 7
# 1 1 30 10 35 10
# 2 30 10 35 10
# 3 30 10 35 10
now calculate the result of the entire division with //7, then add the rest where needed using the modulo % that you can compare with the extra index level created as it is the day number.
# entire division
res = _df//7
# add the rest where needed
res += (_df%7 >= _df.index.get_level_values(1).to_numpy()[:, None]).astype(int)
print(res)
# Week1 Week2 Week3 Week4
# 0 1 2 3 3 1
# 2 2 3 3 1
# 3 1 3 3 1
# 4 1 3 3 1
# 5 1 3 3 1
# 6 1 3 2 1
# 7 1 2 2 1
# 1 1 5 2 5 2
# 2 5 2 5 2
# 3 4 2 5 2
Finally, reshape and rename columns if wanted.
# reshape the result
res = res.unstack()
# rename the columns if you don't want multiindex
res.columns = [f'{w}_col{i}' for w, i in res.columns]
print(res)
# Week1_col1 Week1_col2 Week1_col3 Week1_col4 Week1_col5 Week1_col6 \
# 0 2 2 1 1 1 1
# 1 5 5 4 4 4 4
# 2 5 5 5 5 5 5
# 3 10 10 9 9 9 9
# Week1_col7 Week2_col1 Week2_col2 Week2_col3 Week2_col4 Week2_col5 \
# 0 1 3 3 3 3 3
# 1 4 2 2 2 1 1
# ...
and you can still join to your original dataframe
res = df.join(res)

If you want ot use while/for loops, this iterates over the original data frame. The data frame rows can be of any length and have any number of header elements. The sub-header can have any number of elements (1D).
#Import
import numpy as np
import pandas as pd
#Example data frame and sub-header.
df = pd.DataFrame({"Week1": [9, 30, 35, 65],"Week2": [20, 10, 25, 55],"Week3": [19, 35, 40, 15],"Week4": [7, 10, 70, 105]})
subHeader = ['day1','day2','day3','day4','day5','day6','day7']
#Sort data frame and sub header.
df = df.reindex(sorted(df.columns), axis=1)
subHeader.sort()
#Extract relevant variables.
cols = df.shape[1]
rows = df.shape[0]
subHeadLen = len(subHeader)
mainHeader = list(df.columns)
meanHeadLen = len(mainHeader)
#MultiIndex main header with sub-header.
header = pd.MultiIndex.from_product([mainHeader,subHeader], names=['Week','Day'])
#Hold vals in temporary matrix.
mat = np.zeros((rows,meanHeadLen*subHeadLen))
#Iterate over data frame weeks. For every value in each row distribute over matrix indices by incrementing elements daily.
for col in range(cols):
for val in range(rows):
while df.iat[val,col] > 0:
for subVal in range(subHeadLen):
if df.iat[val,col] > 0:
mat[val][col*subHeadLen + subVal] = mat[val][col*subHeadLen + subVal] + 1
df.iat[val,col] = df.iat[val,col] - 1
#Final data frame.
df2 = pd.DataFrame(mat,columns=header)
print(df2)

Grouping By and Referencing Shifted Values

I am trying to track inventory levels of individual items over time
comparing projected outbound and availability. There are times in
which the projected outbound exceed the availability and when that
occurs I want the Post Available to be 0. I am trying to create the
Pre Available and Post Available columns below:
Item Week Inbound Outbound Pre Available Post Available
A 1 500 200 500 300
A 2 0 400 300 0
A 3 100 0 100 100
B 1 50 50 50 0
B 2 0 80 0 0
B 3 0 20 0 0
B 4 20 20 20 0
I have tried the below code:
def custsum(x):
total = 0
for i, v in x.iterrows():
total += df['Inbound'] - df['Outbound']
x.loc[i, 'Post Available'] = total
if total < 0:
total = 0
return x
df.groupby('Item').apply(custsum)
But I receive the below error message:
ValueError: Incompatible indexer with Series
I am a relative novice to Python so any help would be appreciated.
Thank you!

You could use
import numpy as np
import pandas as pd
df = pd.DataFrame({'Inbound': [500, 0, 100, 50, 0, 0, 20],
'Item': ['A', 'A', 'A', 'B', 'B', 'B', 'B'],
'Outbound': [200, 400, 0, 50, 80, 20, 20],
'Week': [1, 2, 3, 1, 2, 3, 4]})
df = df[['Item', 'Week', 'Inbound', 'Outbound']]
def custsum(x):
total = 0
for i, v in x.iterrows():
total += x.loc[i, 'Inbound'] - x.loc[i, 'Outbound']
if total < 0:
total = 0
x.loc[i, 'Post Available'] = total
x['Pre Available'] = x['Post Available'].shift(1).fillna(0) + x['Inbound']
return x
result = df.groupby('Item').apply(custsum)
result = result[['Item', 'Week', 'Inbound', 'Outbound', 'Pre Available', 'Post Available']]
print(result)
which yields
Item Week Inbound Outbound Pre Available Post Available
0 A 1 500 200 500.0 300.0
1 A 2 0 400 300.0 0.0
2 A 3 100 0 100.0 100.0
3 B 1 50 50 50.0 0.0
4 B 2 0 80 0.0 0.0
5 B 3 0 20 0.0 0.0
6 B 4 20 20 20.0 0.0
The main difference between this code and the code you posted is:
total += x.loc[i, 'Inbound'] - x.loc[i, 'Outbound']
x.loc is used to select the numeric value in the row indexed by i and in
the Inbound or Outbound column. So the difference is numeric and total
remains numeric. In contrast,
total += df['Inbound'] - df['Outbound']
adds an entire Series to total. That leads to the ValueError later. (See below for more on why that occurs).
The conditional
if total < 0:
total = 0
was moved above x.loc[i, 'Post Available'] = total to ensure that Post
Available is always non-negative.
If you didn't need this conditional, then the entire for-loop could be replaced by
x['Post Available'] = (df['Inbound'] - df.loc['Outbound']).cumsum()
And since column-wise arithmetic and cumsum are vectorized operations, the calculation could be performed much quicker.
Unfortunately, the conditional prevents us from eliminating the for-loop and vectorizing the calculation.
In your original code, the error
ValueError: Incompatible indexer with Series
occurs on this line
x.loc[i, 'Post Available'] = total
because total is (sometimes) a Series not a simple numeric value. Pandas is
attempting to align the Series on the right-hand side with the indexer, (i, 'Post Available'), on the left-hand side. The indexer (i, 'Post Available') gets
converted to a tuple like (0, 4), since Post Available is the column at
index 4. But (0, 4) is not an appropriate index for the 1-dimensional Series
on the right-hand side.
You can confirm total is Series by putting print(total) inside your for-loop,
or by noting that the right-hand side of
total += df['Inbound'] - df['Outbound']
is a Series.

There is not need a self-define function , you can using groupby + shift for create PreAvailable and using clip(setting the lower boundary as 0 ) for PostAvailable
df['PostAvailable']=(df.Inbound-df.Outbound).clip(lower=0)
df['PreAvailable']=df.groupby('item').apply(lambda x : x['Inbound'].add(x['PostAvailable'].shift(),fill_value=0)).values
df
Out[213]:
item Week Inbound Outbound PreAvailable PostAvailable
0 A 1 500 200 500.0 300
1 A 2 0 400 300.0 0
2 A 3 100 0 100.0 100
3 B 1 50 50 50.0 0
4 B 2 0 80 0.0 0
5 B 3 0 20 0.0 0
6 B 4 20 20 20.0 0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to efficinetly combine dataframe rows based on conditions? - python

Related

How to combine rows in groupby with several conditions?

Create a master data set comprised of multiple data frames

How to use multiple conditions, including selecting on quantile in Python

How to spilt each column into more columns in a given dataframe

Grouping By and Referencing Shifted Values

Categories

Resources