Grouping By and Referencing Shifted Values - python

I am trying to track inventory levels of individual items over time
comparing projected outbound and availability. There are times in
which the projected outbound exceed the availability and when that
occurs I want the Post Available to be 0. I am trying to create the
Pre Available and Post Available columns below:
Item Week Inbound Outbound Pre Available Post Available
A 1 500 200 500 300
A 2 0 400 300 0
A 3 100 0 100 100
B 1 50 50 50 0
B 2 0 80 0 0
B 3 0 20 0 0
B 4 20 20 20 0
I have tried the below code:
def custsum(x):
total = 0
for i, v in x.iterrows():
total += df['Inbound'] - df['Outbound']
x.loc[i, 'Post Available'] = total
if total < 0:
total = 0
return x
df.groupby('Item').apply(custsum)
But I receive the below error message:
ValueError: Incompatible indexer with Series
I am a relative novice to Python so any help would be appreciated.
Thank you!

You could use
import numpy as np
import pandas as pd
df = pd.DataFrame({'Inbound': [500, 0, 100, 50, 0, 0, 20],
'Item': ['A', 'A', 'A', 'B', 'B', 'B', 'B'],
'Outbound': [200, 400, 0, 50, 80, 20, 20],
'Week': [1, 2, 3, 1, 2, 3, 4]})
df = df[['Item', 'Week', 'Inbound', 'Outbound']]
def custsum(x):
total = 0
for i, v in x.iterrows():
total += x.loc[i, 'Inbound'] - x.loc[i, 'Outbound']
if total < 0:
total = 0
x.loc[i, 'Post Available'] = total
x['Pre Available'] = x['Post Available'].shift(1).fillna(0) + x['Inbound']
return x
result = df.groupby('Item').apply(custsum)
result = result[['Item', 'Week', 'Inbound', 'Outbound', 'Pre Available', 'Post Available']]
print(result)
which yields
Item Week Inbound Outbound Pre Available Post Available
0 A 1 500 200 500.0 300.0
1 A 2 0 400 300.0 0.0
2 A 3 100 0 100.0 100.0
3 B 1 50 50 50.0 0.0
4 B 2 0 80 0.0 0.0
5 B 3 0 20 0.0 0.0
6 B 4 20 20 20.0 0.0
The main difference between this code and the code you posted is:
total += x.loc[i, 'Inbound'] - x.loc[i, 'Outbound']
x.loc is used to select the numeric value in the row indexed by i and in
the Inbound or Outbound column. So the difference is numeric and total
remains numeric. In contrast,
total += df['Inbound'] - df['Outbound']
adds an entire Series to total. That leads to the ValueError later. (See below for more on why that occurs).
The conditional
if total < 0:
total = 0
was moved above x.loc[i, 'Post Available'] = total to ensure that Post
Available is always non-negative.
If you didn't need this conditional, then the entire for-loop could be replaced by
x['Post Available'] = (df['Inbound'] - df.loc['Outbound']).cumsum()
And since column-wise arithmetic and cumsum are vectorized operations, the calculation could be performed much quicker.
Unfortunately, the conditional prevents us from eliminating the for-loop and vectorizing the calculation.
In your original code, the error
ValueError: Incompatible indexer with Series
occurs on this line
x.loc[i, 'Post Available'] = total
because total is (sometimes) a Series not a simple numeric value. Pandas is
attempting to align the Series on the right-hand side with the indexer, (i, 'Post Available'), on the left-hand side. The indexer (i, 'Post Available') gets
converted to a tuple like (0, 4), since Post Available is the column at
index 4. But (0, 4) is not an appropriate index for the 1-dimensional Series
on the right-hand side.
You can confirm total is Series by putting print(total) inside your for-loop,
or by noting that the right-hand side of
total += df['Inbound'] - df['Outbound']
is a Series.

There is not need a self-define function , you can using groupby + shift for create PreAvailable and using clip(setting the lower boundary as 0 ) for PostAvailable
df['PostAvailable']=(df.Inbound-df.Outbound).clip(lower=0)
df['PreAvailable']=df.groupby('item').apply(lambda x : x['Inbound'].add(x['PostAvailable'].shift(),fill_value=0)).values
df
Out[213]:
item Week Inbound Outbound PreAvailable PostAvailable
0 A 1 500 200 500.0 300
1 A 2 0 400 300.0 0
2 A 3 100 0 100.0 100
3 B 1 50 50 50.0 0
4 B 2 0 80 0.0 0
5 B 3 0 20 0.0 0
6 B 4 20 20 20.0 0

Related

Pandas: Set values in all rows that are between two rows in a dataframe?

Say we have a pandas dataframe like the below:
df = pd.DataFrame({"Basis": [300, 1500, 400, 260, 50,-10],"Weights":[0,-1,0,0,0,0]})
print(df)
Basis Weights
0 300 0
1 1500 -1
2 400 0
3 260 0
4 50 0
5 -10 0
So I found out how I can set value of X column within row X based on values within another column of that same row. So in this dataframe I get that I can set all weights to -1 where Basis > 1000
df.loc[df['Basis'] > 1000, 'Weights'] = -1
What I want to be able to do is: in a large df of this format, take all the rows in between a row where there is a weight of -1 and a later row where basis <= 0 and set their weight value to -1 (so in the image case, I want to set rows 1-4's weights value to -1, and I have to work out how to do this without looping through the entire dataframe (have to work with a very large dataset).
The desired output would be:
Basis Weights
0 300 0
1 1500 -1
2 400 -1
3 260 -1
4 50 -1
5 -10 0
is there an elegant way to do this that avoids looping through entire df? I.e. Some quick way of implementing condition that weight equals previous weight if basis >=0
If you only have >0 or -1 values in Weights, you can set up groups starting at negative Basis and get the cummin Weight:
group = df['Basis'].lt(0).cumsum()
df['Weights'] = df.groupby(group)['Weights'].cummin()
If you have arbitrary values, this is a bit more complex, you first need to mask the non -1 values, ffill per group, then restore the other values:
group = df['Basis'].lt(0).cumsum()
df['Weights'] = (df['Weights']
.where(df['Weights'].eq(-1))
.groupby(group).ffill()
.fillna(df['Weights'], downcast='infer')
)
output:
Basis Weights
0 300 0
1 1500 -1
2 400 -1
3 260 -1
4 50 -1
5 -10 0
You can replace the 0s with nan and then fillna -
df.loc[(df['Basis'] > 0) & (df['Weights'] >= 0), 'Weights'] = np.nan
df = df.fillna(method='ffill').fillna(0)
Output
Basis Weights
0 300 0.0
1 1500 -1.0
2 400 -1.0
3 260 -1.0
4 50 -1.0
5 -10 0.0

How to efficinetly combine dataframe rows based on conditions?

I have the following dataset, which contains a column with the cluster number, the number of observations in that cluster and the maximum value of another variable x grouped by that cluster.
clust = np.arange(0, 10)
obs = np.array([1041, 544, 310, 1648, 1862, 2120, 2916, 5148, 12733, 1])
x_max = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90, 100])
df = pd.DataFrame(np.c_[clust, obs, x_max], columns=['clust', 'obs', 'x_max'])
clust obs x_max
0 0 1041 10
1 1 544 20
2 2 310 30
3 3 1648 40
4 4 1862 50
5 5 2120 60
6 6 2916 70
7 7 5148 80
8 8 12733 90
9 9 1 100
My task is to combine the clust row values with adjasent rows, so that each cluster contains at least 1000 observations.
My current attempt gets stuck in an infinite loop because the last cluster has only 1 observation.
condition = True
while (condition):
condition = False
for i in np.arange(0, len(df) + 1):
if df.loc[i, 'x'] < 1000:
df.loc[i, 'id'] = df.loc[i, 'id'] + 1
df = df.groupby('id', as_index=False).agg({'x': 'sum', 'y': 'max'})
condition = True
break
Is there perhaps a more efficient way of doing this? I come from a background in SAS, where such situations would be solved with the if last.row condition, but it seems here is no such condition in python.
The resulting table should look like this
clust obs x_max
0 1041 10
1 2502 40
2 1862 50
3 2120 60
4 2916 70
5 5148 80
6 12734 100
Here is another way. A vectorize way here is difficult to implement, but using for loop on an array (or a list) will be faster than using loc at each iteration. Also, not a good practice to change df within the loop, it can only bring problem.
# define variables
s = 0 #for the sum of observations
gr = [] #for the final grouping values
i = 0 #for the group indices
# loop over observations from an array
for obs in df['obs'].to_numpy():
s+= obs
gr.append(i)
# check that the size of the group is big enough
if s>1000:
s = 0
i+=1
# condition to deal with last rows if last group not big enough
if s!=0:
gr = [i-1 if val==i else val for val in gr]
# now create your new df
new_df = (
df.groupby(gr).agg({'obs':sum, 'x_max':max})
.reset_index().rename(columns={'index':'cluster'})
)
print(new_df)
# cluster obs x_max
# 0 0 1041 10
# 1 1 2502 40
# 2 2 1862 50
# 3 3 2120 60
# 4 4 2916 70
# 5 5 5148 80
# 6 6 12734 100

Sliding Window and comparing elements of DataFrame to a threshold

Assume I have the following dataframe:
Time Flag1
0 0
10 0
30 0
50 1
70 1
90 0
110 0
My goal is to identify if within any window that time is less than lets the number in the row plus 35, then if any element of flag is 1 then that row would be 1. For example consider the above example:
The first element of time is 0 then 0 + 35 = 35 then in the window of values less than 35 (which is Time =0, 10, 30) all the flag1 values are 0 therefore the first row will be assigned to 0 and so on. Then the next window will be 10 + 35 = 45 and still will include (0,10,30) and the flag is still 0. So the complete output is:
Time Flag1 Output
0 0 0
10 0 0
30 0 1
50 1 1
70 1 1
90 1 1
110 1 1
To implement this type of problem, I thought I can use two for loops like this:
Output = []
for ii in range(Data.shape[0]):
count =0
th = Data.loc[ii,'Time'] + 35
for jj in range(ii,Data.shape[0]):
if (Data.loc[jj,'Time'] < th and Data.loc[jj,'Flag1'] == 1):
count = 1
break
output.append(count)
However this looks tedious. since the inner for loop should go for continue for the entire length of data. Also I am not sure if this method checks the boundary cases for out of bound index when we are reaching to end of the dataframe. I appreciate if someone can comment on something easier than this. This is like a sliding window operation only comparing number to a threshold.
Edit: I do not want to compare two consecutive rows only. I want if for example 30 + 35 = 65 then as long as time is less than 65 then if flag1 is 1 then output is 1.
The second example:
Time Flag1 Output
0 0 0
30 0 1
40 0 1
60 1 1
90 1 1
140 1 1
200 1 1
350 1 1
Assuming a window k rows before and k rows after as mentioned in my comment:
import pandas as pd
Data = pd.DataFrame([[0,0], [10,0], [30,0], [50,1], [70,1], [90,1], [110,1]],
columns=['Time', 'Flag1'])
k = 1 # size of window: up to k rows before and up to k rows after
n = len(Data)
output = [0]*n
for i in range(n):
th = Data['Time'][i] + 35
j0 = max(0, i - k)
j1 = min(i + k + 1, n) # the +1 is because range is non-inclusive of end
output[i] = int(any((Data['Time'][j0 : j1] < th) & (Data['Flag1'][j0 : j1] > 0)))
Data['output'] = output
print(Data)
gives the same output as the original example. And you can change the size of the window my modifying k.
Of course, if the idea is to check any row afterward, then just use j1 = n in my example.
import pandas as pd
Data = pd.DataFrame([[0,0],[10,0],[30,0],[50,1],[70,1],[90,1],[110,1]],columns=['Time','Flag1'])
output = Data.index.map(lambda x: 1 if any((Data.Time[x+1:]<Data.Time[x]+35)*(Data.Flag1[x+1:]==1)) else 0).values
output[-1] = Data.Flag1.values[-1]
Data['output'] = output
print(Data)
# show
Time Flag1 output
0 0 0
30 0 1
40 0 1
50 1 1
70 1 1
90 1 1
110 1 1

index counter for if conditions python pandas

I wanted to generate some sort of cycle for my dataFrame. One cycle in the example below has the length of 4. The last column is how is supposed to look like, the rest are attempts on my behalf.
My current code looks like this:
import pandas as pd
import numpy as np
l = list(np.linspace(0,10,12))
data = [
('time',l),
('A',[0,5,0.6,-4.8,-0.3,4.9,0.2,-4.7,0.5,5,0.1,-4.6]),
('B',[ 0,300,20,-280,-25,290,30,-270,40,300,-10,-260]),
]
df = pd.DataFrame.from_dict(dict(data))
length = len(df)
df.loc[0,'cycle']=1
df['cycle'] = length/4 +df.loc[0,'cycle']
i = 0
for i in range(0,length):
df.loc[i,'new_cycle']=i+1
df['want_cycle']= [1,1,1,1,2,2,2,2,3,3,3,3]
print(length)
print(df)
I do need an if conditions in the code, too only increase in the value of df['new_cycle'] if the index counter for example 4. But so far I failed to find a proper way to implement such conditions.
Try this with the default range index, because your dataframe row index is a range starting with 0, the default index of a dataframe, you can use floor divide to calculate your cycle:
df['cycle'] = df.index//4 + 1
Output:
time A B cycle
0 0.000000 0.0 0 1
1 0.909091 5.0 300 1
2 1.818182 0.6 20 1
3 2.727273 -4.8 -280 1
4 3.636364 -0.3 -25 2
5 4.545455 4.9 290 2
6 5.454545 0.2 30 2
7 6.363636 -4.7 -270 2
8 7.272727 0.5 40 3
9 8.181818 5.0 300 3
10 9.090909 0.1 -10 3
11 10.000000 -4.6 -260 3
Now, if your dataframe index isn't the default, the you can use something like this:
df['cycle'] = [df.index.get_loc(i) // 4 + 1 for i in df.index]
I've added just 1 thing for you, a new variable called new_cycle which will keep the count you're after.
In the for loop we're checking to see whether or not i is divisible by 4 without a remainder, if it is we're adding 1 to the new variable, and filling the data frame with this value the same way you did.
import pandas as pd
import numpy as np
l = list(np.linspace(0,10,12))
data = [
('time',l),
('A',[0,5,0.6,-4.8,-0.3,4.9,0.2,-4.7,0.5,5,0.1,-4.6]),
('B',[ 0,300,20,-280,-25,290,30,-270,40,300,-10,-260]),
]
df = pd.DataFrame.from_dict(dict(data))
length = len(df)
df.loc[0,'cycle']=1
df['cycle'] = length/4 +df.loc[0,'cycle']
new_cycle = 0
for i in range(0,length):
if i % 4 == 0:
new_cycle += 1
df.loc[i,'new_cycle']= new_cycle
df['want_cycle'] = [1,1,1,1,2,2,2,2,3,3,3,3]
print(length)
print(df)

Creating a list based on column conditions

I have a DataFrame df
>>df
LED CFL Incan Hall Reading
0 3 2 1 100 150
1 2 3 1 150 100
2 0 1 3 200 150
3 1 2 4 300 250
4 3 3 1 170 100
I want to create two more column which contain lists, one for "Hall" and another for "Reading"
>>df_output
LED CFL Incan Hall Reading Hall_List Reading_List
0 3 2 1 100 150 [0,2,0] [2,0,0]
1 2 3 1 150 100 [0,3,0] [2,0,0]
2 0 1 3 200 150 [0,1,0] [0,0,2]
3 1 2 4 300 250 [0,2,0] [1,0,0]
4 3 3 1 100 100 [0,2,0] [2,0,0]
Each value within the list is populated as follows:
cfl_rating = 50
led_rating = 100
incan_rating = 25
For the Hall_List:
The preference is CFL > LED > Incan. And only one of them will be used (either CFL or LED or Incan).
We first check if CFL != 0 , if True then we calculate min(ceil(Hall/CFL_rating),CFL). For index=0 this evaluates to 2. Hence we have [0,2,0] whereas for index=2 we have [0,1,0].
Similarly for Reading_List, the preference is LED > Incan > CFL.
For index=2, we have LED == 0, so we calculate min(ceil(Reading/Incan_rating),Incan) and hence Reading_List is [0,0,2]
My question is:
Is there a "pandas/pythony-way" of doing this? I am currently iterating through each row, and using if-elif-else conditions to assign values.
My code snippet looks like this:
#Hall_List
for i in range(df.shape[0]):
Hall = []
if (df['CFL'].iloc[i] != 0):
Hall.append(0)
Hall.append(min((math.ceil(df['Hall'].iloc[i]/cfl_rating)),df['CFL'].iloc[i]))
Hall.append(0)
elif (df['LED'].iloc[i] != 0):
Hall.append(min((math.ceil(df['Hall'].iloc[i]/led_rating)),df['LED'].iloc[i]))
Hall.append(0)
Hall.append(0)
else:
Hall.append(0)
Hall.append(0)
Hall.append(min((math.ceil(df['Hall'].iloc[i]/incan_rating)),df['Incan'].iloc[i]))
df['Hall_List'].iloc[i] = Hall
This is really slow and definitely feels like a bad way to code this.
I shorten your formula for simplicity sake but you should use df.apply(axis=1)
this take every row and return and ndarray, then you can apply whatever function you want such has :
df = pd.DataFrame([[3, 2, 1, 100, 150], [2, 3, 1, 150, 100]], columns=['LED', 'CFL', 'Incan', 'Hall', 'Reading'])
def create_list(ndarray):
if ndarray[1] != 0:
result = [0, ndarray[1], 0]
else:
result = [ndarray[2], 0, 0]
return result
df['Hall_List'] = df.apply(lambda x: create_list(x), axis=1)
just change the function to whatever you like here.
In[49]: df
Out[49]:
LED CFL Incan Hall Reading Hall_List
0 3 2 1 100 150 [0, 2, 0]
1 2 3 1 150 100 [0, 3, 0]
hope this helps

Categories

Resources