Creating a list based on column conditions - python

I have a DataFrame df
>>df
LED CFL Incan Hall Reading
0 3 2 1 100 150
1 2 3 1 150 100
2 0 1 3 200 150
3 1 2 4 300 250
4 3 3 1 170 100
I want to create two more column which contain lists, one for "Hall" and another for "Reading"
>>df_output
LED CFL Incan Hall Reading Hall_List Reading_List
0 3 2 1 100 150 [0,2,0] [2,0,0]
1 2 3 1 150 100 [0,3,0] [2,0,0]
2 0 1 3 200 150 [0,1,0] [0,0,2]
3 1 2 4 300 250 [0,2,0] [1,0,0]
4 3 3 1 100 100 [0,2,0] [2,0,0]
Each value within the list is populated as follows:
cfl_rating = 50
led_rating = 100
incan_rating = 25
For the Hall_List:
The preference is CFL > LED > Incan. And only one of them will be used (either CFL or LED or Incan).
We first check if CFL != 0 , if True then we calculate min(ceil(Hall/CFL_rating),CFL). For index=0 this evaluates to 2. Hence we have [0,2,0] whereas for index=2 we have [0,1,0].
Similarly for Reading_List, the preference is LED > Incan > CFL.
For index=2, we have LED == 0, so we calculate min(ceil(Reading/Incan_rating),Incan) and hence Reading_List is [0,0,2]
My question is:
Is there a "pandas/pythony-way" of doing this? I am currently iterating through each row, and using if-elif-else conditions to assign values.
My code snippet looks like this:
#Hall_List
for i in range(df.shape[0]):
Hall = []
if (df['CFL'].iloc[i] != 0):
Hall.append(0)
Hall.append(min((math.ceil(df['Hall'].iloc[i]/cfl_rating)),df['CFL'].iloc[i]))
Hall.append(0)
elif (df['LED'].iloc[i] != 0):
Hall.append(min((math.ceil(df['Hall'].iloc[i]/led_rating)),df['LED'].iloc[i]))
Hall.append(0)
Hall.append(0)
else:
Hall.append(0)
Hall.append(0)
Hall.append(min((math.ceil(df['Hall'].iloc[i]/incan_rating)),df['Incan'].iloc[i]))
df['Hall_List'].iloc[i] = Hall
This is really slow and definitely feels like a bad way to code this.

I shorten your formula for simplicity sake but you should use df.apply(axis=1)
this take every row and return and ndarray, then you can apply whatever function you want such has :
df = pd.DataFrame([[3, 2, 1, 100, 150], [2, 3, 1, 150, 100]], columns=['LED', 'CFL', 'Incan', 'Hall', 'Reading'])
def create_list(ndarray):
if ndarray[1] != 0:
result = [0, ndarray[1], 0]
else:
result = [ndarray[2], 0, 0]
return result
df['Hall_List'] = df.apply(lambda x: create_list(x), axis=1)
just change the function to whatever you like here.
In[49]: df
Out[49]:
LED CFL Incan Hall Reading Hall_List
0 3 2 1 100 150 [0, 2, 0]
1 2 3 1 150 100 [0, 3, 0]
hope this helps

Related

Drop the entire row if a particular sub-row does not fulfill condition

I have a pandas df with subentries. I would like to make a condition for a particular subentry, and if this condition is not fulfilled, I would like to drop the entire row so I could update the df.
For example, I would like to check each subentry 0 for all the entries and give a condition that if pt<120 then drop the entire entry.
pt
entry subentry
0 0 100
1 200
2 300
1 0 200
1 300
2 0 80
1 300
3 400
4 300
... ... ...
So, the entry 0 and 2 (with all the subentries) should be deleted.
pt
entry subentry
1 0 200
1 300
... ... ...
I tried using:
df.loc[(slice(None), 0), :]["pt"]>100
but it creates a new series and I cannot pass it to the original df because it does not match the entries/subentries. Thank you.
Try this:
# Count the number of invalid `pt` per `entry`
invalid = df['pt'].lt(120).groupby(df['entry']).sum()
# Valid `entry` are those whose `invalid` count is 0
df[df['entry'].isin(invalid[invalid == 0].index)]
One solution is to groupby "entry" and then calculate using transform to create a minimum that can then be used with loc to index the correct rows
df = pd.DataFrame({'entry': [0, 0, 1, 1, 2, 2],
'subentry': [1, 2, 1, 2, 1, 2],
'pt': [100, 300, 200, 300, 80, 300]})
Initial df:
entry subentry pt
0 0 1 100
1 0 2 300
2 1 1 200
3 1 2 300
4 2 1 80
5 2 2 300
Use loc to find select only the rows matching conditional:
df.loc[df.groupby('entry').transform('min')['pt']>120]
Output:
entry subentry pt
2 1 1 200
3 1 2 300

Reduce loop time in python

Loops in python taking alot time to give result.This contains around 100k records.
It is taking lot of time. How time can be reduced
df['loan_agr'] = df['loan_agr'].astype(int)
for i in range(len(df)):
if df.loc[i,'order_mt']== df.loc[i,'enr_mt']:
df['new_N_Loan'] = 1
df['exist_N_Loan'] = 0
df['new_V_Loan'] = df['loan_agr']
df['exist_V_Loan'] = 0
else:
df['new_N_Loan'] = 0
df['exist_N_Loan'] = 1
df['new_V_Loan'] = 0
df['exist_V_Loan'] = df['loan_agr']
You can use loc and set the new values in a vectorized way. This approach is much faster than using iteration because these operations are performed on entire columns at once, rather than individual values. Check out this article for more on speed optimization in pandas.
For example:
mask = df['order_mt'] == df['enr_mt']
df.loc[mask, ['new_N_Loan', 'exist_N_Loan', 'exist_V_Loan']] = [1, 0, 0]
df.loc[mask, ['new_V_Loan']] = df['loan_agr']
df.loc[~mask, ['new_N_Loan', 'exist_N_Loan', 'new_V_Loan']] = [0, 1, 0]
df.loc[~mask, ['exist_V_Loan']] = df['loan_agr']
Edit:
If the ~ (bitwise not) operator is not supported in your version of pandas, you can make a new mask for the "else" condition, similar to the first condition.
For example:
mask = df['order_mt'] == df['enr_mt']
else_mask = df['order_mt'] != df['enr_mt']
Then use the else_mask for the second set of definitions instead of ~mask.
Sample:
Input:
order_mt enr_mt new_N_Loan exist_N_Loan exist_V_Loan new_V_Loan loan_agr
0 1 1 None None None None 100
1 2 2 None None None None 200
2 3 30 None None None None 300
3 4 40 None None None None 400
Output:
order_mt enr_mt new_N_Loan exist_N_Loan exist_V_Loan new_V_Loan loan_agr
0 1 1 1 0 0 100 100
1 2 2 1 0 0 200 200
2 3 30 0 1 300 0 300
3 4 40 0 1 400 0 400
Instead of range(Len(...)) you could change the len function to a value.

Grouping By and Referencing Shifted Values

I am trying to track inventory levels of individual items over time
comparing projected outbound and availability. There are times in
which the projected outbound exceed the availability and when that
occurs I want the Post Available to be 0. I am trying to create the
Pre Available and Post Available columns below:
Item Week Inbound Outbound Pre Available Post Available
A 1 500 200 500 300
A 2 0 400 300 0
A 3 100 0 100 100
B 1 50 50 50 0
B 2 0 80 0 0
B 3 0 20 0 0
B 4 20 20 20 0
I have tried the below code:
def custsum(x):
total = 0
for i, v in x.iterrows():
total += df['Inbound'] - df['Outbound']
x.loc[i, 'Post Available'] = total
if total < 0:
total = 0
return x
df.groupby('Item').apply(custsum)
But I receive the below error message:
ValueError: Incompatible indexer with Series
I am a relative novice to Python so any help would be appreciated.
Thank you!
You could use
import numpy as np
import pandas as pd
df = pd.DataFrame({'Inbound': [500, 0, 100, 50, 0, 0, 20],
'Item': ['A', 'A', 'A', 'B', 'B', 'B', 'B'],
'Outbound': [200, 400, 0, 50, 80, 20, 20],
'Week': [1, 2, 3, 1, 2, 3, 4]})
df = df[['Item', 'Week', 'Inbound', 'Outbound']]
def custsum(x):
total = 0
for i, v in x.iterrows():
total += x.loc[i, 'Inbound'] - x.loc[i, 'Outbound']
if total < 0:
total = 0
x.loc[i, 'Post Available'] = total
x['Pre Available'] = x['Post Available'].shift(1).fillna(0) + x['Inbound']
return x
result = df.groupby('Item').apply(custsum)
result = result[['Item', 'Week', 'Inbound', 'Outbound', 'Pre Available', 'Post Available']]
print(result)
which yields
Item Week Inbound Outbound Pre Available Post Available
0 A 1 500 200 500.0 300.0
1 A 2 0 400 300.0 0.0
2 A 3 100 0 100.0 100.0
3 B 1 50 50 50.0 0.0
4 B 2 0 80 0.0 0.0
5 B 3 0 20 0.0 0.0
6 B 4 20 20 20.0 0.0
The main difference between this code and the code you posted is:
total += x.loc[i, 'Inbound'] - x.loc[i, 'Outbound']
x.loc is used to select the numeric value in the row indexed by i and in
the Inbound or Outbound column. So the difference is numeric and total
remains numeric. In contrast,
total += df['Inbound'] - df['Outbound']
adds an entire Series to total. That leads to the ValueError later. (See below for more on why that occurs).
The conditional
if total < 0:
total = 0
was moved above x.loc[i, 'Post Available'] = total to ensure that Post
Available is always non-negative.
If you didn't need this conditional, then the entire for-loop could be replaced by
x['Post Available'] = (df['Inbound'] - df.loc['Outbound']).cumsum()
And since column-wise arithmetic and cumsum are vectorized operations, the calculation could be performed much quicker.
Unfortunately, the conditional prevents us from eliminating the for-loop and vectorizing the calculation.
In your original code, the error
ValueError: Incompatible indexer with Series
occurs on this line
x.loc[i, 'Post Available'] = total
because total is (sometimes) a Series not a simple numeric value. Pandas is
attempting to align the Series on the right-hand side with the indexer, (i, 'Post Available'), on the left-hand side. The indexer (i, 'Post Available') gets
converted to a tuple like (0, 4), since Post Available is the column at
index 4. But (0, 4) is not an appropriate index for the 1-dimensional Series
on the right-hand side.
You can confirm total is Series by putting print(total) inside your for-loop,
or by noting that the right-hand side of
total += df['Inbound'] - df['Outbound']
is a Series.
There is not need a self-define function , you can using groupby + shift for create PreAvailable and using clip(setting the lower boundary as 0 ) for PostAvailable
df['PostAvailable']=(df.Inbound-df.Outbound).clip(lower=0)
df['PreAvailable']=df.groupby('item').apply(lambda x : x['Inbound'].add(x['PostAvailable'].shift(),fill_value=0)).values
df
Out[213]:
item Week Inbound Outbound PreAvailable PostAvailable
0 A 1 500 200 500.0 300
1 A 2 0 400 300.0 0
2 A 3 100 0 100.0 100
3 B 1 50 50 50.0 0
4 B 2 0 80 0.0 0
5 B 3 0 20 0.0 0
6 B 4 20 20 20.0 0

Check several conditions for all values in a column

I have just started using python & pandas. I have searched google and stack overflow for an answer to my question but haven't been able to find one.
This is what I need to do:
I have a df with several data rows per person (id) and a variable called response_go, which can be coded 1 or 0 (type int64), such as this one (just way bigger with 480 rows per person...)
ID response_go
0 1 1
1 1 0
2 1 0
3 1 1
4 2 1
5 2 0
6 2 1
7 2 1
Now, I want to check for each ID/ person whether the entries in response_go separately are all coded 0, all coded 1, or neither (the else condition). So far, I have come up with this:
ids = df['ID'].unique()
for id in ids:
if (df.response_go.all() == 1):
print "ID:",id,": 100% Go"
elif (df.response_go.all() == 0):
print "ID:",id,": 100% NoGo"
else:
print "ID:",id,": Mixed Response Pattern"
However, it gives me the following output:
ID: 1 : 100% NoGo
ID: 2 : 100% NoGo
ID: 2 : Mixed Response Pattern
when it should be (as both ones & zeros are included)
ID: 1 : Mixed Response Pattern
ID: 2 : Mixed Response Pattern
I am really sorry if this question might have been asked before but when searching for an answer, I really found nothing to solve this issue. And if this has been answered before, please point me to the solution. Thank you everyone!!!! Really appreciate it!
Sample (with different data) -
df = pd.DataFrame({'ID' : [1] * 3 + [2] * 3 + [3] * 3,
'response_go' : [0, 0, 0, 1, 1, 1, 0, 1, 0]})
df
ID response_go
0 1 0
1 1 0
2 1 0
3 2 1
4 2 1
5 2 1
6 3 0
7 3 1
8 3 0
Use groupby + mean -
v = df.groupby('ID').response_go.mean()
v
ID
1 0.000000
2 1.000000
3 0.333333
Name: response_go, dtype: float64
Use np.select to compute your statuses based on the mean of response_go -
u = np.select([v == 1, v == 0, v < 1], ['100% Go', '100% NoGo', 'Mixed Response Pattern'])
Or, use a nested np.where to do the same thing (slightly faster) -
u = np.where(v == 1, '100% Go', np.where(v == 0, '100% NoGo', 'Mixed Response Pattern'))
Now, assign the result back -
v[:] = u
v
ID
1 100% NoGo
2 100% Go
3 Mixed Response Pattern
Name: response_go, dtype: object

Creating intervaled ramp array based on a threshold - Python / NumPy

I would like to measure the length of a sub-array fullfilling some condition (like a stop clock), but as soon as the condition is not fulfilled any more, the value should reset to zero. So, the resulting array should tell me, how many values fulfilled some condition (e.g. value > 1):
[0, 0, 2, 2, 2, 2, 0, 3, 3, 0]
should result into the followin array:
[0, 0, 1, 2, 3, 4, 0, 1, 2, 0]
One can easily define a function in python, which returns the corresponding numy array:
def StopClock(signal, threshold=1):
clock = []
current_time = 0
for item in signal:
if item > threshold:
current_time += 1
else:
current_time = 0
clock.append(current_time)
return np.array(clock)
StopClock([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])
However, I really do not like this for-loop, especially since this counter should run over a longer dataset. I thought of some np.cumsum solution in combination with np.diff, however I do not get through the reset part. Is someone aware of a more elegant numpy-style solution of above problem?
This solution uses pandas to perform a groupby:
s = pd.Series([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])
threshold = 0
>>> np.where(
s > threshold,
s
.to_frame() # Convert series to dataframe.
.assign(_dummy_=1) # Add column of ones.
.groupby((s.gt(threshold) != s.gt(threshold).shift()).cumsum())['_dummy_'] # shift-cumsum pattern
.transform(lambda x: x.cumsum()), # Cumsum the ones per group.
0) # Fill value with zero where threshold not exceeded.
array([0, 0, 1, 2, 3, 4, 0, 1, 2, 0])
Yes, we can use diff-styled differentiation alongwith cumsum to create such intervaled ramps in a vectorized manner and that should be pretty efficient specially with large input arrays. The resetting part is taken care of by assigning appropriate values at the end of each interval, with the idea of cum-summing that resets the numbers at end of each interval.
Here's one implementation to accomplish all that -
def intervaled_ramp(a, thresh=1):
mask = a>thresh
# Get start, stop indices
mask_ext = np.concatenate(([False], mask, [False] ))
idx = np.flatnonzero(mask_ext[1:] != mask_ext[:-1])
s0,s1 = idx[::2], idx[1::2]
out = mask.astype(int)
valid_stop = s1[s1<len(a)]
out[valid_stop] = s0[:len(valid_stop)] - valid_stop
return out.cumsum()
Sample runs -
Input (a) :
[5 3 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 1]
Output (intervaled_ramp(a, thresh=1)) :
[1 2 0 1 2 0 0 1 2 3 4 0 1 2 0 0 0 1 0 1 2 3 4 0 0]
Input (a) :
[1 1 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 1]
Output (intervaled_ramp(a, thresh=1)) :
[0 0 0 1 2 0 0 1 2 3 4 0 1 2 0 0 0 1 0 1 2 3 4 0 0]
Input (a) :
[1 1 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 5]
Output (intervaled_ramp(a, thresh=1)) :
[0 0 0 1 2 0 0 1 2 3 4 0 1 2 0 0 0 1 0 1 2 3 4 0 1]
Input (a) :
[1 1 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 5]
Output (intervaled_ramp(a, thresh=0)) :
[1 2 3 4 5 0 0 1 2 3 4 0 1 2 0 1 2 3 0 1 2 3 4 0 1]
Runtime test
One way to do a fair benchmarking was to use the posted sample in the question and tiling into a big number of times and using that as the input array. With that setup, here's the timings -
In [841]: a = np.array([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])
In [842]: a = np.tile(a,10000)
# #Alexander's soln
In [843]: %timeit pandas_app(a, threshold=1)
1 loop, best of 3: 3.93 s per loop
# #Psidom 's soln
In [844]: %timeit stop_clock(a, threshold=1)
10 loops, best of 3: 119 ms per loop
# Proposed in this post
In [845]: %timeit intervaled_ramp(a, thresh=1)
1000 loops, best of 3: 527 µs per loop
Another numpy solution:
import numpy as np
a = np.array([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])
​
def stop_clock(signal, threshold=1):
mask = signal > threshold
indices = np.flatnonzero(np.diff(mask)) + 1
return np.concatenate(list(map(np.cumsum, np.array_split(mask, indices))))
​
stop_clock(a)
# array([0, 0, 1, 2, 3, 4, 0, 1, 2, 0])

Categories

Resources