Incrementing add under condition in pandas - python

For the following pandas dataframe
servo_in_position second_servo_in_position Expected output
0 0 1 0
1 0 1 0
2 1 2 1
3 0 3 0
4 1 4 2
5 1 4 2
6 0 5 0
7 0 5 0
8 1 6 3
9 0 7 0
10 1 8 4
11 0 9 0
12 1 10 5
13 1 10 5
14 1 10 5
15 0 11 0
16 0 11 0
17 0 11 0
18 1 12 6
19 1 12 6
20 0 13 0
21 0 13 0
22 0 13 0
I want to increment the column "Expected output" only if "servo_in_position" changes from 0 to 1. I want also to assume "Expected output" to be 0 (null) if "servo_in_position" equals to 0.
I tried
input_data['second_servo_in_position']=(input_data.servo_in_position.diff()!=0).cumsum()
but it gives output as in "second_servo_in_position" column, which is not what I wanted.
After that I would like to group and calculate mean using:
print("Mean=\n\n",input_data.groupby('second_servo_in_position').mean())

Using cumsum and arithmetic.
u = df['servo_in_position']
(u.eq(1) & u.shift().ne(1)).cumsum() * u
0 0
1 0
2 1
3 0
4 2
5 2
6 0
7 0
8 3
9 0
10 4
11 0
12 5
13 5
14 5
15 0
16 0
17 0
18 6
19 6
20 0
21 0
22 0
Name: servo_in_position, dtype: int64

Use cumsum and mask:
df['E_output'] = df['servo_in_position'].diff().eq(1).cumsum()\
.mask(df['servo_in_position'] == 0, 0)
df['servo_in_position'].diff().fillna(df['servo_in_position']).eq(1).cumsum()\
.mask(df['servo_in_position'] == 0, 0)
Output:
servo_in_position second_servo_in_position Expected output E_output
0 0 1 0 0
1 0 1 0 0
2 1 2 1 1
3 0 3 0 0
4 1 4 2 2
5 1 4 2 2
6 0 5 0 0
7 0 5 0 0
8 1 6 3 3
9 0 7 0 0
10 1 8 4 4
11 0 9 0 0
12 1 10 5 5
13 1 10 5 5
14 1 10 5 5
15 0 11 0 0
16 0 11 0 0
17 0 11 0 0
18 1 12 6 6
19 1 12 6 6
20 0 13 0 0
21 0 13 0 0
22 0 13 0 0
Update for first position equal to 1.
df['servo_in_position'].diff().fillna(df['servo_in_position']).eq(1).cumsum()\
.mask(df['servo_in_position'] == 0, 0)

Try np.where:
df['Expected_output'] = np.where(df.servo_in_position.eq(1),
df.servo_in_position.diff().eq(1).cumsum(),
0)

That is cumsum and mul
df.servo_in_position.diff().eq(1).cumsum().mul(df.servo_in_position.eq(1),axis=0)

Fast with Numba
from numba import njit
#njit
def f(u):
out = np.zeros(len(u), np.int64)
a = out[0] = u[0]
for i in range(1, len(u)):
if u[i] == 1:
if u[i - 1] == 0:
a += 1
out[i] = a
return out
f(df.servo_in_position.to_numpy())
array([0, 0, 1, 0, 2, 2, 0, 0, 3, 0, 4, 0, 5, 5, 5, 0, 0, 0, 6, 6, 0, 0, 0])

Related

Pandas add column on condition: If value of cell is True set value of largest number in Period to true

I have a pandas dataframe with lets say two columns, for example:
value boolean
0 1 0
1 5 1
2 0 0
3 3 0
4 9 1
5 12 0
6 4 0
7 7 1
8 8 1
9 2 0
10 17 0
11 15 1
12 6 0
Now I want to add a third column (new_boolean) with the following criteria:
I specify a period, for this example period = 4.
Now I take a look at all rows where boolean == 1.
new_boolean will be 1 for the maximum value in the last period rows.
For example I have boolean == 1 for row 2. So I look at the last period rows. The values are [1, 5], 5 is the maximum, so the value for new_boolean in row 2 will be one.
Second example: row 8 (value = 7): I get values [7, 4, 12, 9], 12 is the maximum, so the value for new_boolean in the row with value 12 will be 1
result:
value boolean new_boolean
0 1 0 0
1 5 1 1
2 0 0 0
3 3 0 0
4 9 1 1
5 12 0 1
6 4 0 0
7 7 1 0
8 8 1 0
9 2 0 0
10 17 0 1
11 15 1 0
12 6 0 0
How can I do this algorithmically?
Compute the rolling max of the 'value' column
>>> rolling_max_value = df.rolling(window=4, min_periods=1)['value'].max()
>>> rolling_max_value
0 1.0
1 5.0
2 5.0
3 5.0
4 9.0
5 12.0
6 12.0
7 12.0
8 12.0
9 8.0
10 17.0
11 17.0
12 17.0
Name: value, dtype: float64
Select only the relevant values, i.e. where 'boolean' = 1
>>> on_values = rolling_max_value[df.boolean == 1].unique()
>>> on_values
array([ 5., 9., 12., 17.])
The rows where 'new_boolean' = 1 are the ones where 'value' belongs to on_values
>>> df['new_boolean'] = df.value.isin(on_values).astype(int)
>>> df
value boolean new_boolean
0 1 0 0
1 5 1 1
2 0 0 0
3 3 0 0
4 9 1 1
5 12 0 1
6 4 0 0
7 7 1 0
8 8 1 0
9 2 0 0
10 17 0 1
11 15 1 0
12 6 0 0
EDIT:
OP raised a good point
Does this also work if I have multiple columns with the same value and they have different booleans?
The previous solution doesn't account for that. To solve this, instead of computing the rolling max, we gather the row labels associated with rolling max values, i.e. the rolling argmaxor idxmax. To my knowledge, Rolling objects don't have an idxmax method, but we can easily compute it via apply.
def idxmax(values):
return values.idxmax()
rolling_idxmax_value = (
df.rolling(min_periods=1, window=4)['value']
.apply(idxmax)
.astype(int)
)
on_idx = rolling_idxmax_value[df.boolean == 1].unique()
df['new_boolean'] = 0
df.loc[on_idx, 'new_boolean'] = 1
Results:
>>> rolling_idxmax_value
0 0
1 1
2 1
3 1
4 4
5 5
6 5
7 5
8 5
9 8
10 10
11 10
12 10
Name: value, dtype: int64
>>> on_idx
[ 1 4 5 10]
>>> df
value boolean new_boolean
0 1 0 0
1 5 1 1
2 0 0 0
3 3 0 0
4 9 1 1
5 12 0 1
6 4 0 0
7 7 1 0
8 8 1 0
9 2 0 0
10 17 0 1
11 15 1 0
12 6 0 0
I did this in 2 steps, but I think the solution is much clearer:
df = pd.read_csv(StringIO('''
id value boolean
0 1 0
1 5 1
2 0 0
3 3 0
4 9 1
5 12 0
6 4 0
7 7 1
8 8 1
9 2 0
10 17 0
11 15 1
12 6 0'''),delim_whitespace=True,index_col=0)
df['new_bool'] = df['value'].rolling(min_periods=1, window=4).max()
df['new_bool'] = df.apply(lambda x: 1 if ((x['value'] == x['new_bool']) & (x['boolean'] == 1)) else 0, axis=1)
df
Result:
value boolean new_bool
id
0 1 0 0
1 5 1 1
2 0 0 0
3 3 0 0
4 9 1 1
5 12 0 0
6 4 0 0
7 7 1 0
8 8 1 0
9 2 0 0
10 17 0 0
11 15 1 0
12 6 0 0

Pandas - changing rows where less than n subsequent values are equal

I have the following dataframe:
df = pd.DataFrame({"col":[0,0,1,1,1,1,0,0,1,1,0,0,1,1,1,0,1,1,1,1,0,0,0]})
Now I would like to set all the rows equal to zero where less than four 1's appear "in a row", i.e. I would like to have the following resulting DataFrame:
df = pd.DataFrame({"col":[0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0]})
I was not able to find a way to achieve this nicely...
Try with groupby and where:
streaks = df.groupby(df["col"].ne(df["col"].shift()).cumsum()).transform("sum")
output = df.where(streaks.ge(4), 0)
>>> output
col
0 0
1 0
2 1
3 1
4 1
5 1
6 0
7 0
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 1
17 1
18 1
19 1
20 0
21 0
22 0
We can do
df.loc[df.groupby(df.col.eq(0).cumsum()).transform('count')['col']<5,'col'] = 0
df
Out[77]:
col
0 0
1 0
2 1
3 1
4 1
5 1
6 0
7 0
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 1
17 1
18 1
19 1
20 0
21 0
22 0

When one column cell has a value zero, make the value in another column zero and cells below it zero

I'm reading temperature data from a sensor that is cycling on and off in a dataframe df. Each time the sensor turns on, it takes roughly 5 rows of data to thermally equilibrate. I want to ignore the decreased temp values from the warm up time of the sensor in any statistics that would be run on the temperature column, and also ignore them when plotting. The three columns in the data frame are 'Seconds', 'Sensor_State', and 'Temperature'. I have created a fourth column called 'Sensor_Warmup_State' that is created with a loop, and turns all values to 0 after a 0 is detected in the 'Sensor_State' column in the next 5 cells. Then I multiply 'Temperature' by 'Sensor_Warmup_State' to get 'Processed_Temp'. This works, but I know there should be a more pythonic, faster way to do this, I just don't have the expertise yet.
Here's what I have. To create the dataframe:
import numpy as np
a=np.arange(1,21).tolist()
b = (np.zeros((2), dtype=int)).tolist()
c = (np.ones((18), dtype = int)).tolist()
d = b + c
e = [0,0,1,2,4,8,9,10,10,10,10,10,10,10,10,10,10,10,10,10]
data = {'Seconds': a, 'Sensor_State': d, 'Temperature': e}
df = pd.DataFrame.from_dict(data)
df['Sensor_Warmup_State'] = 0
df
To create the final two columns:
NumOfRows = df['Sensor_State'].size
x=0
for index, value in df['Sensor_State'].iteritems():
if (value == 0) & (index < NumOfRows-5):
df['Sensor_Warmup_State'].iloc[index] = 0
elif (value == 1) & (index < NumOfRows-5):
df.loc[(index + 5), 'Sensor_Warmup_State'] = 1
df['Processed_Temp'] = df['Sensor_Warmup_State'] * df['Temperature']
df
OP here, I figured out a better way by using .shift(), this is much simpler and 30% faster than looping through as initially outlined. I edited the starting dataframe to account for when the Sensor_State goes from 0 to 1, back to 0 and to 1 to account for these circumstances. Hope this helps someone:
In [1]:
import numpy as np
import pandas as pd
a=np.arange(1,24).tolist()
b=[0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
c = [0,0,1,2,4,8,9,10,10,10,0,0,0,0,0,2,4,8,10,10,10,10,10]
data = {'Seconds': a, 'Sensor_State': b, 'Temperature': c}
df = pd.DataFrame.from_dict(data)
df['Sensor_Warmup_State'] = 0
df
Out[1]:
Seconds Sensor_State Temperature Sensor_Warmup_State
0 1 0 0 0
1 2 0 0 0
2 3 1 1 0
3 4 1 2 0
4 5 1 4 0
5 6 1 8 0
6 7 1 9 0
7 8 1 10 0
8 9 1 10 0
9 10 1 10 0
10 11 0 0 0
11 12 0 0 0
12 13 0 0 0
13 14 0 0 0
14 15 0 0 0
15 16 1 2 0
16 17 1 4 0
17 18 1 8 0
18 19 1 10 0
19 20 1 10 0
20 21 1 10 0
21 22 1 10 0
22 23 1 10 0
The new code:
In [2]:
df['Sensor_Warmup_State'] = (df['Sensor_State'] == 1) &\
(df['Sensor_State'].shift(1) == 1) &\
(df['Sensor_State'].shift(2) == 1) &\
(df['Sensor_State'].shift(3) == 1) &\
(df['Sensor_State'].shift(4) == 1) &\
(df['Sensor_State'].shift(5) == 1)
df['Processed_Temp'] = df['Sensor_Warmup_State'] * df['Temperature']
df
Out[2]:
Seconds Sensor_State Temperature Sensor_Warmup_State Processed_Temp
0 1 0 0 False 0
1 2 0 0 False 0
2 3 1 1 False 0
3 4 1 2 False 0
4 5 1 4 False 0
5 6 1 8 False 0
6 7 1 9 False 0
7 8 1 10 True 10
8 9 1 10 True 10
9 10 1 10 True 10
10 11 0 0 False 0
11 12 0 0 False 0
12 13 0 0 False 0
13 14 0 0 False 0
14 15 0 0 False 0
15 16 1 2 False 0
16 17 1 4 False 0
17 18 1 8 False 0
18 19 1 10 False 0
19 20 1 10 False 0
20 21 1 10 True 10
21 22 1 10 True 10
22 23 1 10 True 10

Select column data with condition and move it to new column

I have a dataframe looked like below.
T$QOOR
3
14
12
-6
-19
9
I want to move the positive and negative one into new columns.
sls_item['SALES'] = sls_item['T$QOOR'].apply(lambda x: x if x >= 0 else 0)
sls_item['RETURN'] = sls_item['T$QOOR'].apply(lambda x: x*-1 if x < 0 else 0)
The result will be as below.
T$QOOR SALES RETURN
3 3 0
14 14 0
12 12 0
-6 0 -6
-19 0 -19
9 9 0
Any better and cleaner way to do so other than using apply?
Solution with clip_lower and
clip_upper, also mul for multiple by -1 is added:
sls_item['SALES'] = sls_item['T$QOOR'].clip_lower(0)
sls_item['RETURN'] = sls_item['T$QOOR'].clip_upper(0).mul(-1)
print (sls_item)
T$QOOR SALES RETURN
0 3 3 0
1 14 14 0
2 12 12 0
3 -6 0 6
4 -19 0 19
5 9 9 0
Use where or numpy.where:
sls_item['SALES'] = sls_item['T$QOOR'].where(lambda x: x >= 0, 0)
sls_item['RETURN'] = sls_item['T$QOOR'].where(lambda x: x < 0, 0) * -1
print (sls_item)
T$QOOR SALES RETURN
0 3 3 0
1 14 14 0
2 12 12 0
3 -6 0 6
4 -19 0 19
5 9 9 0
mask = sls_item['T$QOOR'] >=0
sls_item['SALES'] = np.where(mask, sls_item['T$QOOR'], 0)
sls_item['RETURN'] = np.where(~mask, sls_item['T$QOOR'] * -1, 0)
print (sls_item)
T$QOOR SALES RETURN
0 3 3 0
1 14 14 0
2 12 12 0
3 -6 0 6
4 -19 0 19
5 9 9 0
assgin + where
df.assign(po=df.where(df['T$QOOR']>0,0),ne=df.where(df['T$QOOR']<0,0))
Out[1355]:
T$QOOR ne po
0 3 0 3
1 14 0 14
2 12 0 12
3 -6 -6 0
4 -19 -19 0
5 9 0 9

Leave blocks of 1 of size >= k in Pandas data frame

I need to leave block >= k of '1'. All other block of '1' should be transformed to zero. For example, k=2:
df=
a b
0 1 1
1 1 1
2 0 0
3 1 0
4 0 0
5 1 0
6 0 0
7 1 0
8 0 0
9 1 1
10 1 1
11 1 1
12 0 0
13 0 0
14 1 0
15 0 0
16 1 1
17 1 1
18 0 0
19 1 0
where the column a is the original sequence, and the column b is the desired.
z = df.a.eq(0)
g = z.cumsum().mask(z, -1)
k = 2
df['b'] = df.a.groupby(g).transform('size').ge(k).mask(z, 0)
a b
0 1 1
1 1 1
2 0 0
3 1 0
4 0 0
5 1 0
6 0 0
7 1 0
8 0 0
9 1 1
10 1 1
11 1 1
12 0 0
13 0 0
14 1 0
15 0 0
16 1 1
17 1 1
18 0 0
19 1 0

Categories

Resources