Select column data with condition and move it to new column - python

I have a dataframe looked like below.
T$QOOR
3
14
12
-6
-19
9
I want to move the positive and negative one into new columns.
sls_item['SALES'] = sls_item['T$QOOR'].apply(lambda x: x if x >= 0 else 0)
sls_item['RETURN'] = sls_item['T$QOOR'].apply(lambda x: x*-1 if x < 0 else 0)
The result will be as below.
T$QOOR SALES RETURN
3 3 0
14 14 0
12 12 0
-6 0 -6
-19 0 -19
9 9 0
Any better and cleaner way to do so other than using apply?

Solution with clip_lower and
clip_upper, also mul for multiple by -1 is added:
sls_item['SALES'] = sls_item['T$QOOR'].clip_lower(0)
sls_item['RETURN'] = sls_item['T$QOOR'].clip_upper(0).mul(-1)
print (sls_item)
T$QOOR SALES RETURN
0 3 3 0
1 14 14 0
2 12 12 0
3 -6 0 6
4 -19 0 19
5 9 9 0
Use where or numpy.where:
sls_item['SALES'] = sls_item['T$QOOR'].where(lambda x: x >= 0, 0)
sls_item['RETURN'] = sls_item['T$QOOR'].where(lambda x: x < 0, 0) * -1
print (sls_item)
T$QOOR SALES RETURN
0 3 3 0
1 14 14 0
2 12 12 0
3 -6 0 6
4 -19 0 19
5 9 9 0
mask = sls_item['T$QOOR'] >=0
sls_item['SALES'] = np.where(mask, sls_item['T$QOOR'], 0)
sls_item['RETURN'] = np.where(~mask, sls_item['T$QOOR'] * -1, 0)
print (sls_item)
T$QOOR SALES RETURN
0 3 3 0
1 14 14 0
2 12 12 0
3 -6 0 6
4 -19 0 19
5 9 9 0

assgin + where
df.assign(po=df.where(df['T$QOOR']>0,0),ne=df.where(df['T$QOOR']<0,0))
Out[1355]:
T$QOOR ne po
0 3 0 3
1 14 0 14
2 12 0 12
3 -6 -6 0
4 -19 -19 0
5 9 0 9

Related

Pandas - changing rows where less than n subsequent values are equal

I have the following dataframe:
df = pd.DataFrame({"col":[0,0,1,1,1,1,0,0,1,1,0,0,1,1,1,0,1,1,1,1,0,0,0]})
Now I would like to set all the rows equal to zero where less than four 1's appear "in a row", i.e. I would like to have the following resulting DataFrame:
df = pd.DataFrame({"col":[0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0]})
I was not able to find a way to achieve this nicely...
Try with groupby and where:
streaks = df.groupby(df["col"].ne(df["col"].shift()).cumsum()).transform("sum")
output = df.where(streaks.ge(4), 0)
>>> output
col
0 0
1 0
2 1
3 1
4 1
5 1
6 0
7 0
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 1
17 1
18 1
19 1
20 0
21 0
22 0
We can do
df.loc[df.groupby(df.col.eq(0).cumsum()).transform('count')['col']<5,'col'] = 0
df
Out[77]:
col
0 0
1 0
2 1
3 1
4 1
5 1
6 0
7 0
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 1
17 1
18 1
19 1
20 0
21 0
22 0

Incrementing add under condition in pandas

For the following pandas dataframe
servo_in_position second_servo_in_position Expected output
0 0 1 0
1 0 1 0
2 1 2 1
3 0 3 0
4 1 4 2
5 1 4 2
6 0 5 0
7 0 5 0
8 1 6 3
9 0 7 0
10 1 8 4
11 0 9 0
12 1 10 5
13 1 10 5
14 1 10 5
15 0 11 0
16 0 11 0
17 0 11 0
18 1 12 6
19 1 12 6
20 0 13 0
21 0 13 0
22 0 13 0
I want to increment the column "Expected output" only if "servo_in_position" changes from 0 to 1. I want also to assume "Expected output" to be 0 (null) if "servo_in_position" equals to 0.
I tried
input_data['second_servo_in_position']=(input_data.servo_in_position.diff()!=0).cumsum()
but it gives output as in "second_servo_in_position" column, which is not what I wanted.
After that I would like to group and calculate mean using:
print("Mean=\n\n",input_data.groupby('second_servo_in_position').mean())
Using cumsum and arithmetic.
u = df['servo_in_position']
(u.eq(1) & u.shift().ne(1)).cumsum() * u
0 0
1 0
2 1
3 0
4 2
5 2
6 0
7 0
8 3
9 0
10 4
11 0
12 5
13 5
14 5
15 0
16 0
17 0
18 6
19 6
20 0
21 0
22 0
Name: servo_in_position, dtype: int64
Use cumsum and mask:
df['E_output'] = df['servo_in_position'].diff().eq(1).cumsum()\
.mask(df['servo_in_position'] == 0, 0)
df['servo_in_position'].diff().fillna(df['servo_in_position']).eq(1).cumsum()\
.mask(df['servo_in_position'] == 0, 0)
Output:
servo_in_position second_servo_in_position Expected output E_output
0 0 1 0 0
1 0 1 0 0
2 1 2 1 1
3 0 3 0 0
4 1 4 2 2
5 1 4 2 2
6 0 5 0 0
7 0 5 0 0
8 1 6 3 3
9 0 7 0 0
10 1 8 4 4
11 0 9 0 0
12 1 10 5 5
13 1 10 5 5
14 1 10 5 5
15 0 11 0 0
16 0 11 0 0
17 0 11 0 0
18 1 12 6 6
19 1 12 6 6
20 0 13 0 0
21 0 13 0 0
22 0 13 0 0
Update for first position equal to 1.
df['servo_in_position'].diff().fillna(df['servo_in_position']).eq(1).cumsum()\
.mask(df['servo_in_position'] == 0, 0)
Try np.where:
df['Expected_output'] = np.where(df.servo_in_position.eq(1),
df.servo_in_position.diff().eq(1).cumsum(),
0)
That is cumsum and mul
df.servo_in_position.diff().eq(1).cumsum().mul(df.servo_in_position.eq(1),axis=0)
Fast with Numba
from numba import njit
#njit
def f(u):
out = np.zeros(len(u), np.int64)
a = out[0] = u[0]
for i in range(1, len(u)):
if u[i] == 1:
if u[i - 1] == 0:
a += 1
out[i] = a
return out
f(df.servo_in_position.to_numpy())
array([0, 0, 1, 0, 2, 2, 0, 0, 3, 0, 4, 0, 5, 5, 5, 0, 0, 0, 6, 6, 0, 0, 0])

multiple conditions on dataframes

I'm trying to write a new column 'is_good' which is marked 1 if the data sets in 'value' column is between range 1 to 6 and when 'value2' column is in range 5 to 10 if they do not satisfy both condition they are marked 0
I know if you do this,
df['is_good'] = [1 if (x >= 1 and x <= 6) else 0 for x in df['value']]
it will fill out 1 or 0 depending on the ranges of value but how would I also consider ranges of value2 when marking 1 or 0.
Is there anyway I can achieve this without numpy?
Thank you in advance!
I think need double between and chain conditions by & (bitwise and):
df = pd.DataFrame({'value':range(13),'value2':range(13)})
df['is_good'] = (df['value'].between(1,6) & df['value2'].between(5,10)).astype(int)
Or use 4 conditions:
df['is_good'] = ((df['value'] >= 1) & (df['value'] <= 6) &
(df['value2'] >= 5) & (df['value'] <= 10)).astype(int)
print (df)
value value2 is_good
0 0 0 0
1 1 1 0
2 2 2 0
3 3 3 0
4 4 4 0
5 5 5 1
6 6 6 1
7 7 7 0
8 8 8 0
9 9 9 0
10 10 10 0
11 11 11 0
12 12 12 0
A bit shorter alternative:
In [47]: df['is_good'] = df.eval("1<=value<=6 & 5<=value2<=10").astype(np.int8)
In [48]: df
Out[48]:
value value2 is_good
0 0 0 0
1 1 1 0
2 2 2 0
3 3 3 0
4 4 4 0
5 5 5 1
6 6 6 1
7 7 7 0
8 8 8 0
9 9 9 0
10 10 10 0
11 11 11 0
12 12 12 0

Pandas sum above all possible thresholds

I have a dataset with two risk model scores and observations that have a certain amount of value. Something like this:
import pandas as pd
df = pd.DataFrame(data={'segment':['A','A','A','A','A','A','A','B','B','B','B','B'],
'model1':[9,4,5,2,9,7,7,8,8,5,6,3],
'model2':[9,8,2,4,6,8,8,7,7,7,4,4],
'dollars':[15,10,-5,-7,6,7,-2,5,7,3,-1,-3]},
columns=['segment','model1','model2','dollars'])
print df
segment model1 model2 dollars
0 A 9 9 15
1 A 4 8 10
2 A 5 2 -5
3 A 2 4 -7
4 A 9 6 6
5 A 7 8 7
6 A 7 8 -2
7 B 8 7 5
8 B 8 7 7
9 B 5 7 3
10 B 6 4 -1
11 B 3 4 -3
My goal is to determine the simultaneous risk model thresholds where value is maximized, i.e. a cutoff like (model1 >= X) & (model2 >= Y). The risk-models are both rank-ordered such that higher numbers are lower risk and generally higher value.
I was able to get the desired output using a loop approach:
df_sum = df.groupby(by=['segment','model1','model2'])['dollars'].agg(['sum']).rename(columns={'sum':'dollar_sum'}).reset_index()
df_sum.loc[:,'threshold_sum'] = 0
#this loop works but runs very slowly on my large dataframe
#calculate the sum of dollars for each combination of possible model score thresholds
for row in df_sum.itertuples():
#subset the original df down to just the observations above the given model scores
df_temp = df[((df['model1'] >= getattr(row,'model1')) & (df['model2'] >= getattr(row,'model2')) & (df['segment'] == getattr(row,'segment')))]
#calculate the sum and add it back to the dataframe
df_sum.loc[row.Index,'threshold_sum'] = df_temp['dollars'].sum()
#show the max value for each segment
print df_sum.loc[df_sum.groupby(by=['segment'])['threshold_sum'].idxmax()]
segment model1 model2 dollar_sum threshold_sum
1 A 4 8 10 30
7 B 5 7 3 15
The loop runs incredibly slowly as the size of the dataframe increases. I'm sure there's a faster way to do this (maybe using cumsum() or numpy), but I'm stumped on what it is. Does anyone have a better way to do it? Ideally any code would be easily extendable to n-many risk models and would output all possible combinations of threshold_sum in case I add other optimization criteria down the road.
You'll get some speedup with apply(), using your same approach, but I agree with your hunch, there's probably a faster way.
Here's an apply() solution:
With df_sum as:
df_sum = (df.groupby(['segment','model1','model2'])
.dollars
.sum()
.reset_index()
)
print(df_sum)
segment model1 model2 dollars
0 A 2 4 -7
1 A 4 8 10
2 A 5 2 -5
3 A 7 8 5
4 A 9 6 6
5 A 9 9 15
6 B 3 4 -3
7 B 5 7 3
8 B 6 4 -1
9 B 8 7 12
apply can be combined with groupby:
def get_threshold_sum(row):
return (df.loc[(df.segment == row.segment) &
(df.model1 >= row.model1) &
(df.model2 >= row.model2),
["segment","dollars"]]
.groupby('segment')
.sum()
.dollars
)
thresholds = df_sum.apply(get_threshold_sum, axis=1)
mask = thresholds.idxmax()
df_sum.loc[mask]
segment model1 model2 dollar_sum
1 A 4 8 10
7 B 5 7 3
To see all possible thresholds, just print the thresholds list.
Finally found a non-loop approach, it requires some re-shaping and cumsum().
df['cumsum_dollars'] = df['dollars']
df2 = pd.pivot_table(df,index=['segment','model1','model2'],values=['dollars','cumsum_dollars'],fill_value=0,aggfunc=np.sum)
# descending sort ensures that the cumsum happens in the desired direction
df2 = df2.unstack(fill_value=0).sort_index(ascending=False,axis=0).sort_index(ascending=False,axis=1)
print(df2)
dollars cumsum_dollars
model2 9 8 7 6 4 2 9 8 7 6 4 2
segment model1
B 8 0 0 12 0 0 0 0 0 12 0 0 0
6 0 0 0 0 -1 0 0 0 0 0 -1 0
5 0 0 3 0 0 0 0 0 3 0 0 0
3 0 0 0 0 -3 0 0 0 0 0 -3 0
A 9 15 0 0 6 0 0 15 0 0 6 0 0
7 0 5 0 0 0 0 0 5 0 0 0 0
5 0 0 0 0 0 -5 0 0 0 0 0 -5
4 0 10 0 0 0 0 0 10 0 0 0 0
2 0 0 0 0 -7 0 0 0 0 0 -7 0
From here, take the cumulative sum in both the horizontal and vertical directions using the axis parameter of the cumsum() function.
df2['cumsum_dollars'] = df2['cumsum_dollars'].groupby(level='segment').cumsum(axis=0).cumsum(axis=1)
print(df2)
dollars cumsum_dollars
model2 9 8 7 6 4 2 9 8 7 6 4 2
segment model1
B 8 0 0 12 0 0 0 0 0 12 12 12 12
6 0 0 0 0 -1 0 0 0 12 12 11 11
5 0 0 3 0 0 0 0 0 15 15 14 14
3 0 0 0 0 -3 0 0 0 15 15 11 11
A 9 15 0 0 6 0 0 15 15 15 21 21 21
7 0 5 0 0 0 0 15 20 20 26 26 26
5 0 0 0 0 0 -5 15 20 20 26 26 21
4 0 10 0 0 0 0 15 30 30 36 36 31
2 0 0 0 0 -7 0 15 30 30 36 29 24
With the cumulative sums calculated, shape the dataframe back into how it originally looked and take the max of each group.
df3 = df2.stack().reset_index()
print(df3.loc[df3.groupby(by='segment')['cumsum_dollars'].idxmax()])
segment model1 model2 cumsum_dollars dollars
43 A 4 4 36 0
14 B 5 6 15 0
These thresholds where there aren't any observations are actually more valuable than picking any of the options that do have data. Note that idxmax() returns the first occurrence of the maximum, which is sufficient for my purposes. If you need to break ties, additional filtering/sorting would be be required before calling idxmax().

Leave blocks of 1 of size >= k in Pandas data frame

I need to leave block >= k of '1'. All other block of '1' should be transformed to zero. For example, k=2:
df=
a b
0 1 1
1 1 1
2 0 0
3 1 0
4 0 0
5 1 0
6 0 0
7 1 0
8 0 0
9 1 1
10 1 1
11 1 1
12 0 0
13 0 0
14 1 0
15 0 0
16 1 1
17 1 1
18 0 0
19 1 0
where the column a is the original sequence, and the column b is the desired.
z = df.a.eq(0)
g = z.cumsum().mask(z, -1)
k = 2
df['b'] = df.a.groupby(g).transform('size').ge(k).mask(z, 0)
a b
0 1 1
1 1 1
2 0 0
3 1 0
4 0 0
5 1 0
6 0 0
7 1 0
8 0 0
9 1 1
10 1 1
11 1 1
12 0 0
13 0 0
14 1 0
15 0 0
16 1 1
17 1 1
18 0 0
19 1 0

Categories

Resources