multiple conditions on dataframes - python

I'm trying to write a new column 'is_good' which is marked 1 if the data sets in 'value' column is between range 1 to 6 and when 'value2' column is in range 5 to 10 if they do not satisfy both condition they are marked 0
I know if you do this,
df['is_good'] = [1 if (x >= 1 and x <= 6) else 0 for x in df['value']]
it will fill out 1 or 0 depending on the ranges of value but how would I also consider ranges of value2 when marking 1 or 0.
Is there anyway I can achieve this without numpy?
Thank you in advance!

I think need double between and chain conditions by & (bitwise and):
df = pd.DataFrame({'value':range(13),'value2':range(13)})
df['is_good'] = (df['value'].between(1,6) & df['value2'].between(5,10)).astype(int)
Or use 4 conditions:
df['is_good'] = ((df['value'] >= 1) & (df['value'] <= 6) &
(df['value2'] >= 5) & (df['value'] <= 10)).astype(int)
print (df)
value value2 is_good
0 0 0 0
1 1 1 0
2 2 2 0
3 3 3 0
4 4 4 0
5 5 5 1
6 6 6 1
7 7 7 0
8 8 8 0
9 9 9 0
10 10 10 0
11 11 11 0
12 12 12 0

A bit shorter alternative:
In [47]: df['is_good'] = df.eval("1<=value<=6 & 5<=value2<=10").astype(np.int8)
In [48]: df
Out[48]:
value value2 is_good
0 0 0 0
1 1 1 0
2 2 2 0
3 3 3 0
4 4 4 0
5 5 5 1
6 6 6 1
7 7 7 0
8 8 8 0
9 9 9 0
10 10 10 0
11 11 11 0
12 12 12 0

Related

Create a new column that counts backwards from a specific point

I would like to look at an outcome in the time prior to a change in product and after a change in product. Here is an example df:
import pandas as pd
ids = [1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2]
date = ["11/4/2020", "12/5/2020", "01/5/2021", "02/5/2020", "03/5/2020", "04/5/2020", "05/5/2020", "06/5/2020", "07/5/2020", "08/5/2020", "09/5/2020",
"01/3/2019", "02/3/2019", "03/3/2019", "04/3/2019", "05/3/2019", "06/3/2019", "07/3/2019", "08/3/2019", "09/3/2019", "10/3/2019"]
months = [0,1,2,3,4,0,1,2,3,4,5,0,1,2,3,4,0,1,2,3,4]
df = pd.DataFrame({'ids': ids,
'date': date,
'months': months
})
df
ids date months
0 1 11/4/2020 0
1 1 12/5/2020 1
2 1 01/5/2021 2
3 1 02/5/2020 3
4 1 03/5/2020 4
5 1 04/5/2020 0
6 1 05/5/2020 1
7 1 06/5/2020 2
8 1 07/5/2020 3
9 1 08/5/2020 4
10 1 09/5/2020 5
11 2 01/3/2019 0
12 2 02/3/2019 1
13 2 03/3/2019 2
14 2 04/3/2019 3
15 2 05/3/2019 4
16 2 06/3/2019 0
17 2 07/3/2019 1
18 2 08/3/2019 2
19 2 09/3/2019 3
20 2 10/3/2019 4
This is what I would like the end result to be:
ids date months new_col
0 1 11/4/2020 0 -5
1 1 12/5/2020 1 -4
2 1 01/5/2021 2 -3
3 1 02/5/2020 3 -2
4 1 03/5/2020 4 -1
5 1 04/5/2020 0 0
6 1 05/5/2020 1 1
7 1 06/5/2020 2 2
8 1 07/5/2020 3 3
9 1 08/5/2020 4 4
10 1 09/5/2020 5 5
11 2 01/3/2019 0 -5
12 2 02/3/2019 1 -4
13 2 03/3/2019 2 -3
14 2 04/3/2019 3 -2
15 2 05/3/2019 4 -1
16 2 06/3/2019 0 0
17 2 07/3/2019 1 1
18 2 08/3/2019 2 2
19 2 09/3/2019 3 3
20 2 10/3/2019 4 4
In other words I would like to add a column that finds the second instance of months = 0 for a specific ID and counts backwards from that so I can look at outcomes before that point (all the negative numbers) vs the outcomes after that point (all the positive numbers).
Is there a simple way to do this in pandas?
Thanks in advance
Assume there are 2 and only 2 instances of 0 per group so I don't care about ids because:
(id1, first 0) -> negative counter,
(id1, second 0) -> positive counter,
(id2, first 0) -> negative counter,
(id2, second 0) -> positive count and so on.
Create virtual groups to know if you have to create negative or positive counter:
odd group: negative counter
even group: positive counter
df['new_col'] = (
df.assign(new_col=df['months'].eq(0).cumsum())
.groupby('new_col')['new_col']
.apply(lambda x: range(-len(x), 0, 1) if x.name % 2 else range(len(x)))
.explode().values
)
Output:
>>> df
ids date months new_col
0 1 11/4/2020 0 -5
1 1 12/5/2020 1 -4
2 1 01/5/2021 2 -3
3 1 02/5/2020 3 -2
4 1 03/5/2020 4 -1
5 1 04/5/2020 0 0
6 1 05/5/2020 1 1
7 1 06/5/2020 2 2
8 1 07/5/2020 3 3
9 1 08/5/2020 4 4
10 1 09/5/2020 5 5
11 2 01/3/2019 0 -5
12 2 02/3/2019 1 -4
13 2 03/3/2019 2 -3
14 2 04/3/2019 3 -2
15 2 05/3/2019 4 -1
16 2 06/3/2019 0 0
17 2 07/3/2019 1 1
18 2 08/3/2019 2 2
19 2 09/3/2019 3 3
20 2 10/3/2019 4 4

Pandas add column on condition: If value of cell is True set value of largest number in Period to true

I have a pandas dataframe with lets say two columns, for example:
value boolean
0 1 0
1 5 1
2 0 0
3 3 0
4 9 1
5 12 0
6 4 0
7 7 1
8 8 1
9 2 0
10 17 0
11 15 1
12 6 0
Now I want to add a third column (new_boolean) with the following criteria:
I specify a period, for this example period = 4.
Now I take a look at all rows where boolean == 1.
new_boolean will be 1 for the maximum value in the last period rows.
For example I have boolean == 1 for row 2. So I look at the last period rows. The values are [1, 5], 5 is the maximum, so the value for new_boolean in row 2 will be one.
Second example: row 8 (value = 7): I get values [7, 4, 12, 9], 12 is the maximum, so the value for new_boolean in the row with value 12 will be 1
result:
value boolean new_boolean
0 1 0 0
1 5 1 1
2 0 0 0
3 3 0 0
4 9 1 1
5 12 0 1
6 4 0 0
7 7 1 0
8 8 1 0
9 2 0 0
10 17 0 1
11 15 1 0
12 6 0 0
How can I do this algorithmically?
Compute the rolling max of the 'value' column
>>> rolling_max_value = df.rolling(window=4, min_periods=1)['value'].max()
>>> rolling_max_value
0 1.0
1 5.0
2 5.0
3 5.0
4 9.0
5 12.0
6 12.0
7 12.0
8 12.0
9 8.0
10 17.0
11 17.0
12 17.0
Name: value, dtype: float64
Select only the relevant values, i.e. where 'boolean' = 1
>>> on_values = rolling_max_value[df.boolean == 1].unique()
>>> on_values
array([ 5., 9., 12., 17.])
The rows where 'new_boolean' = 1 are the ones where 'value' belongs to on_values
>>> df['new_boolean'] = df.value.isin(on_values).astype(int)
>>> df
value boolean new_boolean
0 1 0 0
1 5 1 1
2 0 0 0
3 3 0 0
4 9 1 1
5 12 0 1
6 4 0 0
7 7 1 0
8 8 1 0
9 2 0 0
10 17 0 1
11 15 1 0
12 6 0 0
EDIT:
OP raised a good point
Does this also work if I have multiple columns with the same value and they have different booleans?
The previous solution doesn't account for that. To solve this, instead of computing the rolling max, we gather the row labels associated with rolling max values, i.e. the rolling argmaxor idxmax. To my knowledge, Rolling objects don't have an idxmax method, but we can easily compute it via apply.
def idxmax(values):
return values.idxmax()
rolling_idxmax_value = (
df.rolling(min_periods=1, window=4)['value']
.apply(idxmax)
.astype(int)
)
on_idx = rolling_idxmax_value[df.boolean == 1].unique()
df['new_boolean'] = 0
df.loc[on_idx, 'new_boolean'] = 1
Results:
>>> rolling_idxmax_value
0 0
1 1
2 1
3 1
4 4
5 5
6 5
7 5
8 5
9 8
10 10
11 10
12 10
Name: value, dtype: int64
>>> on_idx
[ 1 4 5 10]
>>> df
value boolean new_boolean
0 1 0 0
1 5 1 1
2 0 0 0
3 3 0 0
4 9 1 1
5 12 0 1
6 4 0 0
7 7 1 0
8 8 1 0
9 2 0 0
10 17 0 1
11 15 1 0
12 6 0 0
I did this in 2 steps, but I think the solution is much clearer:
df = pd.read_csv(StringIO('''
id value boolean
0 1 0
1 5 1
2 0 0
3 3 0
4 9 1
5 12 0
6 4 0
7 7 1
8 8 1
9 2 0
10 17 0
11 15 1
12 6 0'''),delim_whitespace=True,index_col=0)
df['new_bool'] = df['value'].rolling(min_periods=1, window=4).max()
df['new_bool'] = df.apply(lambda x: 1 if ((x['value'] == x['new_bool']) & (x['boolean'] == 1)) else 0, axis=1)
df
Result:
value boolean new_bool
id
0 1 0 0
1 5 1 1
2 0 0 0
3 3 0 0
4 9 1 1
5 12 0 0
6 4 0 0
7 7 1 0
8 8 1 0
9 2 0 0
10 17 0 0
11 15 1 0
12 6 0 0

Replace values in df col - pandas

I'm aiming to replace values in a df column Num. Specifically:
where 1 is located in Num, I want to replace preceding 0's with 1 until the nearest Item is 1 working backwards or backfilling.
where Num == 1, the corresponding row in Item will always be 0.
Also, Num == 0 will always follow Num == 1.
Input and code:
df = pd.DataFrame({
'Item' : [0,1,2,3,4,4,0,1,2,3,1,1,2,3,4,0],
'Num' : [0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0]
})
df['Num'] = np.where((df['Num'] == 1) & (df['Item'].shift() > 1), 1, 0)
Item Num
0 0 0
1 1 0
2 2 0
3 3 0
4 4 0
5 4 1
6 0 0
7 1 0
8 2 0
9 3 0
10 1 0
11 1 0
12 2 0
13 3 0
14 4 1
15 0 0
intended output:
Item Num
0 0 0
1 1 1
2 2 1
3 3 1
4 4 1
5 4 1
6 0 0
7 1 0
8 2 0
9 3 0
10 1 0
11 1 1
12 2 1
13 3 1
14 4 1
15 0 0
First, create groups of the rows according to the two start and end conditions using cumsum. Then we can group by this new column and sum over the Num column. In this way, all groups that contain a 1 in the Num column will get the value 1 while all other groups will get 0.
groups = ((df['Num'].shift() == 1) | (df['Item'] == 1)).cumsum()
df['Num'] = df.groupby(groups)['Num'].transform('sum')
Result:
Item Num
0 0 0
1 1 1
2 2 1
3 3 1
4 4 1
5 4 1
6 0 0
7 1 0
8 2 0
9 3 0
10 1 0
11 1 1
12 2 1
13 3 1
14 4 1
15 0 0
You could try:
for a, b in zip(df[df['Item'] == 0].index, df[df['Num'] == 1].index):
df.loc[(df.loc[a+1:b-1, 'Item'] == 1)[::-1].idxmax():b-1, 'Num'] = 1

Create a Single column using values fom multiple columns

I am trying to create a new column in a Pandas data frame based on values from three columns,if the value for each column ['A','B','C'] is greater than 5 then output = 1 and output =0 if there is any value in either one of the columns ['A','B','C'] that is less then 5
The data frame looks like this:
A B C
5 8 6
9 2 1
6 0 0
2 2 6
0 1 2
5 8 10
5 5 1
9 5 6
Expected output:
A B C new_column
5 8 6 1
9 2 1 0
6 0 0 0
2 2 6 0
0 1 2 0
5 8 10 1
5 5 1 0
9 5 6 1
I tried using this code,but it is not giving me the desired output:
conditions = [(df['A'] >= 5) , (df['B'] >= 5) , (df['C'] >= 5)]
choices = [1,1,1]
df['new_colum'] = np.select(conditions, choices, default=0)
You need chain conditions by & for bitwise AND:
conditions = (df['A'] >= 5) & (df['B'] >= 5) & (df['C'] >= 5)
Or use DataFrame.all for check if all values in row are Trues:
conditions = (df[['A','B','C']] >= 5 ).all(axis=1)
#if need all columns >=5
conditions = (df >= 5 ).all(axis=1)
And then convert mask to integer for True, False to 1, 0:
df['new_colum'] = conditions.astype(int)
Or use numpy.where:
df['new_colum'] = np.where(conditions, 1, 0)
print (df)
A B C new_colum
0 5 8 6 1
1 9 2 1 0
2 6 0 0 0
3 2 2 6 0
4 0 1 2 0
5 5 8 10 1
6 5 5 1 0
7 9 5 6 1

Pandas sum above all possible thresholds

I have a dataset with two risk model scores and observations that have a certain amount of value. Something like this:
import pandas as pd
df = pd.DataFrame(data={'segment':['A','A','A','A','A','A','A','B','B','B','B','B'],
'model1':[9,4,5,2,9,7,7,8,8,5,6,3],
'model2':[9,8,2,4,6,8,8,7,7,7,4,4],
'dollars':[15,10,-5,-7,6,7,-2,5,7,3,-1,-3]},
columns=['segment','model1','model2','dollars'])
print df
segment model1 model2 dollars
0 A 9 9 15
1 A 4 8 10
2 A 5 2 -5
3 A 2 4 -7
4 A 9 6 6
5 A 7 8 7
6 A 7 8 -2
7 B 8 7 5
8 B 8 7 7
9 B 5 7 3
10 B 6 4 -1
11 B 3 4 -3
My goal is to determine the simultaneous risk model thresholds where value is maximized, i.e. a cutoff like (model1 >= X) & (model2 >= Y). The risk-models are both rank-ordered such that higher numbers are lower risk and generally higher value.
I was able to get the desired output using a loop approach:
df_sum = df.groupby(by=['segment','model1','model2'])['dollars'].agg(['sum']).rename(columns={'sum':'dollar_sum'}).reset_index()
df_sum.loc[:,'threshold_sum'] = 0
#this loop works but runs very slowly on my large dataframe
#calculate the sum of dollars for each combination of possible model score thresholds
for row in df_sum.itertuples():
#subset the original df down to just the observations above the given model scores
df_temp = df[((df['model1'] >= getattr(row,'model1')) & (df['model2'] >= getattr(row,'model2')) & (df['segment'] == getattr(row,'segment')))]
#calculate the sum and add it back to the dataframe
df_sum.loc[row.Index,'threshold_sum'] = df_temp['dollars'].sum()
#show the max value for each segment
print df_sum.loc[df_sum.groupby(by=['segment'])['threshold_sum'].idxmax()]
segment model1 model2 dollar_sum threshold_sum
1 A 4 8 10 30
7 B 5 7 3 15
The loop runs incredibly slowly as the size of the dataframe increases. I'm sure there's a faster way to do this (maybe using cumsum() or numpy), but I'm stumped on what it is. Does anyone have a better way to do it? Ideally any code would be easily extendable to n-many risk models and would output all possible combinations of threshold_sum in case I add other optimization criteria down the road.
You'll get some speedup with apply(), using your same approach, but I agree with your hunch, there's probably a faster way.
Here's an apply() solution:
With df_sum as:
df_sum = (df.groupby(['segment','model1','model2'])
.dollars
.sum()
.reset_index()
)
print(df_sum)
segment model1 model2 dollars
0 A 2 4 -7
1 A 4 8 10
2 A 5 2 -5
3 A 7 8 5
4 A 9 6 6
5 A 9 9 15
6 B 3 4 -3
7 B 5 7 3
8 B 6 4 -1
9 B 8 7 12
apply can be combined with groupby:
def get_threshold_sum(row):
return (df.loc[(df.segment == row.segment) &
(df.model1 >= row.model1) &
(df.model2 >= row.model2),
["segment","dollars"]]
.groupby('segment')
.sum()
.dollars
)
thresholds = df_sum.apply(get_threshold_sum, axis=1)
mask = thresholds.idxmax()
df_sum.loc[mask]
segment model1 model2 dollar_sum
1 A 4 8 10
7 B 5 7 3
To see all possible thresholds, just print the thresholds list.
Finally found a non-loop approach, it requires some re-shaping and cumsum().
df['cumsum_dollars'] = df['dollars']
df2 = pd.pivot_table(df,index=['segment','model1','model2'],values=['dollars','cumsum_dollars'],fill_value=0,aggfunc=np.sum)
# descending sort ensures that the cumsum happens in the desired direction
df2 = df2.unstack(fill_value=0).sort_index(ascending=False,axis=0).sort_index(ascending=False,axis=1)
print(df2)
dollars cumsum_dollars
model2 9 8 7 6 4 2 9 8 7 6 4 2
segment model1
B 8 0 0 12 0 0 0 0 0 12 0 0 0
6 0 0 0 0 -1 0 0 0 0 0 -1 0
5 0 0 3 0 0 0 0 0 3 0 0 0
3 0 0 0 0 -3 0 0 0 0 0 -3 0
A 9 15 0 0 6 0 0 15 0 0 6 0 0
7 0 5 0 0 0 0 0 5 0 0 0 0
5 0 0 0 0 0 -5 0 0 0 0 0 -5
4 0 10 0 0 0 0 0 10 0 0 0 0
2 0 0 0 0 -7 0 0 0 0 0 -7 0
From here, take the cumulative sum in both the horizontal and vertical directions using the axis parameter of the cumsum() function.
df2['cumsum_dollars'] = df2['cumsum_dollars'].groupby(level='segment').cumsum(axis=0).cumsum(axis=1)
print(df2)
dollars cumsum_dollars
model2 9 8 7 6 4 2 9 8 7 6 4 2
segment model1
B 8 0 0 12 0 0 0 0 0 12 12 12 12
6 0 0 0 0 -1 0 0 0 12 12 11 11
5 0 0 3 0 0 0 0 0 15 15 14 14
3 0 0 0 0 -3 0 0 0 15 15 11 11
A 9 15 0 0 6 0 0 15 15 15 21 21 21
7 0 5 0 0 0 0 15 20 20 26 26 26
5 0 0 0 0 0 -5 15 20 20 26 26 21
4 0 10 0 0 0 0 15 30 30 36 36 31
2 0 0 0 0 -7 0 15 30 30 36 29 24
With the cumulative sums calculated, shape the dataframe back into how it originally looked and take the max of each group.
df3 = df2.stack().reset_index()
print(df3.loc[df3.groupby(by='segment')['cumsum_dollars'].idxmax()])
segment model1 model2 cumsum_dollars dollars
43 A 4 4 36 0
14 B 5 6 15 0
These thresholds where there aren't any observations are actually more valuable than picking any of the options that do have data. Note that idxmax() returns the first occurrence of the maximum, which is sufficient for my purposes. If you need to break ties, additional filtering/sorting would be be required before calling idxmax().

Categories

Resources