Python Dataframe new column based on conditional statement - python

I have a dataframe consisting of a few columns of custom calculations for a trading strategy. I want to add a new column called 'Signals' to this dataframe, consisting of 0s and 1s (long only strategy). The signals will be generated on the following code, each item in this code is a separate column in the dataframe:
if:
open_price > low_sigma.shift(1) and high_price > high_sigma.shift(1):
signal = 1
else:
signal = 0
From my understanding, if statements are not efficient for dataframes. In addition, I haven't been able to get this to output as desired. How do you recommend I generate the signal and add it to the dataframe?

You could assign df['Signals'] to the boolean condition itself, then use astype to convert the booleans to 0s and 1s:
df['Signals'] = (((df['open_price'] > df['low_sigma'].shift(1))
& (df['high_price'] > df['high_sigma'].shift(1)))
.astype('int'))
for example,
import pandas as pd
df = pd.DataFrame({
'open_price': [1,2,3,4],
'low_sigma': [1,3,2,4],
'high_price': [10,20,30,40],
'high_sigma': [10,40,20,30]})
# high_price high_sigma low_sigma open_price
# 0 10 10 1 1
# 1 20 40 3 2
# 2 30 20 2 3
# 3 40 30 4 4
mask = ((df['open_price'] > df['low_sigma'].shift(1))
& (df['high_price'] > df['high_sigma'].shift(1)))
# 0 False
# 1 True
# 2 False
# 3 True
# dtype: bool
df['Signals'] = mask.astype('int')
print(df)
yields
high_price high_sigma low_sigma open_price Signals
0 10 10 1 1 0
1 20 40 3 2 1
2 30 20 2 3 0
3 40 30 4 4 1

Related

How to make a calculation based on a value and multiple columns in pandas?

I am trying to create a column, that should do a calculation per product, based on multiple columns.
Logic for the calculation column:
Calculations should be done per product
Use quantity as default
IF promo[Y/N] = 1, then take previous weeks quantity * season perc. change.
Except when the promo is on the first week of the product. Then keep quantity as well.
In the example below, You see the calculation column (I placed comments for the logic).
week product promo[Y/N] quantity Season calculation
1 A 0 100 6 100 # no promo, so = quantity col
2 A 0 100 10 100 # no promo, so = quantity col
3 A 1 ? -10 90 # 100 (quantity last week)- 10% (season)
4 A 1 ? 20 108 # quantity last week, what we calculated, 90 + 20% (18) = 108.
5 A 0 80 4 80 # no promo, so = quantity col
1 B 1 100 6 100 # Promo, but first week of this product. So regular quantity.
2 B 0 100 10 100 # no promo, so = quantity col
3 B 1 ? -10 90 # 100 (quantity last week)- 10% (season)
4 B 1 ? 20 108 # quantity last week, what we calculated, 90 + 20% (18) = 108.
5 B 0 80 4 80 # no promo, so = quantity col
I tried to solve this in two ways:
Via a groupby(), and then the Product, but this was messing up my end file (I would like to have it in the format above, so with 1 additional column.
By looping over the dataframe with iterrows(). However, I messed up, because it doesn't distinct between products unfortunately.
Anyone an idea what a proper method is to solve this? Appreciated!
Using custom apply function to add column to dataframe
Code
def calc(xf):
'''
xf is dataframe from groupby
'''
calc = []
# Faster to loop over rows using zip than iterrows
for promo, quantity, season in zip(xf['promo[Y/N]'], xf['quantity'], xf['Season']):
if not np.isnan(quantity):
calc.append(quantity) # not missing value
elif promo and calc: # beyond first week if calc is not None
prev_quantity = float(calc[-1]) # previous quantity
estimate = round((1 + season/100.)*prev_quantity) # estimate
calc.append(estimate)
else:
calc.append(quantity) # use current quantity
xf['calculated'] = calc # Add calculated column to dataframe
return xf
Test
from io import StringIO
s = '''week product promo[Y/N] quantity Season
1 A 0 100 6
2 A 0 100 10
3 A 1 ? -10
4 A 1 ? 20
5 A 0 80 4
1 B 1 100 6
2 B 0 100 10
3 B 1 ? -10
4 B 1 ? 20
5 B 0 80 4'''
# convert '?' to np.nan (missing value so column become float)
df = pd.read_csv(StringIO(s), sep = '\s{2,}', na_values = ['?'], engine = 'python')
print(df.type)
# Output
week int64
product object
promo[Y/N] int64
quantity float64
Season int64
dtype: object
tf = df.groupby('product').apply(calc)
print(tf)
# Output
week product promo[Y/N] quantity Season calculated
0 1 A 0 100 6 100
1 2 A 0 100 10 100
2 3 A 1 ? -10 90
3 4 A 1 ? 20 108
4 5 A 0 80 4 80
5 1 B 1 100 6 100
6 2 B 0 100 10 100
7 3 B 1 ? -10 90
8 4 B 1 ? 20 108
9 5 B 0 80 4 80
groupby() is a good approach in my opinion.
Let's build our dataset first :
csvfile = StringIO(
"""week\tproduct\tpromo\tquantity\tseason
1\tA\t0\t100\t6
2\tA\t0\t100\t10
3\tA\t1\tnan\t-10
4\tA\t1\tnan\t20
5\tA\t0\t80\t4
1\tB\t1\t100\t6
2\tB\t0\t100\t10
3\tB\t1\tnan\t-10
4\tB\t1\tnan\t20
5\tB\t0\t80\t4""")
df = pd.read_csv(csvfile, sep='\t', engine='python')
I strongly encourage to not use '?' or other non numeric character to transcript the lack of data. It will be a mess in your future analysis. Instead use np.nan.
Then let's build a function hich will translate your expected behaviour for each product :
def computation(pd_input):
previous_calculation = np.nan
# output dataframe
pd_output = pd.DataFrame(index=pd_input.index, columns=["calculation"])
# we need to loop other all your row
for index, row in pd_input.iterrows():
# if promo = 0 OR for the first row, use quantity
if row.promo == 0 or np.isnan(previous_calculation):
pd_output.loc[index, "calculation"] = row.quantity
# else, use the previous value we computed and apply the percentage
else:
pd_output.loc[index, "calculation"] = (1 + row.season/100) * previous_calculation
previous_calculation = pd_output.loc[index, "calculation"]
# return the pandas
return pd_output
Finally we just need to apply the previous function to our grouped dataframe :
calculation = df.groupby("product", group_keys=True).apply(lambda x: computation(x))
Let's check our ouput :
print(calculation )
calculation
product
A 0 100.0
1 100.0
2 90.0
3 108.0
4 80.0
B 5 100.0
6 100.0
7 90.0
8 108.0
9 80.0

Compare all columns value to another one with Pandas

I am having trouble with Pandas.
I try to compare each value of a row to another one.
In the attached link you will be able to see a slice of my dataframe.
For each date I have the daily variation of some stocks.
I want to compare each stock variation to the variation of the columns labelled 'CAC 40'.
If the value is greater I want to turn it into a Boolean 1 or 0 if lower.
This should return a dataframe filled only with 1 or 0 so I can then summarize by columns.
I have tried the apply method but this doesn't work.
It returns a Pandas.Serie ( attached below )
def compare_to_cac(row):
for i in row:
if row[i] >= row['CAC 40']:
return 1
else:
return 0
data2 = data.apply(compare_to_cac, axis=1)
Please can someone help me out ?
I worked with this data (column names are not important here, only the CAC 40 one is):
A B CAC 40
0 0 2 9
1 1 3 9
2 2 4 1
3 3 5 2
4 4 7 2
With just a for loop :
for column in df.columns:
if column == "CAC 40":
continue
condition = [df[column] > df["CAC 40"]]
value = [1]
df[column] = np.select(condition, value, default=0)
Which gives me as a result :
A B CAC 40
0 0 0 9
1 0 0 9
2 1 1 1
3 1 1 2
4 1 1 2

In filtering a pandas dataframe on multiple conditions, is df[condition1][condition2] equivelant to df[(condition1) & (condition2)]?

Say you have a pandas dataframe df with columns df['year'], df['fish'], and df['age']. In practice (in pandas version 0.22.0), it appears that
df[df['year']<2000][df['fish']=='salmon'][df['age']!=50]
yields results identical to
df[(df['year']<2000) & (df['fish']=='salmon') & (df['age']!=50)]
However, in tutorials and other stackoverflow questions I only see the second version (the one with boolean operators) recommended. Is that just because it's more flexible and can do other boolean operators, or are there situations in which these two methods do not yield the same result?
Why you should not do df[condition1][condition2]
You should go with the second approach. In addition to the greater readability of the second version, the first approach can lead to warnings as the dataframe that is returned by the first selection operation might not contains all the indices provided during the second selection.
For instance, let's consider this dataframe:
>>> df = pd.DataFrame({'a': [0,1,0,1,0], 'b': [0,0,0,1,1]})
a b
0 0 0
1 1 0
2 0 0
3 1 1
4 0 1
And test for equality to 1 on both columns (the example is trivial) with df['a'].eq(1) and df['b'].eq(1). Both return Series of True/False with all the indices of df:
>>> df['a'].eq(1)
0 False
1 True
2 False
3 True
4 False
Name: a, dtype: bool
>>> df['b'].eq(1)
0 False
1 False
2 False
3 True
4 True
Name: b, dtype: bool
But after the first slicing df[df['a'].eq(1)] you get:
a b
1 1 0
3 1 1
Thus the second selection tries to use indices that are absent and you get a warning:
>>> df[df['a'].eq(1)][df['b'].eq(1)]
a b
3 1 1
UserWarning: Boolean Series key will be reindexed to match DataFrame index.
df[df['a'].eq(1)][df['b'].eq(1)]
How you can sometimes do better than df[condition1 & condition2]
When you do df[condition1 & condition2], both tests are done prior to selecting the data. This can be unnecessary if computation of condition2 is expensive.
Let's consider the following example where column a is mostly 0s with a few 1s:
import numpy as np
np.random.seed(42)
df = pd.DataFrame({'a': np.random.choice([0,1], size=100, p=[0.9, 0.1]),
'b': np.random.choice([0,1], size=100)
})
a b
0 0 0
1 1 1
2 0 0
3 0 0
4 0 1
...
95 0 1
96 0 1
97 0 0
98 0 0
99 0 1
and consider this (stupid) expensive function to apply on the second column, that inefficiently checks whether the values are equal to 1:
def long_check(s):
import time
out = []
for elem in s:
time.sleep(0.01)
out.append(elem == 1)
return out
Now, if we do df[df['a'].eq(1) & long_check(df['b'])], we get the expected result (rows with only 1s), but it takes 1s to run (10ms per row × 100 rows).
a b
1 1 1
33 1 1
34 1 1
50 1 1
52 1 1
We can make it much more efficient by first selecting first on condition1, saving the intermediate result, and then selecting on condition2.
df2 = df[df['a'].eq(1)]
df2[long_check(df2['b'])]
The result is exactly the same but now the expensive function runs only on the rows selected by the first condition (10 rows instead of 100). It is thus 10 times faster.

How to standardize values in a Pandas dataframe based on index position?

I have a number of pandas dataframes that each have a column 'speaker', and one of two labels. Typically, this is 0-1, however in some cases it is 1-2, 1-3, or 0-2. I am trying to find a way to iterate through all of my dataframes and standardize them so that they share the same labels (0-1).
The one consistent feature between them is that the first label to appear (i.e. in the first row of the dataframe) should always be mapped to '0', where as the second should always be mapped to '1'.
Here is an example of one of the dataframes I would need to change - being mindful that others will have different labels:
import pandas as pd
data = [1,2,1,2,1,2,1,2,1,2]
df = pd.DataFrame(data, columns = ['speaker'])
I would like to be able to change so that it appears as [0,1,0,1,0,1,0,1,0,1].
Thus far, I have tried inserting the following code within a bigger for loop that iterates through each dataframe. However it is not working at all:
for label in data['speaker']:
if label == data['speaker'][0]:
label = '0'
else:
label = '1'
Hopefully, what the above makes clear is that I am attempting to create a rule akin to: "find all instances in 'Speaker' that match the label in the first index position and change this to '0'. For all other instances change this to '1'."
Method 1
We can use iat + np.where here for conditional creation of your column:
# import numpy as np
first_val = df['speaker'].iat[0] # same as df['speaker'].iloc[0]
df['speaker'] = np.where(df['speaker'].eq(first_val), 0, 1)
speaker
0 0
1 1
2 0
3 1
4 0
5 1
6 0
7 1
8 0
9 1
Method 2:
We can also make use of booleans, since we can cast them to integers:
first_val = df['speaker'].iat[0]
df['speaker'] = df['speaker'].ne(first_val).astype(int)
speaker
0 0
1 1
2 0
3 1
4 0
5 1
6 0
7 1
8 0
9 1
Only if your values are actually 1, 2 we can use floor division:
df['speaker'] = df['speaker'] // 2
# same as: df['speaker'] = df['speaker'].floordiv(2)
speaker
0 0
1 1
2 0
3 1
4 0
5 1
6 0
7 1
8 0
9 1
You can use a iloc to get the value of the first row and the first column, and then a mask to set the values:
zero_map = df["speaker"].iloc[0]
mask_zero = df["speaker"] == zero_map
df.loc[mask_zero] = 0
df.loc[~mask_zero] = 1
print(df)
speaker
0 0
1 1
2 0
3 1
4 0
5 1
6 0
7 1
8 0
9 1

Pandas: how to join back your data after a operation

I have the following dataframe
df:
group people value value_50
1 5 100 1
2 2 90 1
1 10 80 1
2 20 40 0
1 7 10 0
2 23 30 0
And I am trying to apply sklearn minmax on one of the column, given a condition on dataset, and then want to join that back as per pandas index in my original data
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
After copying the above data
data = pd.read_clipboard()
minmax = MinMaxScaler(feature_range=(0,10))
''' Applying a filter on "group" and then apply minmax only on those values '''
val = pd.DataFrame(minmax.fit_transform(data[data['group'] == 1][['value']])
,columns = ['val_minmax'] )
But it looks like we lose the index after the minmax
val
val_minmax
0 10.000000
1 7.777778
2 0.000000
where index in my original dataset on this filter is
data[data['group'] == 1]['value']
output:
0 100
2 80
4 10
Desired dataset:
df_out:
group people value value_50 val_minmax
1 5 100 1 10
2 2 90 1 na
1 10 80 1 7.88
2 20 40 0 na
1 7 10 0 0
2 23 30 0 na
Now, how to join back my data at rows in the original data, so that I can get the above output?
You just need to assign it back
df.loc[df.group==1,'val_minmax']=minmax.fit_transform(df[df['group'] == 1][['value']])

Categories

Resources