How to use previous values from Dataframe in if statement - python

I have DataFrame df
df=pd.DataFrame([[47,55,47,50], [33,37,30,25],[61,65,54,57],[25,26,21,22], [25,29,23,28]], columns=['open','high','low','close'])
print(df)
open high low close
0 47 55 47 50
1 33 37 30 25
2 61 65 54 57
3 25 26 21 22
4 25 29 23 20
I want to use previous values in if statement to compare.
Logic is as below:
if (close[i-1]/close[i] > 2) and (high[i] < low[i-1]) and
((open[i] > high[i-1]) or (open[i] <low[i-1])) :
I have written a code :
for idx,i in df.iterrows():
if idx !=0 :
if ((prv_close/i['close'])>2) and (i['high'] < prv_low) and ((i['open'] > prv_high) or (i['open'] < prv_low)):
print("Successful")
else:
print("Not Successful")
prv_close = i['close']
prv_open = i['open']
prv_low = i['low']
prv_high = i['high']
output:
Not Successful
Not Successful
Successful
Not Successful
But For millions of rows it's taking too much time. Is there any other way to implement it faster?
P.S : Action to be taken on data are different in if statement. For simplicity I'm using print statement. My columns may not have same order this is why I'm using iterrows() instead of itertuples().
Any suggestions are welcome.
Thanks.

d0 = df.shift()
cond0 = (d0.close / df.close) > 2
cond1 = df.high < d0.low
cond2 = df.open > d0.high
cond3 = df.open < d0.low
mask = cond0 & cond1 & (cond2 | cond3)
mask
0 False
1 False
2 False
3 True
4 False
dtype: bool

Related

Finding rows which match both conditions

Say I have a dataframe:
Do
Re
Mi
Fa
So
1
0
Foo
100
50
1
1
Bar
75
20
0
0
True
59
59
1
1
False
0
12
How would I go about finding all rows where the value in columns "Do" and "Re" BOTH equal 1 AND "Fa" is higher than "So"?
I've tried a couple of ways, but they first returns an error complaining of ambiguity:
df['both_1']=((df['Do'] > df['Re']) & (df['Mi'] == df['Fa'] == 1))
I also tried breaking it down into steps, but I realised the last step will result in me bringing in both True and False statements. I only want True.
df['Do_1'] = df['Do'] == 1
df['Re_1'] = df['Re'] == 1
# This is where I realised I'm bringing in the False rows too
df['both_1'] = (df['Do1'] == df['Re_1'])```
Chain mask by another & for bitwise AND:
df['both_1']= (df['Fa'] > df['So']) & (df['Do'] == 1) & (df['Re'] == 1)
print (df)
Do Re Mi Fa So both_1
0 1 0 Foo 100 50 False
1 1 1 Bar 75 20 True
2 0 0 True 59 59 False
3 1 1 False 0 12 False
Or if possible multiple columns in list comapre filtered columns by subset df[['Do', 'Re']] and test if all Trues by DataFrame.all:
df['both_1']= (df['Fa'] > df['So']) & (df[['Do', 'Re']] == 1).all(axis=1)
If need filter use boolean indexing:
df1 = df[(df['Fa'] > df['So']) & (df['Do'] == 1) & (df['Re'] == 1)]
For second solution:
df1 = df[(df['Fa'] > df['So']) & (df[['Do', 'Re']] == 1).all(axis=1)]
You can combine the three conditions using & (bitwise and):
df[(df['Do'] == 1) & (df['Re'] == 1) & (df['Fa'] > df['So'])]
Output (for your sample data):
Do Re Mi Fa So
1 1 1 Bar 75 20
Try:
import numpy as np
import pandas as pd
df = pd.DataFrame({'Do':[1,1,0,1],'Re':[0,1,0,1],'Mi':['Foo','Bar',True,False],'Fa':[100,75,59,0],'Sol':[50,20,59,12]})
df.loc[np.where((df.Do == 1) & (df.Re ==1) & (df.Fa > df.Sol))[0]]
Output:
Do Re Mi Fa Sol
1 1 1 Bar 75 20
If you do not want to import anything else, just do this:
df.assign(keep=[1 if (df.loc[i,'Do'] == 1 and df.loc[i,'Re'] == 1 and df.loc[i,'Fa'] > df.loc[i,'Sol']) else 0 for i in df.index]).query("keep == 1").drop('keep',axis=1)

In Python Pandas , searching where there are 4 consecutive rows where values going up

I am trying to figure out how I can mark the rows where the price are part of 4 increase prices .
the "is_consecutive" is actually the mark .
I managed to do the diff between the rows :
df['diff1'] = df['Close'].diff()
But I didn't managed to find out which row is a part of 4 increase prices .
I had a thought to use df.rolling() .
The exmple df,
On rows 0-3 , we need to get an output of 'True' on the ["is_consecutive"] column , because the ['diff1'] on this consecutive rows is increase for 4 rows .
On rows 8-11 , we need to get an output of 'False' on the ["is_consecutive"] column , because the ['diff1'] on this consecutive rows is zero .
Date Price diff1 is_consecutive
0 1/22/20 0 0 True
1 1/23/20 130 130 True
2 1/24/20 144 14 True
3 1/25/20 150 6 True
4 1/27/20 60 -90 False
5 1/28/20 95 35 False
6 1/29/20 100 5 False
7 1/30/20 50 -50 False
8 2/01/20 100 0 False
9 1/02/20 100 0 False
10 1/03/20 100 0 False
11 1/04/20 100 0 False
12 1/05/20 50 -50 False
general example :
if
price = [30,55,60,65,25]
the different form the consecutive number on the list will be :
diff1 = [0,25,5,5,-40]
So when the diff1 is plus its actually means the consecutive prices are increase .
I need to mark(in the df) the rows that have 4 consecutive that go up.
Thank You for help (-:
Try: .rolling with window of size 4 and min periods 1:
df["is_consecutive"] = (
df["Price"]
.rolling(4, min_periods=1)
.apply(lambda x: (x.diff().fillna(0) >= 0).all())
.astype(bool)
)
print(df)
Prints:
Date Price is_consecutive
0 1/22/20 0 True
1 1/23/20 130 True
2 1/24/20 144 True
3 1/25/20 150 True
4 1/26/20 60 False
5 1/26/20 95 False
6 1/26/20 100 False
7 1/26/20 50 False
Assuming the dataframe is sorted. One way is based on the cumsum of the differences to identify the first time an upward Price move succeeding a 3 days upwards trend (i.e. 4 days of upward trend).
quant1 = (df['Price'].diff().apply(np.sign) == 1).cumsum()
quant2 = (df['Price'].diff().apply(np.sign) == 1).cumsum().where(~(df['Price'].diff().apply(np.sign) == 1)).ffill().fillna(0).astype(int)
df['is_consecutive'] = (quant1-quant2) >= 3
note that the above takes into account only strictly increasing Prices (not equal).
Then we override also the is_consecutive tag for the previous 3 Prices to be also TRUE using the win_view self defined function:
def win_view(x, size):
if isinstance(x, list):
x = np.array(x)
if isinstance(x, pd.core.series.Series):
x = x.values
if isinstance(x, np.ndarray):
pass
else:
raise Exception('wrong type')
return np.lib.stride_tricks.as_strided(
x,
shape=(x.size - size + 1, size),
strides=(x.strides[0], x.strides[0])
)
arr = win_view(df['is_consecutive'], 4)
arr[arr[:,3]] = True
Note that we inplace replace the values to be True.
EDIT 1
Inspired by the self defined win_view function, I realized that the solution it can be obtained simply by win_view (without the need of using cumsums) as below:
df['is_consecutive'] = False
arr = win_view(df['Price'].diff(), 4)
arr_ind = win_view(list(df['Price'].index), 4)
mask = arr_ind[np.all(arr[:, 1:] > 0, axis=1)].flatten()
df.loc[mask, 'is_consecutive'] = True
We maintain 2 arrays, 1 for the returns and 1 for the indices. We collect the indices where we have 3 consecutive positive return np.all(arr[:, 1:] > 0, axis=1 (i.e. 4 upmoving prices) and we replace those in our original df.
The function will return columns named "consecutive_up" which represents all rows that are part of the 5 increase series and "consecutive_down" which represents all rows that are part of the 4 decrees series.
def c_func(temp_df):
temp_df['increase'] = temp_df['Price'] > temp_df['Price'].shift()
temp_df['decrease'] = temp_df['Price'] < temp_df['Price'].shift()
temp_df['consecutive_up'] = False
temp_df['consecutive_down'] = False
for ind, row in temp_df.iterrows():
if row['increase'] == True:
count += 1
else:
count = 0
if count == 5:
temp_df.iloc[ind - 5:ind + 1, 4] = True
elif count > 5:
temp_df.iloc[ind, 4] = True
for ind, row in temp_df.iterrows():
if row['decrease'] == True:
count += 1
else:
count = 0
if count == 4:
temp_df.iloc[ind - 4:ind + 1, 5] = True
elif count > 4:
temp_df.iloc[ind, 5] = True
return temp_df

What is the equivalent of proc format in SAS to python

I want
proc format;
value RNG
low - 24 = '1'
24< - 35 = '2'
35< - 44 = '3'
44< - high ='4'
I need this in python pandas.
If you are looking for equivalent of the mapping function, you can use something like this.
df = pd.DataFrame(np.random.randint(100,size=5), columns=['score'])
print(df)
output:
score
0 73
1 90
2 83
3 40
4 76
Now lets apply the binning function for score column in dataframe and create new column in the same dataframe.
def format_fn(x):
if x < 24:
return '1'
elif x <35:
return '2'
elif x< 44:
return '3'
else:
return '4'
df['binned_score']=df['score'].apply(format_fn)
print(df)
output:
score binned_score
0 73 4
1 90 4
2 83 4
3 40 3
4 76 4

Alternatives to Dataframe.iterrows() or Dataframe.itertuples()?

My understanding of a Pandas dataframe vectorization (through Pandas vectorization itself or through Numpy) is applying a function to an array, similar to .apply() (Please correct me if I'm wrong). Suppose I have the following dataframe:
import pandas as pd
df = pd.DataFrame({'color' : ['red','blue','yellow','orange','green',
'white','black','brown','orange-red','teal',
'beige','mauve','cyan','goldenrod','auburn',
'azure','celadon','lavender','oak','chocolate'],
'group' : [1,1,1,1,1,
1,1,1,1,1,
1,2,2,2,2,
4,4,5,6,7]})
df = df.set_index('color')
df
For this data, I want to apply a special counter for each unique value in A. Here's my current implementation of it:
df['C'] = 0
for value in set(df['group'].values):
filtered_df = df[df['group'] == value]
adj_counter = 0
initialize_counter = -1
spacing_counter = 20
special_counters = [0,1,-1,2,-2,3,-3,4,-4,5,-5,6,-6,7,-7]
for color,rows in filtered_df.iterrows():
if len(filtered_df.index) < 7:
initialize_counter +=1
df.loc[color,'C'] = (46+special_counters[initialize_counter])
else:
spacing_counter +=1
if spacing_counter > 5:
spacing_counter = 0
df.loc[color,'C'] = spacing_counter
df
Is there a faster way to implement this that doesn't involve iterrows or itertuples? Since the counting in the C columns is very irregular, I'm not sure as how I could implement this through apply or even through vectorization
What you can do is first create the column 'C' with groupby on the column 'group' and cumcount that would almost represent spacing_counter or initialize_counter depending on if len(filtered_df.index) < 7 or not.
df['C'] = df.groupby('group').cumcount()
Now you need to select the appropriate rows to do the if or the else part of your code. One way is to create a series using groupby again and transform to know the size of the group related to each row. Then, use loc on you df using this series and do: if the value is smaller than 7, you can map your values with the special_counters else just use modulo % 6
ser_size = df.groupby('group')['C'].transform('size')
df.loc[ser_size < 7,'C'] = df.loc[ser_size < 7,'C'].map(lambda x: 46 + special_counters[x])
df.loc[ser_size >= 7,'C'] %= 6
at the end, you get as expected:
print (df)
group C
color
red 1 0
blue 1 1
yellow 1 2
orange 1 3
green 1 4
white 1 5
black 1 0
brown 1 1
orange-red 1 2
teal 1 3
beige 1 4
mauve 2 46
cyan 2 47
goldenrod 2 45
auburn 2 48
azure 4 46
celadon 4 47
lavender 5 46
oak 6 46
chocolate 7 46

if else function in pandas dataframe [duplicate]

This question already has answers here:
Pandas conditional creation of a series/dataframe column
(13 answers)
Closed 4 years ago.
I'm trying to apply an if condition over a dataframe, but I'm missing something (error: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().)
raw_data = {'age1': [23,45,21],'age2': [10,20,50]}
df = pd.DataFrame(raw_data, columns = ['age1','age2'])
def my_fun (var1,var2,var3):
if (df[var1]-df[var2])>0 :
df[var3]=df[var1]-df[var2]
else:
df[var3]=0
print(df[var3])
my_fun('age1','age2','diff')
You can use numpy.where:
def my_fun (var1,var2,var3):
df[var3]= np.where((df[var1]-df[var2])>0, df[var1]-df[var2], 0)
return df
df1 = my_fun('age1','age2','diff')
print (df1)
age1 age2 diff
0 23 10 13
1 45 20 25
2 21 50 0
Error is better explain here.
Slowier solution with apply, where need axis=1 for data processing by rows:
def my_fun(x, var1, var2, var3):
print (x)
if (x[var1]-x[var2])>0 :
x[var3]=x[var1]-x[var2]
else:
x[var3]=0
return x
print (df.apply(lambda x: my_fun(x, 'age1', 'age2','diff'), axis=1))
age1 age2 diff
0 23 10 13
1 45 20 25
2 21 50 0
Also is possible use loc, but sometimes data can be overwritten:
def my_fun(x, var1, var2, var3):
print (x)
mask = (x[var1]-x[var2])>0
x.loc[mask, var3] = x[var1]-x[var2]
x.loc[~mask, var3] = 0
return x
print (my_fun(df, 'age1', 'age2','diff'))
age1 age2 diff
0 23 10 13.0
1 45 20 25.0
2 21 50 0.0
There is another way without np.where or pd.Series.where. Am not saying it is better, but after trying to adapt this solution to a challenging problem today, was finding the syntax for where no so intuitive. In the end, not sure whether it would have possible with where, but found the following method lets you have a look at the subset before you modify it and it for me led more quickly to a solution. Works for the OP here of course as well.
You deliberately set a value on a slice of a dataframe as Pandas so often warns you not to.
This answer shows you the correct method to do that.
The following gives you a slice:
df.loc[df['age1'] - df['age2'] > 0]
..which looks like:
age1 age2
0 23 10
1 45 20
Add an extra column to the original dataframe for the values you want to remain after modifying the slice:
df['diff'] = 0
Now modify the slice:
df.loc[df['age1'] - df['age2'] > 0, 'diff'] = df['age1'] - df['age2']
..and the result:
age1 age2 diff
0 23 10 13
1 45 20 25
2 21 50 0
You can use pandas.Series.where
df.assign(age3=(df.age1 - df.age2).where(df.age1 > df.age2, 0))
age1 age2 age3
0 23 10 13
1 45 20 25
2 21 50 0
You can wrap this in a function
def my_fun(v1, v2):
return v1.sub(v2).where(v1 > v2, 0)
df.assign(age3=my_fun(df.age1, df.age2))
age1 age2 age3
0 23 10 13
1 45 20 25
2 21 50 0

Categories

Resources