np.where is not calculating correctly? - python

I am using np.where function to calculate supertrend (an indicator in stock market). For this I need to do a simple calculation which np.where is giving wrong results most of the items. I have done the same calculations in excel too but not sure why np.where is giving wrong calculation in calculating Upper_Band column
Here is the excel calculation which I think is correct.
Here is the excel generated after running python code
from S.NO 6 onwards calculation has been different
this is the actual python code
import numpy as np
data["High-Low"] = data["High"] - data["Low"]
data["Close-low"] = abs(data["Close"].shift(1) - data["Low"])
data["Close-High"] = abs(data["Close"].shift(1) - data["High"])
data["True_Range"] = data[["High-Low", "Close-low", "Close-High"]].max(axis=1)
data["ATR"] = data["True_Range"].rolling(window=10).mean()
data["Basic_Upper_Band"] = (data["High"] + data["Low"])/2 + (data["ATR"]*2)
data["Basic_Lower_Band"] = (data["High"] + data["Low"])/2 - (data["ATR"]*2)
data["Upper_Band"] = 0
data["Lower_Band"] = 0
PreviousUB = data["Upper_Band"].shift(1)
BasicUB = data["Basic_Upper_Band"]
PreviousClose = data["Close"].shift(1)
data["Upper_Band"] = np.where((PreviousUB > BasicUB) | (PreviousUB < PreviousClose),BasicUB,PreviousUB)
Here is the output in python format too
main calculation in the code is data["Upper_Bank"]. I am using the same formula in excel and through np.where but it is giving me different results.

Related

Return dataframe with columns in python

I'm trying to create a function to calculate Heikin Ashi candles (for financial analysis).
My indicators.py file looks like
from pandas import DataFrame, Series
def heikinashi(dataframe):
open = dataframe['open']
high = dataframe['high']
low = dataframe['low']
close = dataframe['close']
ha_close = 0.25 * (open + high + low + close)
ha_open = 0.5 * (open.shift(1) + close.shift(1))
ha_low = max(high, ha_open, ha_close)
ha_high = min(low, ha_open, ha_close)
return dataframe, ha_close, ha_open, ha_low, ha_high
And in my main script i'm trying to call this function the most effective way to return those four dataframes: ha_close, ha_open, ha_low and ha_high
My main script looks something like:
import indicators as ata
ha_close, ha_open, ha_low, ha_high = ata.heikinashi(dataframe)
dataframe['ha_close'] = ha_close(dataframe)
dataframe['ha_open'] = ha_open(dataframe)
dataframe['ha_low'] = ha_low(dataframe)
dataframe['ha_high'] = ha_high(dataframe)
but for some strange reason I cannot find the dataframes
What will be the most efficient way to perform this? code wise and with minimal calls
I expect to have dataframe['ha_close'] etc returned with the correct data as shown in the function
Any advice appreciated
Thank you!
In heikinashi() you are returning five elements, dataframe and ha_*, however, in your main script you are only assigning the return values to four variables.
Try changing the main script to:
dataframe, ha_close, ha_open, ha_low, ha_high = ata.heikinashi(dataframe)
Apart from that, it looks like dataframe which you are passing into heikinashi() is not defined anywhere.

Python Time Series has been differenced, how do I undifference to make the values normal again

I'm working on a time series, it has a index as dates and a values field. I used these 2 lines to difference the data.
df['value2'] = (df['value'] - df.value.rolling(window=12).mean()) / df.value.rolling(window=12).std()
df['value3'] = df['value2'] - df['value2'].shift(12)
This made my dataset stationary, so i'm happy to continue using this.
Now I have ran some analysis from this and now I have values which I'm trying to undifference.
If my result dataset is saved in df_results, how do I make these normal again (undifference them). Is there a way to reverse the transformations?
** SOLUTION **
I figured out a way to reverse the differencing on the dataset.
# DIFFERENCING
df['stp1'] = (df['cpi'] - df.cpi.rolling(window=12).mean())
df['stp2'] = df['stp1'] / df.cpi.rolling(window=12).std()
df['stp3'] = df['stp2'] - df['stp2'].shift(12)
# INVERSE DIFFERENCING
df['stp3r'] = df['stp3'] + df['stp2'].shift(12)
df['stp2r'] = df['stp3r'] * df.cpi.rolling(window=12).std()
df['stp1r'] = (df['stp2r'] + df.cpi.rolling(window=12).mean())
In order to apply this to a forecasted dataset I followed a very similar way. In this the only variable changed is 'wmar' which where the differenced forecast is saved, the last data field 'fcast3' is where the reversed differenced forecast exists:
df['fcast'] = wmar + df['stp2'].shift(12)
df['fcast2'] = df['fcast'] * df.cpi.rolling(window=12).std()
df['fcast3'] = (df['fcast2'] + df.cpi.rolling(window=12).mean())

Calculating On Base Volume (OBV) with Python Pandas

I have a trading Python Pandas DataFrame which includes the "close" and "volume". I want to calculate the On-Balance Volume (OBV). I've got it working over the entire dataset but I want it to be calculated on a rolling series of 10.
The current function looks as follows...
def calculateOnBalanceVolume(df):
df['obv'] = 0
index = 1
while index <= len(df) - 1:
if(df.iloc[index]['close'] > df.iloc[index-1]['close']):
df.at[index, 'obv'] += df.at[index-1, 'obv'] + df.at[index, 'volume']
if(df.iloc[index]['close'] < df.iloc[index-1]['close']):
df.at[index, 'obv'] += df.at[index-1, 'obv'] - df.at[index, 'volume']
index = index + 1
return df
This creates the "obv" column and works out the OBV over the 300 entries.
Ideally I would like to do something like this...
data['obv10'] = data.volume.rolling(10, min_periods=1).apply(calculateOnBalanceVolume)
This looks like it has potential to work but the problem is the "apply" only passes in the "volume" column so you can't work out the change in closing price.
I also tried this...
data['obv10'] = data[['close','volume']].rolling(10, min_periods=1).apply(calculateOnBalanceVolume)
Which sort of works but it tries to update the "close" and "volume" columns instead of adding the new "obv10" column.
What is the best way of doing this or do you just have to iterate over the data in batches of 10?
I found a more efficient way of doing the code above from this link:
Calculating stocks's On Balance Volume (OBV) in python
import numpy as np
def calculateOnBalanceVolume(df):
df['obv'] = np.where(df['close'] > df['close'].shift(1), df['volume'],
np.where(df['close'] < df['close'].shift(1), -df['volume'], 0)).cumsum()
return df
The problem is this still does the entire data set. This looks pretty good but how can I cycle through it in batches of 10 at a time without looping or iterating through the entire data set?
*** UPDATE ***
I've got slightly closer to getting this working. I have managed to calculate the OBV in groups of 10.
for gid,df in data.groupby(np.arange(len(data)) // 10):
df['obv'] = np.where(df['close'] > df['close'].shift(1), df['volume'],
np.where(df['close'] < df['close'].shift(1), -df['volume'], 0)).cumsum()
I want this to be calculated rolling not in groups. Any idea how to do this using Pandas in an efficient way?
*** UPDATE ***
It turns out that OBV is supposed to be calculated over the entire data set. I've settled on the following code which looks correct now.
# calculate on-balance volume (obv)
self.df['obv'] = np.where(self.df['close'] > self.df['close'].shift(1), self.df['volume'],
np.where(self.df['close'] < self.df['close'].shift(1), -self.df['volume'], self.df.iloc[0]['volume'])).cumsum()

How to replace a loop that looks at multiple previous values with a formula in Python

My Problem
I have a loop that creates a column using either a formula based on values from other columns or the previous value in the column depending on a condition ("days from new low == 0"). It is really slow over a huge dataset so I wanted to get rid of the loop and find a formula that is faster.
Current Working Code
import numpy as np
import pandas as pd
csv1 = pd.read_csv('stock_price.csv', delimiter = ',')
df = pd.DataFrame(csv1)
for x in range(1,len(df.index)):
if df["days from new low"].iloc[x] == 0:
df["mB"].iloc[x] = (df["RSI on new low"].iloc[x-1] - df["RSI on new low"].iloc[x]) / -df["days from new low"].iloc[x-1]
else:
df["mB"].iloc[x] = df["mB"].iloc[x-1]
df
Input Data and Expected Output
RSI on new low,days from new low,mB
0,22,0
29.6,0,1.3
29.6,1,1.3
29.6,2,1.3
29.6,3,1.3
29.6,4,1.3
21.7,0,-2.0
21.7,1,-2.0
21.7,2,-2.0
21.7,3,-2.0
21.7,4,-2.0
21.7,5,-2.0
21.7,6,-2.0
21.7,7,-2.0
21.7,8,-2.0
21.7,9,-2.0
25.9,0,0.5
25.9,1,0.5
25.9,2,0.5
23.9,0,-1.0
23.9,1,-1.0
Attempt at Solution
def mB_calc (var1,var2,var3):
df[var3]= np.where(df[var1] == 0, df[var2].shift(1) - df[var2] / -df[var1].shift(1) , "")
return df
df = mB_calc('days from new low','RSI on new low','mB')
First, it gives me this "TypeError: can't multiply sequence by non-int of type 'float'" and second I dont know how to incorporate the "ffill" into the formula.
Any idea how I might be able to do it?
Cheers!
Try this one:
df["mB_temp"] = (df["RSI on new low"].shift() - df["RSI on new low"]) / -df["days from new low"].shift()
df["mB"] = df["mB"].shift()
df["mB"].loc[df["days from new low"] == 0]=df["mB_temp"].loc[df["days from new low"] == 0]
df.drop(["mB_temp"], axis=1)
And with np.where:
df["mB"] = np.where(df["days from new low"]==0, df["RSI on new low"].shift() - df["RSI on new low"]) / -df["days from new low"].shift(), df["mB"].shift())

Omnet++ / Data in a pandas cell(list) vs pandas series(column)

So I'm using Omnet++, a discrete time network simulator, to simulate different networking scenarios. At some point one can further process Omnet++ output statistics and store them in a .csv file.
The interesting thing about it is that for each time (vectime) there is a value (vecvalue). Those vectime/vecvalues are stored in a single cell of such .csv file. When imported into a Pandas Dataframe, I get something like this.
In [45]: df1[['module','vectime','vecvalue']]
Out[45]:
module vectime vecvalue
237 Tictoc13.tic[1] [2.542245319062, 3.066965320033, 4.78723506093... [0.334535581612, 0.390459633837, 0.50391696492...
249 Tictoc13.tic[4] [2.649303071938, 6.02527384362, 21.42434044990... [2.649303071938, 1.654927100273, 3.11051622577...
261 Tictoc13.tic[3] [4.28876656608, 16.104821448604, 19.5989313700... [2.245250432259, 3.201153958979, 2.39023520069...
277 Tictoc13.tic[2] [13.884917126016, 21.467263378748, 29.59962616... [0.411703261805, 0.764708518232, 0.83288346614...
289 Tictoc13.tic[5] [14.146524815409, 14.349744576545, 24.95022463... [1.732060647139, 8.66456377103, 2.275388282721...
For example, if I needed to plot each vectime/vecvalue for each module, today I'm doing the following...
%pylab
def runningAvg(x):
sigma_x = np.cumsum(x)
sigma_n = np.arange(1,x.size + 1)
return sigma_x / sigma_n
for row in df1.itertuples():
t = row.vectime
x = row.vecvalue
x = runningAvg(x)
plot(t,x)
... to obtain this ...
My question is: what's best in terms of performance:
use the data as is, meaning using those arrays inside each cell, looping over the DF to plot each array;
convert those arrays as pd.Series. In this case, what would be better to still have the module as index?
would I benefit from unnesting those arrays into pd.Series?
thanks!
Well, I've wondered around and it seems that converting Omnet data into pd.Series might not be as efficient as I thought.
These are my two methods:
1) Using Omnet data as is, lists inside Pandas DF.
figure(1)
start = datetime.datetime.now()
for row in df1.itertuples():
t = row.vectime
x = row.vecvalue
x = runningAvg(x)
plot(t,x)
total = (datetime.datetime.now() - start).total_seconds()
print(total)
When running the above, the total is 0.026571.
2) Converting Omnet data to pd.Series.
To obtain the same result, I had to transpose the series several times.
figure(2)
start = datetime.datetime.now()
t = df1.vectime
v = df1.vecvalue
t = t.apply(pd.Series)
v = v.apply(pd.Series)
t = t.T
v = v.T
sigma_v = np.cumsum(v)
sigma_n = np.arange(1,v.shape[0]+1)
sigma = sigma_v.T / sigma_n
plot(t,sigma.T)
total = (datetime.datetime.now() - start).total_seconds()
print(total)
For the later, total is 0.57266.
So it seems that I'll stick to method 1, looping over the different rows.

Categories

Resources