I have a trading Python Pandas DataFrame which includes the "close" and "volume". I want to calculate the On-Balance Volume (OBV). I've got it working over the entire dataset but I want it to be calculated on a rolling series of 10.
The current function looks as follows...
def calculateOnBalanceVolume(df):
df['obv'] = 0
index = 1
while index <= len(df) - 1:
if(df.iloc[index]['close'] > df.iloc[index-1]['close']):
df.at[index, 'obv'] += df.at[index-1, 'obv'] + df.at[index, 'volume']
if(df.iloc[index]['close'] < df.iloc[index-1]['close']):
df.at[index, 'obv'] += df.at[index-1, 'obv'] - df.at[index, 'volume']
index = index + 1
return df
This creates the "obv" column and works out the OBV over the 300 entries.
Ideally I would like to do something like this...
data['obv10'] = data.volume.rolling(10, min_periods=1).apply(calculateOnBalanceVolume)
This looks like it has potential to work but the problem is the "apply" only passes in the "volume" column so you can't work out the change in closing price.
I also tried this...
data['obv10'] = data[['close','volume']].rolling(10, min_periods=1).apply(calculateOnBalanceVolume)
Which sort of works but it tries to update the "close" and "volume" columns instead of adding the new "obv10" column.
What is the best way of doing this or do you just have to iterate over the data in batches of 10?
I found a more efficient way of doing the code above from this link:
Calculating stocks's On Balance Volume (OBV) in python
import numpy as np
def calculateOnBalanceVolume(df):
df['obv'] = np.where(df['close'] > df['close'].shift(1), df['volume'],
np.where(df['close'] < df['close'].shift(1), -df['volume'], 0)).cumsum()
return df
The problem is this still does the entire data set. This looks pretty good but how can I cycle through it in batches of 10 at a time without looping or iterating through the entire data set?
*** UPDATE ***
I've got slightly closer to getting this working. I have managed to calculate the OBV in groups of 10.
for gid,df in data.groupby(np.arange(len(data)) // 10):
df['obv'] = np.where(df['close'] > df['close'].shift(1), df['volume'],
np.where(df['close'] < df['close'].shift(1), -df['volume'], 0)).cumsum()
I want this to be calculated rolling not in groups. Any idea how to do this using Pandas in an efficient way?
*** UPDATE ***
It turns out that OBV is supposed to be calculated over the entire data set. I've settled on the following code which looks correct now.
# calculate on-balance volume (obv)
self.df['obv'] = np.where(self.df['close'] > self.df['close'].shift(1), self.df['volume'],
np.where(self.df['close'] < self.df['close'].shift(1), -self.df['volume'], self.df.iloc[0]['volume'])).cumsum()
Related
def split_trajectories(df):
trajectories_list = []
count = 0
for record in range(len(df)):
if record == 0:
continue
if df['time'].iloc[record] - df['time'].iloc[record - 1] > pd.Timedelta('0 days 00:00:30'):
temp_df = reset_index(df[count:record])
if not temp_df.empty:
if len(temp_df) > 50:
trajectories_list.append(temp_df)
count = record
return trajectories_list
This is a python function that receives a pandas dataframe and divides it into a list of dataframes when their time delta is greater than 30 seconds and if the dataframe contains than 50 records. In my case I need to execute this function thousands of times and I wonder if anyone can help me optimize it. Thanks in advance!
I tried to optimize it as far as I can.
You're doing a few things right here, like iterating using a range instead of .iterrows and using .iloc.
Simple things you can do:
Switch .iloc to .iat since you only need 'time' anyway
Don't recalculate Timedelta every time, just save it as a variable
The big thing is that you don't need to actually save each temp_df, or even create it. You can save a tuple (count, record) and retrieve from df afterwards, as needed. (Although, I have to admit I don't quite understand what you're accomplishing with the count:record logic; should one of your .iloc's involve count instead?)
def split_trajectories(df):
trajectories_list = []
td = pd.Timedelta('0 days 00:00:30')
count = 0
for record in range(1, len(df)):
if df['time'].iat[record] - df['time'].iat[record - 1] > td:
if record-count>50:
new_tuple = (count, record)
trajectories_list.append(new_tuple)
return trajectories_list
Maybe you can try this one below, you could use count if you want to keep track of number of df's or else the length of trajectories_list will give you that info.
def split_trajectories(df):
trajectories_list = []
df_difference = df['time'].diff()
if not (df_difference.empty) and (df_difference.shape[0]>50):
trajectories_list.append(df_difference)
return trajectories_list
I am using np.where function to calculate supertrend (an indicator in stock market). For this I need to do a simple calculation which np.where is giving wrong results most of the items. I have done the same calculations in excel too but not sure why np.where is giving wrong calculation in calculating Upper_Band column
Here is the excel calculation which I think is correct.
Here is the excel generated after running python code
from S.NO 6 onwards calculation has been different
this is the actual python code
import numpy as np
data["High-Low"] = data["High"] - data["Low"]
data["Close-low"] = abs(data["Close"].shift(1) - data["Low"])
data["Close-High"] = abs(data["Close"].shift(1) - data["High"])
data["True_Range"] = data[["High-Low", "Close-low", "Close-High"]].max(axis=1)
data["ATR"] = data["True_Range"].rolling(window=10).mean()
data["Basic_Upper_Band"] = (data["High"] + data["Low"])/2 + (data["ATR"]*2)
data["Basic_Lower_Band"] = (data["High"] + data["Low"])/2 - (data["ATR"]*2)
data["Upper_Band"] = 0
data["Lower_Band"] = 0
PreviousUB = data["Upper_Band"].shift(1)
BasicUB = data["Basic_Upper_Band"]
PreviousClose = data["Close"].shift(1)
data["Upper_Band"] = np.where((PreviousUB > BasicUB) | (PreviousUB < PreviousClose),BasicUB,PreviousUB)
Here is the output in python format too
main calculation in the code is data["Upper_Bank"]. I am using the same formula in excel and through np.where but it is giving me different results.
My Problem
I have a loop that creates a column using either a formula based on values from other columns or the previous value in the column depending on a condition ("days from new low == 0"). It is really slow over a huge dataset so I wanted to get rid of the loop and find a formula that is faster.
Current Working Code
import numpy as np
import pandas as pd
csv1 = pd.read_csv('stock_price.csv', delimiter = ',')
df = pd.DataFrame(csv1)
for x in range(1,len(df.index)):
if df["days from new low"].iloc[x] == 0:
df["mB"].iloc[x] = (df["RSI on new low"].iloc[x-1] - df["RSI on new low"].iloc[x]) / -df["days from new low"].iloc[x-1]
else:
df["mB"].iloc[x] = df["mB"].iloc[x-1]
df
Input Data and Expected Output
RSI on new low,days from new low,mB
0,22,0
29.6,0,1.3
29.6,1,1.3
29.6,2,1.3
29.6,3,1.3
29.6,4,1.3
21.7,0,-2.0
21.7,1,-2.0
21.7,2,-2.0
21.7,3,-2.0
21.7,4,-2.0
21.7,5,-2.0
21.7,6,-2.0
21.7,7,-2.0
21.7,8,-2.0
21.7,9,-2.0
25.9,0,0.5
25.9,1,0.5
25.9,2,0.5
23.9,0,-1.0
23.9,1,-1.0
Attempt at Solution
def mB_calc (var1,var2,var3):
df[var3]= np.where(df[var1] == 0, df[var2].shift(1) - df[var2] / -df[var1].shift(1) , "")
return df
df = mB_calc('days from new low','RSI on new low','mB')
First, it gives me this "TypeError: can't multiply sequence by non-int of type 'float'" and second I dont know how to incorporate the "ffill" into the formula.
Any idea how I might be able to do it?
Cheers!
Try this one:
df["mB_temp"] = (df["RSI on new low"].shift() - df["RSI on new low"]) / -df["days from new low"].shift()
df["mB"] = df["mB"].shift()
df["mB"].loc[df["days from new low"] == 0]=df["mB_temp"].loc[df["days from new low"] == 0]
df.drop(["mB_temp"], axis=1)
And with np.where:
df["mB"] = np.where(df["days from new low"]==0, df["RSI on new low"].shift() - df["RSI on new low"]) / -df["days from new low"].shift(), df["mB"].shift())
I am attempting to collect counts of occurrences of an id between two time periods in a dataframe. I have a moderately sized dataframe (about 400 unique ids and just short of 1m rows) containing a time of occurrence and an id for the account which caused the occurrence. I am attempting to get a count of occurrences for multiple time periods (1 hour, 6 hour, 1 day, etc.) prior a specific occurrence and have run into lots of difficulties.
I am using Python 3.7, and for this instance I only have the pandas package loaded. I have tried using for loops and while it likely would have worked (eventually), I am looking for something a bit more efficient time-wise. I have also tried using list comprehension and have run into some errors that I did not anticipate when dealing with datetimes columns. Examples of both are below.
## Sample data
data = {'id':[ 'EAED813857474821E1A61F588FABA345', 'D528C270B80F11E284931A7D66640965', '6F394474B8C511E2A76C1A7D66640965', '7B9C7C02F19711E38C670EDFB82A24A9', '80B409D1EC3D4CC483239D15AAE39F2E', '314EB192F25F11E3B68A0EDFB82A24A9', '68D30EE473FE11E49C060EDFB82A24A9', '156097CF030E4519DBDF84419B855E10', 'EE80E4C0B82B11E28C561A7D66640965', 'CA9F2DF6B82011E28C561A7D66640965', '6F394474B8C511E2A76C1A7D66640965', '314EB192F25F11E3B68A0EDFB82A24A9', 'D528C270B80F11E284931A7D66640965', '3A024345C1E94CED8C7E0DA3A96BBDCA', '314EB192F25F11E3B68A0EDFB82A24A9', '47C18B6B38E540508561A9DD52FD0B79', 'B72F6EA5565B49BBEDE0E66B737A8E6B', '47C18B6B38E540508561A9DD52FD0B79', 'B92CB51EFA2611E2AEEF1A7D66640965', '136EDF0536F644E0ADE6F25BB293DD17', '7B9C7C02F19711E38C670EDFB82A24A9', 'C5FAF9ACB88D4B55AB8196DBFFE5B3C0', '1557D4ECEFA74B40C718A4E5425F3ACB', '68D30EE473FE11E49C060EDFB82A24A9', '68D30EE473FE11E49C060EDFB82A24A9', 'CAF9D8CD627B422DFE1D587D25FC4035', 'C620D865AEE1412E9F3CA64CB86DC484', '47C18B6B38E540508561A9DD52FD0B79', 'CA9F2DF6B82011E28C561A7D66640965', '06E2501CB81811E290EF1A7D66640965', '68EEE17873FE11E4B5B90AFEF9534BE1', '47C18B6B38E540508561A9DD52FD0B79', '1BFE9CB25AD84B64CC2D04EF94237749', '7B20C2BEB82811E28C561A7D66640965', '261692EA8EE447AEF3804836E4404620', '74D7C3901F234993B4788EFA9E6BEE9E', 'CAF9D8CD627B422DFE1D587D25FC4035', '76AAF82EB8C511E2A76C1A7D66640965', '4BD38D6D44084681AFE13C146542A565', 'B8D27E80B82911E28C561A7D66640965' ], 'datetime':[ "24/06/2018 19:56", "24/05/2018 03:45", "12/01/2019 14:36", "18/08/2018 22:42", "19/11/2018 15:43", "08/07/2017 21:32", "15/05/2017 14:00", "25/03/2019 22:12", "27/02/2018 01:59", "26/05/2019 21:50", "11/02/2017 01:33", "19/11/2017 19:17", "04/04/2019 13:46", "08/05/2019 14:12", "11/02/2018 02:00", "07/04/2018 16:15", "29/10/2016 20:17", "17/11/2018 21:58", "12/05/2017 16:39", "28/01/2016 19:00", "24/02/2019 19:55", "13/06/2019 19:24", "30/09/2016 18:02", "14/07/2018 17:59", "06/04/2018 22:19", "25/08/2017 17:51", "07/04/2019 02:24", "26/05/2018 17:41", "27/08/2014 06:45", "15/07/2016 19:30", "30/10/2016 20:08", "15/09/2018 18:45", "29/01/2018 02:13", "10/09/2014 23:10", "11/05/2017 22:00", "31/05/2019 23:58", "19/02/2019 02:34", "02/02/2019 01:02", "27/04/2018 04:00", "29/11/2017 20:35"]}
df = pd.dataframe(data)
df = df.sort_values(['id', 'datetime'], ascending=True)
# for loop attempt
totalAccounts = df['id'].unique()
for account in totalAccounts:
oneHourCount=0
subset = df[df['id'] == account]
for i in range(len(subset)):
onehour = subset['datetime'].iloc[i] - timedelta(hours=1)
for j in range(len(subset)):
if (subset['datetime'].iloc[j] >= onehour) and (subset['datetime'].iloc[j] < sub):
oneHourCount+=1
#list comprehension attempt
df['onehour'] = df['datetime'] - timedelta(hours=1)
for account in totalAccounts:
onehour = sum([1 for x in subset['datetime'] if x >= subset['onehour'] and x < subset['datetime']])
I am getting either 1) incredibly long runtime with the for loop or 2) an ValueError regarding the truth of a series being ambiguous. I know the issue is dealing with the datetimes, and perhaps it is just going to be slow-going, but I want to check here first just to make sure.
So I was able to figure this out using bisection. If you have a similar question please PM me and I'd be more than happy to help.
Solution:
left = bisect_left(keys, subset['start_time'].iloc[i]) ## calculated time
right = bisect_right(keys, subset['datetime'].iloc[i]) ## actual time of occurrence
count=len(subset['datetime'][left:right]
I'm working with a csv file in Python using Pandas.
I'm having a few troubles thinking on how to achieve the following goal.
What I need to achieve is to group entries using a similarity function.
For example, each group X should contain all entries where each couple in the group differs for at most Y on a certain attribute-column value.
Given this example of CSV:
<pre>
name;sex;city;age
john;male;newyork;20
jack;male;newyork;21
mary;female;losangeles;45
maryanne;female;losangeles;48
eric;male;san francisco;29
jenny;female;boston2;30
mattia;na;BostonDynamics;50
</pre>
and considering the age column, with a difference of at most 3 on this value I would get the following groups:
A = {john;male;newyork;20
jack;male;newyork;21}
B={eric;male;san francisco;29
jenny;female;boston2;30}
C={mary;female;losangeles;45
maryanne;female;losangeles;48}
D={maryanne;female;losangeles;48
mattia;na;BostonDynamics;50}
Actually this is my work-around but I hope there exists something more pythonic.
import pandas as pandas
import numpy as numpy
def main():
csv_path = "../resources/dataset_string.csv"
csv_data_frame = pandas.read_csv(csv_path, delimiter=";")
print("\nOriginal Values:")
print(csv_data_frame)
sorted_df = csv_data_frame.sort_values(by=["age", "name"], kind="mergesort")
print("\nSorted Values by AGE & NAME:")
print(sorted_df)
min_age = int(numpy.min(sorted_df["age"]))
print("\nMin_Age:", min_age)
max_age = int(numpy.max(sorted_df["age"]))
print("\nMax_Age:", max_age)
threshold = 3
bins = numpy.arange(min_age, max_age, threshold)
print("Bins:", bins)
ind = numpy.digitize(sorted_df["age"], bins)
print(ind)
print("\n\nClustering by hand:\n")
current_min = min_age
for cluster in range(min_age, max_age, threshold):
next_min = current_min + threshold
print("<Cluster({})>".format(cluster))
print(sorted_df[(current_min <= sorted_df["age"]) & (sorted_df["age"] <= next_min)])
print("</Cluster({})>\n".format(cluster + threshold))
current_min = next_min
if __name__ == "__main__":
main()
On one attribute this is simple:
Sort
Linearly scan the data, and whenever the threshold is violated, begin a new group.
While this won't be optimal, it should be better than what you already have, at less cost.
However, in the multivariate case, finding he optimal groups is supposedly NP-hard, so finding the optimal grouping will require brute-force search in exponential time. So you will need to approximate this, either by AGNES (in O(n³)) or by CLINK (usually worse quality, but O(n²)).
As this is fairly expensive, it will not be a simple operator of your data frame.