How do I get this barssince to work in python? - python

I'm trying to convert indicator Top Bottom by ceyhun from tradingview to python.
I am stuck converting the barssince lines. One of them is like this
per = input(14, title="Bottom Period")
loc = low < lowest(low[1], per) and low <= lowest(low[per], per)
bottom = barssince(loc)
So far I have this in python
bottomPeriod = 14
data['counter'] = data.index.where(data.loc[data.index[-1], "low"] < min(data["low"][(-bottomPeriod-1):-1]) and data.loc[data.index[-1], "low"] <= min(data["low"][(-bottomPeriod*2-1):(-bottomPeriod-1)]))
data['counter'].fillna(method="ffill",inplace=True)
data['Rows_since_condition'] = data.index-data['counter']
data.drop(['counter'], axis=1,inplace=True)
I can't get the where to work. I'm almost about to start iterating over the dataset but it is a big one and needs to run fast. Any help is much appreciated

Related

How do you create a "constant dataframe" in pandas for the sake of adding / subtracting / comparing other dataframes to a constant value?

I'm working with pybacktest, trying to create my own BackTester that will support not only "buy" / "sell" but also "stay" (the middle ground - do nothing).
ohlc = pybacktest.load_from_yahoo('AAPL', start=2000) # Load a pandas dataframe
ohlc.tail()
ms = ohlc.C.rolling(short_ma).mean()
ml = ohlc.C.rolling(long_ma).mean()
eps = 0.001
up = (ms > ml + eps) & (ms.shift() < ml.shift()) # ma cross up
stay = (ml - eps <= ms <= ml + eps)
down = (ms < ml - eps) & (ms.shift() > ml.shift()) # ma cross down
The last part of the code is what I want to do. The code as it exists in the pybacktest tutorial is:
buy = cover = (ms > ml) & (ms.shift() < ml.shift()) # ma cross up
sell = short = (ms < ml) & (ms.shift() > ml.shift()) # ma cross down
And I'm changing buy / sell to up / down to more closely model the instrument I'm trading. And there is no cover / short for this instrument. So I've created my own class and copied the Backtest class code into it, and have started editing. But this has to do with pandas dataframes.
The error I'm getting is related to the fact that I just used a python floating point and tried adding / subtracting it from a dataframe:
It would have been nice if you could simply do that (I would have coded pandas that way), but you can't. So my question is simple and probably easy to answer. Pandas dataframes usually hold complex-looking data. I'm desiring the opposite of that... all I want is a constant value for every timestamped row in ohlc!
Thank you.
The error was about not handling the series a <= b <= c unambiguously.
Simplly convert it to:
stay = (ml - eps <= ms) & (ms <= ml + eps)
Using & logic. So pandas does handle +/-/* with constants and casting! Amazing!

Pandas for Loop Optimization(Vectorization) when looking at previous row value

I'm looking to optimize the time taken for a function with a for loop. The code below is ok for smaller dataframes, but for larger dataframes, it takes too long. The function effectively creates a new column based on calculations using other column values and parameters. The calculation also considers the value of a previous row value for one of the columns. I read that the most efficient way is to use Pandas vectorization, but i'm struggling to understand how to implement this when my for loop is considering the previous row value of 1 column to populate a new column on the current row. I'm a complete novice, but have looked around and cant find anything that suits this specific problem, though I'm searching from a position of relative ignorance, so may have missed something.
The function is below and I've created a test dataframe and random parameters too. it would be great if someone could point me in the right direction to get the processing time down. Thanks in advance.
def MODE_Gain (Data, rated, MODELim1, MODEin, Normalin,NormalLim600,NormalLim1):
print('Calculating Gains')
df = Data
df.fillna(0, inplace=True)
df['MODE'] = ""
df['Nominal'] = ""
df.iloc[0, df.columns.get_loc('MODE')] = 0
for i in range(1, (len(df.index))):
print('Computing Status{i}/{r}'.format(i=i, r=len(df.index)))
if ((df['MODE'].loc[i-1] == 1) & (df['A'].loc[i] > Normalin)) :
df['MODE'].loc[i] = 1
elif (((df['MODE'].loc[i-1] == 0) & (df['A'].loc[i] > NormalLim600))|((df['B'].loc[i] > NormalLim1) & (df['B'].loc[i] < MODELim1 ))):
df['MODE'].loc[i] = 1
else:
df['MODE'].loc[i] = 0
df[''] = (df['C']/6)
for i in range(len(df.index)):
print('Computing MODE Gains {i}/{r}'.format(i=i, r=len(df.index)))
if ((df['A'].loc[i] > MODEin) & (df['A'].loc[i] < NormalLim600)&(df['B'].loc[i] < NormalLim1)) :
df['Nominal'].loc[i] = rated/6
else:
df['Nominal'].loc[i] = 0
df["Upgrade"] = df[""] - df["Nominal"]
return df
A = np.random.randint(0,28,size=(8000))
B = np.random.randint(0,45,size=(8000))
C = np.random.randint(0,2300,size=(8000))
df = pd.DataFrame()
df['A'] = pd.Series(A)
df['B'] = pd.Series(B)
df['C'] = pd.Series(C)
MODELim600 = 32
MODELim30 = 28
MODELim1 = 39
MODEin = 23
Normalin = 20
NormalLim600 = 25
NormalLim1 = 32
rated = 2150
finaldf = MODE_Gain(df, rated, MODELim1, MODEin, Normalin,NormalLim600,NormalLim1)
Your second loop doesn't evaluate the prior row, so you should be able to use this instead
df['Nominal'] = 0
df.loc[(df['A'] > MODEin) & (df['A'] < NormalLim600) & (df['B'] < NormalLim1), 'Nominal'] = rated/6
For your first loop, the elif statements looks to evaluate this
((df['B'].loc[i] > NormalLim1) & (df['B'].loc[i] < MODELim1 )) and sets it to 1 regardless of the other condition, so you can remove that and vectorize that operation. didn't try, but this should do it
df.loc[(df['B'].loc[i] > NormalLim1) & (df['B'].loc[i] < MODELim1 ), 'MODE'] = 1
then you may be able to collapse the other conditions into one statement use |
Not sure how much all that will save you, but you should cut the time in half getting rid of the 2nd loop.
For vectorizing it I suggest you first shift your column in another one :
df['MODE_1'] = df['MODE'].shift(1)
and then use :
(df['MODE_1'].loc[i] == 1)
After that you should be able to vectorize

Efficient way to loop through GroupBy DataFrame

Since my last post did lack in information:
example of my df (the important col):
deviceID: unique ID for the vehicle. Vehicles send data all Xminutes.
mileage: the distance moved since the last message (in km)
positon_timestamp_measure: unixTimestamp of the time the dataset was created.
deviceID mileage positon_timestamp_measure
54672 10 1600696079
43423 20 1600696079
42342 3 1600701501
54672 3 1600702102
43423 2 1600702701
My Goal is to validate the milage by comparing it to the max speed of the vehicle (which is 80km/h) by calculating the speed of the vehicle using the timestamp and the milage. The result should then be written in the orginal dataset.
What I've done so far is the following:
df_ori['dataIndex'] = df_ori.index
df = df_ori.groupby('device_id')
#create new col and set all values to false
df_ori['valid'] = 0
for group_name, group in df:
#sort group by time
group = group.sort_values(by='position_timestamp_measure')
group = group.reset_index()
#since I can't validate the first point in the group, I set it to valid
df_ori.loc[df_ori.index == group.dataIndex.values[0], 'validPosition'] = 1
#iterate through each data in the group
for i in range(1, len(group)):
timeGoneSec = abs(group.position_timestamp_measure.values[i]-group.position_timestamp_measure.values[i-1])
timeHours = (timeGoneSec/60)/60
#calculate speed
if((group.mileage.values[i]/timeHours)<maxSpeedKMH):
df_ori.loc[dataset.index == group.dataIndex.values[i], 'validPosition'] = 1
dataset.validPosition.value_counts()
It definitely works the way I want it to, however it lacks in performance a lot. The df contains nearly 700k in data (already cleaned). I am still a beginner and can't figure out a better solution. Would really appreciate any of your help.
If I got it right, no for-loops are needed here. Here is what I've transformed your code into:
df_ori['dataIndex'] = df_ori.index
df = df_ori.groupby('device_id')
#create new col and set all values to false
df_ori['valid'] = 0
df_ori = df_ori.sort_values(['position_timestamp_measure'])
# Subtract preceding values from currnet value
df_ori['timeGoneSec'] = \
df_ori.groupby('device_id')['position_timestamp_measure'].transform('diff')
# The operation above will produce NaN values for the first values in each group
# fill the 'valid' with 1 according the original code
df_ori[df_ori['timeGoneSec'].isna(), 'valid'] = 1
df_ori['timeHours'] = df_ori['timeGoneSec']/3600 # 60*60 = 3600
df_ori['flag'] = (df_ori['mileage'] / df_ori['timeHours']) <= maxSpeedKMH
df_ori.loc[df_ori['flag'], 'valid'] = 1
# Remove helper columns
df_ori = df.drop(columns=['flag', 'timeHours', 'timeGoneSec'])
The basic idea is try to use vectorized operation as much as possible and to avoid for loops, typically iteration row by row, which can be insanly slow.
Since I can't get the context of your code, please double check the logic and make sure it works as desired.

Pandas: filter large (50M rows) dataframe if row lat/lng falls inside a bounding box?

I have a large data set that I'm evaluating various means of parsing on. It's a set of csv with each file having ~40 million rows. Reading into a pandas dataframe I have the following sample data. (The data that follows was randomly generated) The account identifiers will repeat from time to time, some in the box, some out. I have this working in a 'brute force' manner using csv readers and have gotten the processing time down to ~ 60 seconds for a randomly generated 10 million row sample. I'm hoping I can get it more efficient with pandas, because the real data set is much much larger, and we may need to do this kind of parsing repeatedly in the future.
account lat lng
0 f413e6cd-bbfe-463b-bf58-cba1a74a4aff 50.847615 70.826473
1 8b2ceb89-7ce0-4e14-a5f0-28acb7b05d8b 18.545991 115.078981
2 a51ab728-14b5-473c-bed1-91953da8ba22 30.699439 9.660661
3 83e3964f-130f-49bc-9c4b-d46d4b48c2cb 7.906903 70.507260
4 84c75e57-5a5f-4314-80be-d1271ecd76ef -20.325371 48.310855
This block describes the function by which I want to match the rows of the dataframe. I have multiple bounding boxes that I need to check for membership.
top = 49.3457868 # north lat
left = -124.7844079 # west long
right = -66.9513812 # east long
bottom = 24.7433195 # south lat
def in_bounds(lat, lng):
if bottom <= lat <= top and left <= lng <= right:
return True
elif bottom2 <= lat <= top2 and left2 <= lng <= right2:
return True
else return False
I don't really understand how to apply a lambda function to all rows, while collecting the rows that match in a new dataframe, using criteria from multiple columns of the dataframe. I can't find any examples online that come close to this.
What I've got working so far looks like this:
df = pd.read_csv('DATA.csv')
df.columns = ['account', 'lat', 'lng']
accounts= df[(bottom <= df.lat) & (df.lat <= top) & (left <= df.lng) & (df.lng <= right)]
results = df[df.account.isin(accounts)]
results.to_csv('pd_out.csv', header=False, )
But that only applies a single bounding box. I can't figure out how to do the similar filtering as above, but using my in_bounds function. So how can I best do this?
Not too beautiful, but to combine both conditions, you can use the | operator. Note the parenthesis around both box conditions.
box1_cond = (bottom <= df.lat) & (df.lat <= top) & (left <= df.lng) & (df.lng <= right)
box2_cond = (bottom2 <= df.lat) & (df.lat <= top2) & (left2 <= df.lng) & (df.lng <= right2)
accounts = df[box1_cond | box2_cond]

Why is this code producing a different value in a VM?

Basically I have these tif images that I need to recurse through and read pixel data to determine if a pixel in the image is of melting ice or not. This is determined via the threshold value that's set in the script. This is configured to be able to display both the years total melt value and also each month. It works fine on my own machine, but I need to run this remotely on a Linux VM. It works, but it produces a total number that is exactly 71146 greater than what it should be and what it had bee producing.
This is the snippet that does most of the processing and is ultimately causing my problems I believe.
for file in os.listdir(current):
if os.path.exists(file):
if file.endswith(".tif"):
fname = os.path.splitext(file)[0]
day = fname[4:7]
im = Image.open(file)
for x in range(0,60):
for y in range(0,109):
p = round(im.getpixel((x,y)), 4)
if p >= threshold:
combined = str(x) + "-" + str(y)
if combined not in coords:
melt += 1
coords.append( combined )
totalmelt.append( melt )
And then totalmelt is summed to get the yearly value:
total = sum(totalmelt)
The threshold value has been set previously as follows:
threshold = float(-0.0158)
I feel like I'm missing something obvious. It's been a while since I played with Python...I'm coming over from C++ right now. Thanks for any solutions you might offer!
You need to reset melt to 0 before your inner loops:
melt = 0
for x in range(0,60):
for y in range(0,109):
...
melt += 1
totalmelt.append(melt)

Categories

Resources