Pandas DataFrame Logic - Python - python

Trying to backtest trading logic for fun but I can seem to comprehend how to utilize numpy to make decisions. For example, I want to set df['position'] = 1 or -1 based on whether the data is below or above the upper and lower lines. If Data <= the lower line I want to set position = 1 and keep it at 1 until Data it is >= the upper line. Once data is >= the upper line I want to set position = -1 and keep at -1 then repeat.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = np.random.standard_normal((5, 100)).flatten()
data = data.cumsum()
df = pd.DataFrame({'Data': data})
df['std'] = df['Data'].rolling(50).std()
df['SMA'] = df['Data'].rolling(50).mean()
df['upper'] = df['SMA'] + (2 * df['std'])
df['lower'] = df['SMA'] - (2 * df['std'])
df[['Data', 'SMA', 'upper', 'lower']].plot(figsize=(10, 6))
df['position'] = 0
plt.show()
Here I try to do just that but fail because I don't know how to do this properly.
df['islower'] = np.where(df['Data'] < df['lower'], 1, 0)
df['isupper'] = np.where(df['Data'] > df['upper'], 1, 0)
df['position'] = np.where(df['isupper']==1, -1, 0) | np.where(df['islower']==1, 1, 0)

I think what you want to do is:
df['islower'] = df['islower'].where(df['Data'] < df['lower'], 1, 0)
df['isupper'] = df['isupper'].where(df['Data'] < df['upper'], 1, 0)

Related

Python's `.loc` is really slow on selecting subsets of Data

I'm having a large multindexed (y,t) single valued DataFrame df. Currently, I'm selecting a subset via df.loc[(Y,T), :] and create a dictionary out of it. The following MWE works, but the selection is very slow for large subsets.
import numpy as np
import pandas as pd
# Full DataFrame
y_max = 50
Y_max = range(1, y_max+1)
t_max = 100
T_max = range(1, t_max+1)
idx_max = tuple((y,t) for y in Y_max for t in T_max)
df = pd.DataFrame(np.random.sample(y_max*t_max), index=idx_max, columns=['Value'])
# Create Dictionary of Subset of Data
y1 = 4
yN = 10
Y = range(y1, yN+1)
t1 = 5
tN = 9
T = range(t1, tN+1)
idx_sub = tuple((y,t) for y in Y for t in T)
data_sub = df.loc[(Y,T), :] #This is really slow
dict_sub = dict(zip(idx_sub, data_sub['Value']))
# result, e.g. (y,t) = (5,7)
dict_sub[5,7] == df.loc[(5,7), 'Value']
I was thinking of using df.loc[(y1,t1),(yN,tN), :], but it does not work properly, as the second index is only bounded in the final year yN.
One idea is use Index.isin with itertools.product in boolean indexing:
from itertools import product
idx_sub = tuple(product(Y, T))
dict_sub = df.loc[df.index.isin(idx_sub),'Value'].to_dict()
print (dict_sub)

Create boolean flag in pandas from signal's crossings

I would like to create a flag with a function and applying it to one column in a pandas dataframe.
The intention of the function is to set the value 1 when the signal crosses upwards over -1 and resets the value to 0 when the signal crosses 1 downwards.
Here is my code example:
I just cant get the function to work
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
x = np.arange(0, 10, 0.01)
x2 = np.arange(0, 20, 0.02)
sin1 = np.sin(x)
sin2 = np.sin(x2)
x2 /= 2
sin3 = sin1 + sin2
df = pd.DataFrame(sin3)
#name signal column
df.columns = ['signal']
df.signal.plot()
def my_flag(x):
#cross over -1
ok1 = (x.iloc[-1] > -1)*1
ok2 = (x.iloc[-2] < -1)*1
activate = (ok1*ok2) > 0.5
if activate:
flag_activate = 1
# OFF
#cross under 1
ok3 = (x.iloc[-1] <1)*1
ok4 = (x.iloc[-2] > 1)*1
inactivate = (ok3*ok4) > 0.5
if inactivate:
flag_activate = 0
# # add to df
return flag_activate
df['the_flag'] = df['signal'].apply(my_flag)
#I have set the flag to 0 for plotting purposes for demo,
# should be replaced when my_flag function works
df['the_flag'] = 0
fig, (ax1,ax2) = plt.subplots(2)
ax1.plot(df['signal'])
ax1.set_title('signal')
y1 = -1
y2 = 1
ax1.axhline(y1,color='r')
I have made a "cartoon picture" of what I would like the flag to llook like for a sine signal:
We can first detect the -1 and +1 crossings whilst considering they should cross-up and cross-down, respectively. This can be done via shifting the signal to left and right by 1 and comparing against -/+ 1 with the crossing behaviour in mind:
neg_1_crossings = np.where((sin3[:-1] < -1) & (sin3[1:] > -1))[0]
pos_1_crossings = np.where((sin3[:-1] > +1) & (sin3[1:] < +1))[0]
For -1 cross-up's: First mask imposes previous values be less than -1, second one imposes next values be greater then -1. Similar for the +1, except operators flipped.
Now we have:
>>> neg_1_crossings
array([592], dtype=int64)
>>> pos_1_crossings
array([157, 785], dtype=int64)
I'd run for loops here to get the flag:
flag = np.zeros_like(sin3)
for neg_cross in neg_1_crossings:
# a `neg_cross` raises the flag
flag[neg_cross:] = 1
for pos_cross in pos_1_crossings:
if pos_cross > neg_cross:
# once we hit a `pos_cross` later on, restrict the flag's ON
# periods to be between the `neg_cross` and this `pos_cross`
flag[pos_cross:] = 0
# we are done with this `neg_cross`
break
which gives
Overall:
def get_flag(col):
"""
`col` is a pd.Series
"""
# signal in numpy domain; also its shifted versions
signal = col.to_numpy()
sig_shifted_left = signal[1:]
sig_shifted_right = signal[:-1]
# detect crossings
neg_1_crossings = np.where((sig_shifted_right < -1) & (sig_shifted_left > -1))[0]
pos_1_crossings = np.where((sig_shifted_right > +1) & (sig_shifted_left < +1))[0]
# form the `flag` signal
flag = np.zeros_like(signal)
for neg_cross in neg_1_crossings:
# a `neg_cross` raises the flag
flag[neg_cross:] = 1
for pos_cross in pos_1_crossings:
if pos_cross > neg_cross:
# once we hit a `pos_cross` later on, restrict the flag's ON
# periods to be between the `neg_cross` and this `pos_cross`
flag[pos_cross:] = 0
# we are done with this `neg_cross`
break
return flag
You can use shift and query to find where the signal crosses your interval boundaries
df["shifted"] = df.signal.shift(-1)
start = df.query("shifted <= -1 and signal >= -1")
stop = df.query("shifted <= 1 and signal >= 1")
then you can use these crossings to set your flag column, probably there's some more compact way to do this in pandas
df["flag"] = False
# pair each left boundary with the closest right one, if any
for l in start.index.values:
try:
r = stop.index.values[stop.index.values > l][0]
df.loc[l:r, "flag"] = True
except:
continue
Let's see if this works:
df.signal.plot()
start.signal.plot(marker="o", lw=0)
stop.signal.plot(marker="o", lw=0)
df.flag.astype(int).plot()

heatmap of values grouped by time - seaborn

I'm plotting the counts of a variable grouped by time as a heatmap. However, when including both hour and minute, the counts are quite low so the resulting heatmap doesn't really provide any real insight. Is it possible to group the counts in a bigger block of time? I'm hoping to test some different periods (5, 10 mins).
I'm also hoping to plot time on the x-axis. Similar to the output attached.
import seaborn as sns
import pandas as pd
from datetime import datetime
from datetime import timedelta
start = datetime(1900,1,1,10,0,0)
end = datetime(1900,1,1,13,0,0)
seconds = (end - start).total_seconds()
step = timedelta(minutes = 1)
array = []
for i in range(0, int(seconds), int(step.total_seconds())):
array.append(start + timedelta(seconds=i))
array = [i.strftime('%Y-%m-%d %H:%M%:%S') for i in array]
df2 = pd.DataFrame(array).rename(columns = {0:'Time'})
df2['Count'] = np.random.uniform(0.0, 0.5, size = len(df2))
df2['Count'] = df2['Count'].round(1)
df2['Time'] = pd.to_datetime(df2['Time'])
df2['Hour'] = df2['Time'].dt.hour
df2['Min'] = df2['Time'].dt.minute
g = df2.groupby(['Hour','Min','Count'])
count_df = g['Count'].nunique().unstack()
count_df.fillna(0, inplace = True)
sns.heatmap(count_df)
To deal with such cases, I think it would be easy to use data downsampling. It is also easy to change the thresholds. The axis labels in the output graph will need to be modified, but we recommend this method.
import seaborn as sns
import pandas as pd
import numpy as np
from datetime import datetime
from datetime import timedelta
start = datetime(1900,1,1,10,0,0)
end = datetime(1900,1,1,13,0,0)
seconds = (end - start).total_seconds()
step = timedelta(minutes = 1)
array = []
for i in range(0, int(seconds), int(step.total_seconds())):
array.append(start + timedelta(seconds=i))
array = [i.strftime('%Y-%m-%d %H:%M:%S') for i in array]
df2 = pd.DataFrame(array).rename(columns = {0:'Time'})
df2['Count'] = np.random.uniform(0.0, 0.5, size = len(df2))
df2['Count'] = df2['Count'].round(1)
df2['Time'] = pd.to_datetime(df2['Time'])
df2['Hour'] = df2['Time'].dt.hour
df2['Min'] = df2['Time'].dt.minute
df2.set_index('Time', inplace=True)
count_df = df2.resample('10min')['Count'].value_counts().unstack()
count_df.fillna(0, inplace = True)
sns.heatmap(count_df.T)
The way you could achieve this is by creating a column with numbers that have repeating elements for the number of minutes.
For example:
minutes = 3
x = [0,1,2]
np.repeat(x, repeats=minutes, axis=0)
>>>> [0,0,0,1,1,1,2,2,2]
and then group your data using this column.
So your code would look like:
...
minutes = 5
x = [i for i in range(int(df2.shape[0]/5))]
df2['group'] = np.repeat(x, repeats=minutes, axis=0)
g = df2.groupby(['Min', 'Count'])
count_df = g['Count'].nunique().unstack()
count_df.fillna(0, inplace = True)

Select from given level of MultiIndex Series

How can I select all values where the 'displacement' (second level of MultiIndex) is above a certain value, say > 2?
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
dicts = {}
index = np.linspace(1, 50)
index[2] = 2.0 # Create a duplicate for later testing
for n in range(5):
dicts['test' + str(n)] = pd.Series(np.linspace(0, 20) ** (n / 5),
index=index)
s = pd.concat(dicts, names=('test', 'displacement'))
# Something like this?
s[s.index['displacement'] > 2]
I tried reading the docs but couldn't work it out, even trying IndexSlice.
Bonus points: how to I select a range, say between 2 and 4?
Thanks in advance for any help.
import pandas as pd
import numpy as np
dicts = {}
index = np.linspace(1, 50)
for n in range(5):
dicts['test' + str(n)] = pd.Series(np.linspace(0, 20) ** (n / 5),
index=index)
s = pd.concat(dicts, names=('test', 'displacement'))
displacement = s.index.get_level_values('displacement')
r = s.loc[(displacement > 2) & (displacement < 5)]
Inspired by https://stackoverflow.com/a/18103894/268075

how to make different ranges by using a variable in python?

I want to know is there any simple way to make ranges instead of the below code (buckets =np.where........).If there is any simple way to do that please help me how can i do that.In the below code textdata is my maindata and userid and smstext are my variables
taking subset from textdata
userfreq = textdata[['userid', 'smstext']]
calcuating the count by userid
user_freq = userfreq.groupby('userid').agg(len)
resetting the index
user_freq.reset_index(inplace=True)
subsetting sms text to make the buckets
tobebuckets = user_freq['smstext'] #here smstext is nothing but the frequencies of users
Making different ranges
buckets = np.where(
tobebuckets <= 0, 0,
np.where(
np.logical_and(tobebuckets > 0, tobebuckets <= 10), 10,
np.where(
np.logical_and(tobebuckets > 10,tobebuckets <= 50), 50,
np.where(
np.logical_and(tobebuckets > 50, tobebuckets <= 100), 100,
np.where(
np.logical_and(tobebuckets > 100, tobebuckets <= 500), 500,
np.where(
np.logical_and(tobebuckets > 500, tobebuckets <= 1000),
1000, 1001))))))
Thanks in advance.please tell me the simple way to do the above in python
You are looking for digitize:
import numpy as np
x = np.arange(1500) - 5
top = 1 + x.max()
bins = np.array([0,10,50,100,500,1000,top])
result = bins[np.digitize(x,bins)]
#if you really want 1001 at the top
result = np.where(result==top,1001,result) #or just clip it

Categories

Resources