I have a pandas df with 3 columns:
Close Top_Barrier Bottom_Barrier
0 441.86 441.964112 426.369888
1 448.95 444.162225 425.227108
2 449.99 446.222271 424.285063
3 449.74 447.947051 423.678282
4 451.97 449.879254 423.029413
...
996 436.97 446.468790 426.600543
997 438.16 446.461401 426.599265
998 437.00 446.093899 426.641434
999 437.52 446.024365 426.631635
1000 437.75 446.114093 426.715907
Objective:
For every row, I need to test if any of the next 30 rows Close price touches the top or bottom barrier (from row 0), eg, start from row index 0, test if Close price (441.86) is greater than Top_Barrier (441.96) or lower than Bottom_Barrier (426.36), if it is greater than Top_Barrier, return 1, if it is lower than Bottom_Barrier, return -1. Else, loop to the next row, eg, at index 1, Close price is 448.95, but it is still being tested against barrier price from index 0, ie, Top_Barrier of 441.96, Bottom_Barrier of 426.36. This loop continue until index 29 if Close price never touches the barriers - return 0 if that's the case. Next rolling loop start from index 1 until 30, etc.
Attempts:
I tried using .rolling.apply with the following function but I just could not resolve the errors. Happy to explore any other methods as long as it achieve my objective stated above. Thanks!
def tbl_rolling(x):
start_i = x.index[0]
for i in range(len(x)):
# the barrier freeze at index 0
if x.loc[i, 'Close'] > x.loc[start_i, 'Top_Barrier']:
return 1
elif x.loc[i, 'Close'] < x.loc[start_i, 'Bottom_Barrier']:
return -1
return 0
The following then throws IndexingError: Too many indexers
test = df.rolling(30).apply(tbl_rolling, raw=False)
You can try something like this if your dataset isn't very big:
df = df.reset_index().assign(key=1)
def f(x):
cond1 = x['Close_x'] > x['Top_Barrier_y'].max()
cond2 = x['Close_x'] < x['Bottom_Barrier_y'].min()
return np.select([cond1,cond2],[1,-1], default=0)[0]
df.merge(df, on='key').query('index_y <= index_x').groupby('index_x').apply(f)
Output:
index_x
0 0
1 1
2 1
3 1
4 1
996 0
997 0
998 0
999 0
1000 0
dtype: int64
Related
I want to add a new column called I have a pandas dataframe called week5_233C. My Python version is 3.19.13.
I wrote an if-statement to add a new column to my data set: Spike. If the value in Value [pV] is not equal to 0, I want to add a 1 to that row. If Value [pV] is 0, then I want to add in the spike column that it is 0.
The data looks like this:
TimeStamp [µs] Value [pV]
0 1906200 0
1 1906300 0
2 1906400 0
3 1906500 -149012
4 1906600 -149012
And I want it to look like this:
TimeStamp [µs] Value [pV] Spike
0 1906200 0 0
1 1906300 0 0
2 1906400 0 0
3 1906500 -149012 1
4 1906600 -149012 1
I tried:
week5_233C.loc[week5_233C[' Value [pV]'] != 0, 'Spike'] = 1
week5_233C.loc[week5_233C[' Value [pV]'] == 0, 'Spike'] = 0
but all rows in column Spike get the same value.
I also tried:
week5_233C['Spike'] = week5_233C[' Value [pV]'].apply(lambda x: 0 if x == 0 else 1)
Again, it just adds only 0s or only 1s, but does not work with if and else. See example data:
TimeStamp [µs] Value [pV] Spike
0 1906200 0 1
1 1906300 0 1
2 1906400 0 1
3 1906500 -149012 1
4 1906600 -149012 1
Doing it like this:
for i in week5_233C[' Value [pV]']:
if i != 0:
week5_233C['Spike'] = 1
elif i == 0:
week5_233C['Spike'] = 0
does not do anything: does not add a column, does not give an error, and makes Python crash.
However, when I run this if-statement with just a print as such:
for i in week5_233C[' Value [pV]']:
if i != 0:
print(1)
elif i == 0:
print(0)
then it does print the exact values I want. I cannot figure out how to save these values in a new column.
This:
for i in week5_233C[' Value [pV]']:
if i != 0:
week5_233C.concat([1, df.iloc['Spike']])
elif i == 0:
week5_233C.concat([0, df.iloc['Spike']])
gives me an error: AttributeError: 'DataFrame' object has no attribute 'concat'
How can I make a new column Spike and add the values 0 and 1 based on the value in column Value [pV]?
I think you should check the dtype of Value [pV] column. You probably have string that's why you have the same value. Try print(df['Value [pV]'].dtype). If object try to convert with astype(float) or pd.to_numeric(df['Value [pV]']).
You can also try:
df['spike'] = np.where(df['Value [pV]'] == '0', 0, 1)
Update
To show bad rows and debug your datafame, use the following code:
df.loc[pd.to_numeric(df['Value [pV]'], errors='coerce').isna(), 'Value [pV]']
import pandas as pd
df = pd.DataFrame({'TimeStamp [µs]':[1906200, 1906300, 1906400, 1906500, 1906600],
'Value [pV] ':[0, 0, 0, -149012, -149012],
})
df['Spike'] = df.agg({'Value [pV] ': lambda v: int(bool(v))})
print(df)
TimeStamp [µs] Value [pV] Spike
0 1906200 0 0
1 1906300 0 0
2 1906400 0 0
3 1906500 -149012 1
4 1906600 -149012 1
I'm just seeking some guidance on how to do this better. I was just doing some basic research to compare Monday's opening and low. The code code returns two lists, one with the returns (Monday's close - open/Monday's open) and a list that's just 1's and 0's to reflect if the return was positive or negate.
Please take a look as I'm sure there's a better way to do it in pandas but I just don't know how.
#Monday only
m_list = [] #results list
h_list = [] #hit list (close-low > 0)
n=0 #counter variable
for t in history.index:
if datetime.datetime.weekday(t[1]) == 1: #t[1] is the timestamp in multi index (if timestemp is a Monday)
x = history.ix[n]['open']-history.ix[n]['low']
m_list.append((history.ix[n]['open']-history.ix[n]['low'])/history.ix[n]['open'])
if x > 0:
h_list.append(1)
else:
h_list.append(0)
n += 1 #add to index counter
else:
n += 1 #add to index counter
print("Mean: ", mean(m_list), "Max: ", max(m_list),"Min: ",
min(m_list), "Hit Rate: ", sum(h_list)/len(h_list))
You can do that by straight forward :
(history['open']-history['low'])>0
This will give you true for rows where open is greater and flase where low is greater.
And if you want 1,0, you can multiply the above statement with 1.
((history['open']-history['low'])>0)*1
Example
import numpy as np
import pandas as pd
df = pd.DataFrame({'a':np.random.random(10),
'b':np.random.random(10)})
Printing the data frame:
print(df)
a b
0 0.675916 0.796333
1 0.044582 0.352145
2 0.053654 0.784185
3 0.189674 0.036730
4 0.329166 0.021920
5 0.163660 0.331089
6 0.042633 0.517015
7 0.544534 0.770192
8 0.542793 0.379054
9 0.712132 0.712552
To make a new column compare where it is 1 if a is greater and 9 if b is greater :
df['compare'] = (df['a']-df['b']>0)*1
this will add new column compare:
a b compare
0 0.675916 0.796333 0
1 0.044582 0.352145 0
2 0.053654 0.784185 0
3 0.189674 0.036730 1
4 0.329166 0.021920 1
5 0.163660 0.331089 0
6 0.042633 0.517015 0
7 0.544534 0.770192 0
8 0.542793 0.379054 1
9 0.712132 0.712552 0
I am trying to compare a dataframe's different columns with each other row by row like
for (i= startday to endday)
if(df[i]<df[i+1])
counter=counter+1
else
i=endday+1
the goal is find increasing (or decreasing) trends(need to be consecutive)
And my data looks like this
df= 1 2 3 0 1 1 1
1 1 1 1 0 1 2
1 2 1 0 1 1 2
0 0 0 0 1 0 1
(In this example startday to endday is 7 but actually these two are unstable)
As a result i expect to find this {2,0,1,0} and i need it to work fast because my data is quite big(1.2 million). Because of the time limit I tried not to use loops (for, if etc.)
I tried the code below but couldn't find how to stop counter if condition is false
import math
import numpy as np
import pandas as pd
df1=df.copy()
df2=df.copy()
bool1 = (np.less_equal.outer(startday.startday, range(1,13))
& np.greater_equal.outer(endday.endday, range(1,13))
)
bool1= np.c_[np.zeros(len(startday)),bool1].astype('bool')
bool2 = (np.less_equal.outer(startday2.startday2, range(1,13))
& np.greater_equal.outer(endday2.endday2, range(1,13))
)
bool2= np.c_[bool2, np.zeros(len(startday))].astype('bool')
df1.insert(0, 'c_False',math.pi)
df2.insert(12, 'c_False',math.pi)
#df2.head()
arr_bool = (bool1&bool2&(df1.values<df2.values))
df_new = pd.DataFrame(np.sum(arr_bool , axis=1),
index=data_idx, columns=['coll'])
df_new.coll= np.select( condlist = [startday.startday > endday.endday],
choicelist = [-999],
default = df_new.coll)
Add zeros at the end, then use np.diff, then get the first "non positive" using argmin:
(np.diff(np.hstack((df.values, np.zeros((df.values.shape[0], 1)))), axis=1) > 0).argmin(axis=1)
>> array([2, 0, 1, 0], dtype=int64)
We have this function:
def GetPricePerCustomAmt(CustomAmt):
data = [{"Price":281.48,"USDamt":104.84},{"Price":281.44,"USDamt":5140.77},{"Price":281.42,"USDamt":10072.24},{"Price":281.39,"USDamt":15773.83},{"Price":281.33,"USDamt":19314.54},{"Price":281.27,"USDamt":22255.55},{"Price":281.2,"USDamt":23427.64},{"Price":281.13,"USDamt":23708.77},{"Price":281.1,"USDamt":23738.77},{"Price":281.08,"USDamt":24019.88},{"Price":281.01,"USDamt":25986.95},{"Price":281.0,"USDamt":26127.45}]
df = pd.DataFrame(data)
df["getfirst"] = np.where(df["USDamt"] > CustomAmt, 1, 0)
wantedprice = "??"
print(df)
print()
print("Wanted Price:",wantedprice)
return wantedprice
Calling it using a custom USDamt like this:
GetPricePerCustomAmt(500)
gets this result:
Price USDamt getfirst
0 281.48 104.84 0
1 281.44 5140.77 1
2 281.42 10072.24 1
3 281.39 15773.83 1
4 281.33 19314.54 1
5 281.27 22255.55 1
6 281.20 23427.64 1
7 281.13 23708.77 1
8 281.10 23738.77 1
9 281.08 24019.88 1
10 281.01 25986.95 1
11 281.00 26127.45 1
Wanted Price: ??
We want to return the Price row of the first 1 appearing in the "getfirst" column.
Examples:
GetPricePerCustomAmt(500)
Wanted Price: 281.44
GetPricePerCustomAmt(15000)
Wanted Price: 281.39
GetPricePerCustomAmt(24000)
Wanted Price: 281.08
How do we do it?
(If you know a more efficient way to get the wanted price please do tell too)
Use next with iter for return default value if no value matched and is returneded empty Series, for filtering use boolean indexing:
def GetPricePerCustomAmt(CustomAmt):
data = [{"Price":281.48,"USDamt":104.84},{"Price":281.44,"USDamt":5140.77},{"Price":281.42,"USDamt":10072.24},{"Price":281.39,"USDamt":15773.83},{"Price":281.33,"USDamt":19314.54},{"Price":281.27,"USDamt":22255.55},{"Price":281.2,"USDamt":23427.64},{"Price":281.13,"USDamt":23708.77},{"Price":281.1,"USDamt":23738.77},{"Price":281.08,"USDamt":24019.88},{"Price":281.01,"USDamt":25986.95},{"Price":281.0,"USDamt":26127.45}]
df = pd.DataFrame(data)
return next(iter(df.loc[df["USDamt"] > CustomAmt, 'Price']), 'no matched')
print(GetPricePerCustomAmt(500))
281.44
print(GetPricePerCustomAmt(15000))
281.39
print(GetPricePerCustomAmt(24000))
281.08
print(GetPricePerCustomAmt(100000))
no matched
I have the following dataframe:
Timestamp S_time1 S_time2 End_Time_1 End_time_2 Sign_1 Sign_2
0 2413044 0 0 0 0 x x
1 2422476 0 0 0 0 x x
2 2431908 0 0 0 0 x x
3 2441341 0 0 0 0 x x
4 2541232 2526631 2528631 2520631 2530631 10 80
5 2560273 2544946 2546496 2546496 2548496 40 80
6 2577224 2564010 2566010 2566010 2568010 null null
7 2592905 2580959 2582959 2582959 2584959 null null
The table goes on and on like that. The first column is a timestamp which is in milliseconds. S_time1 and End_time_1 are the duration where a particular sign (number) appear. For example, if we take the 5th row, S_time1 is 2526631, End_time_1 is 2520631, and the corresponding sign_1 is 10, which means from 2526631 to 2520631 the sign 10 will be displayed. And the same thing goes to S_time2 and End_time_2. The corresponding values in sign_2 will appear in the duration from S_time2 to End_time_2.
I want to resample the index column (Timestamp) in 100-millisecond bin time and check in which bin times the signs belong. For instance, between each start time and end time there is 2000 milliseconds difference. So the corresponding sign number will appear repeatedly in around 20 consecutive bin times because each bin time is 100 millisecond. So I need to have two columns only: one with the bin times and the second with the signs. Looks like the following table: (I am just making up the bin time just for example)
Bin_time signs
...100 0
...200 0
...300 10
...400 10
...500 10
...600 10
The sign 10 will be for the duration of the corresponding S_time1 to End_time_1. Then the next sign which is 80 continues for the duration of S_time2 to End_time_2. I am not sure if this can be done in pandas or not. But I really need help either in pandas or other methods.
Thanks for your help and suggestion in advance.
Input:
print df
Timestamp S_time1 S_time2 End_Time_1 End_time_2 Sign_1 Sign_2
0 2413044 0 0 0 0 x x
1 2422476 0 0 0 0 x x
2 2431908 0 0 0 0 x x
3 2441341 0 0 0 0 x x
4 2541232 2526631 2528631 2520631 2530631 10 80
5 2560273 2544946 2546496 2546496 2548496 40 80
6 2577224 2564010 2566010 2566010 2568010 null null
7 2592905 2580959 2582959 2582959 2584959 null null
2 approaches:
In [231]: %timeit s(df)
1 loops, best of 3: 2.78 s per loop
In [232]: %timeit m(df)
1 loops, best of 3: 690 ms per loop
def m(df):
#resample column Timestamp by 100ms, convert bak to integers
df['Timestamp'] = df['Timestamp'].astype('timedelta64[ms]')
df['i'] = 1
df = df.set_index('Timestamp')
df1 = df[[]].resample('100ms', how='first').reset_index()
df1['Timestamp'] = (df1['Timestamp'] / np.timedelta64(1, 'ms')).astype(int)
#felper column i for merging
df1['i'] = 1
#print df1
out = df1.merge(df,on='i', how='left')
out1 = out[['Timestamp', 'Sign_1']][(out.Timestamp >= out.S_time1) & (out.Timestamp <= out.End_Time_1)]
out2 = out[['Timestamp', 'Sign_2']][(out.Timestamp >= out.S_time2) & (out.Timestamp <= out.End_time_2)]
out1 = out1.rename(columns={'Sign_1':'Bin_time'})
out2 = out2.rename(columns={'Sign_2':'Bin_time'})
df = pd.concat([out1, out2], ignore_index=True).drop_duplicates(subset='Timestamp')
df1 = df1.set_index('Timestamp')
df = df.set_index('Timestamp')
df = df.reindex(df1.index).reset_index()
#print df.head(10)
def s(df):
#resample column Timestamp by 100ms, convert bak to integers
df['Timestamp'] = df['Timestamp'].astype('timedelta64[ms]')
df = df.set_index('Timestamp')
out = df[[]].resample('100ms', how='first')
out = out.reset_index()
out['Timestamp'] = (out['Timestamp'] / np.timedelta64(1, 'ms')).astype(int)
#print out.head(10)
#search start end
def search(x):
mask1 = (df.S_time1<=x['Timestamp']) & (df.End_Time_1>=x['Timestamp'])
#if at least one True return first value of series
if mask1.any():
return df.loc[mask1].Sign_1[0]
#check second start and end time
else:
mask2 = (df.S_time2<=x['Timestamp']) & (df.End_time_2>=x['Timestamp'])
if mask2.any():
#if at least one True return first value
return df.loc[mask2].Sign_2[0]
else:
#if all False return NaN
return np.nan
out['Bin_time'] = out.apply(search, axis=1)
#print out.head(10)