Pandas: update a column with an if statement - python

My current dataframe looks like this:
midprice ema12 ema26 difference
0 0.002990 0.002990 0.002990 0.000000e+00
1 0.002990 0.002990 0.002990 4.227920e-08
2 0.003018 0.002994 0.002992 2.295777e-06
3 0.003025 0.002999 0.002994 4.579221e-06
4 0.003067 0.003009 0.003000 9.708765e-06
5 0.003112 0.003025 0.003008 1.718520e-05
What I tried is the following:
df.loc[:, 'action'] = np.select(condlist=[df.difference[0] < df.difference[-1] < df.difference[-2], df.ema12 < df.ema26 ], choicelist=['buy', 'sell'], default='do nothing')
So update the column action with buy if three times in a row the values of the column difference is smaller than it's previous value. Any idea on how to proceed? Thanks!

I think you need:
m1= df['difference'] < df['difference'].shift(-1)
m2= df['difference'] < df['difference'].shift(-2)
m3= df['difference'] < df['difference'].shift(-3)
df['action'] = np.select(condlist=[m1 | m2 | m3, df.ema12 < df.ema26 ],
choicelist=['buy', 'sell'],
default='do nothing')
print (df)
midprice ema12 ema26 difference action
0 0.002990 0.002990 0.002990 0.000000e+00 buy
1 0.002990 0.002990 0.002990 4.227920e-08 buy
2 0.003018 0.002994 0.002992 2.295777e-06 buy
3 0.003025 0.002999 0.002994 4.579221e-06 buy
4 0.003067 0.003009 0.003000 9.708765e-06 buy
5 0.003112 0.003025 0.003008 1.718520e-05 do nothing

Related

df.apply return NaN in pandas dataframe

I am trying to fill up a column in a dataframe with 1, 0 or -1 depending on some factors by doing it like this:
def set_order_signal(row):
if (row.MACD > row.SIGNAL) and (df.iloc[i-1].MACD < df.iloc[i-1].SIGNAL):
if (row.MACD < 0 and row.SIGNAL < 0) and (row.close > row['200EMA']):
return 1
elif (row.MACD < row.SIGNAL) and (df.iloc[i-1].MACD > df.iloc[i-1].SIGNAL):
if (row.MACD > 0 and row.SIGNAL > 0) and (row.close < row['200EMA']):
return -1
else:
return 0
Sometimes it works but in other rows it returns "NaN". I can't find a reason or solution for this.
The dataframe I work with looks like this:
time open high low close tick_volume spread real_volume EMA_LONG EMA_SHORT MACD SIGNAL HIST 200EMA OrderSignal
0 2018-01-09 05:00:00 1.19726 1.19751 1.19675 1.19717 1773 1 0 1.197605 1.197152 -0.000453 -0.000453 0.000000e+00 1.197170 0.0
1 2018-01-09 06:00:00 1.19717 1.19724 1.19659 1.19681 1477 1 0 1.197538 1.197099 -0.000439 -0.000445 6.258599e-06 1.196989 0.0
2 2018-01-09 07:00:00 1.19681 1.19718 1.19642 1.19651 1622 1 0 1.197452 1.197008 -0.000444 -0.000445 5.327180e-07 1.196828 0.0
3 2018-01-09 08:00:00 1.19650 1.19650 1.19518 1.19560 3543 1 0 1.197298 1.196789 -0.000509 -0.000466 -4.237181e-05 1.196516 NaN
I'm trying to apply it to the df with this:
df['OrderSignal'] = df.apply(set_order_signal, axis=1)
Is it a format problem?
Thank you already!
If you are looking for the index of the row that is sent to function, you need to use row.name, not i.
Try this and see what you get for your results. Can't tell if the logic is correct in all cases, but the four rows returns 0 each time
def set_order_signal(row):
if (row.MACD > row.SIGNAL) and (df.iloc[row.name-1].MACD < df.iloc[row.name-1].SIGNAL):
if (row.MACD < 0 and row.SIGNAL < 0) and (row.close > row['200EMA']):
return 1
elif (row.MACD < row.SIGNAL) and (df.iloc[row.name-1].MACD > df.iloc[row.name-1].SIGNAL):
if (row.MACD > 0 and row.SIGNAL > 0) and (row.close < row['200EMA']):
return -1
else:
return 0

Pandas DataFrames: Efficiently find next value in one column where another column has a greater value

The title describes my situation. I already have a working version of this, but it is very inefficient when scaled to large DataFrames (>1M rows). I was wondering if anyone has a better idea of doing this.
Example with solution and code
Create a new column next_time that has the next value of time where the price column is greater than the current row.
import pandas as pd
df = pd.DataFrame({'time': [15, 30, 45, 60, 75, 90], 'price': [10.00, 10.01, 10.00, 10.01, 10.02, 9.99]})
print(df)
time price
0 15 10.00
1 30 10.01
2 45 10.00
3 60 10.01
4 75 10.02
5 90 9.99
series_to_concat = []
for price in df['price'].unique():
index_equal_to_price = df[df['price'] == price].index
series_time_greater_than_price = df[df['price'] > price]['time']
time_greater_than_price_backfilled = series_time_greater_than_price.reindex(index_equal_to_price.union(series_time_greater_than_price.index)).fillna(method='backfill')
series_to_concat.append(time_greater_than_price_backfilled.reindex(index_equal_to_price))
df['next_time'] = pd.concat(series_to_concat, sort=False)
print(df)
time price next_time
0 15 10.00 30.0
1 30 10.01 75.0
2 45 10.00 60.0
3 60 10.01 75.0
4 75 10.02 NaN
5 90 9.99 NaN
This gets me the desired result. When scaled up to some large dataframes, calculating this can take a few minutes. Does anyone have a better idea of how to approach this?
Edit: Clarification of constraints
We can assume the dataframe is sorted by time.
Another way to word this would be, given any row n (Time_n, Price_n), 0 <= n <= len(df) - 1, find x such that Time_x > Time_n AND Price_x > Price_n AND there is no y such that n < y < x with Price_y > Price_n.
These solutions were faster when I tested with %timeit on this sample, but I tested on a larger dataframe and they were much slower than your solution. It would be interesting to see if any of the 3 solutions are faster in your larger dataframe. I would look into dask or check out: https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html
I hope someone else is able to post a more efficient solution. Some different answers below:
You can achieve this with a next one-liner that loops through both the time and price columns simultaneously with zip. The next function works exactly the same as a list comprehension, but you use need to parentheses instead of brackets, and it only returns the first True value. You also need to pass None to handle errors as a parameter within in the next function.
You need to pass axis=1, since you are comparing column-wise.
This should speed up performance, as you don't loop through the entire column as the iteration stops after returning the first value and moves to the next row.
import pandas as pd
df = pd.DataFrame({'time': [15, 30, 45, 60, 75, 90], 'price': [10.00, 10.01, 10.00, 10.01, 10.02, 9.99]})
print(df)
time price
0 15 10.00
1 30 10.01
2 45 10.00
3 60 10.01
4 75 10.02
5 90 9.99
df['next_time'] = (df.apply(lambda x: next((z for (y, z) in zip(df['price'], df['time'])
if y > x['price'] if z > x['time']), None), axis=1))
df
Out[1]:
time price next_time
0 15 10.00 30.0
1 30 10.01 75.0
2 45 10.00 60.0
3 60 10.01 75.0
4 75 10.02 NaN
5 90 9.99 NaN
As you can see list comprehension would return the same result, but in theory will be a lot slower... as the total number of iterating would increase significantly especially with a large dataframe.
df['next_time'] = (df.apply(lambda x: [z for (y, z) in zip(df['price'], df['time'])
if y > x['price'] if z > x['time']], axis=1)).str[0]
df
Out[2]:
time price next_time
0 15 10.00 30.0
1 30 10.01 75.0
2 45 10.00 60.0
3 60 10.01 75.0
4 75 10.02 NaN
5 90 9.99 NaN
Another Option creating a function with some numpy and np.where():
def closest(x):
try:
lst = df.groupby(df['price'].cummax())['time'].transform('first')
lst = np.asarray(lst)
lst = lst[lst>x]
idx = (np.abs(lst - x)).argmin()
return lst[idx]
except ValueError:
pass
df['next_time'] = np.where((df['price'].shift(-1) > df['price']),
df['time'].shift(-1),
df['time'].apply(lambda x: closest(x)))
This one returned a variation of your dataframe with 1,000,000 rows and 162,000 unique prices for me in less than 7 seconds. As such, I think that since you ran it on 660,000 rows and 12,000 unique prices, the increase in speed would be 100x-1000x.
The added complexity of your question is the condition that the closest higher price must be at a later time. This answer https://stackoverflow.com/a/53553226/6366770 helped me discover the bisect functions, but it didn't have your added complexity of having to rely on a time column. As such, I had to tackle the problem from a couple of different angles (as you mentioned in a comment regarding my np.where() to break it down into a couple of different methods).
import pandas as pd
df = pd.DataFrame({'time': [15, 30, 45, 60, 75, 90], 'price': [10.00, 10.01, 10.00, 10.01, 10.02, 9.99]})
def bisect_right(a, x, lo=0, hi=None):
if lo < 0:
raise ValueError('lo must be non-negative')
if hi is None:
hi = len(a)
while lo < hi:
mid = (lo+hi)//2
if x < a[mid]: hi = mid
else: lo = mid+1
return lo
def get_closest_higher(df, col, val):
higher_idx = bisect_right(df[col].values, val)
return higher_idx
df = df.sort_values(['price', 'time']).reset_index(drop=True)
df['next_time'] = df['price'].apply(lambda x: get_closest_higher(df, 'price', x))
df['next_time'] = df['next_time'].map(df['time'])
df['next_time'] = np.where(df['next_time'] <= df['time'], np.nan, df['next_time'] )
df = df.sort_values('time').reset_index(drop=True)
df['next_time'] = np.where((df['price'].shift(-1) > df['price'])
,df['time'].shift(-1),
df['next_time'])
df['next_time'] = df['next_time'].ffill()
df['next_time'] = np.where(df['next_time'] <= df['time'], np.nan, df['next_time'])
df
Out[1]:
time price next_time
0 15 10.00 30.0
1 30 10.01 75.0
2 45 10.00 60.0
3 60 10.01 75.0
4 75 10.02 NaN
5 90 9.99 NaN
David did come up with a great solution for finding the closest greater price at a later time. However, I did want to find the very next occurrence of a greater price at a later time though. Working with a coworker of mine, we found this solution.
Stack containing tuples (index, price)
Iterate through all rows (index i)
While the stack is non-empty AND the top of the stack has a lesser price, then pop and fill in the popped index with times[index]
Push (i, prices[i]) onto the stack
import numpy as np
import pandas as pd
df = pd.DataFrame({'time': [15, 30, 45, 60, 75, 90], 'price': [10.00, 10.01, 10.00, 10.01, 10.02, 9.99]})
print(df)
time price
0 15 10.00
1 30 10.01
2 45 10.00
3 60 10.01
4 75 10.02
5 90 9.99
times = df['time'].to_numpy()
prices = df['price'].to_numpy()
stack = []
next_times = np.full(len(df), np.nan)
for i in range(len(df)):
while stack and prices[i] > stack[-1][1]:
stack_time_index, stack_price = stack.pop()
next_times[stack_time_index] = times[i]
stack.append((i, prices[i]))
df['next_time'] = next_times
print(df)
time price next_time
0 15 10.00 30.0
1 30 10.01 75.0
2 45 10.00 60.0
3 60 10.01 75.0
4 75 10.02 NaN
5 90 9.99 NaN
This solution actually performs very fast. I am not totally sure, but I believe the complexity would be close to O(n) since it is one full pass through the entire dataframe. The reason this performs so well, is the stack is essentially sorted, where the largest prices will be at the bottom, and the smallest price is at the top of the stack.
Here is my test with an actual dataframe in action
print(f'{len(df):,.0f} rows with {len(df["price"].unique()):,.0f} unique prices ranging from ${df["price"].min():,.2f} to ${df["price"].max():,.2f}')
667,037 rows with 11,786 unique prices ranging from $1,857.52 to $2,022.00
def find_next_time_with_greater_price(df):
times = df['time'].to_numpy()
prices = df['price'].to_numpy()
stack = []
next_times = np.full(len(df), np.nan)
for i in range(len(df)):
while stack and prices[i] > stack[-1][1]:
stack_time_index, stack_price = stack.pop()
next_times[stack_time_index] = times[i]
stack.append((i, prices[i]))
return next_times
%timeit -n10 -r10 df['next_time'] = find_next_time_with_greater_price(df)
434 ms ± 11.8 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)

How to insert rows with 0 data for missing quarters into a pandas dataframe?

I have a dataframe with specific Quota values for given quarters (YYYY-Qx format), and need to visualize them with some linecharts. However, some of the quarters are missing (as there was no Quota during those quarters).
Period Quota
2017-Q1 500
2017-Q3 600
2018-Q2 700
I want to add them (starting at 2017-Q1 until today, so 2019-Q2) to the dataframe with a default value of 0 in the Quota column. A desired output would be the following:
Period Quota
2017-Q1 500
2017-Q2 0
2017-Q3 600
2017-Q4 0
2018-Q1 0
2018-Q2 700
2018-Q3 0
2018-Q4 0
2019-Q1 0
2019-Q2 0
I tried
df['Period'] = pd.to_datetime(df['Period']).dt.to_period('Q')
And then resampling the df with 'Q' frequency, but I must be doing something wrong, as it doesn't help with anything.
Any help would be much appreciated.
Use:
df.index = pd.to_datetime(df['Period']).dt.to_period('Q')
end = pd.Period(pd.datetime.now(), freq='Q')
df = (df['Quota'].reindex(pd.period_range(df.index.min(), end), fill_value=0)
.rename_axis('Period')
.reset_index()
)
df['Period'] = df['Period'].dt.strftime('%Y-Q%q')
print (df)
Period Quota
0 2017-Q1 500
1 2017-Q2 0
2 2017-Q3 600
3 2017-Q4 0
4 2018-Q1 0
5 2018-Q2 700
6 2018-Q3 0
7 2018-Q4 0
8 2019-Q1 0
9 2019-Q2 0
#An alternate solution based on left join
qtr=['Q1','Q2','Q3','Q4']
finl=[]
for i in range(2017,2020):
for j in qtr:
finl.append((str(i)+'_'+j))
df1=pd.DataFrame({'year_qtr':finl}).reset_index(drop=True)
df1.head(2)
original_value=['2017_Q1' ,'2017_Q3' ,'2018_Q2']
df_original=pd.DataFrame({'year_qtr':original_value,
'value':[500,600,700]}).reset_index(drop=True)
final=pd.merge(df1,df_original,how='left',left_on=['year_qtr'], right_on =['year_qtr'])
final.fillna(0)
Output
year_qtr value
0 2017_Q1 500.0
1 2017_Q2 0.0
2 2017_Q3 600.0
3 2017_Q4 0.0
4 2018_Q1 0.0
5 2018_Q2 700.0
6 2018_Q3 0.0
7 2018_Q4 0.0
8 2019_Q1 0.0
9 2019_Q2 0.0
10 2019_Q3 0.0
11 2019_Q4 0.0

difference between two rows pandas

i have a dataframe as :
id|amount|date
20|-7|2017:12:25
20|-170|2017:12:26
20|7|2017:12:27
i want to subtract each row from another for 'amount' column:
the output should be like:
id|amount|date|amount_diff
20|-7|2017:12:25|0
20|-170|2017:12:26|-177
20|7|2017:12:27|-163
i used the code:
df.sort_values(by='date',inplace=True)
df['amount_diff'] = df['invoice_amount'].diff()
and obtained the output as:
id|amount|date|amount_diff
20|-7|2017:12:25|163
20|-170|2017:12:26|-218
20|48|2017:12:27|0
IIUC you need:
df.sort_values(by='date',inplace=True)
df['amount_diff'] = df['amount'].add(df['amount'].shift()).fillna(0)
print (df)
id amount date amount_diff
0 20 -7 2017:12:25 0.0
1 20 -170 2017:12:26 -177.0
2 20 7 2017:12:27 -163.0
Because if want subtract your solution should work:
df.sort_values(by='date',inplace=True)
df['amount_diff1'] = df['amount'].sub(df['amount'].shift()).fillna(0)
df['amount_diff2'] = df['amount'].diff().fillna(0)
print (df)
id amount date amount_diff1 amount_diff2
0 20 -7 2017:12:25 0.0 0.0
1 20 -170 2017:12:26 -163.0 -163.0
2 20 7 2017:12:27 177.0 177.0

Dataframe .iloc low speed peformance

I am using pandas library and I am having some problems with performance using .iloc on pandas.
The idea for main software is to search in each row and column of dataframe and if reach in any condition, update this specific row and column of this dataframe with a new value.
Below follow some lines of this code:
for cont, val in enumerate(id_truck_list):
print cont
for index, row in all_travel.iterrows():
id_tr = int(all_travel.iloc[index, 0])
begin = all_travel.iloc[index, 5]
end = all_travel.iloc[index, 11]
if int(val) == id_tr:
#print "test1"
#print id_tr
#print begin_list[cont]
#print begin
#print end_list[cont]
#print end
if begin_list[cont] >= begin:
if end_list[cont] <= begin:
pass
else:
#print 'h1'
all_travel.iloc[index, 18] = all_travel.iloc[index, 18] + 3
else:
if begin < end_list[cont] :
if end <= end_list[cont]:
#print 'h2'
#print(all_travel.iloc[index, 18])
all_travel.iloc[index, 18] = all_travel.iloc[index, 18] + 5
#print(all_travel.iloc[index, 18])
#print str(index)
else:
#print 'h3'
all_travel.iloc[index, 18] = all_travel.iloc[index, 18] + 7
else:
pass
This idea is performing in very slow way (more or less 10 rows per minute). Do you have any idea using pandas library
Below follow the all_travel.head()
truck_id id_farm gatec_dist gps_go_dist gps_ret_dist t1gatec \
0 2010028.0 76.0 11 11.8617 0.211655 2016-03-09 00:24:00
1 2010028.0 1.0 16.2 9.86 0.0637544 2016-03-13 23:57:00
2 2010028.0 75.0 18 10.78 9.65 2016-03-18 09:17:00
3 2010028.0 62.0 6 8.51291 3.99291 2016-03-19 20:16:00
4 2010028.0 62.0 6 2.91 0.0428008 2016-03-21 03:00:00
t1gps t2gatec t2gps t3gatec \
0 03/09/2016 00:09:58 0 03/09/2016 00:43:46 0
1 03/13/2016 23:46:00 0 03/14/2016 00:53:10 0
2 03/18/2016 09:13:15 0 03/18/2016 10:17:14 0
3 03/19/2016 20:29:59 0 03/19/2016 21:22:40 0
4 03/21/2016 02:49:34 0 03/21/2016 03:38:59 0
t3gps t4gatec t4gps wait_mill \
0 03/09/2016 07:00:15 2016-03-09 02:14:55 03/09/2016 02:14:55 154.500000
1 03/14/2016 13:54:30 2016-03-14 01:12:58 03/14/2016 01:12:58 124.733333
2 03/18/2016 12:07:00 2016-03-18 12:37:41 03/18/2016 12:44:01 408.316667
3 03/19/2016 23:57:22 2016-03-19 22:00:08 03/19/2016 22:00:08 256.083333
4 03/22/2016 00:09:56 2016-03-21 04:01:20 03/21/2016 04:01:20 47.333333
go_field wait_field ret_mill tot_trav maintenance_level
0 33.800000 376.483333 -285.333333 124.950000 1
1 67.166667 781.333333 -761.533333 86.966667 1
2 63.983333 109.766667 37.016667 210.766667 1
3 52.683333 154.700000 -117.233333 90.150000 1
4 49.416667 1230.950000 -1208.600000 71.766667 1
I have done another solution that has improved a lot my speed performance.
I changed parts of dataframe to list, due the better performance using lists against normal dataframe.
The conclusion, now I need to wait two minutes for the answer, not 3 days.
Bellow follow the modification
for cont, val in enumerate(id_truck_list):
for cont2, val2 in enumerate(id_truck_list2):
id_tr = int(id_truck_list2[cont2])
begin = begin_list2[cont2]
end = end_list2[cont2]
if int(id_truck_list[cont]) == id_tr:
if begin_list[cont] >= begin:
if begin_list[cont] >= end:
pass
else:
maintenance_list[cont2] = maintenance_list[cont2] + 3
else:
if begin < end_list[cont] :
if end <= end_list[cont]:
#print 'h2'
maintenance_list[cont2] = maintenance_list[cont2] +
#print str(index)
else:
#print 'h3'
maintenance_list[cont2] = maintenance_list[cont2] +
else:
pass
print 'list size ' + str(len(maintenance_list))
for cont3, val3 in enumerate(maintenance_list):
print 'list update ' + str(cont3)
all_travel.iloc[cont3, 18] = maintenance_list[cont3]

Categories

Resources