df.apply return NaN in pandas dataframe - python

I am trying to fill up a column in a dataframe with 1, 0 or -1 depending on some factors by doing it like this:
def set_order_signal(row):
if (row.MACD > row.SIGNAL) and (df.iloc[i-1].MACD < df.iloc[i-1].SIGNAL):
if (row.MACD < 0 and row.SIGNAL < 0) and (row.close > row['200EMA']):
return 1
elif (row.MACD < row.SIGNAL) and (df.iloc[i-1].MACD > df.iloc[i-1].SIGNAL):
if (row.MACD > 0 and row.SIGNAL > 0) and (row.close < row['200EMA']):
return -1
else:
return 0
Sometimes it works but in other rows it returns "NaN". I can't find a reason or solution for this.
The dataframe I work with looks like this:
time open high low close tick_volume spread real_volume EMA_LONG EMA_SHORT MACD SIGNAL HIST 200EMA OrderSignal
0 2018-01-09 05:00:00 1.19726 1.19751 1.19675 1.19717 1773 1 0 1.197605 1.197152 -0.000453 -0.000453 0.000000e+00 1.197170 0.0
1 2018-01-09 06:00:00 1.19717 1.19724 1.19659 1.19681 1477 1 0 1.197538 1.197099 -0.000439 -0.000445 6.258599e-06 1.196989 0.0
2 2018-01-09 07:00:00 1.19681 1.19718 1.19642 1.19651 1622 1 0 1.197452 1.197008 -0.000444 -0.000445 5.327180e-07 1.196828 0.0
3 2018-01-09 08:00:00 1.19650 1.19650 1.19518 1.19560 3543 1 0 1.197298 1.196789 -0.000509 -0.000466 -4.237181e-05 1.196516 NaN
I'm trying to apply it to the df with this:
df['OrderSignal'] = df.apply(set_order_signal, axis=1)
Is it a format problem?
Thank you already!

If you are looking for the index of the row that is sent to function, you need to use row.name, not i.
Try this and see what you get for your results. Can't tell if the logic is correct in all cases, but the four rows returns 0 each time
def set_order_signal(row):
if (row.MACD > row.SIGNAL) and (df.iloc[row.name-1].MACD < df.iloc[row.name-1].SIGNAL):
if (row.MACD < 0 and row.SIGNAL < 0) and (row.close > row['200EMA']):
return 1
elif (row.MACD < row.SIGNAL) and (df.iloc[row.name-1].MACD > df.iloc[row.name-1].SIGNAL):
if (row.MACD > 0 and row.SIGNAL > 0) and (row.close < row['200EMA']):
return -1
else:
return 0

Related

Pandas: AttributeError: 'float' object has no attribute 'MACD'

I would like to compare 2 rows in a pandas dataframe but I always get an Error saying: AttributeError: 'float' object has no attribute 'MACD'.
This is the df:
time open high low close tick_volume spread real_volume EMA_LONG EMA_SHORT MACD SIGNAL HIST 200EMA
0 2018-01-05 03:00:00 1.20775 1.20794 1.20700 1.20724 2887 1 0 1.206134 1.206803 0.000669 0.000669 0.000000 1.207240
1 2018-01-05 04:00:00 1.20723 1.20743 1.20680 1.20710 2349 1 0 1.206216 1.206849 0.000633 0.000649 -0.000016 1.207170
2 2018-01-05 05:00:00 1.20709 1.20755 1.20709 1.20744 1869 1 0 1.206318 1.206941 0.000622 0.000638 -0.000016 1.207261
Now I want to count on how many times it would buy and sell based on some information in the rows so I'm trying to iterate through it like this:
buy = 0
sell = 0
for i, row in df.iterrows():
if i == 0:
continue
if row.MACD > row.SIGNAL and row[i - 1].MACD < row[i - 1].SIGNAL:
if row.HIST < 0 and row.MACD > row['200EMA'] and row.SIGNAL > row['200EMA']:
buy += 1
elif row.MACD < row.SIGNAL and row[i - 1].MACD > row[i - 1].SIGNAL:
if row.HIST > 0 and row.MACD < row['200EMA'] and row.SIGNAL < row['200EMA']:
sell += 1
print("BUY: " + buy + "SELL: " + sell)
I am getting the following Error:
AttributeError: 'float' object has no attribute 'MACD'
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-75-ff4a2b3629bc> in <module>
8 if row.HIST < 0 and row.MACD > row['200EMA'] and row.SIGNAL > row['200EMA']:
9 buy += 1
---> 10 elif row.MACD < row.SIGNAL and row[i - 1].MACD > row[i - 1].SIGNAL:
11 if row.HIST > 0 and row.MACD < row['200EMA'] and row.SIGNAL < row['200EMA']:
12 sell += 1
AttributeError: 'float' object has no attribute 'MACD'
I know this Error has already been here but I the solutions there didn't help me.
Thank you already!
your problem his here row[i - 1].MACD
when you are accesessing the row[i-1] place you get the value of the location in the service
if i = 1 then you will get the row[0] for the row and not the preivice row in the dataframe you should probably switch it by df.iloc[i-1].MACD

Pandas: update a column with an if statement

My current dataframe looks like this:
midprice ema12 ema26 difference
0 0.002990 0.002990 0.002990 0.000000e+00
1 0.002990 0.002990 0.002990 4.227920e-08
2 0.003018 0.002994 0.002992 2.295777e-06
3 0.003025 0.002999 0.002994 4.579221e-06
4 0.003067 0.003009 0.003000 9.708765e-06
5 0.003112 0.003025 0.003008 1.718520e-05
What I tried is the following:
df.loc[:, 'action'] = np.select(condlist=[df.difference[0] < df.difference[-1] < df.difference[-2], df.ema12 < df.ema26 ], choicelist=['buy', 'sell'], default='do nothing')
So update the column action with buy if three times in a row the values of the column difference is smaller than it's previous value. Any idea on how to proceed? Thanks!
I think you need:
m1= df['difference'] < df['difference'].shift(-1)
m2= df['difference'] < df['difference'].shift(-2)
m3= df['difference'] < df['difference'].shift(-3)
df['action'] = np.select(condlist=[m1 | m2 | m3, df.ema12 < df.ema26 ],
choicelist=['buy', 'sell'],
default='do nothing')
print (df)
midprice ema12 ema26 difference action
0 0.002990 0.002990 0.002990 0.000000e+00 buy
1 0.002990 0.002990 0.002990 4.227920e-08 buy
2 0.003018 0.002994 0.002992 2.295777e-06 buy
3 0.003025 0.002999 0.002994 4.579221e-06 buy
4 0.003067 0.003009 0.003000 9.708765e-06 buy
5 0.003112 0.003025 0.003008 1.718520e-05 do nothing

Create a column by applying a conditional statement to multiple other columns of dtypes datetime and integer

I have a dataframe called df that looks similar to this (except the Visits go up to 74 and there are several hundred clients - I have simplified it here).
Client Visit_1 Visit_2 Visit_3 Visit_4 Visit_5 Eligible Active
Client_1 2016-05-10 2016-05-25 2016-06-10 2016-06-25 2016-07-10 0 0
Client_2 2017-05-10 2017-05-25 2017-06-10 2017-06-25 2017-07-10 0 0
Client_3 2018-09-10 2018-09-26 2018-10-10 2018-10-26 2018-11-10 1 0
Client_4 2018-10-10 2018-10-26 2018-11-10 2018-11-26 2018-12-10 1 1
I want to create a new column called Visit in Window with two values, 0 and 1. I want to set Visit in Window to equal 1 if the Client is Eligible (value of '1' in the Eligible column) AND if the Client is Active (value of '1' in the Active column) AND if any one of the 5 columns from Visit_1 to Visit_5 contains a date that falls between 2018-10-25 and 2018-12-15.
So, I want to end up with a dataframe that looks like this:
Client Visit_1 Visit_2 Visit_3 Visit_4 Visit_5 Eligible Active Visit_in_Window
Client_1 2016-05-10 2016-05-25 2016-06-10 2016-06-25 2016-07-10 0 0 0
Client_2 2017-05-10 2017-05-25 2017-06-10 2017-06-25 2017-07-10 0 0 0
Client_3 2018-09-10 2018-09-26 2018-10-10 2018-10-26 2018-11-10 1 0 0
Client_4 2018-10-10 2018-10-26 2018-11-10 2018-11-26 2018-12-10 1 1 1
I can do this for one column by using the following code
df['Visit_in_Window'] = 0
df.loc[((df.Eligible == 1) & (df.Active == 1) &
(df.Visit_1 > '2018-10-24') &
(df.Visit_1 < '2018-12-16')), 'Visit_in_Window'] = 1
However, I do not know how to do perform this action on multiple columns at the same time. Can anyone help?
I think, this is certainly a way to do this:
import pandas as pd
from collections import OrderedDict
df = pd.DataFrame(OrderedDict([
("Client", ["Client_1", "Client_2", "Client_3", "Client_4"]),
("Visit_1", ["2016-05-10", "2017-05-10", "2018-09-10", "2018-10-10"]),
("Visit_2", ["2016-05-25", "2017-05-25", "2018-09-26", "2018-10-26"]),
("Visit_3", ["2016-06-10", "2017-06-10", "2018-10-10", "2018-11-10"]),
("Visit_4", ["2016-06-25", "2017-06-25", "2018-10-26", "2018-11-26"]),
("Visit_5", ["2016-07-10", "2017-07-10", "2018-11-10", "2018-12-10"]),
("Eligible", [0, 0, 1, 1]),
("Active", [0, 0, 0, 1])
]))
df["Visit_in_Window"] = (
df["Eligible"] & df["Active"] & (
(("2018-10-25" < df["Visit_1"]) & (df["Visit_1"] < "2018-12-15")) |
(("2018-10-25" < df["Visit_2"]) & (df["Visit_2"] < "2018-12-15")) |
(("2018-10-25" < df["Visit_3"]) & (df["Visit_3"] < "2018-12-15")) |
(("2018-10-25" < df["Visit_4"]) & (df["Visit_4"] < "2018-12-15")) |
(("2018-10-25" < df["Visit_5"]) & (df["Visit_5"] < "2018-12-15"))
)
)
print(df.to_string(index=False))
Which prints:
Client Visit_1 Visit_2 Visit_3 Visit_4 Visit_5 Eligible Active Visit_in_Window
Client_1 2016-05-10 2016-05-25 2016-06-10 2016-06-25 2016-07-10 0 0 False
Client_2 2017-05-10 2017-05-25 2017-06-10 2017-06-25 2017-07-10 0 0 False
Client_3 2018-09-10 2018-09-26 2018-10-10 2018-10-26 2018-11-10 1 0 False
Client_4 2018-10-10 2018-10-26 2018-11-10 2018-11-26 2018-12-10 1 1 True
Update
For a variable number N of columns from Visit_1 to Visit_N, this should work:
N = 5
visits = pd.DataFrame([(("2018-10-25" < df["Visit_" + str(i)]) & (df["Visit_" + str(i)] < "2018-12-15")) for i in range(1, N + 1)])
print(visits)
df["Visit_in_Window"] = df["Eligible"] & df["Active"] & visits.any()
Which prints:
0 1 2 3
Visit_1 False False False False
Visit_2 False False False True
Visit_3 False False False True
Visit_4 False False True True
Visit_5 False False True True
As you can see, only columns 2 and 3 (client 3 and 4) have True where they had visits inside the date range. any will take care of the "merging" which was done beforehand with bitwise operator |.
One of the possible ways to do it is the same as you suggested in the question, but with additional 'or' statements
df['Visit_in_Window'] = 0
df.loc[
(df.Eligible == 1) &
(df.Active == 1) &
( ((df.Visit_1 > '2018-10-24') & (df.Visit_1 < '2018-12-16')) |
((df.Visit_2 > '2018-10-24') & (df.Visit_2 < '2018-12-16')) |
((df.Visit_3 > '2018-10-24') & (df.Visit_3 < '2018-12-16')) |
((df.Visit_4 > '2018-10-24') & (df.Visit_4 < '2018-12-16')) |
((df.Visit_5 > '2018-10-24') & (df.Visit_5 < '2018-12-16'))
) ,
'Visit_in_Window'] = 1

Dataframe .iloc low speed peformance

I am using pandas library and I am having some problems with performance using .iloc on pandas.
The idea for main software is to search in each row and column of dataframe and if reach in any condition, update this specific row and column of this dataframe with a new value.
Below follow some lines of this code:
for cont, val in enumerate(id_truck_list):
print cont
for index, row in all_travel.iterrows():
id_tr = int(all_travel.iloc[index, 0])
begin = all_travel.iloc[index, 5]
end = all_travel.iloc[index, 11]
if int(val) == id_tr:
#print "test1"
#print id_tr
#print begin_list[cont]
#print begin
#print end_list[cont]
#print end
if begin_list[cont] >= begin:
if end_list[cont] <= begin:
pass
else:
#print 'h1'
all_travel.iloc[index, 18] = all_travel.iloc[index, 18] + 3
else:
if begin < end_list[cont] :
if end <= end_list[cont]:
#print 'h2'
#print(all_travel.iloc[index, 18])
all_travel.iloc[index, 18] = all_travel.iloc[index, 18] + 5
#print(all_travel.iloc[index, 18])
#print str(index)
else:
#print 'h3'
all_travel.iloc[index, 18] = all_travel.iloc[index, 18] + 7
else:
pass
This idea is performing in very slow way (more or less 10 rows per minute). Do you have any idea using pandas library
Below follow the all_travel.head()
truck_id id_farm gatec_dist gps_go_dist gps_ret_dist t1gatec \
0 2010028.0 76.0 11 11.8617 0.211655 2016-03-09 00:24:00
1 2010028.0 1.0 16.2 9.86 0.0637544 2016-03-13 23:57:00
2 2010028.0 75.0 18 10.78 9.65 2016-03-18 09:17:00
3 2010028.0 62.0 6 8.51291 3.99291 2016-03-19 20:16:00
4 2010028.0 62.0 6 2.91 0.0428008 2016-03-21 03:00:00
t1gps t2gatec t2gps t3gatec \
0 03/09/2016 00:09:58 0 03/09/2016 00:43:46 0
1 03/13/2016 23:46:00 0 03/14/2016 00:53:10 0
2 03/18/2016 09:13:15 0 03/18/2016 10:17:14 0
3 03/19/2016 20:29:59 0 03/19/2016 21:22:40 0
4 03/21/2016 02:49:34 0 03/21/2016 03:38:59 0
t3gps t4gatec t4gps wait_mill \
0 03/09/2016 07:00:15 2016-03-09 02:14:55 03/09/2016 02:14:55 154.500000
1 03/14/2016 13:54:30 2016-03-14 01:12:58 03/14/2016 01:12:58 124.733333
2 03/18/2016 12:07:00 2016-03-18 12:37:41 03/18/2016 12:44:01 408.316667
3 03/19/2016 23:57:22 2016-03-19 22:00:08 03/19/2016 22:00:08 256.083333
4 03/22/2016 00:09:56 2016-03-21 04:01:20 03/21/2016 04:01:20 47.333333
go_field wait_field ret_mill tot_trav maintenance_level
0 33.800000 376.483333 -285.333333 124.950000 1
1 67.166667 781.333333 -761.533333 86.966667 1
2 63.983333 109.766667 37.016667 210.766667 1
3 52.683333 154.700000 -117.233333 90.150000 1
4 49.416667 1230.950000 -1208.600000 71.766667 1
I have done another solution that has improved a lot my speed performance.
I changed parts of dataframe to list, due the better performance using lists against normal dataframe.
The conclusion, now I need to wait two minutes for the answer, not 3 days.
Bellow follow the modification
for cont, val in enumerate(id_truck_list):
for cont2, val2 in enumerate(id_truck_list2):
id_tr = int(id_truck_list2[cont2])
begin = begin_list2[cont2]
end = end_list2[cont2]
if int(id_truck_list[cont]) == id_tr:
if begin_list[cont] >= begin:
if begin_list[cont] >= end:
pass
else:
maintenance_list[cont2] = maintenance_list[cont2] + 3
else:
if begin < end_list[cont] :
if end <= end_list[cont]:
#print 'h2'
maintenance_list[cont2] = maintenance_list[cont2] +
#print str(index)
else:
#print 'h3'
maintenance_list[cont2] = maintenance_list[cont2] +
else:
pass
print 'list size ' + str(len(maintenance_list))
for cont3, val3 in enumerate(maintenance_list):
print 'list update ' + str(cont3)
all_travel.iloc[cont3, 18] = maintenance_list[cont3]

Split string and append parts in running list [Python]

I have a veery long list that contains the same pattern. Here an original example:
04:50 10 244.685 0 0
05:00 10 244.680 0 0
HBCHa 9990 Seite 762
16.02.16
Hafenpegel
Pegel müM Var 0 Pegelstand
Datum Zeit Intervall müM Q Art
Sonntag, 31. Dezember 2000 05:10 10 244.679 0 0
05:20 10 244.688 0 0
05:30 10 244.694 0 0
05:40 10 244.688 0 0
As you can see, there is one line with measurement-data within the string that starts with "Sonntag"
My target is:
04:50 10 244.685 0 0
05:00 10 244.680 0 0
HBCHa 9990 Seite 762
16.02.16
Hafenpegel
Pegel müM Var 0 Pegelstand
Datum Zeit Intervall müM Q Art
Sonntag, 31. Dezember 2000
05:10 10 244.679 0 0 !!
05:20 10 244.688 0 0
05:30 10 244.694 0 0
05:40 10 244.688 0 0
I managed to write the txt-file in a list, here called "data_list_splitted", catch this onle line over the whole txt-file, split it and extract the part with the measurements:
for i in data_list_splitted:
if len(i) >= 40:
ii = i.split(";")
txt_line = "%s;%s;%s;%s;%s"%(ii[4],ii[5],ii[6],ii[7],ii[8])
But i don't get it to break this line and add the measurement-values in the running list!
I think this should't be that difficult?
Any ideas?
Thank you very much!
You can create another list and insert values into it
new_data_list_splitted = []
for i in data_list_splitted:
if len(i) >= 40:
ii = i.split(";")
txt_line = "%s;%s;%s;%s;%s"%(ii[0],ii[1],ii[2],ii[3],ii[4])
new_data_list_splitted.append(txt_line)
txt_line = "%s;%s;%s;%s;%s"%(ii[4],ii[5],ii[6],ii[7],ii[8])
new_data_list_splitted.append(txt_line)
else:
new_data_list_splitted.append(i)
print new_data_list_splitted #this will have a new row for measurement value

Categories

Resources