Dataframe .iloc low speed peformance

Dataframe .iloc low speed peformance - python

I am using pandas library and I am having some problems with performance using .iloc on pandas.
The idea for main software is to search in each row and column of dataframe and if reach in any condition, update this specific row and column of this dataframe with a new value.
Below follow some lines of this code:
for cont, val in enumerate(id_truck_list):
print cont
for index, row in all_travel.iterrows():
id_tr = int(all_travel.iloc[index, 0])
begin = all_travel.iloc[index, 5]
end = all_travel.iloc[index, 11]
if int(val) == id_tr:
#print "test1"
#print id_tr
#print begin_list[cont]
#print begin
#print end_list[cont]
#print end
if begin_list[cont] >= begin:
if end_list[cont] <= begin:
pass
else:
#print 'h1'
all_travel.iloc[index, 18] = all_travel.iloc[index, 18] + 3
else:
if begin < end_list[cont] :
if end <= end_list[cont]:
#print 'h2'
#print(all_travel.iloc[index, 18])
all_travel.iloc[index, 18] = all_travel.iloc[index, 18] + 5
#print(all_travel.iloc[index, 18])
#print str(index)
else:
#print 'h3'
all_travel.iloc[index, 18] = all_travel.iloc[index, 18] + 7
else:
pass
This idea is performing in very slow way (more or less 10 rows per minute). Do you have any idea using pandas library
Below follow the all_travel.head()
truck_id id_farm gatec_dist gps_go_dist gps_ret_dist t1gatec \
0 2010028.0 76.0 11 11.8617 0.211655 2016-03-09 00:24:00
1 2010028.0 1.0 16.2 9.86 0.0637544 2016-03-13 23:57:00
2 2010028.0 75.0 18 10.78 9.65 2016-03-18 09:17:00
3 2010028.0 62.0 6 8.51291 3.99291 2016-03-19 20:16:00
4 2010028.0 62.0 6 2.91 0.0428008 2016-03-21 03:00:00
t1gps t2gatec t2gps t3gatec \
0 03/09/2016 00:09:58 0 03/09/2016 00:43:46 0
1 03/13/2016 23:46:00 0 03/14/2016 00:53:10 0
2 03/18/2016 09:13:15 0 03/18/2016 10:17:14 0
3 03/19/2016 20:29:59 0 03/19/2016 21:22:40 0
4 03/21/2016 02:49:34 0 03/21/2016 03:38:59 0
t3gps t4gatec t4gps wait_mill \
0 03/09/2016 07:00:15 2016-03-09 02:14:55 03/09/2016 02:14:55 154.500000
1 03/14/2016 13:54:30 2016-03-14 01:12:58 03/14/2016 01:12:58 124.733333
2 03/18/2016 12:07:00 2016-03-18 12:37:41 03/18/2016 12:44:01 408.316667
3 03/19/2016 23:57:22 2016-03-19 22:00:08 03/19/2016 22:00:08 256.083333
4 03/22/2016 00:09:56 2016-03-21 04:01:20 03/21/2016 04:01:20 47.333333
go_field wait_field ret_mill tot_trav maintenance_level
0 33.800000 376.483333 -285.333333 124.950000 1
1 67.166667 781.333333 -761.533333 86.966667 1
2 63.983333 109.766667 37.016667 210.766667 1
3 52.683333 154.700000 -117.233333 90.150000 1
4 49.416667 1230.950000 -1208.600000 71.766667 1

I have done another solution that has improved a lot my speed performance.
I changed parts of dataframe to list, due the better performance using lists against normal dataframe.
The conclusion, now I need to wait two minutes for the answer, not 3 days.
Bellow follow the modification
for cont, val in enumerate(id_truck_list):
for cont2, val2 in enumerate(id_truck_list2):
id_tr = int(id_truck_list2[cont2])
begin = begin_list2[cont2]
end = end_list2[cont2]
if int(id_truck_list[cont]) == id_tr:
if begin_list[cont] >= begin:
if begin_list[cont] >= end:
pass
else:
maintenance_list[cont2] = maintenance_list[cont2] + 3
else:
if begin < end_list[cont] :
if end <= end_list[cont]:
#print 'h2'
maintenance_list[cont2] = maintenance_list[cont2] +
#print str(index)
else:
#print 'h3'
maintenance_list[cont2] = maintenance_list[cont2] +
else:
pass
print 'list size ' + str(len(maintenance_list))
for cont3, val3 in enumerate(maintenance_list):
print 'list update ' + str(cont3)
all_travel.iloc[cont3, 18] = maintenance_list[cont3]

Related

Converting Matplotlib plots to Plotly Line chart

I have a code which will plot multiple plots using matplotlib.My code is give below.
index = LFRO_tdms["Measurement Config"].as_dataframe()["Frequencies"]
for vdd in set_vdds:
for DUT in unique_DUTs:
temp_df = df[ (df["Serial"] == str(DUT)) & (df["VDD (V)"]==vdd) ]
temp_df["Serial ID - Temp(C)"] = temp_df["Serial"] + " - " + temp_df["Target Temp (°C)"].astype(dtype=str)
df_LFRO_data_to_plot = df_LFRO[temp_df["#Magnitude"].to_numpy(dtype=str)]
df_LFRO_data_to_plot.index = index
df_LFRO_data_to_plot.columns = temp_df["Serial ID - Temp(C)"]
df_LFRO_data_to_plot.plot(logx=True, colormap="jet")
plt.title("Unit: "+ DUT + " Vdd: " + str(vdd))
afetr running this code it will output multiple plots 12 0r 13 nos as shown below.
I need to implement the same using plotly.below is my code,but it is outputting plots in a different way.
fig = go.Figure()
index = LFRO_tdms["Measurement Config"].as_dataframe()["Frequencies"]
for vdd in set_vdds:
for DUT in unique_DUTs:
temp_df = df[ (df["Serial"] == str(DUT)) & (df["VDD (V)"]==vdd) ]
temp_df["Serial ID - Temp(C)"] = temp_df["Serial"] + " - " + temp_df["Target Temp (°C)"].astype(dtype=str)
df_LFRO_data_to_plot = df_LFRO[temp_df["#Magnitude"].to_numpy(dtype=str)]
df_LFRO_data_to_plot.index = index
df_LFRO_data_to_plot.columns = temp_df["Serial ID - Temp(C)"]
# df_LFRO_data_to_plot.plot(logx=True, colormap="jet")
# plt.title("Unit: "+ DUT + " Vdd: " + str(vdd))
fig.add_traces(go.Scatter(x=df_LFRO_data_to_plot.index, y=df_LFRO_data_to_plot.columns, mode='lines', name = ("Unit: "+ DUT + " Vdd: " + str(vdd))))
fig.show()
I need the plots to come as shown in the 1st image. May I know what mistake I am making.
Test Station Position Serial Timestamp ETC Temp (°C) ETC Pressure (kPa) ETC Humidity (%RH) Ref Mic Temp (°C) Site Temp (°C) Target Temp (°C) VDD (V) LFRO (Hz) #Magnitude
0 1 1 1 07LX-1 2022-11-17 23:05:51.591926 23.151848 99.334515 3.564379 14.349645 -30.041135 -35.0 1.60 9.221194 0
1 1 1 2 07LX-2 2022-11-17 23:05:51.591926 23.151848 99.334515 3.564379 14.349645 -30.257592 -35.0 1.60 8.995556 1
2 1 1 3 07LX-3 2022-11-17 23:05:51.591926 23.151848 99.334515 3.564379 14.349645 -30.511629 -35.0 1.60 9.452866 2
3 1 1 4 07LX-4 2022-11-17 23:05:51.591926 23.151848 99.334515 3.564379 14.349645 -29.863173 -35.0 1.60 9.299079 3
4 1 1 1 07LX-1 2022-11-17 23:09:41.373825 22.475499 99.338778 3.574306 12.311989 -28.114924 -35.0 1.66 7.390171 4

df.apply return NaN in pandas dataframe

I am trying to fill up a column in a dataframe with 1, 0 or -1 depending on some factors by doing it like this:
def set_order_signal(row):
if (row.MACD > row.SIGNAL) and (df.iloc[i-1].MACD < df.iloc[i-1].SIGNAL):
if (row.MACD < 0 and row.SIGNAL < 0) and (row.close > row['200EMA']):
return 1
elif (row.MACD < row.SIGNAL) and (df.iloc[i-1].MACD > df.iloc[i-1].SIGNAL):
if (row.MACD > 0 and row.SIGNAL > 0) and (row.close < row['200EMA']):
return -1
else:
return 0
Sometimes it works but in other rows it returns "NaN". I can't find a reason or solution for this.
The dataframe I work with looks like this:
time open high low close tick_volume spread real_volume EMA_LONG EMA_SHORT MACD SIGNAL HIST 200EMA OrderSignal
0 2018-01-09 05:00:00 1.19726 1.19751 1.19675 1.19717 1773 1 0 1.197605 1.197152 -0.000453 -0.000453 0.000000e+00 1.197170 0.0
1 2018-01-09 06:00:00 1.19717 1.19724 1.19659 1.19681 1477 1 0 1.197538 1.197099 -0.000439 -0.000445 6.258599e-06 1.196989 0.0
2 2018-01-09 07:00:00 1.19681 1.19718 1.19642 1.19651 1622 1 0 1.197452 1.197008 -0.000444 -0.000445 5.327180e-07 1.196828 0.0
3 2018-01-09 08:00:00 1.19650 1.19650 1.19518 1.19560 3543 1 0 1.197298 1.196789 -0.000509 -0.000466 -4.237181e-05 1.196516 NaN
I'm trying to apply it to the df with this:
df['OrderSignal'] = df.apply(set_order_signal, axis=1)
Is it a format problem?
Thank you already!

If you are looking for the index of the row that is sent to function, you need to use row.name, not i.
Try this and see what you get for your results. Can't tell if the logic is correct in all cases, but the four rows returns 0 each time
def set_order_signal(row):
if (row.MACD > row.SIGNAL) and (df.iloc[row.name-1].MACD < df.iloc[row.name-1].SIGNAL):
if (row.MACD < 0 and row.SIGNAL < 0) and (row.close > row['200EMA']):
return 1
elif (row.MACD < row.SIGNAL) and (df.iloc[row.name-1].MACD > df.iloc[row.name-1].SIGNAL):
if (row.MACD > 0 and row.SIGNAL > 0) and (row.close < row['200EMA']):
return -1
else:
return 0

difference between two rows pandas

i have a dataframe as :
id|amount|date
20|-7|2017:12:25
20|-170|2017:12:26
20|7|2017:12:27
i want to subtract each row from another for 'amount' column:
the output should be like:
id|amount|date|amount_diff
20|-7|2017:12:25|0
20|-170|2017:12:26|-177
20|7|2017:12:27|-163
i used the code:
df.sort_values(by='date',inplace=True)
df['amount_diff'] = df['invoice_amount'].diff()
and obtained the output as:
id|amount|date|amount_diff
20|-7|2017:12:25|163
20|-170|2017:12:26|-218
20|48|2017:12:27|0

IIUC you need:
df.sort_values(by='date',inplace=True)
df['amount_diff'] = df['amount'].add(df['amount'].shift()).fillna(0)
print (df)
id amount date amount_diff
0 20 -7 2017:12:25 0.0
1 20 -170 2017:12:26 -177.0
2 20 7 2017:12:27 -163.0
Because if want subtract your solution should work:
df.sort_values(by='date',inplace=True)
df['amount_diff1'] = df['amount'].sub(df['amount'].shift()).fillna(0)
df['amount_diff2'] = df['amount'].diff().fillna(0)
print (df)
id amount date amount_diff1 amount_diff2
0 20 -7 2017:12:25 0.0 0.0
1 20 -170 2017:12:26 -163.0 -163.0
2 20 7 2017:12:27 177.0 177.0

Compare some columns from some tables using python

I need to compare two values MC and JT from 2 tables:
EID MolIdx TEStart TEEnd TE TZone TBulkBE TBulkAE MC JT zavg vabs vzavg xyd.x xyd.y xydist nnbw vabsprev midhb
0 370 36700 36800 110 20 36150 37090 0 0 -8.25705 0.219113 -0.000800014 20.8926 41.4347 5.75852 0 4.13067 0
1 423 17950 18150 210 180 17400 18430 1 0 -4.26426 0.586578 -0.053 77.22 85.2104 22.0534 0 3.551 0
2 468 41790 42020 240 50 41360 42380 0 0 7.82681 0.181248 -0.00269566 90.0646 92.7698 5.0841 0 4.19304 0
and
EID MolIdx TEStart TEEnd TE TZone TBulkBE TBulkAE MC JT zavg vabs vzavg xyd.x xyd.y xydist nnbw vabsprev midhb
0 370 36700 36800 110 20 36150 37090 0 0 -0.846655 0.0218695 2.59898e-05 2.0724 4.1259 0.583259 10 0.412513 0
1 423 17950 18150 210 180 17400 18780 1 0 -0.453311 0.058732 -0.00526783 7.7403 8.52544 2.19627 0 0.354126 0
2 468 41790 42020 240 70 41360 42380 0 0 0.743716 0.0181613 -0.000256186 9.08777 9.21395 0.502506 0 0.419265 0
I need to do it using module csv. I know how to do it using pandas and xlrd, but using csv don't know.
Desire output:
Number_of_strings MC JT
And print strings, where values are different
import csv
old = csv.reader(open('old.csv', 'rb'), delimiter=',')
row1 = old.next()
new = csv.reader(open('new.csv', 'rb'), delimiter=',')
row2 = new.next()
if (row1[8] == row2[8]) and (row1[9] == row2[9]):
continue
else:
print row1[0] + ':' + row1[8] + '!=' + row2[8]

You can try something like the following:
old = list(csv.reader(open('old.csv', 'rb'), delimiter=','))
new = list(csv.reader(open('new.csv', 'rb'), delimiter=','))
old = zip(*old)
new = zip(*new)
print ['%s-%s-%s'%(str(a), str(b), str(c)) for a, b, c in zip(old[0], new[8], old[8]) if b != c]
First, we get a list of lists. zip(*x) will transpose a list of lists. The rest should be easy to decipher ...
You can actually put whatever you want within the string ...

Split string and append parts in running list [Python]

I have a veery long list that contains the same pattern. Here an original example:
04:50 10 244.685 0 0
05:00 10 244.680 0 0
HBCHa 9990 Seite 762
16.02.16
Hafenpegel
Pegel müM Var 0 Pegelstand
Datum Zeit Intervall müM Q Art
Sonntag, 31. Dezember 2000 05:10 10 244.679 0 0
05:20 10 244.688 0 0
05:30 10 244.694 0 0
05:40 10 244.688 0 0
As you can see, there is one line with measurement-data within the string that starts with "Sonntag"
My target is:
04:50 10 244.685 0 0
05:00 10 244.680 0 0
HBCHa 9990 Seite 762
16.02.16
Hafenpegel
Pegel müM Var 0 Pegelstand
Datum Zeit Intervall müM Q Art
Sonntag, 31. Dezember 2000
05:10 10 244.679 0 0 !!
05:20 10 244.688 0 0
05:30 10 244.694 0 0
05:40 10 244.688 0 0
I managed to write the txt-file in a list, here called "data_list_splitted", catch this onle line over the whole txt-file, split it and extract the part with the measurements:
for i in data_list_splitted:
if len(i) >= 40:
ii = i.split(";")
txt_line = "%s;%s;%s;%s;%s"%(ii[4],ii[5],ii[6],ii[7],ii[8])
But i don't get it to break this line and add the measurement-values in the running list!
I think this should't be that difficult?
Any ideas?
Thank you very much!

You can create another list and insert values into it
new_data_list_splitted = []
for i in data_list_splitted:
if len(i) >= 40:
ii = i.split(";")
txt_line = "%s;%s;%s;%s;%s"%(ii[0],ii[1],ii[2],ii[3],ii[4])
new_data_list_splitted.append(txt_line)
txt_line = "%s;%s;%s;%s;%s"%(ii[4],ii[5],ii[6],ii[7],ii[8])
new_data_list_splitted.append(txt_line)
else:
new_data_list_splitted.append(i)
print new_data_list_splitted #this will have a new row for measurement value

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Dataframe .iloc low speed peformance - python

Related

Converting Matplotlib plots to Plotly Line chart

df.apply return NaN in pandas dataframe

difference between two rows pandas

Compare some columns from some tables using python

Split string and append parts in running list [Python]

Categories

Resources