Rename a persistent object within a definition (python) - python

Minimal Question:
def smooth(indicator, aggregation, tick):
storage.ZZZ = []
storage.ZZZZ = []
is the pertinent part of my definition, when I call that definition I'm using:
MA_now_smooth = smooth(MA, IN, I)[-1]
where MA is an input array, IN and I are constants; the definition is further defined below but ultimately returns the last input to storage.ZZZZ. What I want is to create custom storage objects that are named according to the "indicator" input so that the persistent variables don't overlap when calling upon this definition for myriad array inputs.
ie
smooth(MA, IN, I)[-1]
should create:
storage.ZZZ_MA
storage.ZZZZ_MA
but
smooth(MA2, IN, I)[-1]
should create:
storage.ZZZ_MA2
storage.ZZZZ_MA2
In Depth Question:
I'm creating an Simple Moving Average smoothing definition for TA-lib indicators at tradewave.net; TA-lib is a library of black box functions that give "Financial Technical Analysis" array outputs for things like "moving average" "exponential moving average" "stochastic" etc. My definition is a secondary simple smoothing of these TA-lib functions.
I'm having to do this because when "aggregating" candles counting backwards from current, I'm getting "wiggly" outputs; you can read more about that here if you need background: https://discuss.tradewave.net/t/aggregating-candles-some-thoughts
My definition code works well to create a list of smoothed values when smoothing a single indicator "MA"; a TA-lib array:
import talib
def smooth(indicator, aggregation, tick):
import math
A = int(math.ceil(aggregation/tick))
if info.tick == 0:
storage.ZZZ = []
storage.ZZZZ = []
storage.ZZZ.append(indicator[-1])
storage.ZZZ = storage.ZZZ[-A:]
ZZZ = sum(storage.ZZZ) / len(storage.ZZZ)
storage.ZZZZ.append(ZZZ)
storage.ZZZZ = storage.ZZZZ[-250:]
return storage.ZZZZ
def tick():
I = info.interval
period = 10
IN = 3600
instrument = pairs.btc_usd
C = data(interval=IN)[instrument].warmup_period('close')
MA = talib.MA(C, timeperiod=period, matype=0)
MA_now = MA[-1]
MA_now_smooth = smooth(MA, IN, I)[-1]
plot('MA', MA_now)
plot('MA_smooth', MA_now_smooth)
However, when I attempt to smooth more than one indicator with the same definition, it fails because the persistent variables in the definition are the same for both MA and MA2. This does not work:
import talib
def smooth(indicator, aggregation, tick):
import math
A = int(math.ceil(aggregation/tick))
if info.tick == 0:
storage.ZZZ = []
storage.ZZZZ = []
storage.ZZZ.append(indicator[-1])
storage.ZZZ = storage.ZZZ[-A:]
ZZZ = sum(storage.ZZZ) / len(storage.ZZZ)
storage.ZZZZ.append(ZZZ)
storage.ZZZZ = storage.ZZZZ[-250:]
return storage.ZZZZ
def tick():
I = info.interval
period = 10
IN = 3600
instrument = pairs.btc_usd
C = data(interval=IN)[instrument].warmup_period('close')
MA = talib.MA(C, timeperiod=period, matype=0)
MA2 = talib.MA(C, timeperiod=2*period, matype=0)
MA_now = MA[-1]
MA2_now = MA2[-1]
MA_now_smooth = smooth(MA, IN, I)[-1]
MA2_now_smooth = smooth(MA2, IN, I)[-1]
plot('MA', MA_now)
plot('MA2', MA2_now)
plot('MA_smooth', MA_now_smooth)
plot('MA2_smooth', MA2_now_smooth)
What I would like to do... and don't understand how to do:
I'd like for the definition to create a new persistent storage object for each new input and I'd like for the names of my objects to detect the name of the "indicator" input, ie:
storage.ZZZ_MA
storage.ZZZZ_MA
ZZZ_MA
for the "MA" smoothing and
storage.ZZZ_MA2
storage.ZZZZ_MA2
ZZZ_MA2
for "MA2" smoothing
I would like to be able to reuse this definition with many different array inputs for "indicator" and for each instance use the name of the indicator array appended to the persistent object names used in the definition. For example:
storage.ZZZ_MA3
storage.ZZZ_MA4
etc.
In the instances below info.interval is my tick size of 15 minutes (900 sec) and my aggregation was 1 hour (3600 sec)
With the single output of "MA" and correct smoothing
With dual outputs of "MA" and "MA2" I'm getting incorrect smoothing
In the second image I'm looking for a two "smooth" lines one in the middle of the wiggly red plot and the other in the middle of wiggly blue plot. Instead I'm getting two identical wiggly lines (purple & orange) that split the difference. I understand why, but I don't know how to fix it.
1) please show me how
2) please tell me what I'm looking to do is "called" and point me to some tags/posts where I can learn more.
Thanks for your help!
LP

Make storage a dict, and use string keys rather than trying to create and access dynamic variables?

Well I've arrived at an interim solution.
While I like this solution as its doing everything I need. I would like to eliminate the redundant "label" input. Is there any way for me to reference the name of my input parameter/argument "indicator" instead of its object so that I could return to my original 3 input parameters rather than 4?
I tried this:
def smooth(indicator, aggregation, tick):
import math
A = int(math.ceil(aggregation/tick))
ZZZ = 'ZZZ_%s' % dict([(t.__name__, t) for t in indicator])
ZZZZ = 'ZZZZ_%s' % dict([(t.__name__, t) for t in indicator])
if info.tick == 0:
storage[ZZZ] = []
storage[ZZZZ] = []
storage[ZZZ].append(indicator[-1])
storage[ZZZ] = storage[ZZZ][-A:]
ZZZZZ = sum(storage[ZZZ]) / len(storage[ZZZ])
storage[ZZZZ].append(ZZZZZ)
storage[ZZZZ] = storage[ZZZZ][-250:]
return storage[ZZZZ]
but I get:
File "", line 259, in File "", line 31, in tick File "", line 6, in smooth AttributeError: 'numpy.float64' object has no attribute 'name'
Here is my current 4 argument definition smoothing 4 different TA-lib moving averages. This same definition can be used with many other aggregated TA-lib indicators. It should work with ANY aggregate/tick size ratio including 1:1.
import talib
def smooth(indicator, aggregation, tick, label):
import math
A = int(math.ceil(aggregation/tick))
ZZZ = 'ZZZ_%s' % label
ZZZZ = 'ZZZZ_%s' % label
if info.tick == 0:
storage[ZZZ] = []
storage[ZZZZ] = []
storage[ZZZ].append(indicator[-1])
storage[ZZZ] = storage[ZZZ][-A:]
ZZZZZ = sum(storage[ZZZ]) / len(storage[ZZZ])
storage[ZZZZ].append(ZZZZZ)
storage[ZZZZ] = storage[ZZZZ][-250:]
return storage[ZZZZ]
def tick():
I = info.interval
period = 10
IN = 3600
instrument = pairs.btc_usd
C = data(interval=IN)[instrument].warmup_period('close')
MA1 = talib.MA(C, timeperiod=period, matype=0)
MA2 = talib.MA(C, timeperiod=2*period, matype=0)
MA3 = talib.MA(C, timeperiod=3*period, matype=0)
MA4 = talib.MA(C, timeperiod=4*period, matype=0)
MA1_now = MA1[-1]
MA2_now = MA2[-1]
MA3_now = MA3[-1]
MA4_now = MA4[-1]
MA1_now_smooth = smooth(MA1, IN, I, 'MA1')[-1]
MA2_now_smooth = smooth(MA2, IN, I, 'MA2')[-1]
MA3_now_smooth = smooth(MA3, IN, I, 'MA3')[-1]
MA4_now_smooth = smooth(MA4, IN, I, 'MA4')[-1]
plot('MA1', MA1_now)
plot('MA2', MA2_now)
plot('MA3', MA3_now)
plot('MA4', MA4_now)
plot('MA1_smooth', MA1_now_smooth)
plot('MA2_smooth', MA2_now_smooth)
plot('MA3_smooth', MA3_now_smooth)
plot('MA4_smooth', MA4_now_smooth)
h/t james for collaboration

Related

Weird "demonic" xtick in matplotlib (jpeg artifacts? No way...)

So I'm comparing NBA betting lines between different sportsbooks over time
Procedure:
Open pickle file of scraped data
Plot the scraped data
The pickle file is a dictionary of NBA betting lines over time. Each of the two teams are their own nested dictionary. Each key in these team-specific dictionaries represents a different sportsbook. The values for these sportsbook keys are lists of tuples, representing timeseries data. It looks roughly like this:
dicto = {
'Time': <time that the game starts>,
'Team1': {
Market1: [ (time1, value1), (time2, value2), etc...],
Market2: [ (time1, value1), (time2, value2), etc...],
etc...
}
'Team2': {
<SAME FORM AS TEAM1>
}
}
There are no issues with scraping or manipulating this data. The issue comes when I plot it. Here is the code for the script that unpickles and plots these dictionaries:
import matplotlib.pyplot as plt
import pickle, datetime, os, time, re
IMAGEPATH = 'Images'
reg = re.compile(r'[A-Z]+#[A-Z]+[0-9|-]+')
noDate = re.compile(r'[A-Z]+#[A-Z]+')
# Turn 1 into '01'
def zeroPad(num):
if num < 10:
return '0' + str(num)
else:
return num
# Turn list of time-series tuples into an x list and y list
def unzip(lst):
x = []
y = []
for i in lst:
x.append(f'{i[0].hour}:{zeroPad(i[0].minute)}')
y.append(i[1])
return x, y
# Make exactly 5, evenly spaced xticks
def prune(xticks):
last = len(xticks)
first = 0
mid = int(len(xticks) / 2) - 1
upMid = int( mid + (last - mid) / 2)
downMid = int( (mid - first) / 2)
out = []
count = 0
for i in xticks:
if count in [last, first, mid, upMid, downMid]:
out.append(i)
else:
out.append('')
count += 1
return out
def plot(filename, choice):
IMAGEPATH = 'Images'
IMAGEPATH = os.path.join(IMAGEPATH, choice)
with open(filename, 'rb') as pik:
dicto = pickle.load(pik)
fig, axs = plt.subplots(2)
gameID = noDate.search(filename).group(0)
tm = dicto['Time']
fig.suptitle(gameID + '\n' + str(tm))
i = 0
for team in dicto.keys():
axs[i].set_title(team)
if team == 'Time':
continue
for market in dicto[team].keys():
lst = dicto[team][market]
x, y = unzip(lst)
axs[i].plot(x, y, label= market)
axs[i].set_xticks(prune(x))
axs[i].set_xticklabels(rotation=45, labels = x)
i += 1
plt.tight_layout()
#Finish
outputFile = reg.search(filename).group(0)
date = (datetime.datetime.today() - datetime.timedelta(hours = 6)).date()
fig.savefig(os.path.join(IMAGEPATH, str(date), f'{outputFile}.png'))
plt.close()
Here is the image that results from calling the plot function on one of the dictionaries that I described above. It is pretty much exactly as I intended it, except for one very strange and bothersome problem.
You will notice that the bottom right tick looks haunted, demonic, jpeggy, whatever you want to call it. I am highly suspicious that this problem occurs in the prune function, which I use to set the xtick values of the plot.
The reason that I prune the values with a function like this is because these dictionaries are continuously updated, so setting a static number of xticks would not work. And if I don't prune the xticks, they end up becoming unreadable due to overlapping one another.
I am quite confused as to what could cause an xtick to look like this. It happens consistently, for every dictionary, every time. Before I added the prune function (when the xticks unbound, overlapping one another), this issue did not occur. So when I say I'm suspicious that the prune function is the cause, I am really quite certain.
I will be happy to share an instance of one of these dictionaries, but they are saved as .pickle files, and I'm pretty sure it's bad practice to share pickle files over the internet. I have been warned about potential malware, so I'll just stay away from that. But if you need to see the dictionary, I can take the time to prettily print one and share a screenshot. Any help is greatly appreciated!
Matplotlib does this when there are many xticks or yticks which are plotted on the same value. It is normal. If you can limit the number of times the specific value is plotted - you can make it appear indistinguishable from the rest of the xticks.
Plot a simple example to test this out and you will see for yourself.

How to delete smaller multiple objects in my scene using Blender script?

I am using Blender 2.8. I want to import an object into blender that is made up of a few pieces that aren't connected. So I want to split the object up and only export the largest of the pieces.
So lets say there are 3 pieces in one object, one big and two small. I'm able to turn this object into three objects, each containing one of the pieces. I would like to delete the two smaller objects and only keep the largest one. I'm thinking maybe to somehow find the surface area of the three different objects and only keep the largest while deleting all others? I'm pretty new at Blender.
bpy.ops.import_mesh.stl(filepath='path/of/file.stl')
bpy.ops.mesh.separate(type='LOOSE')
amount_of_pieces = len(context.selected_objects)
if amount_of_pieces > 1:
highest_surface_area = 0
#the rest is pseudocode
for object in scene:
if object.area > highest_surface_area:
highest_surface_area = object.area
else:
bpy.ops.object.delete()
bpy.ops.export_mesh.stl(filepath='path/of/new/file.stl')
The steps would be :-
import file
break into multiple objects
for safety, get a list of mesh objects
list the surface area of each object
get the max from the list of areas
delete the not biggest objects
export the largest
cleanup
We don't need to use bmesh to get the surface area, the normal mesh data includes polygon.area.
Using list comprehension, we can get most steps into one line each.
import bpy
# import and separate
file = (r'path/of/file.stl')
bpy.ops.import_mesh.stl(filepath= file)
bpy.ops.mesh.separate(type='LOOSE')
# list of mesh objects
mesh_objs = [o for o in bpy.context.scene.objects
if o.type == 'MESH']
# dict with surface area of each object
obj_areas = {o:sum([f.area for f in o.data.polygons])
for o in mesh_objs}
# which is biggest
big_obj = max(obj_areas, key=obj_areas.get)
# select and delete not biggest
[o.select_set(o is not big_obj) for o in mesh_objs]
bpy.ops.object.delete(use_global=False, confirm=False)
#export
bpy.ops.export_mesh.stl(filepath= 'path/of/new/file.stl')
# cleanup
bpy.ops.object.select_all(action='SELECT')
bpy.ops.object.delete(use_global=False, confirm=False)
I was able to write a code that works for this, however, it is very long and chaotic. I would appreciate it if anyone could give me some advice on cleaning it up.
import bpy
import os
import bmesh
context = bpy.context
file = (r'path\to\file.stl')
bpy.ops.import_mesh.stl(filepath= file)
fileName = os.path.basename(file)[:-4].capitalize()
bpy.ops.mesh.separate(type='LOOSE')
bpy.ops.object.select_all(action='SELECT')
piece = len(context.selected_objects)
bpy.ops.object.select_all(action='DESELECT')
high = 0
if piece > 1:
bpy.data.objects[fileName].select_set(True)
obj = bpy.context.active_object
bm = bmesh.new()
bm.from_mesh(obj.data)
area = sum(f.calc_area() for f in bm.faces)
high = area
bm.free()
bpy.ops.object.select_all(action='DESELECT')
for x in range (1, piece):
name = fileName + '.00' + str(x)
object = bpy.data.objects[name]
context.view_layer.objects.active = object
bpy.data.objects[name].select_set(True)
obj = bpy.context.active_object
bm = bmesh.new()
bm.from_mesh(obj.data)
newArea = sum(f.calc_area() for f in bm.faces)
bm.free()
if newArea > high:
high = newArea
bpy.ops.object.select_all(action='DESELECT')
else:
bpy.ops.object.delete()
bpy.ops.object.select_all(action='DESELECT')
if area != high:
bpy.data.objects[fileName].select_set(True)
bpy.ops.object.delete()
bpy.ops.export_mesh.stl(filepath= 'path/to/export/file.stl')
bpy.ops.object.select_all(action='SELECT')
bpy.ops.object.delete(use_global=False, confirm=False)

Storing output from Python function necessary despite not using output

I am trying to understand why I must store the output of a Python function (regardless of the name of the variable I use, and regardless of whether I subsequently use that variable). I think this is more general to Python and not specifically to the software NEURON, thus I put it here on Stackoverflow.
The line of interest is here:
clamp_output = attach_current_clamp(cell)
If I just write attach_current_clamp(cell), without storing the output of the function into a variable, the code does not work (plot is empty), and yet I don't use clamp_output at all. Why cannot I not just call the function? Why must I use a variable to store the output even without using the output?
import sys
import numpy
sys.path.append('/Applications/NEURON-7.4/nrn/lib/python')
from neuron import h, gui
from matplotlib import pyplot
#SET UP CELL
class SingleCell(object):
def __init__(self):
self.soma = h.Section(name='soma', cell=self)
self.soma.L = self.soma.diam = 12.6517
self.all = h.SectionList()
self.all.wholetree(sec=self.soma)
self.soma.insert('pas')
self.soma.e_pas = -65
for sec in self.all:
sec.cm = 20
#CURRENT CLAMP
def attach_current_clamp(cell):
stim = h.IClamp(cell.soma(1))
stim.delay = 100
stim.dur = 300
stim.amp = 0.2
return stim
cell = SingleCell()
#IF I CALL THIS FUNCTION WITHOUT STORING THE OUTPUT, THEN IT DOES NOT WORK
clamp_output = attach_current_clamp(cell)
#RECORD AND PLOT
soma_v_vec = h.Vector()
t_vec = h.Vector()
soma_v_vec.record(cell.soma(0.5)._ref_v)
t_vec.record(h._ref_t)
h.tstop = 800
h.run()
pyplot.figure(figsize=(8,4))
soma_plot = pyplot.plot(t_vec,soma_v_vec)
pyplot.show()
This is a NEURON+Python specific bug/feature. It has to do with Python garbage collection and the way NEURON implements the Python-HOC interface.
When there are no more references to a NEURON object (e.g. the IClamp) from within Python or HOC, the object is removed from NEURON.
Saving the IClamp as a property of the cell averts the problem in the same way as saving the result, so that could be an option for you:
# In __init__:
self.IClamps = []
# In attach_current_clamp:
stim.amp = 0.2
cell.IClamps.append(stim)
#return stim

Implementing multiprocessing to deal with heavy input/output on HPC

I need to process over 10 million spectroscopic data sets. The data is structured like this: there are around 1000 .fits (.fits is some data storage format) files, each file contains around 600-1000 spectra in which there are around 4500 elements in each spectra (so each file returns a ~1000*4500 matrix). That means each spectra is going to be repeatedly read around 10 times (or each file is going to be repeatedly read around 10,000 times) if I am going to loop over the 10 million entries. Although the same spectra is repeatedly read around 10 times, it is not duplicate because each time I extract different segments of the same spectra. With the help of #Paul Panzer, I already avoid reading the same file multiple times.
I have a catalog file which contains all the information I need, like the coordinates x, y, the radius r, the strength s, etc. The catalog also contains the information to target which file I am going to read (identified by n1, n2) and which spectra in that file I am going to use (identified by n3).
The code I have now is:
import numpy as np
from itertools import izip
import itertools
import fitsio
x = []
y = []
r = []
s = []
n1 = []
n2 = []
n3 = []
with open('spectra_ID.dat') as file_ID, open('catalog.txt') as file_c:
for line1, line2 in izip(file_ID,file_c):
parts1 = line1.split()
parts2 = line2.split()
n1.append(int(parts1[0]))
n2.append(int(parts1[1]))
n3.append(int(parts1[2]))
x.append(float(parts2[0]))
y.append(float(parts2[1]))
r.append(float(parts2[2]))
s.append(float(parts2[3]))
def data_analysis(n_galaxies):
n_num = 0
data = np.zeros((n_galaxies), dtype=[('spec','f4',(200)),('x','f8'),('y','f8'),('r','f8'),('s','f8')])
idx = np.lexsort((n3,n2,n1))
for kk,gg in itertools.groupby(zip(idx, n1[idx], n2[idx]), lambda x: x[1:]):
filename = "../../data/" + str(kk[0]) + "/spPlate-" + str(kk[0]) + "-" + str(kk[1]) + ".fits"
fits_spectra = fitsio.FITS(filename)
fluxx = fits_spectra[0].read()
n_element = fluxx.shape[1]
hdu = fits_spectra[0].read_header()
wave_start = hdu['CRVAL1']
logwave = wave_start + 0.0001 * np.arange(n_element)
wavegrid = np.power(10,logwave)
for ss, plate1, mjd1 in gg:
if n_num % 1000000 == 0:
print n_num
n3new = n3[ss]-1
flux = fluxx[n3new]
### following is my data reduction of individual spectra, I will skip here
### After all my analysis, I have the data storage as below:
data['spec'][n_num] = flux_intplt
data['x'][n_num] = x[ss]
data['y'][n_num] = y[ss]
data['r'][n_num] = r[ss]
data['s'][n_num] = s[ss]
n_num += 1
print n_num
data_output = FITS('./analyzedDATA/data_ALL.fits','rw')
data_output.write(data)
I kind of understand that the multiprocessing need to remove one loop, but pass the index to the function. However, there are two loops in my function and those two are highly correlated, so I do not know how to approach. Since the most time-consuming part of this code is reading files from disk, so the multiprocessing need to take full advantage of cores to read multiple files at one time. Could any one shed a light on me?
Get rid of global vars, you can't use global vars with processes
Merge your multiple global vars into one container class or dict,
assigning different segments of the same spectra into one data set
Move your global with open(... into a def ...
Separate data_output into a own def ...
Try first, without multiprocessing, this concept:
for line1, line2 in izip(file_ID,file_c):
data_set = create data set from (line1, line2)
result = data_analysis(data_set)
data_output.write(data)
Consider to use 2 processes one for file reading and one for file writing.
Use multiprocessing.Pool(processes=n) for data_analysis.
Communicate between processes using multiprocessing.Manager().Queue()

Optimizing Algorithm of large dataset calculations

Once again I find myself stumped with pandas, and how to best perform a 'vector operation'. My code works, however it will take a long time to iterate through everything.
What the code is trying to do is loop through shapes.cv and determine which shape_pt_sequence is a stop_id, and then assigns the stop_lat and stop_lon to shape_pt_lat and shape_pt_lon, while also marking the shape_pt_sequence as is_stop.
GISTS
stop_times.csv LINK
trips.csv LINK
shapes.csv LINK
Here is my code:
import pandas as pd
from haversine import *
'''
iterate through shapes and match stops along a shape_pt_sequence within
x amount of distance. for shape_pt_sequence that is closest, replace the stop
lat/lon to the shape_pt_lat/shape_pt_lon, and mark is_stop column with 1.
'''
# readability assignments for shapes.csv
shapes = pd.read_csv('csv/shapes.csv')
shapes_index = list(set(shapes['shape_id']))
shapes_index.sort(key=int)
shapes.set_index(['shape_id', 'shape_pt_sequence'], inplace=True)
# readability assignments for trips.csv
trips = pd.read_csv('csv/trips.csv')
trips_index = list(set(trips['trip_id']))
trips.set_index(['trip_id'], inplace=True)
# readability assignments for stops_times.csv
stop_times = pd.read_csv('csv/stop_times.csv')
stop_times.set_index(['trip_id','stop_sequence'], inplace=True)
print(len(stop_times.loc[1423492]))
# readability assginments for stops.csv
stops = pd.read_csv('csv/stops.csv')
stops.set_index(['stop_id'], inplace=True)
# for each trip_id
for i in trips_index:
print('******NEW TRIP_ID******')
print(i)
i = i.astype(int)
# for each stop_sequence in stop_times
for x in range(len(stop_times.loc[i])):
stop_lat = stop_times.loc[i,['stop_lat','stop_lon']].iloc[x,[0,1]][0]
stop_lon = stop_times.loc[i,['stop_lat','stop_lon']].iloc[x,[0,1]][1]
stop_coordinate = (stop_lat, stop_lon)
print(stop_coordinate)
# shape_id that matches trip_id
print('**SHAPE_ID**')
trips_shape_id = trips.loc[i,['shape_id']].iloc[0]
trips_shape_id = int(trips_shape_id)
print(trips_shape_id)
smallest = 0
for y in range(len(shapes.loc[trips_shape_id])):
shape_lat = shapes.loc[trips_shape_id].iloc[y,[0,1]][0]
shape_lon = shapes.loc[trips_shape_id].iloc[y,[0,1]][1]
shape_coordinate = (shape_lat, shape_lon)
haversined = haversine_mi(stop_coordinate, shape_coordinate)
if smallest == 0 or haversined < smallest:
smallest = haversined
smallest_shape_pt_indexer = y
else:
pass
print(haversined)
print('{0:.20f}'.format(smallest))
print('{0:.20f}'.format(smallest))
print(smallest_shape_pt_indexer)
# mark is_stop as 1
shapes.iloc[smallest_shape_pt_indexer,[2]] = 1
# replace coordinate value
shapes.loc[trips_shape_id].iloc[y,[0,1]][0] = stop_lat
shapes.loc[trips_shape_id].iloc[y,[0,1]][1] = stop_lon
shapes.to_csv('csv/shapes.csv', index=False)
What you could do to optmizing this code is use some threads/workers instead those for.
I recommend using the Pool of Workes as its very simple to use.
In:
for i in trips_index:
You could use something like:
from multiprocessing import Pool
pool = Pool(processes=4)
result = pool.apply_async(func, trips_index)
And than the method func would be like:
def func(i):
#code here
And you could simply put the whole for loop inside this method.
It would make it work with 4 subprocess in this example, git it a nice improvment.
One thing to consider is that a collection of trips will often have the same sequence of stops and the same shape data (the only difference between trips is the timing). So it might make sense to cache the find-closest-point-on-shape operation for (stop_id, shape_id). I bet that would reduce your runtime by an order-of-magnitude.

Categories

Resources