I have nested for-loop in python to create a netCDF file. The for-loop takes a pandas dataframe with time, lat, lot, and parameters and replaces the information in the netCDF file by the parameters in the correct location and time. This is taking too long since the pandas dataframe has more than 80000 rows and the netCDF file has around 8000 time-steps. I've been looking to use either xargs or multiprocessing but in the first case the uses files as inputs and in the second case, it produces as many outputs as processes I use. I have no experience in parallel processing so probably my affirmations are totally wrong. This is the code that I am using:
with Dataset(os.path.join('Downloads', inv, ''), 'w') as dset:
dset.createDimension('time_components', 6)
groups = ['obs', 'mix_apri', 'mix_apos', 'mix_background']
for group in groups:
dset[group].createDimension('nt', 8760)
dset[group].createDimension('nlat', 80)
dset[group].createDimension('nlon', 100)
times_start = dset[group].createVariable('times_start', 'i4', ('nt', 'time_components'))
times_end = dset[group].createVariable('times_end', 'i4', ('nt', 'time_components'))
lats = dset[group].createVariable('lats', 'f4', ('nlat'))
lons = dset[group].createVariable('lons', 'f4', ('nlon'))
times_start[:,:] = list(emis_apri['biosphere']['times_start'])
times_end[:,:] = list(emis_apri['biosphere']['times_end'])
lats[:] = list(emis_apri['biosphere']['lats'])
lons[:] = list(emis_apri['biosphere']['lons'])
conc_obs = dset['obs'].createVariable('conc', 'f8', ('nt', 'nlat', 'nlon'))
conc_mix_apri = dset['mix_apri'].createVariable('conc', 'f8', ('nt', 'nlat', 'nlon'))
conc_mix_apos = dset['mix_apos'].createVariable('conc', 'f8', ('nt', 'nlat', 'nlon'))
conc_mix_background = dset['mix_background'].createVariable('conc', 'f8', ('nt', 'nlat', 'nlon'))
for i in range(8760):
conc_obs[i,:,:] = emis_apri['biosphere']['emis'][i][:,:]*0
conc_mix_apri[:,:,:] = list(conc_obs)
conc_mix_apos[:,:,:] = list(conc_obs)
conc_mix_background[:,:,:] = list(conc_obs)
db = obsdb(os.path.join('Downloads', inv, 'observations.apos.tar.gz'))
nsites = db.sites.shape[0]
for isite, site in enumerate(db.sites.itertuples()):
dbs = db.observations.loc[ == site.Index]
lat = where((array(emis_apri['biosphere']['lats']) >= list([0]-0.25) & (array(emis_apri['biosphere']['lats']) <= list([0]+0.25))[0][0]
lon = where((array(emis_apri['biosphere']['lons']) >= list(dbs.lon)[0]-0.25) & (array(emis_apri['biosphere']['lons']) <= list(dbs.lon)[0]+0.25))[0][0]
for i in range(len(list(dbs.time))):
for j in range(len(times_start)):
if datetime(*times_start[j,:].data) >= Timestamp.to_pydatetime(list(dbs.time)[i]) and datetime(*times_end[j,:].data) >= Timestamp.to_pydatetime(list(dbs.time)[i]):
conc_obs[i,lat,lon] = list(dbs.obs)[i]
conc_mix_apri[i,lat,lon] = list(dbs.mix_apri)[i]
conc_mix_apos[i,lat,lon] = list(dbs.mix_apos)[i]
conc_mix_background[i,lat,lon] = list(dbs.mix_background)[i]
From for isite, site in enumerate(db.sites.itertuples()): is the part of the code that I need to parallelize. I really appreciate any insights about this.

Consider the following as pseudocode as I cannot run any test without any samples etc. I have usually parallelized my code with mpi4py and in your case, you could do in the beginning:
from mpi4py import MPI
size = comm.Get_size(); # let your program know how many processors you are using
rank = comm.Get_rank() # let the running program know, which processor it is
Now, in the beginning of the code, let one of the processes be the so called master task, which can do all the basic/important stuff that cannot be done simultaneously by all the tasks. For example, opening/initializing some file for the output. So, in your code, for those parts, you can use:
if rank==0:
# do some important stuff
# do something not important (for example a = 5)
comm.barrier() # this is important to synchronize the processes
Now, to parallelize your code, you can do the loop over distributed db.sites i.e. you divide the db.sites.itertuples() over the number of processor you are going to use:
allsites = db.sites.itertuples() # all the processor have to know all the sites
sites = allsites[rank::size] # each starts from it's current rank and jumps with the size
for isite, site in enumerate(sites):
dbs = db.observations.loc[ == site.Index]
lat = where((array(emis_apri['biosphere']['lats']) >= list([0]-0.25) & (array(emis_apri['biosphere']['lats']) <= list([0]+0.25))[0][0]
lon = where((array(emis_apri['biosphere']['lons']) >= list(dbs.lon)[0]-0.25) & (array(emis_apri['biosphere']['lons']) <= list(dbs.lon)[0]+0.25))[0][0]
for i in range(len(list(dbs.time))):
for j in range(len(times_start)):
if datetime(*times_start[j,:].data) >= Timestamp.to_pydatetime(list(dbs.time)[i]) and datetime(*times_end[j,:].data) >= Timestamp.to_pydatetime(list(dbs.time)[i]):
conc_obs[i,lat,lon] = list(dbs.obs)[i]
conc_mix_apri[i,lat,lon] = list(dbs.mix_apri)[i]
conc_mix_apos[i,lat,lon] = list(dbs.mix_apos)[i]
conc_mix_background[i,lat,lon] = list(dbs.mix_background)[i]
comm.barrier() # do not forget to synchronize
Nevertheless, in this case the "isite" has now value based on the size of the list, you are giving in. So instead of being 0...len(allsites), it is 0...len(allsites)/size. If the "isite" is important to have value from 0 to len(allsites), you somehow have to recalculate. Perhaps isite_global = isite*size+rank to get the actual number that the processor is doing.
So, in the end how to run the code, I usually do:
mpiexec -np 10 ipython script_name
at the terminal to run the code on 10 processors.
But, in any case, the hardest part is to parallelize the I/O operations without specific support from the library. I am not sure that netCDF4 supports parallel I/O meaning if your processors with ranks 0...X open the file simultaneously for X processors, write something to the specific location in the file and close the file, that afterwards the data from all the processors is written there.
Therefore, the safest idea is to somehow let one of the processors (master) be responsible for the output and exchange/collect data that needs to be written from all the subprocessors before writing.
Hope this helps, good luck with the code!


Process 100 of feature classes through script and feature class name to end of each output

NOTE: Work constraints I must use python 2.7 (I know - eyeroll) and standard modules. I'm still learning python.
I have about 100 tiled 'area of interest' polygons in a geodatabase that need to be processed through my script. My script has been tested on individual tiles & works great. I need advice how to iterate this process so I don't have to run one at a time. (I don't want to iterate ALL 100 at once in case something fails - I just want to make a list or something to run about 10-15 at a time). I also need to add the tile name that I am processing to each feature class that I output.
So far I have tried using fnmatch.fnmatch which errors because it does not like a list. I changed syntax to parenthesis which did NOT error but did NOT print anything.
I figure once that naming piece is done, running the process in the for loop should work. Please help with advice what I am doing wrong or if there is a better way - thanks!
This is just a snippet of the full process:
tilename = 'T0104'
HIFLD_fc = os.path.join(work_dir, 'fc_clipped_lo' + tilename)
HIFLD_fc1 = os.path.join(work_dir, 'fc1_hifldstr_lo' + tilename)
HIFLD_fc2 = os.path.join(work_dir, 'fc2_non_ex_lo' + tilename)
HIFLD_fc3 = os.path.join(work_dir, 'fc3_no_wilder_lo' + tilename)
arcpy.env.workspace = (env_dir)
fcs = arcpy.ListFeatureClasses()
tile_list = ('AK1004', 'AK1005')
for tile in fcs:
filename, ext = os.path.splitext(tile)
if fnmatch.fnmatch(tile, tile_list):
arcpy.Clip_analysis(HIFLD_fc, bufferOut2, HIFLD_fc1, "")
print('HIFLD clipped for analysis')
arcpy.Clip_analysis(HIFLD_fc, env_mask, HIFLD_masked_rds, "")
print('HIFLD clipped by envelopes and excluded from analysis')
arcpy.Clip_analysis(HIFLD_masked_rds, wild_mask, HIFLD_excluded, "")
print('HIFLD clipped by wilderness mask and excluded from analysis')
arcpy.MakeFeatureLayer_management(HIFLD_fc1, 'hifld_lyr')
arcpy.SelectLayerByLocation_management('hifld_lyr', "COMPLETELY_WITHIN", bufferOut1, "", "NEW_SELECTION", "INVERT")
if arcpy.GetCount_management('hifld_lyr') > 0:
arcpy.CopyFeatures_management('hifld_lyr', HIFLD_fc2)
print('HIFLD split features deleted fc2')

How do I improve the speed of this parser using python?

I am currently parsing historic delay data from a public transport network in Sweden. I have ~5700 files (one from every 15 seconds) from the 27th of January containing momentary delay data for vehicles on active trips in the network. It's, unfortunately, a lot of overhead / duplicate data, so I want to parse out the relevant stuff to do visualizations on it.
However, when I try to parse and filter out the relevant delay data on a trip level using the script below it performs really slow. It has been running for over 1,5 hours now (on my 2019 Macbook Pro 15') and isn't finished yet.
How can I optimize / improve this python parser?
Or should I reduce the number of files, and i.e. the frequency of the data collection, for this task?
Thank you so much in advance. 💗
from google.transit import gtfs_realtime_pb2
import gzip
import os
import datetime
import csv
import numpy as np
directory = '../data/tripu/27/'
datapoints = np.zeros((0,3), int)
read_trips = set()
# Loop through all files in directory
for filename in os.listdir(directory)[::3]:
# Uncompress and parse protobuff-file using gtfs_realtime_pb2
with + filename, 'rb') as file:
response =
feed = gtfs_realtime_pb2.FeedMessage()
print("Filename: " + filename, "Total entities: " + str(len(feed.entity)))
for trip in feed.entity:
if trip.trip_update.trip.trip_id not in read_trips:
if len(trip.trip_update.stop_time_update) == len(stopsOnTrip[trip.trip_update.trip.trip_id]):
print("\t","Adding delays for",len(trip.trip_update.stop_time_update),"stops, on trip_id",trip.trip_update.trip.trip_id)
for i, stop_time_update in enumerate(trip.trip_update.stop_time_update[:-1]):
# Store the delay data point (arrival difference of two ascending nodes)
delay = int(trip.trip_update.stop_time_update[i+1].arrival.delay-trip.trip_update.stop_time_update[i].arrival.delay)
# Store contextual metadata (timestamp and edgeID) for the unique delay data point
ts = int(trip.trip_update.stop_time_update[i+1].arrival.time)
key = int(str(trip.trip_update.stop_time_update[i].stop_id) + str(trip.trip_update.stop_time_update[i+1].stop_id))
# Append data to numpy array
datapoints = np.append(datapoints, np.array([[key,ts,delay]]), axis=0)
except KeyError:
except OSError:
I suspect the problem here is repeatedly calling np.append to add a new row to a numpy array. Because the size of a numpy array is fixed when it is created, np.append() must create a new array, which means that it has to copy the previous array. On each loop, the array is bigger and so all these copies add a quadratic factor to your execution time. This becomes significant when the array is quite big (which apparently it is in your application).
As an alternative, you could just create an ordinary Python list of tuples, and then if necessary convert that to a complete numpy array at the end.
That is (only the modified lines):
datapoints = []
# ...
# ...
npdata = np.array(datapoints, dtype=int)
I still think the parse routine is your bottleneck (even if it did come from Google), but all those '.'s were killing me! (And they do slow down performance somewhat.) Also, I converted your i, i+1 iterating to using two iterators zipping through the list of updates, this is a little more advanced style of working through a list. Plus the cur/next_update names helped me keep straight when you wanted to reference one vs. the other. Finally, I remove the trailing "else: continue", since you are at the end of the for loop anyway.
for trip in feed.entity:
this_trip_update = trip.trip_update
this_trip_id = this_trip_update.trip.trip_id
if this_trip_id not in read_trips:
if len(this_trip_update.stop_time_update) == len(stopsOnTrip[this_trip_id]):
print("\t", "Adding delays for", len(this_trip_update.stop_time_update), "stops, on trip_id",
# create two iterators to walk through the list of updates
cur_updates = iter(this_trip_update.stop_time_update)
nxt_updates = iter(this_trip_update.stop_time_update)
# advance the nxt_updates iter so it is one ahead of cur_updates
for cur_update, next_update in zip(cur_updates, nxt_updates):
# Store the delay data point (arrival difference of two ascending nodes)
delay = int(nxt_update.arrival.delay - cur_update.arrival.delay)
# Store contextual metadata (timestamp and edgeID) for the unique delay data point
ts = int(next_update.arrival.time)
key = "{}/{}".format(cur_update.stop_id, next_update.stop_id)
# Append data to numpy array
datapoints = np.append(datapoints, np.array([[key, ts, delay]]), axis=0)
except KeyError:
This code should be equivalent to what you posted, and I don't really expect major performance gains either, but perhaps this will be more maintainable when you come back to look at it in 6 months.
(This probably is more appropriate for CodeReview, but I hardly ever go there.)

Can a high CPU load (from other applications) affect python performance/accuracy?

I'm working on a code to read and display the results of a Finite Element Analysis (FEA) calculation. The results are stored in several (relatively big) text files that contain a list of nodes (ID number, location in space) and lists for the physical fields of relevance (ID of node, value of the field on that point).
However, I have noticed that when I'm running a FEA case in the background and I try to run my code at the same time it returns errors, not always the same one and not always at the same iteration, all seemly at random and without any modification to the code or to the input files whatsoever, just by hitting the RUN button seconds apart between runs.
Example of the errors that I'm getting are:
keys[key] = np.round(np.asarray(keys[key]),7)
TypeError: can't multiply sequence by non-int of type 'float'
triang = tri.Triangulation(x, y)
ValueError: x and y arrays must have a length of at least 3
line = [float(n) for n in line]
ValueError: could not convert string to float: '0.1225471E'
In case you are curious, this is my code (keep in mind that it is not finished yet and that I'm a mechanical engineer, not a programmer). Any feedback on how to make it better is also appreciated:
import matplotlib.pyplot as plt
import matplotlib.tri as tri
import numpy as np
import os
triangle_max_radius = 0.003
respath = 'C:/path'
fields = ['TEMPERATURE']
# Plot figure definition --------------------------------------------------------------------------------------
fig, ax1 = plt.subplots()
fig.subplots_adjust(left=0, right=1, bottom=0.04, top=0.99)
# -------------------------------------------------------------------------------------------------------------
# Read outputfiles --------------------------------------------------------------------------------------------
resfiles = [f for f in os.listdir(respath) if (os.path.isfile(os.path.join(respath,f)) and f[:3]=='csv')]
resfiles = [[f,int(f[4:])] for f in resfiles]
resfiles = sorted(resfiles,key=lambda x: (x[1]))
resfiles = [os.path.join(respath,f[:][0]).replace("\\","/") for f in resfiles]
# -------------------------------------------------------------------------------------------------------------
# Read data inside outputfile ---------------------------------------------------------------------------------
for result_file in resfiles:
keys = {}
keywords = []
with open(result_file, 'r') as res:
for line in res:
if line[0:2] == '##':
if len(line) >= 5:
line = line[:3] + line[7:]
line = line.replace(';',' ')
line = line.split()
if line:
if line[0] == '##':
if len(line) >= 3:
keys[line[1]] = []
elif line[0] in keywords:
curr_key = line[0]
line = [float(n) for n in line]
for key in keys:
keys[key] = np.round(np.asarray(keys[key]),7)
for item in fields:
gob_temp = np.empty((0,4))
for node in keys[item]:
temp_coords, = np.where(node[0] == keys['COORDINATES'][:,0])
gob_temp_coords = [node[0], keys['COORDINATES'][temp_coords,1], keys['COORDINATES'][temp_coords,2], node[1]]
gob_temp = np.append(gob_temp,[gob_temp_coords],axis=0)
x = gob_temp[:,1]
y = gob_temp[:,2]
z = gob_temp[:,3]
triang = tri.Triangulation(x, y)
triangles = triang.triangles
xtri = x[triangles] - np.roll(x[triangles], 1, axis=1)
ytri = y[triangles] - np.roll(y[triangles], 1, axis=1)
maxi = np.max(np.sqrt(xtri**2 + ytri**2), axis=1)
triang.set_mask(maxi > triangle_max_radius)
So back to the question, is it possible for the accuracy/performance of python to be affected by CPU load or any other 'external' factors? Or that's not an option and there's definitively something wrong with my code (which works well on other circumstances by the way)?
No, other processes only affect how often your process gets time slots to execute -- i.e., from a user's perspective, how quickly it completes its job.
If you're having errors under load, this means there are errors in your program's logic -- most probably, race conditions. They basically boil down to making assumptions about your environment that are no longer true when there's other activity in it. E.g.:
Your program is multithreaded, and the logic makes assumptions about which order threads are executed in. (This includes assumptions about how long some task would take to complete.)
Your program is using shared resources (files, streams etc) that other processes are also using at the same time. (E.g. some other program is in the process of (over)writing a file while you're trying to read it. Or, if you're reading from a stream, not all data are available yet.)

parallel processing - nearest neighbour search using pysal python?

I have this data frame df1,
id lat_long
400743 2504043 (175.0976323, -41.1141412)
43203 1533418 (173.976683, -35.2235338)
463952 3805508 (174.6947496, -36.7437555)
1054906 3144009 (168.0105269, -46.36193)
214474 3030933 (174.6311167, -36.867717)
1008802 2814248 (169.3183615, -45.1859095)
988706 3245376 (171.2338968, -44.3884099)
492345 3085310 (174.740957, -36.8893026)
416106 3794301 (174.0106383, -35.3876921)
937313 3114127 (174.8436185, -37.80499)
I have constructed the tree for search here,
def construct_geopoints(s):
data_geopoints = [tuple(x) for x in s[['longitude','latitude']].to_records(index=False)]
tree = KDTree(data_geopoints, distance_metric='Arc',
return tree
tree = construct_geopoints(actualdata)
Now, I am trying to search all the geopoints which are within 1KM of every geopoint in my data frame df1. Here is how I am doing,
dfs = []
for name,group in df1.groupby(np.arange(len(df1))//10000):
s = group.reset_index(drop=True).copy()
pts = list(s['lat_long'])
neighbours = tree.query_ball_point(pts, 1)
s['neighbours'] = pd.Series(neighbours)
output = pd.concat(dfs,axis = 0)
Everything here works fine, however I am trying to parallelise this task, since my df1 size is 2M records, this process is running for more than 8 hours. Can anyone help me on this? And another thing is, the result returned by query_ball_point is a list and so its throwing memory error when I am processing it for the huge amount of records. Any way to handle this.
EDIT :- Memory issue, look at the VIRT size.
It should be possible to parallelize your last segment of code with something like this:
from multiprocessing import Pool
def process_group(group):
s = group[1].reset_index(drop=True) # .copy() is implicit
pts = list(s['lat_long'])
neighbours = tree.query_ball_point(pts, 1)
s['neighbours'] = pd.Series(neighbours)
return s
groups = df1.groupby(np.arange(len(df1))//10000)
p = Pool(5)
dfs =, groups)
output = pd.concat(dfs, axis=0)
But watch out, because the multiprocessing library pickles all the data on its way to and from the workers, and that can add a lot of overhead for data-intensive tasks, possibly cancelling the savings due to parallel processing.
I can't see where you'd be getting out-of-memory errors from. 8 million records is not that much for pandas. Maybe if your searches are producing hundreds of matches per row that could be a problem. If you say more about that I might be able to give some more advice.
It also sounds like pysal may be taking longer than necessary to do this. You might be able to get better performance by using GeoPandas or "rolling your own" solution like this:
assign each point to a surrounding 1-km grid cell (e.g., calculate UTM coordinates x and y, then create columns cx=x//1000 and cy=y//1000);
create an index on the grid cell coordinates cx and cy (e.g., df=df.set_index(['cx', 'cy']));
for each point, find the points in the 9 surrounding cells; you can select these directly from the index via df.loc[[(cx-1,cy-1),(cx-1,cy),(cx-1,cy+1),(cx,cy-1),...(cx+1,cy+1)], :];
filter the points you just selected to find the ones within 1 km.

Implementing multiprocessing to deal with heavy input/output on HPC

I need to process over 10 million spectroscopic data sets. The data is structured like this: there are around 1000 .fits (.fits is some data storage format) files, each file contains around 600-1000 spectra in which there are around 4500 elements in each spectra (so each file returns a ~1000*4500 matrix). That means each spectra is going to be repeatedly read around 10 times (or each file is going to be repeatedly read around 10,000 times) if I am going to loop over the 10 million entries. Although the same spectra is repeatedly read around 10 times, it is not duplicate because each time I extract different segments of the same spectra. With the help of #Paul Panzer, I already avoid reading the same file multiple times.
I have a catalog file which contains all the information I need, like the coordinates x, y, the radius r, the strength s, etc. The catalog also contains the information to target which file I am going to read (identified by n1, n2) and which spectra in that file I am going to use (identified by n3).
The code I have now is:
import numpy as np
from itertools import izip
import itertools
import fitsio
x = []
y = []
r = []
s = []
n1 = []
n2 = []
n3 = []
with open('spectra_ID.dat') as file_ID, open('catalog.txt') as file_c:
for line1, line2 in izip(file_ID,file_c):
parts1 = line1.split()
parts2 = line2.split()
def data_analysis(n_galaxies):
n_num = 0
data = np.zeros((n_galaxies), dtype=[('spec','f4',(200)),('x','f8'),('y','f8'),('r','f8'),('s','f8')])
idx = np.lexsort((n3,n2,n1))
for kk,gg in itertools.groupby(zip(idx, n1[idx], n2[idx]), lambda x: x[1:]):
filename = "../../data/" + str(kk[0]) + "/spPlate-" + str(kk[0]) + "-" + str(kk[1]) + ".fits"
fits_spectra = fitsio.FITS(filename)
fluxx = fits_spectra[0].read()
n_element = fluxx.shape[1]
hdu = fits_spectra[0].read_header()
wave_start = hdu['CRVAL1']
logwave = wave_start + 0.0001 * np.arange(n_element)
wavegrid = np.power(10,logwave)
for ss, plate1, mjd1 in gg:
if n_num % 1000000 == 0:
print n_num
n3new = n3[ss]-1
flux = fluxx[n3new]
### following is my data reduction of individual spectra, I will skip here
### After all my analysis, I have the data storage as below:
data['spec'][n_num] = flux_intplt
data['x'][n_num] = x[ss]
data['y'][n_num] = y[ss]
data['r'][n_num] = r[ss]
data['s'][n_num] = s[ss]
n_num += 1
print n_num
data_output = FITS('./analyzedDATA/data_ALL.fits','rw')
I kind of understand that the multiprocessing need to remove one loop, but pass the index to the function. However, there are two loops in my function and those two are highly correlated, so I do not know how to approach. Since the most time-consuming part of this code is reading files from disk, so the multiprocessing need to take full advantage of cores to read multiple files at one time. Could any one shed a light on me?
Get rid of global vars, you can't use global vars with processes
Merge your multiple global vars into one container class or dict,
assigning different segments of the same spectra into one data set
Move your global with open(... into a def ...
Separate data_output into a own def ...
Try first, without multiprocessing, this concept:
for line1, line2 in izip(file_ID,file_c):
data_set = create data set from (line1, line2)
result = data_analysis(data_set)
Consider to use 2 processes one for file reading and one for file writing.
Use multiprocessing.Pool(processes=n) for data_analysis.
Communicate between processes using multiprocessing.Manager().Queue()

