Fast loading multiple .npy files into data generator - python

I do have thousands of .npy files stored in my hard disk, each containing a single matrix with dimensions [128, T], where T is variable (on average T=800). Each .npy file has size around 2Mb, depending on the matrix shape.
These matrices are then passed to a generator, which yields batches of 32 to a neural network. The Python code used to pass the matrices into the generator is:
def load_batch(path_list):
np_list = []
for path in path_list:
np_list.append(np.load(path))
return np_list
which, given a list of paths of the .npy files, returns a list of the corresponding NumPy matrices.
This code takes, on average, 0.6s to return a list of 32 matrices. I am using append because this is usually a quick operation.
I am aware that the speed of the hard disk buffer does have an influence on timings but, right now, I really would like to shrink the amount of time required as much as possible by just modifying the code in a smart way.
As an alternative, I tried implementing multi-processing:
from multiprocessing import Pool
def reader(filename):
return np.load(filename)
def load_multiprocess(path_list, n_cores=5):
pool = Pool(n_cores)
np_list = pool.map(reader, path_list)
return np_list
However, the performance is much worse. I had a look around stackoverflow, and I got the idea that my specific application could not benefit from multiprocessing.
To summarize, I am looking for any kind of advice for one of these two tasks:
Improving the speed of the first code (even 0.1s less would mean a lot).
Using multiprocessing in the right way, if possible.
SOLUTION AND BENCHMARK
Out of the three methods here proposed, user7138814's solution seems to generally improve a lot the execution speed. However, things seem to change when the data is loaded while training a neural network: even though mapping is by itself still the quicker method for loading data, the overall training time seems to increase, I have no idea where and why, as timings using the mapping load are always better.
Below, I will do a benchmark of the three methods.First, define the methods:
import numpy as np
# my initial method
def load_batch(path_list):
np_list = []
for path in path_list:
np_list.append(np.load(path))
return np_list
# Aaj Kaal's method
def load_batch1(path_list):
return [np.load(path) for path in path_list]
# user7138814's method
def load_batch2(path_list):
np_list = []
for path in path_list:
np_list.append(np.load(path, mmap_mode='r'))
return np_list
I defined a list of paths as follows:
batches_list = []
batch_size = 32
for n in range(0,150):
batches_list.append(X_path_list[n*batch_size:n*batch_size+batch_size])
The list contains 150 batches of 32 paths each, it should be enough to calculate the mean.
Then, each method is executed using passing to it exactly the same data.
import time
# my initial method
timing0 = []
for l in batches_list:
start = time.time()
load_batch(l)
end = time.time()
timing0.append(end-start)
print(np.mean(timing0))
# Aaj Kaal's method
timing1 = []
for l in batches_list:
start = time.time()
load_batch1(l)
end = time.time()
timing1.append(end-start)
print(np.mean(timing1))
# user7138814's method
timing2 = []
for l in batches_list:
start = time.time()
load_batch2(l)
end = time.time()
timing2.append(end-start)
print(np.mean(timing2))
Output (mean timing in seconds over 150 executions):
0.022530150413513184
0.022546884218851725
0.009580903053283692
Results seem to be consistent when changing length of batches_list and batch_size.

Maybe memory mapping the files will be beneficial due to lazy loading. If you would use for example
np.load(filename, mmap_mode='r')
the creation of the numpy array becomes almost a no-op, but later in the pipeline you pay the price. This could provide a speedup if it results in processing the data in parallel with reading from disk.

Did you try using use list comprehension. Replace
def load_batch(path_list):
np_list = []
for path in path_list:
np_list.append(np.load(path))
return np_list
with
def load_batch(path_list):
return [np.load(path) for path in path_list]
In fact you can get rid of the function and directly use list comprehension. If functional call is required use lambda

Related

Attempting to optimize the following loop over numpy arrays? Best method ? (numba or dask)

I am trying to refactor this code in order to minimize its runtime and memory usage (if possible)
for i in range(gbl.NumStoreRows):
cal_effects[i,:,:len(orig_cols)] = cal_effects_vals - **Use ~1gb memory on this line**
priors[i,:len(orig_cols)] = orig_prior_coeffs
priors_SE[i,:len(orig_cols)] = orig_prior_SE
It is only the first operation in the loop which is time/memory intensive, I tried splitting the the memory/runtime intensive line from the other two and created two separate loops. - just made it a second slower, and no memory impact.
I tried to create a jit function for this code block then, but the application stops running later on in the code with error message. - It just stops on one of the LoadFunctions(), so I think jit might be altering the output or my function is incorrectly structured.
Variations of my jit function
Variation 1
#jit
def populate_cal_effects(cal_effects_vals):
for i in range(gbl.NumStoreRows):
cal_effects[i,:,:len(orig_cols)] = cal_effects_vals
populate_cal_effects(cal_effects_vals)
for i in range(gbl.NumStoreRows):
priors[i,:len(orig_cols)] = orig_prior_coeffs
priors_SE[i,:len(orig_cols)] = orig_prior_SE
Variation 2: Adding a return statement to the function
#jit
def populate_cal_effects(cal_effects_vals):
for i in range(gbl.NumStoreRows):
cal_effects[i,:,:len(orig_cols)] = cal_effects_vals
return cal_effects[i,:,:len(orig_cols)]
Variation 3: add the operations from the other for loop to the function
This was the method I expected to be fastest and not affect data output
#jit(parallel=True)
def populate_cal_effects(cal_effects_vals):
for i in prange(gbl.NumStoreRows):
cal_effects[i,:,:len(orig_cols)] = cal_effects_vals
priors[i,:len(orig_cols)] = orig_prior_coeffs
priors_SE[i,:len(orig_cols)] = orig_prior_SE
I wanted to utilize parallel mode and use prange for the loop, but I cannot get this to work.
Context/Other:
I have defined this function inside the main load function. - My next step is too move it out of the Load function and re-run.
If this method doesn't work I was thinking of trying to process in parallel (multiple cores) - not machines. using Dask.
Any pointers on this would be great, maybe I am wasting my time and this is not optimizable, if so, do let me know
Steps to reproduce
gbl.NumstoreRows = 866 (# of stores)
All data types are numpy arrays
cal_effects = np.zeros((gbl.NumStoreRows, n_days, n_cal_effects), dtype=np.float64)
priors = np.zeros((gbl.NumStoreRows, n_cal_effects), dtype=np.float64)
priors_SE = np.zeros((gbl.NumStoreRows, n_cal_effects), dtype=np.float64)
To illustrate my comment:
for i in range(gbl.NumStoreRows):
cal_effects[i,:,:len(orig_cols)] = cal_effects_vals - **Use ~1gb memory on this line**
priors[i,:len(orig_cols)] = orig_prior_coeffs
priors_SE[i,:len(orig_cols)] = orig_prior_SE
from this I deduce cal_effects is a (N,M,L) shape array; priors is (N,L)
big_arr = np.zeros((N,M,L))
arr = np.zeros((N,L)
for i in range(N):
big_arr[i, :, :l] = np.ones((M,l))
arr[i, :l] = np.ones(l)
And apparently np.ones((M,l)) is large, on the order of 1gb.
Do cal_effects_vals and orig_prior_coeffs differ with i. It isn't obvious from the code. If they don't differ, why iterate on i?
So this isn't an answer, but it may help you write a question that is more succinct, and attract more answers.

How do I improve the speed of this parser using python?

I am currently parsing historic delay data from a public transport network in Sweden. I have ~5700 files (one from every 15 seconds) from the 27th of January containing momentary delay data for vehicles on active trips in the network. It's, unfortunately, a lot of overhead / duplicate data, so I want to parse out the relevant stuff to do visualizations on it.
However, when I try to parse and filter out the relevant delay data on a trip level using the script below it performs really slow. It has been running for over 1,5 hours now (on my 2019 Macbook Pro 15') and isn't finished yet.
How can I optimize / improve this python parser?
Or should I reduce the number of files, and i.e. the frequency of the data collection, for this task?
Thank you so much in advance. 💗
from google.transit import gtfs_realtime_pb2
import gzip
import os
import datetime
import csv
import numpy as np
directory = '../data/tripu/27/'
datapoints = np.zeros((0,3), int)
read_trips = set()
# Loop through all files in directory
for filename in os.listdir(directory)[::3]:
try:
# Uncompress and parse protobuff-file using gtfs_realtime_pb2
with gzip.open(directory + filename, 'rb') as file:
response = file.read()
feed = gtfs_realtime_pb2.FeedMessage()
feed.ParseFromString(response)
print("Filename: " + filename, "Total entities: " + str(len(feed.entity)))
for trip in feed.entity:
if trip.trip_update.trip.trip_id not in read_trips:
try:
if len(trip.trip_update.stop_time_update) == len(stopsOnTrip[trip.trip_update.trip.trip_id]):
print("\t","Adding delays for",len(trip.trip_update.stop_time_update),"stops, on trip_id",trip.trip_update.trip.trip_id)
for i, stop_time_update in enumerate(trip.trip_update.stop_time_update[:-1]):
# Store the delay data point (arrival difference of two ascending nodes)
delay = int(trip.trip_update.stop_time_update[i+1].arrival.delay-trip.trip_update.stop_time_update[i].arrival.delay)
# Store contextual metadata (timestamp and edgeID) for the unique delay data point
ts = int(trip.trip_update.stop_time_update[i+1].arrival.time)
key = int(str(trip.trip_update.stop_time_update[i].stop_id) + str(trip.trip_update.stop_time_update[i+1].stop_id))
# Append data to numpy array
datapoints = np.append(datapoints, np.array([[key,ts,delay]]), axis=0)
read_trips.add(trip.trip_update.trip.trip_id)
except KeyError:
continue
else:
continue
except OSError:
continue
I suspect the problem here is repeatedly calling np.append to add a new row to a numpy array. Because the size of a numpy array is fixed when it is created, np.append() must create a new array, which means that it has to copy the previous array. On each loop, the array is bigger and so all these copies add a quadratic factor to your execution time. This becomes significant when the array is quite big (which apparently it is in your application).
As an alternative, you could just create an ordinary Python list of tuples, and then if necessary convert that to a complete numpy array at the end.
That is (only the modified lines):
datapoints = []
# ...
datapoints.append((key,ts,delay))
# ...
npdata = np.array(datapoints, dtype=int)
I still think the parse routine is your bottleneck (even if it did come from Google), but all those '.'s were killing me! (And they do slow down performance somewhat.) Also, I converted your i, i+1 iterating to using two iterators zipping through the list of updates, this is a little more advanced style of working through a list. Plus the cur/next_update names helped me keep straight when you wanted to reference one vs. the other. Finally, I remove the trailing "else: continue", since you are at the end of the for loop anyway.
for trip in feed.entity:
this_trip_update = trip.trip_update
this_trip_id = this_trip_update.trip.trip_id
if this_trip_id not in read_trips:
try:
if len(this_trip_update.stop_time_update) == len(stopsOnTrip[this_trip_id]):
print("\t", "Adding delays for", len(this_trip_update.stop_time_update), "stops, on trip_id",
this_trip_id)
# create two iterators to walk through the list of updates
cur_updates = iter(this_trip_update.stop_time_update)
nxt_updates = iter(this_trip_update.stop_time_update)
# advance the nxt_updates iter so it is one ahead of cur_updates
next(nxt_updates)
for cur_update, next_update in zip(cur_updates, nxt_updates):
# Store the delay data point (arrival difference of two ascending nodes)
delay = int(nxt_update.arrival.delay - cur_update.arrival.delay)
# Store contextual metadata (timestamp and edgeID) for the unique delay data point
ts = int(next_update.arrival.time)
key = "{}/{}".format(cur_update.stop_id, next_update.stop_id)
# Append data to numpy array
datapoints = np.append(datapoints, np.array([[key, ts, delay]]), axis=0)
read_trips.add(this_trip_id)
except KeyError:
continue
This code should be equivalent to what you posted, and I don't really expect major performance gains either, but perhaps this will be more maintainable when you come back to look at it in 6 months.
(This probably is more appropriate for CodeReview, but I hardly ever go there.)

Sharing large objects in multiprocessing pools

I'm trying to revisit this slightly older question and see if there's a better answer these days.
I'm using python3 and I'm trying to share a large dataframe with the workers in a pool. My function reads the dataframe, generates a new array using data from the dataframe, and returns that array. Example code below (note: in the example below I do not actually use the dataframe, but in my code I do).
def func(i):
return i*2
def par_func_dict(mydict):
values = mydict['values']
df = mydict['df']
return pd.Series([func(i) for i in values])
N = 10000
arr = list(range(N))
data_split = np.array_split(arr, 3)
df = pd.DataFrame(np.random.randn(10,10))
pool = Pool(cores)
gen = ({'values' : i, 'df' : df}
for i in data_split)
data = pd.concat(pool.map(par_func_dict,gen), axis=0)
pool.close()
pool.join()
I'm wondering if there's a way I can prevent feeding the generator with copies of the dataframe to prevent taking up so much memory.
The answer to the link above suggests using multiprocessing.Process(), but from what I can tell, it's difficult to use that on top of functions that return things (need to incorporate signals / events), and the comments indicate that each process still ends up using a large amount of memory.

Parallelize python functions without storing intermediate results

I need to accumulate the results of many trees for some query that outputs a large result. Since all trees can be handled independently it is embarrassingly parallel, except for the fact that the results needs to be summed and I cannot store the intermediate results for all trees in memory. Below is a simple example of a code for the problem which saves all the intermediate results in memory (of course the functions are newer the same in the real problem since that would be doing duplicated work).
import numpy as np
from joblib import Parallel, delayed
functions=[[abs,np.round] for i in range(500)] # Dummy functions
functions=[function for sublist in functions for function in sublist]
X=np.random.normal(size=(5,5)) # Dummy data
def helper_function(function,X=X):
return function(X)
results = Parallel(n_jobs=-1,)(
map(delayed(helper_function), [functions[i] for i in range(1000)]))
results_out = np.zeros(results[0].shape)
for result in results:
results_out+=result
A solution could be the following modification:
import numpy as np
from joblib import Parallel, delayed
functions=[[abs,np.round] for i in range(500)] # Dummy functions
functions=[function for sublist in functions for function in sublist]
X=np.random.normal(size=(5,5)) # Dummy data
results_out = np.zeros(results[0].shape)
def helper_function(function,X=X,results=results_out):
result = function(X)
results += result
Parallel(n_jobs=-1,)(
map(delayed(helper_function), [functions[i] for i in range(1000)]))
But this might cause races. So it is not optimal.
Do you have any suggestions for preforming this without storing the intermediate results and still make it parallel?
The answer is given in the documentation of joblib.
with Parallel(n_jobs=2) as parallel:
accumulator = 0.
n_iter = 0
while accumulator < 1000:
results = parallel(delayed(sqrt)(accumulator + i ** 2)
for i in range(5))
accumulator += sum(results) # synchronization barrier
n_iter += 1
You can do the calculation in chunks and reduce the chunk as you are about to run out of memory.

Python + Ever-increasing memory allocation

I am writing a module to train a ML model on a large dataset - It includes 0.6M datapoints, each of 0.15M dimensions. I am facing problem with loading the data set itself. (its all numpy arrays)
Below is a code snippet (This replicates major behaviour of the actual code):
import numpy
import psutil
FV_length = 150000
X_List = []
Y_List = []
for i in range(0,600000):
feature_vector = numpy.zeros((FV_length),dtype=numpy.int)
# using db data, mark the features to activated
class_label = 0
X_List.append(feature_vector)
Y_List.append(class_label)
if (i%100 == 0):
print(i)
print("Virtual mem %s" %(psutil.virtual_memory().percent))
print("CPU usage %s" %psutil.cpu_percent())
X_Data = np.asarray(X_List)
Y_Data = np.asarray(Y_List)
The code results in ever-increasing memory allocation, until it gets killed. Is there a way to reduce the ever-increasing memory allocation ?
I have tried using gc.collect() but it always returns 0. I have made variables = None explicitly, no use again.
As noted in the comments, the data volume here is just very large, and the Neural Network would probably struggle even if you managed to load the training set. The best option for you is probably looking into some method of dimensional reduction on your datapoints. Something like Principal Component Analysis could help to get the 150K dimensions down to a more reasonable number.
This is what I did for a similar problem. I just always create the empty list again when it should be overwritten.
#initialize
X_List = []
Y_List = []
//do something with the list
Now if you don't need the old values, just create the list again
X_List = []
Y_List = []
But I don't know if this is needed or possible in your case. Maybe its the most idiomatic way but it worked.

Categories

Resources