parallelize 'for' loop in Python 3 - python

I am trying to do some analysis of the MODIS satellite data. My code primarily reads a lot of files (806) of the dimension 1200 by 1200 (806*1200*1200). It do it using a for loop and perform mathematical operations.
Following is the general way in which I read files.
mindex=np.zeros((1200,1200))
for i in range(1200):
var1 = xray.open_dataset('filename.nc')['variable'][:,i,:].data
for j in range(1200):
var2 = var1[:,j]
## Mathematical Calculations to find var3[i,j]##
mindex[i,j] = var3[i,j]
Since its a lot of data to handle, the process is very slow and I was considering parallelizing it. I tried doing something with joblib, but I have not been able to do it.
I am unsure how to tackle this problem.

My guess is that you want to work on several files at the same time. To do so, the best way (in my opinion) is to use multiprocessing. To use this, you need to define an elementary step, and it is already done in your code.
import numpy as np
import multiprocessing as mp
import os
def f(file):
mindex=np.zeros((1200,1200))
for i in range(1200):
var1 = xray.open_dataset(file)['variable'][:,i,:].data
for j in range(1200):
var2 = var1[:,j]
## Mathematical Calculations to find var3[i,j]##
mindex[i,j] = var3[i,j]
return (file, mindex)
if __name__ == '__main__':
N= mp.cpu_count()
files = os.scandir(folder)
with mp.Pool(processes = N) as p:
results = p.map(f, [file.name for file in files])
This should return a list of element results in which each element is a tuple with the file name and the mindex matrix. With this, you can work on multiple files at the same time. It is particularly efficient if the computation on each file is long.

Related

Fast loading multiple .npy files into data generator

I do have thousands of .npy files stored in my hard disk, each containing a single matrix with dimensions [128, T], where T is variable (on average T=800). Each .npy file has size around 2Mb, depending on the matrix shape.
These matrices are then passed to a generator, which yields batches of 32 to a neural network. The Python code used to pass the matrices into the generator is:
def load_batch(path_list):
np_list = []
for path in path_list:
np_list.append(np.load(path))
return np_list
which, given a list of paths of the .npy files, returns a list of the corresponding NumPy matrices.
This code takes, on average, 0.6s to return a list of 32 matrices. I am using append because this is usually a quick operation.
I am aware that the speed of the hard disk buffer does have an influence on timings but, right now, I really would like to shrink the amount of time required as much as possible by just modifying the code in a smart way.
As an alternative, I tried implementing multi-processing:
from multiprocessing import Pool
def reader(filename):
return np.load(filename)
def load_multiprocess(path_list, n_cores=5):
pool = Pool(n_cores)
np_list = pool.map(reader, path_list)
return np_list
However, the performance is much worse. I had a look around stackoverflow, and I got the idea that my specific application could not benefit from multiprocessing.
To summarize, I am looking for any kind of advice for one of these two tasks:
Improving the speed of the first code (even 0.1s less would mean a lot).
Using multiprocessing in the right way, if possible.
SOLUTION AND BENCHMARK
Out of the three methods here proposed, user7138814's solution seems to generally improve a lot the execution speed. However, things seem to change when the data is loaded while training a neural network: even though mapping is by itself still the quicker method for loading data, the overall training time seems to increase, I have no idea where and why, as timings using the mapping load are always better.
Below, I will do a benchmark of the three methods.First, define the methods:
import numpy as np
# my initial method
def load_batch(path_list):
np_list = []
for path in path_list:
np_list.append(np.load(path))
return np_list
# Aaj Kaal's method
def load_batch1(path_list):
return [np.load(path) for path in path_list]
# user7138814's method
def load_batch2(path_list):
np_list = []
for path in path_list:
np_list.append(np.load(path, mmap_mode='r'))
return np_list
I defined a list of paths as follows:
batches_list = []
batch_size = 32
for n in range(0,150):
batches_list.append(X_path_list[n*batch_size:n*batch_size+batch_size])
The list contains 150 batches of 32 paths each, it should be enough to calculate the mean.
Then, each method is executed using passing to it exactly the same data.
import time
# my initial method
timing0 = []
for l in batches_list:
start = time.time()
load_batch(l)
end = time.time()
timing0.append(end-start)
print(np.mean(timing0))
# Aaj Kaal's method
timing1 = []
for l in batches_list:
start = time.time()
load_batch1(l)
end = time.time()
timing1.append(end-start)
print(np.mean(timing1))
# user7138814's method
timing2 = []
for l in batches_list:
start = time.time()
load_batch2(l)
end = time.time()
timing2.append(end-start)
print(np.mean(timing2))
Output (mean timing in seconds over 150 executions):
0.022530150413513184
0.022546884218851725
0.009580903053283692
Results seem to be consistent when changing length of batches_list and batch_size.
Maybe memory mapping the files will be beneficial due to lazy loading. If you would use for example
np.load(filename, mmap_mode='r')
the creation of the numpy array becomes almost a no-op, but later in the pipeline you pay the price. This could provide a speedup if it results in processing the data in parallel with reading from disk.
Did you try using use list comprehension. Replace
def load_batch(path_list):
np_list = []
for path in path_list:
np_list.append(np.load(path))
return np_list
with
def load_batch(path_list):
return [np.load(path) for path in path_list]
In fact you can get rid of the function and directly use list comprehension. If functional call is required use lambda

how to ensure multiprocessing code using the configured cpu cores?

I use multiprocessing Pool to run parallel. I tried with 4 cores first in HPC with sub. When it uses 4 core, the time is reduced 4 times compared to 1 core. When I check with qstat, several times it uses 4 cores but after that just 1 core, with exactly the same code.
Could you please give some advice what is wrong with my code or the system?
import pandas as pd
import numpy as np
from multiprocessing import Pool
from datetime import datetime
t1 = pd.read_csv("template.csv",header=None)
s1 = pd.read_csv("/home/donp/dude_1000_raw_raw/dude_1000_raw_raw_adfr.csv")
s2 = pd.read_csv("/home/donp/dude_1000_raw_raw/dude_1000_raw_raw_dock.csv")
s3 = pd.read_csv("/home/donp/dude_1000_raw_raw/dude_1000_raw_raw_gemdock.csv")
s4 = pd.read_csv("/home/donp/dude_1000_raw_raw/dude_1000_raw_raw_ledock.csv")
s5 = pd.read_csv("/home/donp/dude_1000_raw_raw/dude_1000_raw_raw_plants.csv")
s6 = pd.read_csv("/home/donp/dude_1000_raw_raw/dude_1000_raw_raw_psovina.csv")
s7 = pd.read_csv("/home/donp/dude_1000_raw_raw/dude_1000_raw_raw_quickvina2.csv")
s8 = pd.read_csv("/home/donp/dude_1000_raw_raw/dude_1000_raw_raw_smina.csv")
s9 = pd.read_csv("/home/donp/dude_1000_raw_raw/dude_1000_raw_raw_vina.csv")
s10 = pd.read_csv("/home/donp/dude_1000_raw_raw/dude_1000_raw_raw_vinaxb.csv")
#number of core and arrays
n = 4
m = (len(t1) // n)+1
g= m*n - len(t1)
for g1 in range(g):
t1.loc[len(t1)]=0
results=[]
def block_linear(i):
temp = pd.DataFrame(np.zeros((m,29)))
for a in range(0,m):
sum_matrix = (t1.iloc[a,0]*s1) + (t1.iloc[a,1]*s2) + (t1.iloc[a,2]*s3)+ (t1.iloc[a,3]*s4) + (t1.iloc[a,4]*s5) + (t1.iloc[a,5]*s6) + (t1.iloc[a,6]*s7) + (t1.iloc[a,7]*s8) + (t1.iloc[a,8]*s9) + (t1.iloc[a,9]*s10)
rank_sum= pd.DataFrame.rank(sum_matrix,axis=0,ascending=True,method='min') #real-True
temp.iloc[a,:] = rank_sum.iloc[999].values
temp['median'] = temp.median(axis=1)
temp.index = range(i*m,(i+1)*m)
return temp
start=datetime.now()
if __name__ == '__main__':
pool = Pool(processes=n)
results = pool.map(block_linear,range(0,n))
print(datetime.now()-start)
out=pd.concat(results)
out.drop(out.tail(g).index,inplace=True)
out.to_csv('test_10dock_4core.csv',index=False)
The main idea is to cut large table into smallers, run calculations and combine together.
Without a more detailed usage of the multiprocessing's Pool package is really difficult to understand and help. Please notice that the Pool package does not guarantee parallelization: the _apply function, for example, only uses one worker of the Pool, and block all your executions. You can check out more details about it here and there.
But assuming you are using the library properly, you should make sure your code is fully parallelizable: an I/O operation on disk, for example, can bottleneck your parallelization and thus making your code run in only one process at a time.
I hope it helped.
[Edit]
Since you provided more details about your problem, I can give more specific tips:
The first thing is that your code is zero parallel. You are just calling the same function N times. This is not how multiprocessing should work.
Instead, the part that should be parallel is the one that is usually in a for loops, like the one you have inside the block_linear().
So, what I recommend to you:
You should change your code to first calculate all your weighted sum and only after that do the rest of the operations. This will help a lot with parallelization.
So, put this operation in a function:
def weighted_sum(column,df2):
temp = pd.DataFrame(np.zeros(m))
for a in range(0,m):
result = (t1.iloc[a,column]*df2)
temp.iloc[a] = result
return temp
So then, you use pool.starmap to parallel the function for the 10 dataframes you have, something like this:
results = pool.starmap(weighted_sum,[(0,s1),(1,s2),(2,s3),....,[9,s10]])
ps: pool.starmap is similar to pool.map but accepts a list of tuple arguments. You can have more details about it here.
At last but not least, you should operate over your results to end your calculations. Since you will have one weighted_sum per column, you can apply a sum over the columns and then the rank_sum.
This is not a fully runnable code to solve your problem, but a general guide of how your should restructure your code to have a multiprocessing advantage. I recommend you to test it over a subsample of the data frames just to make sure it's working properly before you run it on all your data.

Sharing large objects in multiprocessing pools

I'm trying to revisit this slightly older question and see if there's a better answer these days.
I'm using python3 and I'm trying to share a large dataframe with the workers in a pool. My function reads the dataframe, generates a new array using data from the dataframe, and returns that array. Example code below (note: in the example below I do not actually use the dataframe, but in my code I do).
def func(i):
return i*2
def par_func_dict(mydict):
values = mydict['values']
df = mydict['df']
return pd.Series([func(i) for i in values])
N = 10000
arr = list(range(N))
data_split = np.array_split(arr, 3)
df = pd.DataFrame(np.random.randn(10,10))
pool = Pool(cores)
gen = ({'values' : i, 'df' : df}
for i in data_split)
data = pd.concat(pool.map(par_func_dict,gen), axis=0)
pool.close()
pool.join()
I'm wondering if there's a way I can prevent feeding the generator with copies of the dataframe to prevent taking up so much memory.
The answer to the link above suggests using multiprocessing.Process(), but from what I can tell, it's difficult to use that on top of functions that return things (need to incorporate signals / events), and the comments indicate that each process still ends up using a large amount of memory.

How to read multiple files into multiple threads/processess to optimize data analyses?

I am trying to read 3 different files in python and do something to extract the data outof it. Then I want to merge the data into one big file.
Since each individual files are already big and take sometime doing the data processing, I am thinking if
I can read all three files at once (in multiple threads/process)
wait for the process for all files to finish
when all output are ready then pipe all the data to downstream function to merge it.
Can someone suggest some improvement to this code to do what I want.
import pandas as pd
file01_output = ‘’
file02_output = ‘’
file03_output = ‘’
# I want to do all these three “with open(..)” at once.
with open(‘file01.txt’, ‘r’) as file01:
for line in file01:
something01 = do something in line
file01_output += something01
with open(‘file02.txt’, ‘r’) as file01:
for line in file01:
something02 = do something in line
file02_output += something02
with open(‘file03.txt’, ‘r’) as file01:
for line in file01:
something03 = do something in line
file03_output += something03
def merge(a,b,c):
a = file01_output
b = file01_output
c = file01_output
# compile the list of dataframes you want to merge
data_frames = [a, b, c]
df_merged = reduce(lambda left,right: pd.merge(left,right,
on=['common_column'], how='outer'), data_frames).fillna('.')
There are many ways to use multiprocessing in your problem so I'll just propose one way. Since, as you mentioned, the processing happening on the data in the file is CPU bound you can run that in a separate process and expect to see some improvement (how much improvement, if any, depends on the problem, algorithm, # cores, etc.). For example, the overall structure could look like just having a pool which you map a list of all the filenames which you need to process and in that function you do your computing.
It's easier with a concrete example. Let's pretend we have a list of CSVs 'file01.csv', 'file02.csv', 'file03.csv' which have a NUMBER column and we want to compute whether that number is prime (CPU bound). Example, file01.csv:
NUMBER
1
2
3
...
And the other files look similar but with different numbers to avoid duplicating work. The code to compute the primes could then look like this:
import pandas as pd
from multiprocessing import Pool
from sympy import isprime
def compute(filename):
# IO (probably not faster)
my_data_df = pd.read_csv(filename)
# do some computing (CPU)
my_data_df['IS_PRIME'] = my_data_df.NUMBER.map(isprime)
return my_data_df
if __name__ == '__main__':
filenames = ['file01.csv', 'file02.csv', 'file03.csv']
# construct the pool and map to the workers
with Pool(2) as pool:
results = pool.map(compute, filenames)
print(pd.concat(results))
I've used the sympy package for a convenient isprime method and I'm sure the structure of my data is quite different but, hopefully, that example illustrates a structure you could use too. The plan of performing your CPU bound computations in a pool (or list of Processes) and then merge/reduce/concatenating the result is a reasonable approach to the problem.

Speed up Python eval when reading and evaluating list of equations from file

I have put together a simple Python script which reads a large list of algebraic expressions from a text file on separate lines, evaluates the mathematics on each line and puts it into a numpy array. The eigenvalues of this matrix are then found. The parameters A,B,C will then be changed and the program run again, hence a function is used to achieve this.
Some of these text files will have millions of lines of equations, so after profiling the code I found that the eval command accounts for approximately 99% of the execution time. I am aware of the dangers of using eval but this code will only ever be used by myself. All other parts of the code are fast, except the call to eval.
Here is the code where mat_size is set to 500 which represents a 500*500 array meaning 250,000 lines of equations are being read in from the file. I cannot provide the file as it is ~ 0.5GB in size, but have provided an example of what it looks like below and it only uses basic mathematical operations.
import numpy as np
from numpy import *
from scipy.linalg import eigvalsh
mat_size = 500
# Read the file line by line
with open("test_file.txt", 'r') as f:
lines = f.readlines()
# Function to evaluate the maths and build the numpy array
def my_func(A,B,C):
lst = []
for i in lines:
# Strip the \n
new = eval(i.rstrip())
lst.append(new)
# Build the numpy array
AA = np.array(lst,dtype=np.float64)
# Resize it to mat_size
matt = np.resize(AA,(mat_size,mat_size))
return matt
# Function to find eigenvalues of matrix
def optimise(x):
A,B,C = x
test = my_func(A,B,C)
ev=-1*eigvalsh(test)
return ev[-(1)]
# Define what A,B,C are, this can be changed each time the program is run
x0 = [7.65,5.38,4.00]
# Print result
print(optimise(x0))
A few lines of an example input text file: (mat_size can be changed to 2 to run this file)
.5/A**3*B**5+C
35.5/A**3*B**5+3*C
.8/C**3*A**5+C**9
.5/A*3+B**5-C/45
I am aware eval is usually bad practice and slow, so I looked for other means to achieving a speed up. I tried methods outlined here but none of these appeared to work. I also tried applying sympy to the problem but that caused a massive slowdown. What is a better way of going about this problem?
EDIT
From the suggestion to use numexpr instead, I have come across an issue where it grinds to a halt compared to the standard eval. For some instances the matrix elements contain quite a lot of algebraic expressions. Here is an example of just one matrix element, i.e one of the equations in the file (it contains a few more terms not defined in the code above, but can be easily defined at top of the code):
-71*A**3/(A+B)**7-61*B**3/(A+B)**7-3/2/B**2/C**2*A**6/(A+B)**7-7/4/B**3/m3*A**6/(A+B)**7-49/4/B**2/C*A**6/(A+B)**7+363/C*A**3/(A+B)**7*z3+451*B**3/C/(A+B)**7*z3-3/2*B**5/C/A**2/(A+B)**7-3/4*B**7/C/A**3/(A+B)**7-1/B/C**3*A**6/(A+B)**7-3/2/B**2/C*A**5/(A+B)**7-107/2/C/m3*A**4/(A+B)**7-21/2/B/C*A**4/(A+B)**7-25/2*B/C*A**2/(A+B)**7-153/2*B**2/C*A/(A+B)**7-5/2*B**4/C/m3/(A+B)**7-B**6/C**3/A/(A+B)**7-21/2*B**4/C/A/(A+B)**7-7/4/B**3/C*A**7/(A+B)**7+86/C**2*A**4/(A+B)**7*z3+90*B**4/C**2/(A+B)**7*z3-1/4*B**6/m3/A**3/(A+B)**7-149/4/B/C*A**5/(A+B)**7-65*B**2/C**3*A**4/(A+B)**7-241/2*B/C**2*A**4/(A+B)**7-38*B**3/C**3*A**3/(A+B)**7+19*B**2/C**2*A**3/(A+B)**7-181*B/C*A**3/(A+B)**7-47*B**4/C**3*A**2/(A+B)**7+19*B**3/C**2*A**2/(A+B)**7+362*B**2/C*A**2/(A+B)**7-43*B**5/C**3*A/(A+B)**7-241/2*B**4/C**2*A/(A+B)**7-272*B**3/C*A/(A+B)**7-25/4*B**6/C**2/A/(A+B)**7-77/4*B**5/C/A/(A+B)**7-3/4*B**7/C**2/A**2/(A+B)**7-23/4*B**6/C/A**2/(A+B)**7-11/B/C**2*A**5/(A+B)**7-13/B**2/m3*A**5/(A+B)**7-25*B/C**3*A**4/(A+B)**7-169/4/B/m3*A**4/(A+B)**7-27*B**2/C**3*A**3/(A+B)**7-47*B/C**2*A**3/(A+B)**7-27*B**3/C**3*A**2/(A+B)**7-38*B**2/C**2*A**2/(A+B)**7-131/4*B/m3*A**2/(A+B)**7-25*B**4/C**3*A/(A+B)**7-65*B**3/C**2*A/(A+B)**7-303/4*B**2/m3*A/(A+B)**7-5*B**5/C**2/A/(A+B)**7-49/4*B**4/m3/A/(A+B)**7-1/2*B**6/C**2/A**2/(A+B)**7-5/2*B**5/m3/A**2/(A+B)**7-1/2/B/C**3*A**7/(A+B)**7-3/4/B**2/C**2*A**7/(A+B)**7-25/4/B/C**2*A**6/(A+B)**7-45*B/C**3*A**5/(A+B)**7-3/2*B**7/C**3/A/(A+B)**7-123/2/C*A**4/(A+B)**7-37/B*A**4/(A+B)**7-53/2*B*A**2/(A+B)**7-75/2*B**2*A/(A+B)**7-11*B**6/C**3/(A+B)**7-39/2*B**5/C**2/(A+B)**7-53/2*B**4/C/(A+B)**7-7*B**4/A/(A+B)**7-7/4*B**5/A**2/(A+B)**7-1/4*B**6/A**3/(A+B)**7-11/C**3*A**5/(A+B)**7-43/C**2*A**4/(A+B)**7-363/4/m3*A**3/(A+B)**7-11*B**5/C**3/(A+B)**7-45*B**4/C**2/(A+B)**7-451/4*B**3/m3/(A+B)**7-5/C**3*A**6/(A+B)**7-39/2/C**2*A**5/(A+B)**7-49/4/B**2*A**5/(A+B)**7-7/4/B**3*A**6/(A+B)**7-79/2/C*A**3/(A+B)**7-207/2*B**3/C/(A+B)**7+22/B/C**2*A**5/(A+B)**7*z3+94*B/C**2*A**3/(A+B)**7*z3+76*B**2/C**2*A**2/(A+B)**7*z3+130*B**3/C**2*A/(A+B)**7*z3+10*B**5/C**2/A/(A+B)**7*z3+B**6/C**2/A**2/(A+B)**7*z3+3/B**2/C**2*A**6/(A+B)**7*z3+7/B**3/C*A**6/(A+B)**7*z3+52/B**2/C*A**5/(A+B)**7*z3+169/B/C*A**4/(A+B)**7*z3+131*B/C*A**2/(A+B)**7*z3+303*B**2/C*A/(A+B)**7*z3+49*B**4/C/A/(A+B)**7*z3+10*B**5/C/A**2/(A+B)**7*z3+B**6/C/A**3/(A+B)**7*z3-3/4*B**7/C/m3/A**3/(A+B)**7-7/4/B**3/C/m3*A**7/(A+B)**7-49/4/B**2/C/m3*A**6/(A+B)**7-149/4/B/C/m3*A**5/(A+B)**7-293*B/C/m3*A**3/(A+B)**7+778*B**2/C/m3*A**2/(A+B)**7-480*B**3/C/m3*A/(A+B)**7-77/4*B**5/C/m3/A/(A+B)**7-23/4*B**6/C/m3/A**2/(A+B)**7
numexpr completely chokes when the matrix elements are of this form, whereas eval evaluates it instantaneously. For just a 10*10 matrix (100 equations in file) numexpr takes about 78 seconds to process the file, whereas eval takes 0.01 seconds. Profiling the code that uses numexpr reveals that the getExprnames and precompile function of numexpr are the causes of the issue with precompile taking 73.5 seconds of the total time and getExprNames taking 3.5 seconds of the time. Why would the precompile cause such a bottleneck in this particular calculation along with the getExprNames? Is this module just not well suited to long algebraic expressions?
I found a way to speed eval() up in this particular instance by making use of the multiprocessing library. I read the file in as usual, but then break the list into equal sized sub-lists which can then be processed separately on different CPU's and the evaluated sub-lists recombined at the end. This offers a nice speedup over the original method. I am sure the code below can be simplified/optimised; but for now it works (for instance what if there is a prime number of list elements? this will mean unequal lists). Some rough benchmarks show it is ~ 3 times faster using the 4 CPU's of my laptop. Here is the code:
from multiprocessing import Process, Queue
with open("test.txt", 'r') as h:
linesHH = h.readlines()
# Get the number of list elements
size = len(linesHH)
# Break apart the list into the desired number of chunks
chunk_size = size/4
chunks = [linesHH[x:x+chunk_size] for x in xrange(0, len(linesHH), chunk_size)]
# Declare variables
A = 0.1
B = 2
C = 2.1
m3 = 1
z3 = 2
# Declare all the functions that process the substrings
def my_funcHH1(A,B,C,que): #add a argument to function for assigning a queue to each chunk function
lstHH1 = []
for i in chunks[0]:
HH1 = eval(i)
lstHH1.append(HH1)
que.put(lstHH1)
def my_funcHH2(A,B,C,que):
lstHH2 = []
for i in chunks[1]:
HH2 = eval(i)
lstHH2.append(HH2)
que.put(lstHH2)
def my_funcHH3(A,B,C,que):
lstHH3 = []
for i in chunks[2]:
HH3 = eval(i)
lstHH3.append(HH3)
que.put(lstHH3)
def my_funcHH4(A,B,C,que):
lstHH4 = []
for i in chunks[3]:
HH4 = eval(i)
lstHH4.append(HH4)
que.put(lstHH4)
queue1 = Queue()
queue2 = Queue()
queue3 = Queue()
queue4 = Queue()
# Declare the processes
p1 = Process(target= my_funcHH1, args= (A,B,C,queue1))
p2 = Process(target= my_funcHH2, args= (A,B,C,queue2))
p3 = Process(target= my_funcHH3, args= (A,B,C,queue3))
p4 = Process(target= my_funcHH4, args= (A,B,C,queue4))
# Start them
p1.start()
p2.start()
p3.start()
p4.start()
HH1 = queue1.get()
HH2 = queue2.get()
HH3 = queue3.get()
HH4 = queue4.get()
p1.join()
p2.join()
p3.join()
p4.join()
# Obtain the final result by combining lists together again.
mergedlist = HH1 + HH2 + HH3 + HH4

Categories

Resources