I am new here but I wanted to ask something regarding multiprocessing.
So I have some huge raster tiles that I process and extract information and I found that delivering tons of pickle files is faster than appending to a dataframe. The point is that I loop over each of my tiles for processing and I create pools inside a for loop
#This creates a directory for my pickle files
if not os.path.exists('pkl_tmp'):
os.mkdir('pkl_tmp')
Here I start looping over each one of my tiles and create a pool with the grid cells that I want to process, then I use the map function to apply all my nasty processing to each cell of my grid.
for GHSL_tile in ROI.iloc[4:].itertuples():
ct += 1
L18_cells = GHSL_query(GHSL_tile, L18_grid)
vector_tile = poligonize_tile(GHSL_tile)
print(datetime.today())
subdir = './pkl_tmp/{}/'.format(ct)
if not os.path.exists(subdir):
os.mkdir(subdir)
if vector_tile is not None:
# assign how many cores will be used
num_processes = int(multiprocessing.cpu_count() - 15)
chunk_size = 1 # chunk size set to 1 to return cell like outputs
# break the dataframe as a list
chunks = [L18_cells.iloc[i:i + chunk_size, :] for i in range(0, L18_cells.shape[0], chunk_size)]
pool = multiprocessing.Pool(processes=num_processes)
result = pool.map(process_cell, chunks)
del result
else:
print('Tile # {} skipped'.format(ct))
print('GHSL database created')
In fact this has not errors, it takes around 2 days executing due to the size of my data and sometimes I had many iddle cores (specially towards the end of a tile).
My question is:
I tried using map_async instead of map and it was creating files really fast, even sometimes processes multiple tiles at the same time which is wonderful, the problem is that when it creates the directory for my last tile, the code gets out of the for loop and many tasks end up not being executed. What am I doing wrong? How can I make the map_async function work better or how can I avoid iddle cores (slow down) when I use the map function?
Thank you in advance
PC resources is definitely not a problem
Related
I have a very large python code. The fundamental of that is I have a function which use a row of Dataframe and apply some formulas and save the object i've create whit joblib in my files. (Im gonna put a function to capture the essence of the script).
import Multiprocessing as multi
def somefunct(DataFrame_row, some_parameter1, some_parameter2, sema):
python_object = My_object(DataFrame_row['Column1'],DataFrame_row['Column2'])
python_object.some_complicate_method(some_parameter1, some_parameter2)
# for example calculate an integral of My_object data
#takes 50-60 second aprox per row
joblib.dump(python_object, path_save)
#Before of tried a function that save the object i tried afunction that
#save the object in the DataFrame
sema.release()
def apply_all_data_frame(df, n_procces):
sema = multi.Semaphore(n_procesos)
procesos_list = []
for index, row in df.iterrow():
sema.acquire()
p = multi.Process(target = somefunct,
args = (row, some_parameter1, some_parameter2, sema))
procesos_list.append(p)
p.start()
for proceso in procesos_list:
proceso.join()
So, the DataFrame contain 5000 rows and it maybe contain more in the future. I test the script with a data with 100 rows in a computer with 16 cores and 32 logic processor. I choose 30 process and with 100 rows use the 30 process(100% CPU) and finish quickly. But when i try again with all the data the computer only use 4 or 3 process (11%) and use 2.0 gb of RAM each process. Take to long.
My first try with the program was use Pool and Pool.map, but in that case is the same problem and full the RAM an broke everything despite having use less process (16 i think).
I've coment in the script that my first program was saving the object in the DataFrame but when i see that the RAM full 100% i decided to save the object. In that case i tried the Pool and freezing all, because create a python process with 0% work in the CPU.
I tried the function without Semaphore to.
I'm apologize for the English and for the explanation, is my first question online.
screenshot of how the process of computer works
I have edited the code , currently it is working fine . But thinks it is not executing parallely or dynamically . Can anyone please check on to it
Code :
def folderStatistic(t):
j, dir_name = t
row = []
for content in dir_name.split(","):
row.append(content)
print(row)
def get_directories():
import csv
with open('CONFIG.csv', 'r') as file:
reader = csv.reader(file,delimiter = '\t')
return [col for row in reader for col in row]
def folderstatsMain():
freeze_support()
start = time.time()
pool = Pool()
worker = partial(folderStatistic)
pool.map(worker, enumerate(get_directories()))
def datatobechecked():
try:
folderstatsMain()
except Exception as e:
# pass
print(e)
if __name__ == '__main__':
datatobechecked()
Config.CSV
C:\USERS, .CSV
C:\WINDOWS , .PDF
etc.
There may be around 200 folder paths in config.csv
welcome to StackOverflow and Python programming world!
Moving on to the question.
Inside the get_directories() function you open the file in with context, get the reader object and close the file immediately after the moment you leave the context so when the time comes to use the reader object the file is already closed.
I don't want to discourage you, but if you are very new to programming do not dive into parallel programing yet. Difficulty in handling multiple threads simultaneously grows exponentially with every thread you add (pools greatly simplify this process though). Processes are even worse as they don't share memory and can't communicate with each other easily.
My advice is, try to write it as a single-thread program first. If you have it working and still need to parallelize it, isolate a single function with input file path as a parameter that does all the work and then use thread/process pool on that function.
EDIT:
From what I can understand from your code, you get directory names from the CSV file and then for each "cell" in the file you run parallel folderStatistics. This part seems correct. The problem may lay in dir_name.split(","), notice that you pass individual "cells" to the folderStatistics not rows. What makes you think it's not running paralelly?.
There is a certain amount of overhead in creating a multiprocessing pool because creating processes is, unlike creating threads, a fairly costly operation. Then those submitted tasks, represented by each element of the iterable being passed to the map method, are gathered up in "chunks" and written to a multiprocessing queue of tasks that are read by the pool processes. This data has to move from one address space to another and that has a cost associated with it. Finally when your worker function, folderStatistic, returns its result (which is None in this case), that data has to be moved from one process's address space back to the main process's address space and that too has a cost associated with it.
All of those added costs become worthwhile when your worker function is sufficiently CPU-intensive such that these additional costs is small compared to the savings gained by having the tasks run in parallel. But your worker function's CPU requirements are so small as to reap any benefit from multiprocessing.
Here is a demo comparing single-processing time vs. multiprocessing times for invoking a worker function, fn, twice where the first time it only performs its internal loop 10 times (low CPU requirements) while the second time it performs its internal loop 1,000,000 times (higher CPU requirements). You can see that in the first case the multiprocessing version runs considerable slower (you can't even measure the time for the single processing run). But when we make fn more CPU-intensive, then multiprocessing achieves gains over the single-processing case.
from multiprocessing import Pool
from functools import partial
import time
def fn(iterations, x):
the_sum = x
for _ in range(iterations):
the_sum += x
return the_sum
# required for Windows:
if __name__ == '__main__':
for n_iterations in (10, 1_000_000):
# single processing time:
t1 = time.time()
for x in range(1, 20):
fn(n_iterations, x)
t2 = time.time()
# multiprocessing time:
worker = partial(fn, n_iterations)
t3 = time.time()
with Pool() as p:
results = p.map(worker, range(1, 20))
t4 = time.time()
print(f'#iterations = {n_iterations}, single processing time = {t2 - t1}, multiprocessing time = {t4 - t3}')
Prints:
#iterations = 10, single processing time = 0.0, multiprocessing time = 0.35399389266967773
#iterations = 1000000, single processing time = 1.182999849319458, multiprocessing time = 0.5530076026916504
But even with a pool size of 8, the running time is not reduced by a factor of 8 (it's more like a factor of 2) due to the fixed multiprocessing overhead. When I change the number of iterations for the second case to be 100,000,000 (even more CPU-intensive), we get ...
#iterations = 100000000, single processing time = 109.3077495098114, multiprocessing time = 27.202054023742676
... which is a reduction in running time by a factor of 4 (I have many other processes running in my computer, so there is competition for the CPU).
I have a script that uses multiprocessing to open and perform calculation on ~200k .csv files. Here's the workflow:
1) Considering a folder with ~200k .csv files. Each .csv file contains the folowing:
.csv file example:
0, 1
2, 3
4, 5
...
~500 rows
2) The script saves a list of all .csv files in a list()
3) The script divides the list with ~200k .csv files into 8 lists since I have 8 processors available.
4) The script calls do_something_with_csv() 8 times and perform calculation in parallel.
In linear mode, the execution takes around 4 min.
In parallel and in series, if I execute the script for the first time, it takes a much longer time. If I execute for the second, third etc. time, it takes around 1min. Seems like python is caching the IO operations of some sort? It looks like because I have a progress bar, and for example, if I execute until the progress bar is 5k/200k and terminate the program, the next execution will go through the first 5k runs very quickly and then slow down.
Python version: 3.6.1
Pseudo Python code:
def multiproc_dispatch():
lst_of_all_csv_files = get_list_of_files('/path_to_csv_files')
divided_lst_of_all_csv_files = split_list_chunks(lst_of_all_csv_files, 8)
manager = Manager()
shared_dict = manager.dict()
jobs = []
for lst_of_all_csv_files in divided_lst_of_all_csv_files:
p = Process(target=do_something_with_csv, args=(shared_dict, lst_of_all_csv_files))
jobs.append(p)
p.start()
# Wait for the worker to finish
for job in jobs:
job.join()
def read_csv_file(csv_file):
lst_a = []
lst_b = []
with open(csv_file, 'r') as f_read:
csv_reader = csv.reader(f_read, delimiter = ',')
for row in csv_reader:
lst_a.append(float(row[0]))
lst_b.append(float(row[1]))
return lst_a, lst_b
def do_something_with_csv(shared_dict, lst_of_all_csv_files):
temp_dict = lambda: defaultdict(self.mydict)()
for csv_file in lst_of_all_csv_files:
lst_a, lst_b = read_csv_file(csv_file)
temp_dict[csv_file] = (lst_a, lst_b)
shared_dict.update(temp_dict)
if __name__ == '__main__':
multiproc_dispatch()
This is without a doubt RAM caching coming into play, meaning that loading your files is faster the second time as data is already in RAM and is not coming from disk. (struggling to find good references here, any help welcome)
This has nothing to do with multiprocessing, not even with python itself.
Irrelevant since question edit I think the cause of the longer duration taken by your code when run in parallel comes from your shared_dict variable that is accessed from within each subprocess (see e.g. here). Creating and sending data between processes in python is slow and should be reduced to minimum (here you could return one dict per job then merge them).
I am trying to parallelize my code to find the similarity matrix using multiprocessing module in Python. It works fine when I use the small np.ndarray with 10 X 15 elements. But, when I scale my np.ndarray to 3613 X 7040 elements, system runs out of memory.
Below, is my code.
import multiprocessing
from multiprocessing import Pool
## Importing Jacard_similarity_score
from sklearn.metrics import jaccard_similarity_score
# Function for finding the similarities between two np arrays
def similarityMetric(a,b):
return (jaccard_similarity_score(a,b))
## Below functions are used for Parallelizing the scripts
# auxiliary funciton to make it work
def product_helper1(args):
return (similarityMetric(*args))
def parallel_product1(list_a, list_b):
# spark given number of processes
p = Pool(8)
# set each matching item into a tuple
job_args = getArguments(list_a,list_b)
# map to pool
results = p.map(product_helper1, job_args)
p.close()
p.join()
return (results)
## getArguments function is used to get the combined list
def getArguments(list_a,list_b):
arguments = []
for i in list_a:
for j in list_b:
item = (i,j)
arguments.append(item)
return (arguments)
Now when I run the below code, system runs out of memory and gets hanged. I am passing two numpy.ndarrays testMatrix1 and testMatrix2 which are of size (3613, 7040)
resultantMatrix = parallel_product1(testMatrix1,testMatrix2)
I am new to using this module in Python and trying to understand where I am going wrong. Any help is appreciated.
Odds are, the problem is just combinatoric explosion. You're trying to realize all the pairs in the main process up front, rather than generating them live, so you're storing a huge amount of memory. Assuming the ndarrays contain double values, which become Python float, then the memory usage of the list returned by getArguments is roughly the cost of a tuple and two floats per pair, or about:
3613 * 7040 * (sys.getsizeof((0., 0.)) + sys.getsizeof(0.) * 2)
On my 64 bit Linux system, that means ~2.65 GB of RAM on Py3, or ~2.85 GB on Py2, before the workers even do anything.
If you can process the data in a streaming fashion using a generator, so arguments are produced lazily and discarded when no longer needed, you could probably reduce memory usage dramatically:
import itertools
def parallel_product1(list_a, list_b):
# spark given number of processes
p = Pool(8)
# set each matching item into a tuple
# Returns a generator that lazily produces the tuples
job_args = itertools.product(list_a,list_b)
# map to pool
results = p.map(product_helper1, job_args)
p.close()
p.join()
return (results)
This still requires all the results to fit in memory; if product_helper returns floats, then the expected memory usage for the result list on a 64 bit machine would still be around 0.75 GB or so, which is pretty large; if you can process the results in a streaming fashion, iterating the results of p.imap or even better, p.imap_unordered (the latter returns results as computed, not in the order the generator produced the arguments) and writing them to disk or otherwise ensuring they're released in memory quickly would save a lot of memory; the following just prints them out, but writing them to a file in some reingestable format would also be reasonable.
def parallel_product1(list_a, list_b):
# spark given number of processes
p = Pool(8)
# set each matching item into a tuple
# Returns a generator that lazily produces the tuples
job_args = itertools.product(list_a,list_b)
# map to pool
for result in p.imap_unordered(product_helper1, job_args):
print(result)
p.close()
p.join()
The map method sends all data to the workers via inter-process communication. As currently done, this consumes a huge amount of resources, because you're sending
What I would suggest it to modify getArguments to make a list of tuple of indices in the matrix that need to be combined. That's only two numbers that have to be sent to the worker process, instead of two rows of a matrix! Each worker then knows which rows in the matrix to use.
Load the two matrices and store them in global variables before calling map. This way every worker has access to them. And as long as they're not modified in the workers, the OS's virtual memory manager will not copy identical memory pages, keeping memory usage down.
I am attempting to parallelize an algorithm that I have been working on using the Multiprocessing and Pool.map() commands. I ran into a problem and was hoping someone could point me in the right direction.
Let x denote an array of N rows and 1 column, which is initialized to be a vector of zeros. Let C denote an array of length N by 2. The vector x is constructed iteratively by using information from some subsets of C (doing some math operations). The code (not parallelized) as a large for loop looks roughly as follows:
for j in range(0,N)
#indx_j will have n_j <<N entries
indx_j = build_indices(C,j)
#x_j will be entries to be added to vector x at indices indx_j
#This part is time consuming
x_j = build_x_j(indx_j,C)
#Add x_j into entries of x
x[indx_j] = x[indx_j] + x_j
I was able to parallelize this using the multiprocessing module and using the pool.map to eliminate the large for loop. I wrote a function that did the above computations, except the step of adding x_j to x[indx_j]. The parallelized function instead returns two data sets back: x_j and indx_j. After those are computed, I run a for loop (not parallel) to build up x by doing the x[indx_j] = x[indx_j] +x_j computation for j=0,N.
The downside to my method is that pool.map operation returns a gigantic list of N pairs of arrays x_j and indx_j. where both x_j and indx_j were n_j by 1 vectors (n_j << N). For large N (N >20,000) this was taking up way too much memory. Here is my question: Can I somehow, in parallel, do the construction operation x[indx_j] = x[indx_j] + x_j. It seems to me each process in pool.map() would have to be able to interact with the vector x. Do I place x in some sort of shared memory? How would I do such a thing? I suspect that this has to be possible somehow, as I assume people assemble matrices in parallel for finite element methods all the time. How can I have multiple processes interact with a vector without having some sort of problem? I'm worried that perhaps for j= 20 and j = 23, if they happen simultaneously, they might try to add to x[indx_20] = x[indx_20] + x_20 and simultaneously x[indx_30] = x[indx_30] + x_30 and maybe some error will happen. I also don't know how to even have this computation done via the pool.map() (I don't think I can feed x in as an input, as it would be changing after each process).
I'm not sure if it matters or not, but the sets indx_j will have non-trivial intersection (e.g., indx_1 and indx_2 may have indices [1,2,3] and [3,4,5] for example).
If this is unclear, please let me know and I will attempt to clarify. This is my first time trying to work in parallel, so I am very unsure of how to proceed. Any information would be greatly appreciated. Thanks!
I dont know If I am qualified to give proper advice on the topic of shared memory arrays, but I had a similar need to share arrays across processes in python recently and came across a small custom numpy.ndarray implementation for a shared memory array in numpy using the shared ctypes within multiprocessing. Here is a link to the code: shmarray.py. It acts just like a normal array,except the underlying data is stored in shared memory, meaning separate processes can both read and write to the same array.
Using Shared Memory Array
In threading, all information available to the thread (global and local namespace) can be handled as shared between all other threads that have access to it, but in multiprocessing that data is not so easily accessible. On linux data is available for reading, but cannot be written to. Instead when a write is done, the data is copied and then written to, meaning no other process can see those changes. However, if the memory being written to is shared memory, it is not copied. This means with shmarray we can do things similar to the way we would do threading, with the true parallelism of multiprocessing. One way to have access to the shared memory array is with a subclass. I know you are currently using Pool.map(), but I had felt limited by the way map worked, especially when dealing with n-dimensional arrays. Pool.map() is not really designed to work with numpy styled interfaces, at least I don't think it can easily. Here is a simple idea where you would spawn a process for each j in N:
import numpy as np
import shmarray
import multiprocessing
class Worker(multiprocessing.Process):
def __init__(self, j, C, x):
multiprocessing.Process.__init__()
self.shared_x = x
self.C = C
self.j = j
def run(self):
#Your Stuff
#indx_j will have n_j <<N entries
indx_j = build_indices(self.C,self.j)
#x_j will be entries to be added to vector x at indices indx_j
x_j = build_x_j(indx_j,self.C)
#Add x_j into entries of x
self.shared_x[indx_j] = self.shared_x[indx_j] + x_j
#And then actually do the work
N = #What ever N should be
x = shmarray.zeros(shape=(N,1))
C = #What ever C is, doesn't need to be shared mem, since no writing is happening
procs = []
for j in range(N):
proc = Worker(j, C, x)
procs.append(proc)
proc.start()
#And then join() the processes with the main process
for proc in procs:
proc.join()
Custom Process Pool and Queues
So this might work, but spawning several thousand processes is not really going to be of any use if you only have a few cores. The way I handled this was to implement a Queue system between my process. That is to say, we have a Queue that the main process fills with j's and then a couple worker processes get numbers from the Queue and do work with it, note that by implementing this, you are essentially doing exactly what Pool does. Also note we are actually going to use multiprocessing.JoinableQueue for this since it lets use join() to wait till a queue is emptied.
Its not hard to implement this at all really, simply we must modify our Subclass a bit and how we use it.
import numpy as np
import shmarray
import multiprocessing
class Worker(multiprocessing.Process):
def __init__(self, C, x, job_queue):
multiprocessing.Process.__init__()
self.shared_x = x
self.C = C
self.job_queue = job_queue
def run(self):
#New Queue Stuff
j = None
while j!='kill': #this is how I kill processes with queues, there might be a cleaner way.
j = self.job_queue.get() #gets a job from the queue if there is one, otherwise blocks.
if j!='kill':
#Your Stuff
indx_j = build_indices(self.C,j)
x_j = build_x_j(indx_j,self.C)
self.shared_x[indx_j] = self.shared_x[indx_j] + x_j
#This tells the queue that the job that was pulled from it
#Has been completed (we need this for queue.join())
self.job_queue.task_done()
#The way we interact has changed, now we need to define a job queue
job_queue = multiprocessing.JoinableQueue()
N = #What ever N should be
x = shmarray.zeros(shape=(N,1))
C = #What ever C is, doesn't need to be shared mem, since no writing is happening
procs = []
proc_count = multiprocessing.cpu_count() # create as many procs as cores
for _ in range(proc_count):
proc = Worker(C, x, job_queue) #now we pass the job queue instead
procs.append(proc)
proc.start()
#all the workers are just waiting for jobs now.
for j in range(N):
job_queue.put(j)
job_queue.join() #this blocks the main process until the queue has been emptied
#Now if you want to kill all the processes, just send a 'kill'
#job for each process.
for proc in procs:
job_queue.put('kill')
job_queue.join()
Finally, I really cannot say anything about how this will handle writing to overlapping indices at the same time. Worst case is that you could have a serious problem if two things attempt to write at the same time and things get corrupted/crash(I am no expert here so I really have no idea if that would happen). Best case since you are just doing addition, and order of operations doesn't matter, everything runs smoothly. If it doesn't run smoothly, my suggestion is to create a second custom Process subclass that specifically does the array assignment. To implement this you would need to pass both a job queue, and an 'output' queue to the Worker subclass. Within the while loop, you should have a `output_queue.put((indx_j, x_j)). NOTE: If you are putting these into a Queue they are being pickled, which can be slow. I recommend making them shared memory arrays if they can be before using put. It may be faster to just pickle them in some cases, but I have not tested this. To assign these as they are generated, you then need to have your Assigner process read these values from a queue as jobs and apply them, such that the work loop would essentially be:
def run(self):
job = None
while job!='kill':
job = self.job_queue.get()
if job!='kill':
indx_j, x_j = job
#Note this is the process which really needs access to the X array.
self.x[indx_j] += x_j
self.job_queue.task_done()
This last solution will likely be slower than doing the assignment within the worker threads, but if you are doing it this way, you have no worries about race conditions, and memory is still lighter since you can use up the indx_j and x_j values as you generate them, instead of waiting till all of them are done.
Note for Windows
I didn't do any of this work on windows, so I am not 100% certain, but I believe the code above will be very memory intensive since windows does not implement a copy-on-write system for spawning independent processes. Essentially windows will copy ALL information that a process needs when spawning a new one from the main process. To fix this, I think replacing all your x_j and C with shared memory arrays (anything you will be handing around to other processes) instead of normal arrays should cause windows to not copy the data, but I am not certain. You did not specify which platform you were on so I figured better safe than sorry, since multiprocessing is a different beast on windows than linux.