I am using pool.map in multiprocessing to do my custom function,
def my_func(data): #This is just a dummy function.
data = data.assign(new_col = data.apply(lambda x: f(x), axis = 1))
return data
def main():
mypool=pool.Pool(processes=16,maxtasksperchild=100)
ret_list=mypool.map(my_func,(group for name, group in gpd))
mypool.close()
mypool.join()
result = pd.concat(ret_list, axis=0)
Here gpd is a grouped data frame and so I am passing one data frame at a time to the pool.map function here. I keep getting memory error here.
As I can see from here, VIRT increase to multiple fold and leads to this error.
Two questions,
How do I solve this key growing memory issue at VIRT? May be a way to play with chunk size here.?
Second thing, though its launching as many python subprocess as I mentioned in pool(processes), I can see all the CPU doesn't hit 100%CPU, seems it doesn't use all the processes. One or Two run at a time? May be due to its applying same chunk size on different data frame sizes I pass every time (some data frames will be small)? How do I utilise every CPU process here?
Just for someone looking for answer in future. I solved this by using imap instead of map. Because map will make a list of iterator which is intensive.
Related
I've not used multiprocessing/multithread in python code.
I have a long code (Over 600 lines) and I need to run it with using multiple cpus.
So I saw the way to use mutiprocessing/thread, but the way to do it for a whole code could not find it.
The code has the form of..
for loop
read csv
do several preprocessing
Mean of the values
Compare it with other values
...
If I have to edit all the code for the multiprocessing, it would take many times so could you please let me know if you know the way to do mutiprocessing the whole code?
To parallelize a function on multiple CPU cores it must generally avoid mutating global state and each function call must be independent of the others. Consider this hypothetical fonction which respects the conditions (comparison with other values is removed):
def f(file: Path) -> Value:
data = read_csv(file)
processed = pre_processing(data)
return mean(processed)
You can easily multithread it with Python using the concurrent integrated package:
from concurrent.futures import ThreadPoolExecutor
files = ["/path/1/", ...] # List of files
with ThreadPoolExecutor() as executor:
values = executor.map(f, files)
# Compare values here
for value in values:
...
You can also use ProcessPoolExecutor for multiprocessing.
I have a very large python code. The fundamental of that is I have a function which use a row of Dataframe and apply some formulas and save the object i've create whit joblib in my files. (Im gonna put a function to capture the essence of the script).
import Multiprocessing as multi
def somefunct(DataFrame_row, some_parameter1, some_parameter2, sema):
python_object = My_object(DataFrame_row['Column1'],DataFrame_row['Column2'])
python_object.some_complicate_method(some_parameter1, some_parameter2)
# for example calculate an integral of My_object data
#takes 50-60 second aprox per row
joblib.dump(python_object, path_save)
#Before of tried a function that save the object i tried afunction that
#save the object in the DataFrame
sema.release()
def apply_all_data_frame(df, n_procces):
sema = multi.Semaphore(n_procesos)
procesos_list = []
for index, row in df.iterrow():
sema.acquire()
p = multi.Process(target = somefunct,
args = (row, some_parameter1, some_parameter2, sema))
procesos_list.append(p)
p.start()
for proceso in procesos_list:
proceso.join()
So, the DataFrame contain 5000 rows and it maybe contain more in the future. I test the script with a data with 100 rows in a computer with 16 cores and 32 logic processor. I choose 30 process and with 100 rows use the 30 process(100% CPU) and finish quickly. But when i try again with all the data the computer only use 4 or 3 process (11%) and use 2.0 gb of RAM each process. Take to long.
My first try with the program was use Pool and Pool.map, but in that case is the same problem and full the RAM an broke everything despite having use less process (16 i think).
I've coment in the script that my first program was saving the object in the DataFrame but when i see that the RAM full 100% i decided to save the object. In that case i tried the Pool and freezing all, because create a python process with 0% work in the CPU.
I tried the function without Semaphore to.
I'm apologize for the English and for the explanation, is my first question online.
screenshot of how the process of computer works
I am using Dask for a complicated operation. First I do a reduction which produces a moderately sized df (a few MBs) which I then need to pass to each worker to calculate the final result so my code looks a bit like this
intermediate_result = ddf.reduction().compute()
final_result = ddf.reduction(
chunk=function, chunk_kwargs={"intermediate_result": intermediate_result}
)
However I am getting the warning message that looks like this
Consider scattering large objects ahead of time
with client.scatter to reduce scheduler burden and
keep data on workers
future = client.submit(func, big_data) # bad
big_future = client.scatter(big_data) # good
future = client.submit(func, big_future) # good
% (format_bytes(len(b)), s)
I have tried doing this
intermediate_result = client.scatter(intermediate_result, broadcast=True)
But this isn't working as the function now sees this as a Future object and not the datatype it is supposed to be.
I can't seem to find any documentation on how to use scatter with reductions, does anyone know how to do this? Or should I just ignore the warning message and pass the moderately sized df as I am?
Actually, the best solution probably is not to scatter your materialised result, but to avoid computing it in the first place. You can simply remove the .compute(), which will mean all the calculation gets done in one stage, with the results automatically moved where you need them.
Alternatively, if you want to have a clear boundary between the stages, you can use
intermediate_result = ddf.reduction().persist()
which will kick off the reduction and store it on workers without pulling it to the client. You can choose to wait on this to finish before the next step or not.
I have a function which I will run using multi-processing. However the function returns a value and I do not know how to store that value once it's done.
I read somewhere online about using a queue but I don't know how to implement it or if that'd even work.
cores = []
for i in range(os.cpu_count()):
cores.append(Process(target=processImages, args=(dataSets[i],)))
for core in cores:
core.start()
for core in cores:
core.join()
Where the function 'processImages' returns a value. How do I save the returned value?
In your code fragment you have input dataSets which is a list of some unspecified size. You have a function processImages which takes a dataSet element and apparently returns a value you want to capture.
cpu_count == dataset length ?
The first problem I notice is that os.cpu_count() drives the range of values i which then determines which datasets you process. I'm going to assume you would prefer these two things to be independent. That is, you want to be able to crunch some X number of datasets and you want it to work on any machine, having anywhere from 1 - 1000 (or more...) cores.
An aside about CPU-bound work
I'm also going to assume that you have already determined that the task really is CPU-bound, thus it makes sense to split by core. If, instead, your task is disk io-bound, you would want more workers. You could also be memory bound or cache bound. If optimal parallelization is important to you, you should consider doing some trials to see which number of workers really gives you maximum performance.
Here's more reading if you like
Pool class
Anyway, as mentioned by Michael Butscher, the Pool class simplifies this for you. Yours is a standard use case. You have a set of work to be done (your list of datasets to be processed) and a number of workers to do it (in your code fragment, your number of cores).
TLDR
Use those simple multiprocessing concepts like this:
from multiprocessing import Pool
# Renaming this variable just for clarity of the example here
work_queue = datasets
# This is the number you might want to find experimentally. Or just run with cpu_count()
worker_count = os.cpu_count()
# This will create processes (fork) and join all for you behind the scenes
worker_pool = Pool(worker_count)
# Farm out the work, gather the results. Does not care whether dataset count equals cpu count
processed_work = worker_pool.map(processImages, work_queue)
# Do something with the result
print(processed_work)
You cannot return the variable from another process. The recommended way would be to create a Queue (multiprocessing.Queue), then have your subprocess put the results to that queue, and once it's done, you may read them back -- this works if you have a lot of results.
If you just need a single number -- using Value or Array could be easier.
Just remember, you cannot use a simple variable for that, it has to be wrapped with above mentioned classes from multiprocessing lib.
If you want to use the result object returned by a multiprocessing, try this
from multiprocessing.pool import ThreadPool
def fun(fun_argument1, ... , fun_argumentn):
<blabla>
return object_1, object_2
pool = ThreadPool(processes=number_of_your_process)
async_num1 = pool.apply_async(fun, (fun_argument1, ... , fun_argumentn))
object_1, object_2 = async_num1.get()
then you can do whatever you want.
I have dask arrays that represents frames of a video and want to create multiple video files. I'm using the imageio library which allows me to "append" the frames to an ffmpeg subprocess. So I may have something like this:
my_frames = [[arr1f1, arr1f2, arr1f3], [arr2f1, arr2f2, arr2f3], ...]
So each internal list represents the frames for one video (or product). I'm looking for the best way to send/submit frames to be computed while also writing frames to imageio as they complete (in order). To make it more complicated the internal lists above are actually generators and can be 100s or 1000s of frames. Also keep in mind that because of how imageio works I think it needs to exist in one single process. Here is a simplified version of what I have working so far:
for frame_arrays in frames_to_write:
# 'frame_arrays' is [arr1f1, arr2f1, arr3f1, ...]
future_list = _client.compute(frame_arrays)
# key -> future
future_dict = dict(zip(frame_keys, future_list))
# write the current frame
# future -> key
rev_future_dict = {v: k for k, v in future_dict.items()}
result_iter = as_completed(future_dict.values(), with_results=True)
for future, result in result_iter:
frame_key = rev_future_dict[future]
# get the writer for this specific video and add a new frame
w = writers[frame_key]
w.append_data(result)
This works and my actual code is reorganized from the above to submit the next frame while writing the current frame so there is some benefit I think. I'm thinking of a solution where the user says "I want to process X frames at a time" so I send 50 frames, write 50 frames, send 50 more frames, write 50 frames, etc.
My questions after working on this for a while:
When does result's data live in local memory? When it is returned by the iterator or when it is completed?
Is it possible to do something like this with the dask-core threaded scheduler so a user doesn't have to have distributed installed?
Is it possible to adapt how many frames are sent based on number of workers?
Is there a way to send a dictionary of dask arrays and/or use as_completed with the "frame_key" being included?
If I load the entire series of frames and submit them to the client/cluster I would probably kill the scheduler right?
Is using get_client() followed by Client() on ValueError the preferred way of getting the client (if not provided by the user)?
Is it possible to give dask/distributed one or more iterators that it pulls from as workers become available?
Am I being dumb? Overcomplicating this?
Note: This is kind of an extension to this issue that I made a while ago, but is slightly different.
After following a lot of the examples here I got the following:
try:
# python 3
from queue import Queue
except ImportError:
# python 2
from Queue import Queue
from threading import Thread
def load_data(frame_gen, q):
for frame_arrays in frame_gen:
future_list = client.compute(frame_arrays)
for frame_key, arr_future in zip(frame_keys, future_list):
q.put({frame_key: arr_future})
q.put(None)
input_q = Queue(batch_size if batch_size is not None else 1)
load_thread = Thread(target=load_data, args=(frames_to_write, input_q,))
remote_q = client.gather(input_q)
load_thread.start()
while True:
future_dict = remote_q.get()
if future_dict is None:
break
# write the current frame
# this should only be one element in the dictionary, but this is
# also the easiest way to get access to the data
for frame_key, result in future_dict.items():
# frame_key = rev_future_dict[future]
w = writers[frame_key]
w.append_data(result)
input_q.task_done()
load_thread.join()
This answers most of my questions that I had and seems to work the way I want in general.