I have an interesting multi-processing problem with structure that I might be able to exploit. The problem involves a largish ~80 column DataFrame (df) in Pandas with many columns and a function func that operates on pairs (~80*79/2 pairs) of those columns in df and takes a fairly short amount of time on each run.
the code looks like
mgr = Manager()
ns = mgr.Namespace()
ns.df = df
pool = Pool(processes=16)
args = [(ns, list(combo)) for combo in list(combinations(df.columns, 2))]
results = pool.map(func, args)
pool.close()
The above is not fast but faster than without the pool but only faster by a factor of 7 or so. I'm worried that the the overhead from so many calls is the issue. Is there a good way to exploit the structure here for MultiProcessing?
That is a fairly standard result. Nothing will scale perfectly linearly when run in parallel because of the overhead required to set up each process and pass data between processes. Keep in mind that (80 * 79) / 2 = 3,160 is actually a very small number assuming the function is not extremely computationally intensive (i.e. takes a really long time). All else equal, the faster the function the greater the overhead cost to using multiprocessing because the time to set up an additional process is relatively fixed.
Overhead on multiprocessing mainly comes in memory if you have to make several duplications of a large dataset (one duplication for each process if the function is poorly designed) because processes do not share memory. Assuming your function is set up such that it can be easily parallelized, adding more processes is good so long as you do not exceed the number of processors on your computer. Most home computers do not have 16 processors (at most 8 is typical) and your result (that it is 7 times faster in parallel) is consistent with you having fewer than 16 processors. You can check the number of processors on your machine with multiprocessing.cpu_count().
EDIT:
If you parallelize a function by passing the column string then it will repeatedly make copies of the dataframe. For example:
def StringPass(string1, string2):
return df[string1] * df[string2]
If you parallelize StringPass it will copy the data frame at least once per process. In contrast:
def ColumnPass(column1, column2):
return column1 * column2
If you pass just the necessary columns ColumnPass will only copy the columns necessary for each call to the function when run in parallel. So while StringPass(string1, string2) and ColumnPass(df[string1], df[string2]) will return the same result, in multiprocessing the former will make several inefficient copies of the global df, while the latter will only copy the necessary columns for each call to the function.
Related
I am using Dask on a HPC cluster with 4 nodes, and each node has 12 cores.
My code is pure Python dealing with lists and sets, and doing most of the computation in tight Python for loops. I read an answer here which suggests to use more processes and less threads for such computation.
If I have
client = Client(n_workers=24, threads_per_worker=2)
will a computation on Python list when using .map() and .compute() split up the work into 48 chunks in parallel? Wouldn't the GIL allow only one thread, and hence only 24 computations in parallel?
EDIT: How is that if I use multiprocessing module and call a pool of threads, on a single node it is faster? Can I use dask with 4 workers (1 worker per node), and 12 pool of threads from the multiprocessing module?
My stripped down code looks like this:
b = db.from_sequence([some_list], npartitions=48).map(my_func, g, k)
m_op = b.compute()
def my_func(g, k):
# several for loops
return
The data g is a pretty huge list, and this gets duplicated if I use more number of processes, and hence becomes the bottleneck. I also tried using
gx = dask.delayed(g) and passing gx to the function. This is also both memory and time consuming.
I understand (from the answers on stackoverflow), that I can use:
[future] = c.scatter([g])
but if all my workers randomly use the data g, I will have to broadcast, and this will again be memory consuming.
Please note that I am not modifying g in my function.
What is the right approach to tackle this?
One other minor observation/question about dask is:
my_func is searching for something, and returns a list of found elements. If a particular worker did not find an element, the return is an empty list. In the end to concatenate the output, I have an ugly piece of code like below:
for sl in m_op:
for item in sl:
if item != []:
nm_op.append(item)
Is there a better way to do this?
Thanks a lot for your time.
What I've tried
I have an embarrassingly parallel for loop in which I iterate over 90x360 values in two nested for loops and do some computation. I tried dask.delayed to parallelize the for loops as per this tutorial although it is demonstrated for a very small set of iterations.
Problem description
I'm surprised to find that the parallel code took 2h 39 mins compared to non-parallel timing of 1h 54 mins which means that I'm doing something fundamentally wrong or maybe the task graphs are too big to handle?
Set-up info
This test was done for a subset of my iterations that is, 10 x 360, but the optimized code should be able to handle 90 x 360 nested iterations. My mini-cluster has 66 cores and 256 GB of RAM and the 2 data files are 4 GB and < 1 GB each. I'm also confused between the approach of multi-processing vs multi-threading for this task. I thought running parallel loops in multiple processes similar to joblib default implementation would be the way to go as each loop works on independent grid-points. But, this suggests that multi-threading is faster and should be preferred if one doesn't have a GIL issue (which I don't). So, for the timing above, I used dask.delay default scheduling option which uses multi-threading option for a single process.
Simplified code
import numpy as np
import pandas as pd
import xarray as xr
from datetime import datetime
from dask import compute, delayed
def add_data_from_small_file(lat):
""" for each grid-point, get time steps from big-file as per mask, and
compute data from small file for those time-steps
Returns: array per latitude which is to be stacked
"""
for lon in range(0,360):
# get time steps from big file
start_time = big_file.time.values[mask1[:, la, lo]]
end_time = big_file.time.values[[mask2[:,la,lo]]
i=0
for t1, t2 in zip(start_time, end_time):
# calculate value from small file for each time pair
temp_var[i] = small_file.sel(t=slice(t1, t2)).median()
i=i+1
temp_per_lon[:, lon] = temp_var
return temp_per_lon
if __name__ == '__main__':
t1 = datetime.now()
small_file = xr.open_dataarray('small_file.nc') # size < 1 GB, 10000x91
big_file = xr.open_dataset('big_file.nc') # size = 4 GB, 10000x91x360
delayed_values = [delayed(add_data_from_small_file)(lat) for lat in range(0,10)] # 10 loops for testing, to scale to 90 loops
# have to delay stacking to avoid memory error
stack_arr = delayed(np.stack)(delayed_values, axis=1)
stack_arr = stack_arr.compute()
print('Total run time:{}'.format(datetime.now()-t1))
Every delayed task adds about 1ms of overhead. So if your function is slow (maybe you're calling out to some other expensive function), then yes dask.delayed might be a good fit. If not, then you should probably look elsewhere.
If you're curious about whether or not threads or processes are better for you, the easiest way to find out is just to try both. It is easy to do.
dask.compute(*values, scheduler="processes")
dask.compute(*values, scheduler="threads")
It could be that even though you're using numpy arrays, most of your time is actually spent in Python for loops. If so, multithreading won't help you here, and the real solution is to stop using Python for loops, either by being clever with numpy/xarray, or by using a project like Numba.
Let's say I have a huge list containing random numbers for example
L = [random.randrange(0,25000000000) for _ in range(1000000000)]
I need to get rid of the duplicates in this list
I wrote this code for lists containing a smaller number of elements
def remove_duplicates(list_to_deduplicate):
seen = set()
result=[]
for i in list_to_deduplicate:
if i not in seen:
result.append(i)
seen.add(i)
return result
In the code above I create a set so I can memorize what numbers have already appeared in the list I'm working on if the number is not in the set then I add it to the result list I need to return and save it in the set so it won't be added again in the result list
Now for 1000000 number in a list all is good I can get a result fast but for numbers superior to let's say 1000000000 problems arise I need to use the different cores on my machine to try and break up the problem and then combine the results from multiple processes
My first guess was to make a set accessible to all processes but many complications will arise
How can a process read while maybe another one is adding to the set and I don't even know if it is possible to share a set between processes I know we can use a Queue or a pipe but I'm not sure on how to use it
Can someone give me an advice on what is the best way to solve this problem
I am open to any new idea
I'm skeptic even your greatest list is big enough so that multiprocessing would improve timings. Using numpy and multithreading is probably your best chance.
Multiprocessing introduces quite some overhead and increases memory consumption like #Frank Merrow rightly mentioned earlier.
That's not the case (to that extend) for multithreading, though. It's important to not mix these terms up because processes and threads are not the same.
Threads within the same process share their memory, distinct processes do not.
The problem with going multi-core in Python is the GIL, which doesn't allow multiple threads (in the same process) to execute Python bytecode in parallel. Some C-extensions like numpy can release the GIL, this enables profiting from multi-core parallelism with multithreading. Here's your chance to get some speed up on top of a big improvement just by using numpy.
from multiprocessing.dummy import Pool # .dummy uses threads
import numpy as np
r = np.random.RandomState(42).randint(0, 25000000000, 100_000_000)
n_threads = 8
result = np.unique(np.concatenate(
Pool(n_threads).map(np.unique, np.array_split(r, n_threads)))
).tolist()
Use numpy and a thread-pool, split up the array, make the sub-arrays unique in separate threads, then concatenate the sub-arrays and make the recombined array once more unique again.
The final dropping of duplicates for the recombined array is necessary because within the sub-arrays only local duplicates can be identified.
For low entropy data (many duplicates) using pandas.unique instead of numpy.unique can be much faster. Unlike numpy.unique it also preserves order of appearance.
Note that using a thread-pool like above makes only sense if the numpy-function is not already multi-threaded under the hood by calling into low-level math libraries. So, always test to see if it actually improves performance and don't take it for granted.
Tested with 100M random generated integers in the range:
High entropy: 0 - 25_000_000_000 (199560 duplicates)
Low entropy: 0 - 1000
Code
import time
import timeit
from multiprocessing.dummy import Pool # .dummy uses threads
import numpy as np
import pandas as pd
def time_stmt(stmt, title=None):
t = timeit.repeat(
stmt=stmt,
timer=time.perf_counter_ns, repeat=3, number=1, globals=globals()
)
print(f"\t{title or stmt}")
print(f"\t\t{min(t) / 1e9:.2f} s")
if __name__ == '__main__':
n_threads = 8 # machine with 8 cores (4 physical cores)
stmt_np_unique_pool = \
"""
np.unique(np.concatenate(
Pool(n_threads).map(np.unique, np.array_split(r, n_threads)))
).tolist()
"""
stmt_pd_unique_pool = \
"""
pd.unique(np.concatenate(
Pool(n_threads).map(pd.unique, np.array_split(r, n_threads)))
).tolist()
"""
# -------------------------------------------------------------------------
print(f"\nhigh entropy (few duplicates) {'-' * 30}\n")
r = np.random.RandomState(42).randint(0, 25000000000, 100_000_000)
r = list(r)
time_stmt("list(set(r))")
r = np.asarray(r)
# numpy.unique
time_stmt("np.unique(r).tolist()")
# pandas.unique
time_stmt("pd.unique(r).tolist()")
# numpy.unique & Pool
time_stmt(stmt_np_unique_pool, "numpy.unique() & Pool")
# pandas.unique & Pool
time_stmt(stmt_pd_unique_pool, "pandas.unique() & Pool")
# ---
print(f"\nlow entropy (many duplicates) {'-' * 30}\n")
r = np.random.RandomState(42).randint(0, 1000, 100_000_000)
r = list(r)
time_stmt("list(set(r))")
r = np.asarray(r)
# numpy.unique
time_stmt("np.unique(r).tolist()")
# pandas.unique
time_stmt("pd.unique(r).tolist()")
# numpy.unique & Pool
time_stmt(stmt_np_unique_pool, "numpy.unique() & Pool")
# pandas.unique() & Pool
time_stmt(stmt_pd_unique_pool, "pandas.unique() & Pool")
Like you can see in the timings below, just using numpy without multithreading already accounts for the biggest performance improvement. Also note pandas.unique() being faster than numpy.unique() (only) for many duplicates.
high entropy (few duplicates) ------------------------------
list(set(r))
32.76 s
np.unique(r).tolist()
12.32 s
pd.unique(r).tolist()
23.01 s
numpy.unique() & Pool
9.75 s
pandas.unique() & Pool
28.91 s
low entropy (many duplicates) ------------------------------
list(set(r))
5.66 s
np.unique(r).tolist()
4.59 s
pd.unique(r).tolist()
0.75 s
numpy.unique() & Pool
1.17 s
pandas.unique() & Pool
0.19 s
Can't say I like this, but it should work, after a fashion.
Divide the data in N readonly pieces. Distribute one per worker to research the data. Everything is readonly, so it can all be shared. Each worker i 1...N checks its list against all the other 'future' lists i+1...N
Each worker i maintains a bit table for its i+1...N lists noting if any of the its items hit any of the the future items.
When everyone is done, worker i sends it's bit table back to master where tit can be ANDed. the zeroes then get deleted. No sorting no sets. The checking is not fast, tho.
If you don't want to bother with multiple bit tables you can let every worker i write zeroes when they find a dup above their own region of responsibility. HOWEVER, now you run into real shared memory issues. For that matter, you could even let each work just delete dup above their region, but ditto.
Even dividing up the work begs the question. It's expensive for each worker to walk though everyone else's list for each of its own entries. *(N-1)len(region)/2. Each worker could create a set of it's region, or sort it's region. Either would permit faster checks, but the costs add up.
I have written a little script to distribute workload between 4 threads and to test whether the results stay ordered (in respect to the order of the input):
from multiprocessing import Pool
import numpy as np
import time
import random
rows = 16
columns = 1000000
vals = np.arange(rows * columns, dtype=np.int32).reshape(rows, columns)
def worker(arr):
time.sleep(random.random()) # let the process sleep a random
for idx in np.ndindex(arr.shape): # amount of time to ensure that
arr[idx] += 1 # the processes finish at different
# time steps
return arr
# create the threadpool
with Pool(4) as p:
# schedule one map/worker for each row in the original data
q = p.map(worker, [row for row in vals])
for idx, row in enumerate(q):
print("[{:0>2}]: {: >8} - {: >8}".format(idx, row[0], row[-1]))
For me this always results in:
[00]: 1 - 1000000
[01]: 1000001 - 2000000
[02]: 2000001 - 3000000
[03]: 3000001 - 4000000
[04]: 4000001 - 5000000
[05]: 5000001 - 6000000
[06]: 6000001 - 7000000
[07]: 7000001 - 8000000
[08]: 8000001 - 9000000
[09]: 9000001 - 10000000
[10]: 10000001 - 11000000
[11]: 11000001 - 12000000
[12]: 12000001 - 13000000
[13]: 13000001 - 14000000
[14]: 14000001 - 15000000
[15]: 15000001 - 16000000
Question: So, does Pool really keep the original input's order when storing the results of each map function in q?
Sidenote: I am asking this, because I need an easy way to parallelize work over several workers. In some cases the ordering is irrelevant. However, there are some cases where the results (like in q) have to be returned in the original order, because I'm using an additional reduce function that relies on ordered data.
Performance: On my machine this operation is about 4 times faster (as expected, since I have 4 cores) than normal execution on a single process. Additionally, all 4 cores are at 100% usage during the runtime.
Pool.map results are ordered. If you need order, great; if you don't, Pool.imap_unordered may be a useful optimization.
Note that while the order in which you receive the results from Pool.map is fixed, the order in which they are computed is arbitrary.
The documentation bills it as a "parallel equivalent of the map() built-in function". Since map is guaranteed to preserve order, multiprocessing.Pool.map makes that guarantee too.
Note that while the results are ordered, the execution isn't necessarily ordered.
From the documentation:
map(func, iterable[, chunksize])
A parallel equivalent of the map() built-in function (it supports only one iterable argument though). It blocks until the result is ready.
This method chops the iterable into a number of chunks which it submits to the process pool as separate tasks. The (approximate) size of these chunks can be specified by setting chunksize to a positive integer.
In my experience, it often chunks the list into pairs, so that items #1 & #2 go to the first process/thread, #3 & #4 to the second, and so on. In this example, the order would be [#1, #3, #2, #4] -- but this can vary depending on the number and duration of each process/thread (for example, if #1 is a very long process, #2 could be delayed enough to be the very last process to run).
Obviously, if the order of execution matters to you (like it does for us -- more on this below), then this is highly undesirable.
Fortunately, there is a fairly simple solution: just set the chunksize to 1!
pool.map(func, my_list, 1)
The documentation states this parameter specifies an approximate chunksize, but in my experience, setting it to 1 works: it feeds the items to the pool one by one, rather than in chunks.
Edit: Our use case may not be very standard, so let me provide some details:
We have to process large numbers of fast & slow jobs in parallel (the degree of parallelism depends on the number of nodes, and cores per nodes).
We use a multiprocessing thread pool to kick off those jobs (in a separate process), and wait for them to complete (using ThreadPool.map), before doing other things.
These jobs can take minutes or hours to complete (it's not just some basic calculations).
These workflows happen all the time (usually daily or hourly).
The order of execution matters mostly in terms of compute time efficiency (which equals money, in the cloud). We want the slowest jobs to run first, while the faster jobs complete with the leftover parallelism at the end. It's like filling a suitcase -- if you start with all the small items, you're going to have a bad time.
Here's an example: let's say we have 20 jobs to run on 4 threads/processes -- the first two each take ~2 hours to run, and the other ones take a few minutes. Here are the two alternative scenarios:
With chunking (default behavior):
#1 & #2 will be chunked into the same thread/process (and hence run sequentially), while the other ones will be executed in similarly chunked order. All the other threads/processes will be idle while #2 completes. Total runtime: ~4 hours.
Without chunking (setting chunksize = 1):
#1 & #2 will not be chunked into the same thread/process, and hence run in parallel. The other ones will be executed in order as threads/processes become available. Total runtime: ~2 hours.
When you're paying for compute in the cloud, this makes a huge difference -- especially as the hourly & daily runs add up to monthly & yearly bills.
I am trying to get to grips with multiprocessing in Python. I started by creating this code. It simply computes cos(i) for integers i and measures the time taken when one uses multiprocessing and when one does not. I am not observing any time difference. Here is my code:
import multiprocessing
from multiprocessing import Pool
import numpy as np
import time
def tester(num):
return np.cos(num)
if __name__ == '__main__':
starttime1 = time.time()
pool_size = multiprocessing.cpu_count()
pool = multiprocessing.Pool(processes=pool_size,
)
pool_outputs = pool.map(tester, range(5000000))
pool.close()
pool.join()
endtime1 = time.time()
timetaken = endtime1 - starttime1
starttime2 = time.time()
for i in range(5000000):
tester(i)
endtime2 = time.time()
timetaken2 = timetaken = endtime2 - starttime2
print( 'The time taken with multiple processes:', timetaken)
print( 'The time taken the usual way:', timetaken2)
I am observing no (or very minimal) difference between the two times measured. I am using a machine with 8 cores, so this is surprising. What have I done incorrectly in my code?
Note that I learned all of this from this.
http://pymotw.com/2/multiprocessing/communication.html
I understand that "joblib" might be more convenient for an example like this, but the ultimate thing that this needs to be applied to does not work with "joblib".
Your job seems the computation of a single cos value. This is going to be basically unnoticeable compared to the time of communicating with the slave.
Try making 5 computations of 1000000 cos values and you should see them going in parallel.
First, you wrote :
timetaken2 = timetaken = endtime2 - starttime2
So it is normal if you have the same times displayed. But this is not the important part.
I ran your code on my computer (i7, 4 cores), and I get :
('The time taken with multiple processes:', 14.95710802078247)
('The time taken the usual way:', 6.465447902679443)
The multiprocessed loop is slower than doing the for loop. Why?
The multiprocessing module can use multiple processes, but still has to work with the Python Global Interpreter Lock, wich means you can't share memory between your processes. So when you try to launch a Pool, you need to copy useful variables, process your calculation, and retrieve the result. This cost you a little time for every process, and makes you less effective.
But this happens because you do a very small computation : multiprocessing is only useful for larger calculation, when the memory copying and results retrieving is cheaper (in time) than the calculation.
I tried with following tester, which is much more expensive, on 2000 runs:
def expenser_tester(num):
A=np.random.rand(10*num) # creation of a random Array 1D
for k in range(0,len(A)-1): # some useless but costly operation
A[k+1]=A[k]*A[k+1]
return A
('The time taken with multiple processes:', 4.030329942703247)
('The time taken the usual way:', 8.180987119674683)
You can see that on an expensive calculation, it is more efficient with the multiprocessing, even if you don't always have what you could expect (I could have a x4 speedup, but I only got x2)
Keep in mind that Pool has to duplicate every bit of memory used in calculation, so it may be memory expensive.
If you really want to improve a small calculation like your example, make it big by grouping and sending a list of variable to the pool instead of one variable by process.
You should also know that numpy and scipy have a lot of expensive function written in C/Fortran and already parallelized, so you can't do anything much to speed them.
If the problem is cpu bounded then you should see the required speed-up (if the operation is long enough and overhead is not significant). But when multiprocessing (because memory is not shared between processes) it's easier to have a memory bound problem.