Python multiprocessing Pool.map not faster than calling the function once - python

I have a very large list of strings (originally from a text file) that I need to process using python. Eventually I am trying to go for a map-reduce style of parallel processing.
I have written a "mapper" function and fed it to multiprocessing.Pool.map(), but it takes the same amount of time as simply calling the mapper function with the full set of data. I must be doing something wrong.
I have tried multiple approaches, all with similar results.
def initial_map(lines):
results = []
for line in lines:
processed = # process line (O^(1) operation)
results.append(processed)
return results
def chunks(l, n):
for i in xrange(0, len(l), n):
yield l[i:i+n]
if __name__ == "__main__":
lines = list(open("../../log.txt", 'r'))
pool = Pool(processes=8)
partitions = chunks(lines, len(lines)/8)
results = pool.map(initial_map, partitions, 1)
So the chunks function makes a list of sublists of the original set of lines to give to the pool.map(), then it should hand these 8 sublists to 8 different processes and run them through the mapper function. When I run this I can see all 8 of my cores peak at 100%. Yet it takes 22-24 seconds.
When I simple run this (single process/thread):
lines = list(open("../../log.txt", 'r'))
results = initial_map(results)
It takes about the same amount of time. ~24 seconds. I only see one process getting to 100% CPU.
I have also tried letting the pool split up the lines itself and have the mapper function only handle one line at a time, with similar results.
def initial_map(line):
processed = # process line (O^(1) operation)
return processed
if __name__ == "__main__":
lines = list(open("../../log.txt", 'r'))
pool = Pool(processes=8)
pool.map(initial_map, lines)
~22 seconds.
Why is this happening? Parallelizing this should result in faster results, shouldn't it?

If the amount of work done in one iteration is very small, you're spending a big proportion of the time just communicating with your subprocesses, which is expensive. Instead, try to pass bigger slices of your data to the processing function. Something like the following:
slices = (data[i:i+100] for i in range(0, len(data), 100)
def process_slice(data):
return [initial_data(x) for x in data]
pool.map(process_slice, slices)
# and then itertools.chain the output to flatten it
(don't have my comp. so can't give you a full working solution nor verify what I said)
Edit: or see the 3rd comment on your question by #ubomb.

Related

Best way to simultaneously run this loop?

I have the following code:
data = [2,5,3,16,2,5]
def f(x):
return 2*x
f_total = 0
for x in data:
f_total += f(x)
print(f_total/len(data))
which I want to speed up the for loop. (In reality the code is more complex and I want to run it in a super computer with many many processing cores). I have read that I can do this with the multiprocessing library where I can get python3 to simultaneously run different chunks of the loop at the same time but I am a bit lost with it.
Could you explain me how to do it with this minimal version of my program?
Thanks!
import multiprocessing
from numpy import random
"""
This mentions the number of worker threads that you want to run in parallel.
Depending on the number of cores in your system you should choose the appropriate
number of threads. When you call 'map' function it will distribute the input
values in that many parts
"""
NUM_CORES = 6
data = random.rand(100, 1)
"""
+2 so that the cores are not left idle in case a thread is waiting for I/O.
Choose by performing an empirical analysis depending on the function you are trying to compute.
It could match up to NUM_CORES as well. You can vary the chunksize as well depending on the size of 'data' that you have.
"""
NUM_THREADS = NUM_CORES+2
CHUNKSIZE = int(len(data)/(NUM_THREADS))
def f(x):
return 2*x
# This takes care of creating pool of worker threads which will be assigned the jobs
pool = multiprocessing.Pool(NUM_THREADS)
# map vs imap. If the data is large go for imap else map is also good.
it = pool.imap(f, data, chunksize=CHUNKSIZE)
f_total = 0
# Iterate and sum up the result
for value in it:
f_total += sum(value)
print(f_total/len(data))
Why choose imap over map?

Sharing large objects in multiprocessing pools

I'm trying to revisit this slightly older question and see if there's a better answer these days.
I'm using python3 and I'm trying to share a large dataframe with the workers in a pool. My function reads the dataframe, generates a new array using data from the dataframe, and returns that array. Example code below (note: in the example below I do not actually use the dataframe, but in my code I do).
def func(i):
return i*2
def par_func_dict(mydict):
values = mydict['values']
df = mydict['df']
return pd.Series([func(i) for i in values])
N = 10000
arr = list(range(N))
data_split = np.array_split(arr, 3)
df = pd.DataFrame(np.random.randn(10,10))
pool = Pool(cores)
gen = ({'values' : i, 'df' : df}
for i in data_split)
data = pd.concat(pool.map(par_func_dict,gen), axis=0)
pool.close()
pool.join()
I'm wondering if there's a way I can prevent feeding the generator with copies of the dataframe to prevent taking up so much memory.
The answer to the link above suggests using multiprocessing.Process(), but from what I can tell, it's difficult to use that on top of functions that return things (need to incorporate signals / events), and the comments indicate that each process still ends up using a large amount of memory.

Calculations are not stored in the passed arguments when processes are executed in parallel

I have a function which I am applying to different chunks of my data. Since each chunk is independent of the rest, I wish to execute the function for all chunks in parallel.
I have a result dictionary which should hold the output of calculations for each chunk.
Here is how I did it:
from joblib import Parallel, delayed
import multiprocessing
cpu_count = multiprocessing.cpu_count()
# I have 8 cores, so I divide the data into 8 chunks.
endIndeces = divideIndecesUniformly(myData.shape[0], cpu_count) # e.g., [0, 125, 250, ..., 875, 1000]
# initialize result dictionary with empty lists.
result = dict()
for i in range(cpu_count):
result[i] = []
# Parallel execution for 8 chunks
Parallel(n_jobs=cpu_count)(delayed(myFunction)(myData, start_idx=endIndeces[i], end_idx=endIndeces[i+1]-1, result, i) for i in range(cpu_count))
However, when the execution is finished result has all initial empty lists. I figured that if I execute the function serially over each chunk of data, it works just fine. For example, if I replace the last line with the following, result will have all the calculated values.
# Instead of parallel execution, call the function in a for-loop.
for i in range(cpu_count):
myFunction(myData, start_idx=endIndeces[i], end_idx=endIndeces[i+1]-1, result, i)
In this case, result values are updated.
It seems that when the function is executed in parallel, it cannot write on the given dictionary (result). So, I was wondering how I can obtain the output of function for each chunk of data?
joblib, by default uses the multiprocessing module in python. According to this SO Answer, when arguments are passed to new Processes they create a fork, which copies the memory space of the current process. This means that myFunction is essentially working on a copy of result and does not modify the original.
My suggestion is to have myFunction return the desired data as a list. The call to Process will then return a list of the lists generated by myFunction. From there, it is simple to add them to results. It could look something like this:
from joblib import Parallel, delayed
import multiprocessing
if __name__ == '__main__':
cpu_count = multiprocessing.cpu_count()
endIndeces = divideIndecesUniformly(myData.shape[0], cpu_count)
# make sure myFunction returns the grouped results in a list
r = Parallel(n_jobs=cpu_count)(delayed(myFunction)(myData, start_idx=endIndeces[i], end_idx=endIndeces[i+1]-1, result, i) for i in range(cpu_count))
result = dict()
for i, data in enumerate(r): # cycles through each resultant chunk, numbered and in the original order
result[i] = data

List combinations in defined range

I am writing parallel rainbow tables generator using parallel python and multiple machines. So far, I have it working on a single machine. It creates all possible passwords, hashes them, saves to file. It takes max_pass_len, file as arguments. Charset is predefined. Here's the code:
def hashAndSave(listOfComb, fileObject):
for item in listOfComb:
hashedVal = crypt(item, 'po')
fileObject.write("%s:%s\n" % (hashedVal, item))
def gen_rt_save(max_pw_len, file):
charset = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'
for i in range(3, max_pw_len):
lista = [''.join(j) for j in combinations_with_replacement(charset, i)]
hashAndSave(lista, file)
In order to parallelize it I need to split work between several machines. They need to know where start and stop generating passwords.
I figured I need a function which takes two arguments as parameters - start and stop point. Charset is global and has to be fully used in that combinations.
Simplest way would be to pick subset defined by two specific combinations from the list of all possible combinations for given charset and length range. However it takes time and space, and I need to avoid that.
Example:
charset='abcdefghi' #minified charset, normally 62 characters
ranged_comb(abf,defg)
result -> # it is not combination between to lists! there are specific functions for that, and they would not use full charset, only whats in lists
abf
abg
abh
abi
aca
acb
...
defd
defe
deff
defg
I thought about using lists of indexes of charset letters as parameters to use it in for loops. Yet I can't really use fors because their number might vary. How to create such function?
Since for brute-forcing passwords/generating rainbow tables you don't need a strict lexicographic order, as long as you go through all permutations (with repeats), this is quite simple:
def get_permutation_by_index(source, size, index):
result = []
for _ in range(size):
result.append(source[index % len(source)])
index = index // len(source)
return result
Then all you need is an index of your permutation to get it from your iterable (strings of characters work, too). What it does is essentially looping through every possible element position for a given size, offset by the passed index, and stores it in the result list. You can use return "".join(result) instead if you're interested in getting a string out of it.
Now your workers can use this function to generate their 'password' range chunks. The simplest way of distribution would be if your workers would receive a single index from a distributor, perform their task and wait for the next index, however, unless your hashing function is excruciatingly slow spawning your workers, data transfer and such might end up slower than executing everything linearly in the main process. That's why you'd ideally want your workers to crunch over larger chunks at a time to justify distributing the whole process. Thus, you'd want your worker to accept a range and do something like:
def worker(source, size, start, end):
result = []
for i in range(start, end):
result.append(get_permutation_by_index(source, size, i)) # add to the result
return result
Then all you need is a 'distributor' - a central dispatcher that will conduct the workers and split the workload among them. Since our workers don't accept varying sizes (that's an exercise I'll leave for you to figure out, you have all the ingredients) your 'distributor' will need to advance through sizes and keep a track what chunks it sends to your workers. This means that, for smaller and edge chunks, your workers will be receiving lesser workload than defined, but in the grand scheme of things this won't matter much for your use case. So, a simple distributor would look like:
def distributor(source, start, end, chunk_size=1000):
result = []
for size in range(start, end + 1): # for each size in the given range...
total = len(source) ** size # max number of permutations for this size
for chunk in range(0, total, chunk_size): # for each chunk...
data = worker(source, size, chunk, min(chunk + chunk_size, total)) # process...
result.append(data) # store the result...
return result
Where start and end represent the number of source elements you want to permute through your workers, and chunk_size represents the number of permutations each worker should process in an ideal case - as I mentioned, this won't be the case if a total number of permutations for a given size is lower than the chunk_size or if there is less unprocessed permutations left for a given size than the chunk_size value, but those are edge cases and I'll leave it for you to figure out how to distribute even more evenly. Also, keep in mind that the returned result will be a list of lists returned from our workers - you'll have to flatten it if you want to treat all the results equally.
But wait, isn't this a linear execution using a single process? Well, of course it is! What we did here is effectively decoupled workers from the distributor so now we can add as many arbitrary levels of separation and/or parallelization in-between without affecting our execution. For example, here's how to make our workers run in parallel:
from multiprocessing import Pool
import time
def get_permutation_by_index(source, size, index):
result = []
for _ in range(size):
result.append(source[index % len(source)])
index = index // len(source)
return result
# let's have our worker perform a naive ascii-shift Caesar cipher
def worker(source, size, start, end):
result = []
for i in range(start, end):
time.sleep(0.2) # simulate a long operation by adding 200 milliseconds of pause
permutation = get_permutation_by_index(source, size, i)
# naive Caesar cipher - simple ascii shift by +4 places
result.append("".join([chr(ord(x) + 4) for x in permutation]))
return result
def distributor(source, start, end, workers=10, chunk_size=10):
pool = Pool(processes=workers) # initiate our Pool with a specified number of workers
jobs = set() # store our worker result references
for size in range(start, end + 1): # for each size in the given range...
total = len(source) ** size # max number of permutations for this size
for chunk in range(0, total, chunk_size): # for each chunk...
# add a call to the worker to our Pool
r = pool.apply_async(worker,
(source, size, chunk, min(chunk + chunk_size, total)))
jobs.add(r) # add our ApplyResult in the jobs set for a later checkup
result = []
while jobs: # loop as long as we're waiting for results...
for job in jobs:
if job.ready(): # current worker finished processing...
result.append(job.get()) # store our result...
jobs.remove(job)
break
time.sleep(0.05) # let other threads get a chance to breathe a little...
return result # keep in mind that this is NOT an ordered result
if __name__ == "__main__": # important protection for cross-platform use
# call 6 threaded workers to sift through all 2 and 3-letter permutations
# of "abcd", using the default chunk size ('ciphers per worker') of 10
caesar_permutations = distributor("abcd", 2, 3, 6)
print([perm for x in caesar_permutations for perm in x]) # print flattened results
# ['gg', 'hg', 'eh', 'fh', 'gh', 'hh', 'eff', 'fff', 'gff', 'hff', 'egf', 'fgf', 'ggf',
# 'hgf', 'ehf', 'fhf', 'ghf', 'hhf', 'eeg', 'feg', 'geg', 'heg', 'efg', 'ffg', 'gfg',
# 'hfg', 'eee', 'fee', 'gee', 'hee', 'efe', 'ffe', 'gfe', 'hfe', 'ege', 'fge', 'ee',
# 'fe', 'ge', 'he', 'ef', 'ff', 'gf', 'hf', 'eg', 'fg', 'gge', 'hge', 'ehe', 'fhe',
# 'ghe', 'hhe', 'eef', 'fef', 'gef', 'hef', 'ehh', 'fhh', 'ghh', 'hhh', 'egg', 'fgg',
# 'ggg', 'hgg', 'ehg', 'fhg', 'ghg', 'hhg', 'eeh', 'feh', 'geh', 'heh', 'efh', 'ffh',
# 'gfh', 'hfh', 'egh', 'fgh', 'ggh', 'hgh']
Voila! Everything executed in parallel (and over multiple cores if the underlying OS did its scheduling properly). This should be sufficient for your use case - all you need is to add your communication or I/O code in the worker function and let the actual code be executed by a receiver on the other side, then when you get the results return them back to the distributor. You can also directly write your tables in the distributor() instead of waiting for everything to finish.
And if you're going to execute this exclusively over a network, you don't really need it in a multiprocess setting, threads would be sufficient to handle the I/O latency, so just replace the multiprocess import with: from multiprocessing.pool import ThreadPool as Pool (don't let the module name fool you, this is a threading interface, not a multiprocessing one!).

multiprocessing.Pool.map() not working as expected

I understand from simple examples that Pool.map is supposed to behave identically to the 'normal' python code below except in parallel:
def f(x):
# complicated processing
return x+1
y_serial = []
x = range(100)
for i in x: y_serial += [f(x)]
y_parallel = pool.map(f, x)
# y_serial == y_parallel!
However I have two bits of code that I believe should follow this example:
#Linear version
price_datas = []
for csv_file in loop_through_zips(data_directory):
price_datas += [process_bf_data_csv(csv_file)]
#Parallel version
p = Pool()
price_data_parallel = p.map(process_bf_data_csv, loop_through_zips(data_directory))
However the Parallel code doesn't work whereas the Linear code does. From what I can observe, the parallel version appears to be looping through the generator (it's printing out log lines from the generator function) but then not actually performing the "process_bf_data_csv" function. What am I doing wrong here?
.map tries to pull all values from your generator to form it into an iterable before actually starting the work.
Try waiting longer (till the generator runs out) or use multi threading and a queue instead.

Categories

Resources