List combinations in defined range - python

I am writing parallel rainbow tables generator using parallel python and multiple machines. So far, I have it working on a single machine. It creates all possible passwords, hashes them, saves to file. It takes max_pass_len, file as arguments. Charset is predefined. Here's the code:
def hashAndSave(listOfComb, fileObject):
for item in listOfComb:
hashedVal = crypt(item, 'po')
fileObject.write("%s:%s\n" % (hashedVal, item))
def gen_rt_save(max_pw_len, file):
charset = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'
for i in range(3, max_pw_len):
lista = [''.join(j) for j in combinations_with_replacement(charset, i)]
hashAndSave(lista, file)
In order to parallelize it I need to split work between several machines. They need to know where start and stop generating passwords.
I figured I need a function which takes two arguments as parameters - start and stop point. Charset is global and has to be fully used in that combinations.
Simplest way would be to pick subset defined by two specific combinations from the list of all possible combinations for given charset and length range. However it takes time and space, and I need to avoid that.
Example:
charset='abcdefghi' #minified charset, normally 62 characters
ranged_comb(abf,defg)
result -> # it is not combination between to lists! there are specific functions for that, and they would not use full charset, only whats in lists
abf
abg
abh
abi
aca
acb
...
defd
defe
deff
defg
I thought about using lists of indexes of charset letters as parameters to use it in for loops. Yet I can't really use fors because their number might vary. How to create such function?

Since for brute-forcing passwords/generating rainbow tables you don't need a strict lexicographic order, as long as you go through all permutations (with repeats), this is quite simple:
def get_permutation_by_index(source, size, index):
result = []
for _ in range(size):
result.append(source[index % len(source)])
index = index // len(source)
return result
Then all you need is an index of your permutation to get it from your iterable (strings of characters work, too). What it does is essentially looping through every possible element position for a given size, offset by the passed index, and stores it in the result list. You can use return "".join(result) instead if you're interested in getting a string out of it.
Now your workers can use this function to generate their 'password' range chunks. The simplest way of distribution would be if your workers would receive a single index from a distributor, perform their task and wait for the next index, however, unless your hashing function is excruciatingly slow spawning your workers, data transfer and such might end up slower than executing everything linearly in the main process. That's why you'd ideally want your workers to crunch over larger chunks at a time to justify distributing the whole process. Thus, you'd want your worker to accept a range and do something like:
def worker(source, size, start, end):
result = []
for i in range(start, end):
result.append(get_permutation_by_index(source, size, i)) # add to the result
return result
Then all you need is a 'distributor' - a central dispatcher that will conduct the workers and split the workload among them. Since our workers don't accept varying sizes (that's an exercise I'll leave for you to figure out, you have all the ingredients) your 'distributor' will need to advance through sizes and keep a track what chunks it sends to your workers. This means that, for smaller and edge chunks, your workers will be receiving lesser workload than defined, but in the grand scheme of things this won't matter much for your use case. So, a simple distributor would look like:
def distributor(source, start, end, chunk_size=1000):
result = []
for size in range(start, end + 1): # for each size in the given range...
total = len(source) ** size # max number of permutations for this size
for chunk in range(0, total, chunk_size): # for each chunk...
data = worker(source, size, chunk, min(chunk + chunk_size, total)) # process...
result.append(data) # store the result...
return result
Where start and end represent the number of source elements you want to permute through your workers, and chunk_size represents the number of permutations each worker should process in an ideal case - as I mentioned, this won't be the case if a total number of permutations for a given size is lower than the chunk_size or if there is less unprocessed permutations left for a given size than the chunk_size value, but those are edge cases and I'll leave it for you to figure out how to distribute even more evenly. Also, keep in mind that the returned result will be a list of lists returned from our workers - you'll have to flatten it if you want to treat all the results equally.
But wait, isn't this a linear execution using a single process? Well, of course it is! What we did here is effectively decoupled workers from the distributor so now we can add as many arbitrary levels of separation and/or parallelization in-between without affecting our execution. For example, here's how to make our workers run in parallel:
from multiprocessing import Pool
import time
def get_permutation_by_index(source, size, index):
result = []
for _ in range(size):
result.append(source[index % len(source)])
index = index // len(source)
return result
# let's have our worker perform a naive ascii-shift Caesar cipher
def worker(source, size, start, end):
result = []
for i in range(start, end):
time.sleep(0.2) # simulate a long operation by adding 200 milliseconds of pause
permutation = get_permutation_by_index(source, size, i)
# naive Caesar cipher - simple ascii shift by +4 places
result.append("".join([chr(ord(x) + 4) for x in permutation]))
return result
def distributor(source, start, end, workers=10, chunk_size=10):
pool = Pool(processes=workers) # initiate our Pool with a specified number of workers
jobs = set() # store our worker result references
for size in range(start, end + 1): # for each size in the given range...
total = len(source) ** size # max number of permutations for this size
for chunk in range(0, total, chunk_size): # for each chunk...
# add a call to the worker to our Pool
r = pool.apply_async(worker,
(source, size, chunk, min(chunk + chunk_size, total)))
jobs.add(r) # add our ApplyResult in the jobs set for a later checkup
result = []
while jobs: # loop as long as we're waiting for results...
for job in jobs:
if job.ready(): # current worker finished processing...
result.append(job.get()) # store our result...
jobs.remove(job)
break
time.sleep(0.05) # let other threads get a chance to breathe a little...
return result # keep in mind that this is NOT an ordered result
if __name__ == "__main__": # important protection for cross-platform use
# call 6 threaded workers to sift through all 2 and 3-letter permutations
# of "abcd", using the default chunk size ('ciphers per worker') of 10
caesar_permutations = distributor("abcd", 2, 3, 6)
print([perm for x in caesar_permutations for perm in x]) # print flattened results
# ['gg', 'hg', 'eh', 'fh', 'gh', 'hh', 'eff', 'fff', 'gff', 'hff', 'egf', 'fgf', 'ggf',
# 'hgf', 'ehf', 'fhf', 'ghf', 'hhf', 'eeg', 'feg', 'geg', 'heg', 'efg', 'ffg', 'gfg',
# 'hfg', 'eee', 'fee', 'gee', 'hee', 'efe', 'ffe', 'gfe', 'hfe', 'ege', 'fge', 'ee',
# 'fe', 'ge', 'he', 'ef', 'ff', 'gf', 'hf', 'eg', 'fg', 'gge', 'hge', 'ehe', 'fhe',
# 'ghe', 'hhe', 'eef', 'fef', 'gef', 'hef', 'ehh', 'fhh', 'ghh', 'hhh', 'egg', 'fgg',
# 'ggg', 'hgg', 'ehg', 'fhg', 'ghg', 'hhg', 'eeh', 'feh', 'geh', 'heh', 'efh', 'ffh',
# 'gfh', 'hfh', 'egh', 'fgh', 'ggh', 'hgh']
Voila! Everything executed in parallel (and over multiple cores if the underlying OS did its scheduling properly). This should be sufficient for your use case - all you need is to add your communication or I/O code in the worker function and let the actual code be executed by a receiver on the other side, then when you get the results return them back to the distributor. You can also directly write your tables in the distributor() instead of waiting for everything to finish.
And if you're going to execute this exclusively over a network, you don't really need it in a multiprocess setting, threads would be sufficient to handle the I/O latency, so just replace the multiprocess import with: from multiprocessing.pool import ThreadPool as Pool (don't let the module name fool you, this is a threading interface, not a multiprocessing one!).

Related

Parallelizing a function to write to global variables

I'm trying to figure out the best way to parallelize a program like this:
global_data = some data
global_data2 = some data
data_store1 = np.empty(n)
data_store2 = np.empty(n)
.
.
.
def simulation(global_data):
retrieve values from global datasets and set element of global datastores
such that I do something like pass list(enumerate(global_data)) to a multiprocessing function, and each process sets elements of the global data stores corresponding to the received (index, vlaue) pair. I'm running on a high performance cluster with 128 cores, so I think parallelization is preferable to threading.
If you use a multiprocessing pool (e.g. a multiprocessing.Pool instance) with its map method, then your worker function, simulation, just needs to return its result back to the main process which will end up with a list of results that will be in the correct order. This will be less expensive than using a managed list to which the worker function adds its result:
import multiprocessing
def simulation(global_data_elem):
# We are passed a single element of global_data
# Do calculation with global_data_elem and return result.
# The CPU resources required to do this calculation must be sufficiently
# high to justify the additional overhead of multiprocessing (which is
# not the case for this demo):
return global_data_elem * 2
def main():
# global_data is some data (not necesarilly at global scope)
global_data = ['a', 'b', 'c']
# create pool of the correct size
# (not larger than the number of cores we have nor the number of tasks being submitted):
pool = multiprocessing.Pool(min(len(global_data), multiprocessing.cpu_count()))
# results are returned in the correct order (task submission order):
results = pool.map(simulation, global_data)
print(results)
# Required for Windows:
if __name__ == '__main__':
main()
Prints:
['aa', 'bb', 'cc']

Improve performance of combinations

Hey guys I have a script that compares each possible user and checks how similar their text is:
dictionary = {
t.id: (
t.text,
t.set,
t.compare_string
)
for t in dataframe.itertuples()
}
highly_similar = []
for a, b in itertools.combinations(dictionary.items(), 2):
if a[1][2] == b[1][2] and not a[1][1].isdisjoint(b[1][1]):
similarity_score = fuzz.ratio(a[1][0], b[1][0])
if (similarity_score >= 95 and len(a[1][0]) >= 10) or similarity_score == 100:
highly_similar.append([a[0], b[0], a[1][0], b[1][0], similarity_score])
This script takes around 15 minutes to run, the dataframe contains 120k users, so comparing each possible combination takes quite a bit of time, if I just write pass on the for loop it takes 2 minutes to loop through all values.
I tried using filter() and map() for the if statements and fuzzy score but the performance was worse. I tried improving the script as much as I could but I don't know how I can improve this further.
Would really appreciate some help!
It is slightly complicated to reason about the data since you have not attached it, but we can see multiple places that might provide an improvement:
First, let's rewrite the code in a way which is easier to reason about than using the indices:
dictionary = {
t.id: (
t.text,
t.set,
t.compare_string
)
for t in dataframe.itertuples()
}
highly_similar = []
for a, b in itertools.combinations(dictionary.items(), 2):
a_id, (a_text, a_set, a_compre_string) = a
b_id, (b_text, b_set, b_compre_string) = b
if (a_compre_string == b_compre_string
and not a_set.isdisjoint(b_set)):
similarity_score = fuzz.ratio(a_text, b_text)
if (similarity_score >= 95 and len(a_text) >= 10)
or similarity_score == 100):
highly_similar.append(
[a_id, b_id, a_text, b_text, similarity_score])
You seem to only care about pairs having the same compare_string values. Therefore, and assuming this is not something that all pairs share, we can key by whatever that value is to cover much less pairs.
To put some numbers into it, let's say you have 120K inputs, and 1K values for each value of val[1][2] - then instead of covering 120K * 120K = 14 * 10^9 combinations, you would have 120 bins of size 1K (where in each bin we'd need to check all pairs) = 120 * 1K * 1K = 120 * 10^6 which is about 1000 times faster. And it would be even faster if each bin has less than 1K elements.
import collections
# Create a dictionary from compare_string to all items
# with the same compare_string
items_by_compare_string = collections.defaultdict(list)
for item in dictionary.items():
compare_string = item[1][2]
items_by_compare_string[compare_string].append(items)
# Iterate over each group of items that have the same
# compare string
for item_group in items_by_compare_string.values():
# Check pairs only within that group
for a, b in itertools.combinations(item_group, 2):
# No need to compare the compare_strings!
if not a_set.isdisjoint(b_set):
similarity_score = fuzz.ratio(a_text, b_text)
if (similarity_score >= 95 and len(a_text) >= 10)
or similarity_score == 100):
highly_similar.append(
[a_id, b_id, a_text, b_text, similarity_score])
But, what if we want more speed? Let's look at the remaining operations:
We have a check to find if two sets share at least one item
This seems like an obvious candidate for optimization if we have any knowledge about these sets (to allow us to determine which pairs are even relevant to compare)
Without additional knowledge, and just looking at every two pairs and trying to speed this up, I doubt we can do much - this is probably highly optimized using internal details of Python sets, I don't think it's likely to optimize it further
We a fuzz.ratio computation which is some external function, and I'm going to assume is heavy
If you are using this from the FuzzyWuzzy package, make sure to install python-Levenshtein to get the speedups detailed here
We have some comparisons which we are unlikely to be able to speed up
We might be able to cache the length of a_text by nesting the two loops, but that's negligible
We have appends to a list, which runs on average ("amortized") constant time per operation, so we can't really speed that up
Therefore, I don't think we can reasonably suggest any more speedups without additional knowledge. If we know something about the sets that can help optimize which pairs are relevant we might be able to speed things up further, but I think this is about it.
EDIT: As pointed out in other answers, you can obviously run the code in multi-threading. I assumed you were looking for an algorithmic change that would possibly reduce the number of operations significantly, instead of just splitting these over more CPUs.
Essentially, from python programming side, i see two things that can improve your processing time:
Multi-threads and Vectorized operations
From the fuzzy score side, here is a list of tips you can use to improve your processing time (new anonymous tab to avoid paywall):
https://towardsdatascience.com/fuzzy-matching-at-scale-84f2bfd0c536
Using multi thread you can speed you operation up to N times, being N the number of threads in you CPU. You can check it with:
import multiprocessing
multiprocessing.cpu_count()
Using vectorized operations you can parallel process your operations in low level with SIMD (single instruction / multiple data) operations, or with gpu tensor operations (like those in tensorflow/pytorch).
Here is a small comparison of results for each case:
import numpy as np
import time
A = [np.random.rand(512) for i in range(2000)]
B = [np.random.rand(512) for i in range(2000)]
high_similarity = []
def measure(i,j,a,b,high_similarity):
d = ((a-b)**2).sum()
if d>12:
high_similarity.append((i,j,d))
start_single_thread = time.time()
for i in range(len(A)):
for j in range(len(B)):
if i<j:
measure(i,j,A[i],B[j],high_similarity)
finis_single_thread = time.time()
print("single thread time:",finis_single_thread-start_single_thread)
out[0] single thread time: 147.64517450332642
running on multi thread:
from threading import Thread
high_similarity = []
def measure(a = None,b= None,high_similarity = None):
d = ((a-b)**2).sum()
if d > 12:
high_similarity.append(d)
start_multi_thread = time.time()
for i in range(len(A)):
for j in range(len(B)):
if i<j:
thread = Thread(target=measure,kwargs= {'a':A[i],'b':B[j],'high_similarity':high_similarity} )
thread.start()
thread.join()
finish_multi_thread = time.time()
print("time to run on multi threads:",finish_multi_thread - start_multi_thread)
out[1] time to run on multi-threads: 11.946279764175415
A_array = np.array(A)
B_array = np.array(B)
start_vectorized = time.time()
for i in range(len(A_array)):
#vectorized distance operation
dists = (A_array-B_array)**2
high_similarity+= dists[dists>12].tolist()
aux = B_array[-1]
np.delete(B_array,-1)
np.insert(B_array, 0, aux)
finish_vectorized = time.time()
print("time to run vectorized operations:",finish_vectorized-start_vectorized)
out[2] time to run vectorized operations: 2.302949905395508
Note that you can't guarantee any order of execution, so will you also need to store the index of results. The snippet of code is just to illustrate that you can use parallel process, but i highly recommend to use a pool of threads and divide your dataset in N subsets for each worker and join the final result (instead of create a thread for each function call like i did).

Best way to simultaneously run this loop?

I have the following code:
data = [2,5,3,16,2,5]
def f(x):
return 2*x
f_total = 0
for x in data:
f_total += f(x)
print(f_total/len(data))
which I want to speed up the for loop. (In reality the code is more complex and I want to run it in a super computer with many many processing cores). I have read that I can do this with the multiprocessing library where I can get python3 to simultaneously run different chunks of the loop at the same time but I am a bit lost with it.
Could you explain me how to do it with this minimal version of my program?
Thanks!
import multiprocessing
from numpy import random
"""
This mentions the number of worker threads that you want to run in parallel.
Depending on the number of cores in your system you should choose the appropriate
number of threads. When you call 'map' function it will distribute the input
values in that many parts
"""
NUM_CORES = 6
data = random.rand(100, 1)
"""
+2 so that the cores are not left idle in case a thread is waiting for I/O.
Choose by performing an empirical analysis depending on the function you are trying to compute.
It could match up to NUM_CORES as well. You can vary the chunksize as well depending on the size of 'data' that you have.
"""
NUM_THREADS = NUM_CORES+2
CHUNKSIZE = int(len(data)/(NUM_THREADS))
def f(x):
return 2*x
# This takes care of creating pool of worker threads which will be assigned the jobs
pool = multiprocessing.Pool(NUM_THREADS)
# map vs imap. If the data is large go for imap else map is also good.
it = pool.imap(f, data, chunksize=CHUNKSIZE)
f_total = 0
# Iterate and sum up the result
for value in it:
f_total += sum(value)
print(f_total/len(data))
Why choose imap over map?

"Processes" or "Threads" strategy to fill independant blocks into global array?

In python2, I would like to fill a global array by filling with parallel processes (or threads) different sub-arrays (there is a total 16 blocks). I must precise that each block doesn't depend of the others, I mean when I perfom the assignement of each cells of the current block.
1) From what I have found, I would have a great benefit from a CPU multi-cores by using different "processes" but it seems a little bit complicated to share the global array by all others processes.
2) From another point of view, I can use "threads" instead of "processes" since the implementation is less hard. I have found out the libray "ThreadPool" from "multiprocessing.dummy" allows to share this global array by all others concurrent threads.
For example, with python2.7, the following code works :
from multiprocessing.dummy import Pool as ThreadPool
## discretization along x-axis and y-axis for each block
arrayCross_k = np.linspace(kMIN, kMAX, dimPoints)
arrayCross_mu = np.linspace(-1, 1, dimPoints)
# Build all big matrix with N total blocks = dimBlock*dimBlock = 16 here
arrayFullCross = np.zeros((dimBlocks, dimBlocks, arrayCross_k.size, arrayCross_mu.size))
dimBlocks = 4
# Size of dimension along k and mu axis
dimPoints = 100
# dimension along one dimension of global arrayFullCross
dimMatCovCross = dimBlocks*dimPoints
# Build cross-correlation matrix
def buildCrossMatrix_loop(params_array):
# rows indices
xb = params_array[0]
# columns indices
yb = params_array[1]
# Current redshift
z = zrange[params_array[2]]
# Loop inside block
for ub in range(dimPoints):
for vb in range(dimPoints):
# Diagonal blocs
if (xb == yb):
# Fill the (xb,yb) su-block of global array by
arrayFullCross[xb][xb][ub][vb] = 2*P_obs_cross(arrayCross_k[ub], arrayCross_mu[vb] , z, 10**P_m(np.log10(arrayCross_k[ub])),
...
...
# End of function buildCrossMatrix_loop
# Main loop
while i < len(zrange):
def generatorCrossMatrix(index):
for igen in range(dimBlocks):
for lgen in range(dimBlocks):
yield igen, lgen, index
if __name__ == '__main__':
# Use 20 threads
pool = ThreadPool(20)
pool.map(buildCrossMatrix_loop, generatorCrossMatrix(i))
# Increment index "i"
i = i+1
But unfortunately, even by using 20 threads, I realize that the cores of my CPU are not fully running (actually, with 'top' or 'htop' command, I only see a single process at 100%).
3) What is the strategy that I have to chose if I want to full exploit the 16 cores of my CPU (like this is the case with pool.map(function, generator)) but with also the sharing of global array ?
4) some people told me to do I/O for each sub-array (basically, write each block in a file and gather all sub-arrays by reading them and get the full array filled). This solution is handy but I would like to avoid I/O (unless there is really not other solutions).
5) I have practised MPI library with C language and the operation of filling sub-array and finally gather them to build a big array, is not very complicated. However, I wouldn't like to use MPI with Python language (I don't know even if it exists).
6) I tried also to use Process with target equal to my filling function (buildCrossMatrix_loop) like this into while Main loop above :
from multiprocessing import Process
# Main loop on z range
while i < len(zrange):
params_p = []
for ip in range(4):
for jp in range(4):
params_p.append(ip)
params_p.append(jp)
params_p.append(i)
p = Process(target=buildCrossMatrix_loop, args=(params_p,))
params_p = []
p.start()
# Finished : wait everybody
p.join()
...
...
i = i+1
# End of main while loop
But the final 2D global array is filled only of zeros. So I must deduce that Process function doesn't share the array to fill ?
7) So which strategy I have to look for ? :
1. The using of "pool processes" and find a way to share the global array knowing all my 16-cores will be running
2. The using of "Threads" and share the global array but performances, at first sight, seems to be less good than with "pool processes". Maybe there is a way to increase the power of each "Threads", I mean like with "pool processes" ?
I tried to follow the different examples on https://docs.python.org/2/library/multiprocessing.html but without success, this is to say, without relevant performances from a speed-up point of view.
I think that in my case, the major issue is the gathering of all sub-arrays OR the fact that the global array arrayFullCross is not shared by other processes or threads.
If someone had a simple example of the sharing of global variable in a multi-threading context (here an array), this would nice to put it here.
UPDATE 1: I made test with the Threading (and not multiprocessing) but performances remain rather bad. GIL is not apparently unlocked, i.e only one process appears in htop command (maybe the version of Threading library is not the right one).
So I am going to try to handle my issue with using the "return" method.
Naively, I tried to return the whole array at the end of the function on which I apply the map function, like this :
# Build cross-correlation matrix
def buildCrossMatrix_loop(params_array):
# rows indices
xb = params_array[0]
# columns indices
yb = params_array[1]
# Current redshift
z = zrange[params_array[2]]
# Loop inside block
for ub in range(dimPoints):
for vb in range(dimPoints):
# Diagonal blocs
if (xb == yb):
arrayFullCross[xb][xb][ub][vb] = 2*P_obs_cross(arrayCross_k[ub], arrayCross_mu[vb])
...
... #others assignments on arrayFullCross elements
# Return global array to main process
return arrayFullCross
Then, I tried to receive this global array from map like this :
if __name__ == '__main__':
pool = Pool(16)
outputArray = pool.map(buildCrossMatrix_loop, generatorCrossMatrix(i))
pool.terminate()
## Print outputArray
print 'outputArray = ', outputArray
## Reshape 4D outputArray to 2D array
arrayFullCross2D_swap = np.array(outputArray).swapaxes(1,2).reshape(dimMatCovCross,dimMatCovCross)
Unfortunately, when I print the outputArray, I get :
outputArray = [None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]
This is not the 4D outputArray expected, just a list of 16 None (I think that number of 16 correspond to the number of processes provided by generatorCrossMatrix(i)).
How could I get back the whole 4D array once map is launched and when it has finished ?
First of all I believe multiprocessing.ThreadPool is a private API so you should avoid it. Now multiprocessing.dummy is a useless module. It does not do any multithreading/processing that's why you don't see any benefit. You should use the "plain" multiprocessing module.
The second code does not work because it is using multiple processes. Processes do not share memory, so the changes you do in a subprocess are not reflected in the other subprocesses or the main process. You either want to:
Return the value and combine them together in the main process, for example using multiprocessing.Pool.map
Use threading instead of multiprocessing: just replaceimport multiprocessingwithimport threadingandmultiprocessing.Processwiththreading.Thread` and the code should work.
Note that the threading version will work only because numpy releases the GIL during computations, otherwise it would be stuck at 1 CPU.
You may want to look at this similar question which I answered a couple of minutes ago.

Python multiprocessing Pool.map not faster than calling the function once

I have a very large list of strings (originally from a text file) that I need to process using python. Eventually I am trying to go for a map-reduce style of parallel processing.
I have written a "mapper" function and fed it to multiprocessing.Pool.map(), but it takes the same amount of time as simply calling the mapper function with the full set of data. I must be doing something wrong.
I have tried multiple approaches, all with similar results.
def initial_map(lines):
results = []
for line in lines:
processed = # process line (O^(1) operation)
results.append(processed)
return results
def chunks(l, n):
for i in xrange(0, len(l), n):
yield l[i:i+n]
if __name__ == "__main__":
lines = list(open("../../log.txt", 'r'))
pool = Pool(processes=8)
partitions = chunks(lines, len(lines)/8)
results = pool.map(initial_map, partitions, 1)
So the chunks function makes a list of sublists of the original set of lines to give to the pool.map(), then it should hand these 8 sublists to 8 different processes and run them through the mapper function. When I run this I can see all 8 of my cores peak at 100%. Yet it takes 22-24 seconds.
When I simple run this (single process/thread):
lines = list(open("../../log.txt", 'r'))
results = initial_map(results)
It takes about the same amount of time. ~24 seconds. I only see one process getting to 100% CPU.
I have also tried letting the pool split up the lines itself and have the mapper function only handle one line at a time, with similar results.
def initial_map(line):
processed = # process line (O^(1) operation)
return processed
if __name__ == "__main__":
lines = list(open("../../log.txt", 'r'))
pool = Pool(processes=8)
pool.map(initial_map, lines)
~22 seconds.
Why is this happening? Parallelizing this should result in faster results, shouldn't it?
If the amount of work done in one iteration is very small, you're spending a big proportion of the time just communicating with your subprocesses, which is expensive. Instead, try to pass bigger slices of your data to the processing function. Something like the following:
slices = (data[i:i+100] for i in range(0, len(data), 100)
def process_slice(data):
return [initial_data(x) for x in data]
pool.map(process_slice, slices)
# and then itertools.chain the output to flatten it
(don't have my comp. so can't give you a full working solution nor verify what I said)
Edit: or see the 3rd comment on your question by #ubomb.

Categories

Resources