I have an iterator function that yields an infinite stream of integers:
def all_ints(start=0):
yield start
yield all_ints(start+1)
I want to have a pool of threads or processes do calculations on these up to $POOLSIZE at a time. Each process will possibly save a result to some shared data structure so I do not need the return value from the process/thread function. It seems to me this use of the python3 Pool would achieve this:
# dummy example functions
def check_prime(n):
return n % 2 == 0
def store_prime(p):
''' synchronize, write to some shared structure'''
pass
p = Pool()
for n in all_ints():
p.apply_async(check_prime, (n,), callback=store_prime)
But when I run this I get a python process that just continually uses more memory (and not from the iterator, that can run for days). I would expect this behavior if I was storing the results of all the apply_async calls, but I am not.
What am I doing wrong here? Or is there another API from the thread pool I should be using?
I think you are looking for Pool.imap_unordered, which uses the pooled processes to apply a function to the elements yielded by an iterator. Its parameter chunksize allows you to specify how many items from the iterator are passed to the pool in each step.
Also I would avoid using any shared memory structures for IPC. Just let the "expensive" function sent to the pool return the information you need, and process it in the main process.
Here's an example (where I abort after 200'000 results; if you remove that part, you'll see the processes happily work in a fixed amount of RAM "forever"):
from multiprocessing import Pool
from math import sqrt
import itertools
import time
def check_prime(n):
if n == 2: return (n, True)
if n % 2 == 0 or n < 2: return (n, False)
for i in range(3, int(sqrt(n))+1, 2):
if n % i == 0: return (n, False)
return (n, True)
def main():
L = 200000 # limit for performance timing
p = Pool()
n_primes = 0
before = time.time()
for (n, is_prime) in p.imap_unordered(check_prime, itertools.count(1), 1000):
if is_prime:
n_primes += 1
if n_primes >= L:
break
print("Computed %d primes in %.1fms" % (n_primes, (time.time()-before)*1000.0))
if __name__ == "__main__":
main()
Output on my Intel Core i5 (2 Core, 4 Threads):
Computed 200000 primes in 15167.9ms
Output if I change it to Pool(1), so using just 1 subprocess:
Computed 200000 primes in 37909.2ms
HTH!
Related
I'm using python's multiprocessing module some analyses. Specifically, I use a pool and the apply_async method to compute some results from a large number of variables. The time required to compute the results vary drastically with the input, which got me wondering if the pool will either A) immediately assign a process to each input when I call apply_async, or if B) processes are assigned as they become ready. My worry is A) would imply that one of the pool's processes might be assigned to several of the 'heavy' inputs and so take longer than necessary to complete.
I tried running the below code and got similar returns in each run (<2 seconds), indicating that B) is the case, at least here. Is that what should be expected to happen always?
import multiprocessing as mp
import numpy as np
import time
def foo(x):
if x % 4 == 0:
time.sleep(1) # Simulating a large workload
else:
time.sleep(.1) # Simulating a small workload
return 2*x
if __name__ == '__main__':
for _ in range(20):
with mp.Pool(processes=4) as pool:
jobs = []
now = time.time()
vals = list(range(1, 9))
np.random.shuffle(vals)
for x in vals:
jobs.append(pool.apply_async(foo, args=(x,)))
results = [j.get() for j in jobs]
elapsed = time.time() - now
print(elapsed)
I'm trying to use multiprocessing in python for the first time. I wrote a basic prime searching program and I want to run simultaneously it on each core. The problem is: when the program does the multiprocessing, it not only does the 'primesearch' function but also the beginning of the code. My expected output would be a list of prime numbers between 0 and a limit, but it writes 16 times (I have 16 cores and 16 processes) "Enter a limit: "
Here is my code:
import time
import os
from multiprocessing import Process
# Defining lists
primes = []
processes = []
l = [0]
limit = int(input('Enter a limit: '))
def primesearch(lower,upper):
global primes
for num in range(lower, upper):
if num > 1:
for i in range(2, num):
if (num % i) == 0:
break
else:
primes.append(num)
# Start the clock
starter = time.perf_counter()
#Dividing data
step = limit // os.cpu_count()
for x in range(os.cpu_count()):
l.append(step * (x+1))
l[-1] = limit
#Multiprocessing
for init in range(os.cpu_count()):
processes.append(Process(target=primesearch, args=[l[init], l[init + 1],] ))
for process in processes:
process.start()
for process in processes:
process.join()
#End clock
finish = time.perf_counter()
print(primes)
print(f'Finished in {round(finish-starter, 2)} second')
What could be the problem?
You are using Windows - If you read the Python documenation for multiprocessing, it will reveal to you that you should protect your main code using if __name__==“__main__”: This is because on Windows each process re-executes the complete main .py file.
This is used in pretty much every example in the documentation., and explained in the section at the end ‘Programming guidelines’.
See https://docs.python.org/3/library/multiprocessing.html
Except for the __main__ issue, your way of using primes as a global list doesn't seem to work.
I imported Queue from multiprocessing and used primes = Queue() and
size = primes.qsize()
print([primes.get() for _ in range(size)])
primes.close()
in the main function and primes.put(num) in your function. I don't know if it's the best way, for me this works but if N > 12000 then the console freezes. Also, in this case, using multiprocessing is actually slightly slower than single process.
If you aim for speed, you can test only to the square root of num, which saves about half of the time. There are many optimizations you can do. If you are testing huge numbers, you can use the Rabin-Miller algorithm.
http://inventwithpython.com/cracking/chapter22.html
I need, in a data analysis python project, to use both classes and multiprocessing features, and I haven't found a good example of it on Google.
My basic idea - which is probably wrong - is to create a class with a high-size variable (it's a pandas dataframe in my case), and then to define a method which computes an operation (a sum in this case).
import multiprocessing
import time
class C:
def __init__(self):
self.__data = list(range(0, 10**7))
def func(self, nums):
return sum(nums)
def start_multi(self):
for n_procs in range(1, 4):
print()
time_start = time.clock()
chunks = [self.__data[(i-1)*len(self.__data)// n_procs: (i)*len(self.__data)// n_procs] for i in range(1, n_procs+1)]
pool = multiprocessing.Pool(processes=n_procs)
results = pool.map_async(self.func, chunks )
results.wait()
pool.close()
results = results.get()
print(sum(results))
print("n_procs", n_procs, "total time: ", time.clock() - time_start)
print('sum(list(range(0, 10**7)))', sum(list(range(0, 10**7))))
c = C()
c.start_multi()
The code doesn't work properly: I get the following print output
sum(list(range(0, 10**7))) 49999995000000
49999995000000
n_procs 1 total time: 0.45133500000000026
49999995000000
n_procs 2 total time: 0.8055279999999954
49999995000000
n_procs 3 total time: 1.1330870000000033
that is the computation time increases instead of decreasing. So, which is the error in this code?
But I'm also worried by the RAM usage since, when the variable chunks is created, the self.__data RAM usage is doubled. Is it possible, when dealing with multiprocessing code, and more specifically in this code, to avoid this memory waste? (I promise I'll put everything on Spark in the future :) )
It looks like there are a few things at play here:
The chunking operation is pretty slow. On my computer the generation of the chunks was taking about 16% of the time for the cases with multiple processes. The single process, non-pool, version doesn't have that overhead.
You are sending a lot of data into your processes. The chunks array is all the raw data for the ranges which needs to get pickled and sent over to the new processes. It would be much easier to, instead of sending all the raw data, just send the start and end indices.
In general, if you put timers in your func you'll see that most of the time is not being spent there. That's why you aren't seeing a speedup. Most of the time is spent on the chunking, pickling, forking, and other overhead.
As an alternative, you should try switching the chunking technique to just compute the start and end numbers and to avoid sending over so much data.
Next, I would recommend doing something a little more computationally hard than computing the sum. For example, you can try counting primes. Here is an example where we use simple prime computing from here and we use a modified chunking technique. Otherwise, tried to keep the code the same.
import multiprocessing
import time
from math import sqrt; from itertools import count, islice
# credit to https://stackoverflow.com/a/27946768
def isPrime(n):
return n > 1 and all(n%i for i in islice(count(2), int(sqrt(n)-1)))
limit = 6
class C:
def __init__(self):
pass
def func(self, start_end_tuple):
start, end = start_end_tuple
primes = []
for x in range(start, end):
if isPrime(x):
primes.append(x)
return len(primes)
def get_chunks(self, total_size, n_procs):
# start and end value tuples
chunks = []
# Example: (10, 5) -> (2, 0) so 2 numbers per process
# (10, 3) -> (3, 1) or here the first process does 4 and the others do 3
quotient, remainder = divmod(total_size, n_procs)
current_start = 0
for i in range(0, n_procs):
my_amount = quotient
if i == 0:
# somebody needs to do extra
my_amount += remainder
chunks.append((current_start, current_start + my_amount))
current_start += my_amount
return chunks
def start_multi(self):
for n_procs in range(1, 4):
time_start = time.clock()
# chunk the start and end indices instead
chunks = self.get_chunks(10**limit, n_procs)
pool = multiprocessing.Pool(processes=n_procs)
results = pool.map_async(self.func, chunks)
results.wait()
results = results.get()
print(sum(results))
time_delta = time.clock() - time_start
print("n_procs {} time {}".format(n_procs, time_delta))
c = C()
time_start = time.clock()
print("serial func(...) = {}".format(c.func((1, 10**limit))))
print("total time {}".format(time.clock() - time_start))
c.start_multi()
This should result in a speedup for the multiple processes. Assuming you have the cores for it.
Questions
Why is the CPU usage of my threaded python merge sort only 50% for each core?
Why does this result in "cannot create new thread" errors for relatively small inputs (100000)
How can I make this more pythonic? (It's very ugly.)
Linux/Ubuntu 12.4 64-bit i5 mobile (quad)
from random import shuffle
from threading import *
import time
import Queue
q = Queue.LifoQueue()
def merge_sort(num, q):
end = len(num)
if end > 1:
mid = end / 2
thread = Thread(target=merge_sort, args=(num[0:mid],q,))
thread1 = Thread(target=merge_sort, args=(num[mid:end],q,))
thread.setDaemon(True)
thread1.setDaemon(True)
thread.start()
thread1.start()
thread.join()
thread1.join()
return merge(q.get(num), q.get(num))
else:
if end != 0:
q.put(num)
else:
print "?????"
return
def merge(num, num1):
a = []
while len(num) is not 0 and len(num1) is not 0:
if num[0] < num1[0]:
a.append(num.pop(0))
else:
a.append(num1.pop(0))
if len(num) is not 0:
for i in range(0,len(num)):
a.append(num.pop(0))
if len(num1) is not 0:
for i in range(0,len(num1)):
a.append(num1.pop(0))
q.put(a)
return a
def main():
val = long(raw_input("Please enter the maximum value of the range:")) + 1
start_time = time.time()
numbers = xrange(0, val)
shuffle(numbers)
numbers = merge_sort(numbers[0:val], q)
# print "Sorted list is: \n"
# for number in numbers:
# print number
print str(time.time() - start_time) + " seconds to run.\n"
if __name__ == "__main__":
main()
For the 100000 input your code tries to create ~200000 threads. Python threads are real OS threads so the 50% CPU load that you are seeing is probably the system busy handling the threads. On my system the error happens around ~32000 threads.
Your code as written can't possibly work:
from random import shuffle
#XXX won't work
numbers = xrange(0, val)
shuffle(numbers)
xrange() is not a mutable sequence.
Note: the sorting takes much less time than the random shuffling of the array:
import numpy as np
numbers = np.random.permutation(10000000) # here spent most of the time
numbers.sort()
If you want to sort parts of the array using different threads; you can do it:
from multiprocessing.dummy import Pool # use threads
Pool(2).map(lambda a: a.sort(), [numbers[:N//2], numbers[N//2:]])
a.sort() releases GIL so the code uses 2 CPUs.
If you include the time it takes to merge the sorted parts; it may be faster just to sort the whole array at once (numbers.sort()) in a single thread.
You may want to look into using Parallel Python, as by default CPython will be restricted to one core because of the Global Interpreter Lock (GIL). This is why CPython cannot perform true CPU bound concurrent operations. But, CPython is still great at carrying out IO bound tasks.
There is a good article that describes the threading limitations of CPyton here.
Suppose you have got a list comprehension in python, like
Values = [ f(x) for x in range( 0, 1000 ) ]
with f being just a function without side effects. So all the entries can be computed independently.
Is Python able to increase the performance of this list comprehension compared with the "obvious" implementation; e.g. by shared-memory-parallelization on multicore CPUs?
In Python 3.2 they added concurrent.futures, a nice library for solving problems concurrently. Consider this example:
import math, time
from concurrent import futures
PRIMES = [112272535095293, 112582705942171, 112272535095293, 115280095190773, 115797848077099, 1099726899285419, 112272535095293, 112582705942171, 112272535095293, 115280095190773, 115797848077099, 1099726899285419]
def is_prime(n):
if n % 2 == 0:
return False
sqrt_n = int(math.floor(math.sqrt(n)))
for i in range(3, sqrt_n + 1, 2):
if n % i == 0:
return False
return True
def bench(f):
start = time.time()
f()
elapsed = time.time() - start
print("Completed in {} seconds".format(elapsed))
def concurrent():
with futures.ProcessPoolExecutor() as executor:
values = list(executor.map(is_prime, PRIMES))
def listcomp():
values = [is_prime(x) for x in PRIMES]
Results on my quad core:
>>> bench(listcomp)
Completed in 14.463825941085815 seconds
>>> bench(concurrent)
Completed in 3.818351984024048 seconds
No, Python will not magically parallelize this for you. In fact, it can't, since it cannot prove the independence of the entries; that would require a great deal of program inspection/verification, which is impossible to get right in the general case.
If you want quick coarse-grained multicore parallelism, I recommend joblib instead:
from joblib import delayed, Parallel
values = Parallel(n_jobs=NUM_CPUS)(delayed(f)(x) for x in range(1000))
Not only have I witnessed near-linear speedups using this library, it also has the great feature of signals such as the one from Ctrl-C onto its worker processes, which cannot be said of all multiprocess libraries.
Note that joblib doesn't really support shared-memory parallelism: it spawns worker processes, not threads, so it incurs some communication overhead from sending data to workers and results back to the master process.
Try if the following can be faster:
Values = map(f,range(0,1000))
That's a functionnal manner to code
Another idea is to replace all occurences of Values in the code by the generator expression
imap(f,range(0,1000)) # Python < 3
map(f,range(0,1000)) # Python 3