I'm having the hardest time trying to figure out the difference in usage between multiprocessing.Pool and multiprocessing.Queue.
To help, this is bit of code is a barebones example of what I'm trying to do.
def update():
def _hold(url):
soup = BeautifulSoup(url)
return soup
def _queue(url):
soup = BeautifulSoup(url)
li = [l for l in soup.find('li')]
return True if li else False
url = 'www.ur_url_here.org'
_hold(url)
_queue(url)
I'm trying to run _hold() and _queue() at the same time. I'm not trying to have them communicate with each other so there is no need for a Pipe. update() is called every 5 seconds.
I can't really rap my head around the difference between creating a pool of workers, or creating a queue of functions. Can anyone assist me?
The real _hold() and _queue() functions are much more elaborate than the example so concurrent execution actually is necessary, I just thought this example would suffice for asking the question.
The Pool and the Queue belong to two different levels of abstraction.
The Pool of Workers is a concurrent design paradigm which aims to abstract a lot of logic you would otherwise need to implement yourself when using processes and queues.
The multiprocessing.Pool actually uses a Queue internally for operating.
If your problem is simple enough, you can easily rely on a Pool. In more complex cases, you might need to deal with processes and queues yourself.
For your specific example, the following code should do the trick.
def hold(url):
...
return soup
def queue(url):
...
return bool(li)
def update(url):
with multiprocessing.Pool(2) as pool:
hold_job = pool.apply_async(hold, args=[url])
queue_job = pool.apply_async(queue, args=[url])
# block until hold_job is done
soup = hold_job.get()
# block until queue_job is done
li = queue_job.get()
I'd also recommend you to take a look at the concurrent.futures module. As the name suggest, that is the future proof implementation for the Pool of Workers paradigm in Python.
You can easily re-write the example above with that library as what really changes is just the API names.
Related
Suppose I have the following in Python
# A loop
for i in range(10000):
Do Task A
# B loop
for i in range(10000):
Do Task B
How do I run these loops simultaneously in Python?
If you want concurrency, here's a very simple example:
from multiprocessing import Process
def loop_a():
while 1:
print("a")
def loop_b():
while 1:
print("b")
if __name__ == '__main__':
Process(target=loop_a).start()
Process(target=loop_b).start()
This is just the most basic example I could think of. Be sure to read http://docs.python.org/library/multiprocessing.html to understand what's happening.
If you want to send data back to the program, I'd recommend using a Queue (which in my experience is easiest to use).
You can use a thread instead if you don't mind the global interpreter lock. Processes are more expensive to instantiate but they offer true concurrency.
There are many possible options for what you wanted:
use loop
As many people have pointed out, this is the simplest way.
for i in xrange(10000):
# use xrange instead of range
taskA()
taskB()
Merits: easy to understand and use, no extra library needed.
Drawbacks: taskB must be done after taskA, or otherwise. They can't be running simultaneously.
multiprocess
Another thought would be: run two processes at the same time, python provides multiprocess library, the following is a simple example:
from multiprocessing import Process
p1 = Process(target=taskA, args=(*args, **kwargs))
p2 = Process(target=taskB, args=(*args, **kwargs))
p1.start()
p2.start()
merits: task can be run simultaneously in the background, you can control tasks(end, stop them etc), tasks can exchange data, can be synchronized if they compete the same resources etc.
drawbacks: too heavy!OS will frequently switch between them, they have their own data space even if data is redundant. If you have a lot tasks (say 100 or more), it's not what you want.
threading
threading is like process, just lightweight. check out this post. Their usage is quite similar:
import threading
p1 = threading.Thread(target=taskA, args=(*args, **kwargs))
p2 = threading.Thread(target=taskB, args=(*args, **kwargs))
p1.start()
p2.start()
coroutines
libraries like greenlet and gevent provides something called coroutines, which is supposed to be faster than threading. No examples provided, please google how to use them if you're interested.
merits: more flexible and lightweight
drawbacks: extra library needed, learning curve.
Why do you want to run the two processes at the same time? Is it because you think they will go faster (there is a good chance that they wont). Why not run the tasks in the same loop, e.g.
for i in range(10000):
doTaskA()
doTaskB()
The obvious answer to your question is to use threads - see the python threading module. However threading is a big subject and has many pitfalls, so read up on it before you go down that route.
Alternatively you could run the tasks in separate proccesses, using the python multiprocessing module. If both tasks are CPU intensive this will make better use of multiple cores on your computer.
There are other options such as coroutines, stackless tasklets, greenlets, CSP etc, but Without knowing more about Task A and Task B and why they need to be run at the same time it is impossible to give a more specific answer.
from threading import Thread
def loopA():
for i in range(10000):
#Do task A
def loopB():
for i in range(10000):
#Do task B
threadA = Thread(target = loopA)
threadB = Thread(target = loobB)
threadA.run()
threadB.run()
# Do work indepedent of loopA and loopB
threadA.join()
threadB.join()
You could use threading or multiprocessing.
How about: A loop for i in range(10000): Do Task A, Do Task B ? Without more information i dont have a better answer.
I find that using the "pool" submodule within "multiprocessing" works amazingly for executing multiple processes at once within a Python Script.
See Section: Using a pool of workers
Look carefully at "# launching multiple evaluations asynchronously may use more processes" in the example. Once you understand what those lines are doing, the following example I constructed will make a lot of sense.
import numpy as np
from multiprocessing import Pool
def desired_function(option, processes, data, etc...):
# your code will go here. option allows you to make choices within your script
# to execute desired sections of code for each pool or subprocess.
return result_array # "for example"
result_array = np.zeros("some shape") # This is normally populated by 1 loop, lets try 4.
processes = 4
pool = Pool(processes=processes)
args = (processes, data, etc...) # Arguments to be passed into desired function.
multiple_results = []
for i in range(processes): # Executes each pool w/ option (1-4 in this case).
multiple_results.append(pool.apply_async(param_process, (i+1,)+args)) # Syncs each.
results = np.array(res.get() for res in multiple_results) # Retrieves results after
# every pool is finished!
for i in range(processes):
result_array = result_array + results[i] # Combines all datasets!
The code will basically run the desired function for a set number of processes. You will have to carefully make sure your function can distinguish between each process (hence why I added the variable "option".) Additionally, it doesn't have to be an array that is being populated in the end, but for my example, that's how I used it. Hope this simplifies or helps you better understand the power of multiprocessing in Python!
I have a python generator that returns lots of items, for example:
import itertools
def generate_random_strings():
chars = "ABCDEFGH"
for item in itertools.product(chars, repeat=10):
yield "".join(item)
I then iterate over this and perform various tasks, the issue is that I'm only using one thread/process for this:
my_strings = generate_random_strings()
for string in my_strings:
# do something with string...
print(string)
This works great, I'm getting all my strings, but it's slow. I would like to harness the power of Python multiprocessing to "divide and conquer" this for loop. However, of course, I want each string to be processed only once. While I've found much documentation on multiprocessing, I'm trying to find the most simple solution for this with the least amount of code.
I'm assuming each thread should take a big chunk of items every time and process them before coming back and getting another big chunk etc...
Many thanks,
Most simple solution with least code? multiprocessing context manager.
I assume that you can put "do something with string" into a function called "do_something"
from multiprocessing import Pool as ProcessPool
number_of_processes = 4
with ProcessPool(number_of_processes) as pool:
pool.map(do_something, my_strings)
If you want to get the results of "do_something" back again, easy!
with ProcessPool(number_of_processes) as pool:
results = pool.map(do_something, my_strings)
You'll get them in a list.
Multiprocessing.dummy is a syntactic wrapper for process pools that lets you use the multiprocessing syntax. If you want threads instead of processes, just do this:
from multiprocessing.dummy import Pool as ThreadPool
You may use multiprocessing.
import multiprocessing
def string_fun(string):
# do something with string...
print(string)
my_strings = generate_random_strings()
num_of_threads = 7
pool = multiprocessing.Pool(num_of_threads)
pool.map(string_fun, my_strings)
Assuming you're using the lastest version of Python, you may want to read something about asyncio module. Multithreading is not easy to implement due to GIL lock: "In CPython, the global interpreter lock, or GIL, is a mutex that protects access to Python objects, preventing multiple threads from executing Python bytecodes at once. This lock is necessary mainly because CPython's memory management is not thread-safe."
So you can swap on Multiprocessing, or, as reported above, take a look at asycio module.
asyncio — Asynchronous I/O > https://docs.python.org/3/library/asyncio.html
I'll integrate this answer with some code as soon as possible.
Hope it helps,
Hele
As #Hele mentioned, asyncio is best of all, here is an example
Code
#!/usr/bin/python3
# -*- coding: utf-8 -*-
# python 3.7.2
from asyncio import ensure_future, gather, run
import random
alphabet = 'ABCDEFGH'
size = 1000
async def generate():
tasks = list()
result = None
for el in range(1, size):
task = ensure_future(generate_one())
tasks.append(task)
result = await gather(*tasks)
return list(set(result))
async def generate_one():
return ''.join(random.choice(alphabet) for i in range(8))
if __name__ == '__main__':
my_strings = run(generate())
print(my_strings)
Output
['CHABCGDD', 'ACBGAFEB', ...
Of course, you need to improve generate_one, this variant is very slow.
You can see source code here.
I have some split files on a remote server.
I have tried downloading them one by one and join them. But it takes a lot of time. I googled and found that simultaneous download might speed up things. The script is on Python.
My pseudo is like this:
url1 = something
url2 = something
url3 = something
data1 = download(url1)
data2 = download(url2)
data3 = download(url3)
wait for all download to complete
join all data and save
Could anyone point me to a direction by which I can load files all simultaneously and wait till they are done.
I have tried by creating a class. But again I can't figure out how to wait till all complete.
I am more interested in Threading and Queue feature and I can import them in my platform.
I have tried with Thread and Queue with an example found on this site. Here is the code pastebin.com/KkiMLTqR . But it does not wait or waits forever..not sure
There are 2 ways to do things simultaneously. Or, really, 2-3/4 or so:
Multiple threads
Or multiple processes, especially if the "things" take a lot of CPU power
Or coroutines or greenlets, especially if there are thousands of "things"
Or pools of one of the above
Event loops (either coded manually)
Or hybrid greenlet/event loop systems like gevent.
If you have 1000 URLs, you probably don't want to do 1000 requests at the same time. For example, web browsers typically only do something like 8 requests at a time. A pool is a nice way to do only 8 things at a time, so let's do that.
And, since you're only doing 8 things at a time, and those things are primarily I/O bound, threads are perfect.
I'll implement it with futures. (If you're using Python 2.x, or 3.0-3.1, you will need to install the backport, futures.)
import concurrent.futures
urls = ['http://example.com/foo',
'http://example.com/bar']
with concurrent.futures.ThreadPoolExecutor(max_workers=8) as executor:
result = b''.join(executor.map(download, urls))
with open('output_file', 'wb') as f:
f.write(result)
Of course you need to write the download function, but that's exactly the same function you'd write if you were doing these one at a time.
For example, using urlopen (if you're using Python 2.x, use urllib2 instead of urllib.request):
def download(url):
with urllib.request.urlopen(url) as f:
return f.read()
If you want to learn how to build a thread pool executor yourself, the source is actually pretty simple, and multiprocessing.pool is another nice example in the stdlib.
However, both of those have a lot of excess code (handling weak references to improve memory usage, shutting down cleanly, offering different ways of waiting on the results, propagating exceptions properly, etc.) that may get in your way.
If you look around PyPI and ActiveState, you will find simpler designs like threadpool that you may find easier to understand.
But here's the simplest joinable threadpool:
class ThreadPool(object):
def __init__(self, max_workers):
self.queue = queue.Queue()
self.workers = [threading.Thread(target=self._worker) for _ in range(max_workers)]
def start(self):
for worker in self.workers:
worker.start()
def stop(self):
for _ in range(self.workers):
self.queue.put(None)
for worker in self.workers:
worker.join()
def submit(self, job):
self.queue.put(job)
def _worker(self):
while True:
job = self.queue.get()
if job is None:
break
job()
Of course the downside of a dead-simple implementation is that it's not as friendly to use as concurrent.futures.ThreadPoolExecutor:
urls = ['http://example.com/foo',
'http://example.com/bar']
results = [list() for _ in urls]
results_lock = threading.Lock()
def download(url, i):
with urllib.request.urlopen(url) as f:
result = f.read()
with results_lock:
results[i] = url
pool = ThreadPool(max_workers=8)
pool.start()
for i, url in enumerate(urls):
pool.submit(functools.partial(download, url, i))
pool.stop()
result = b''.join(results)
with open('output_file', 'wb') as f:
f.write(result)
You can use an async framwork like twisted.
Alternatively this is one thing that Python's threads do ok at. Since you are mostly IO bound
I've got a piece of code:
for url in get_lines(file):
visit(url, timeout=timeout)
It gets URLs from file and visit it (by urllib2) in for loop.
Is is possible to do this in few threads? For example, 10 visits at the same time.
I've tried:
for url in get_lines(file):
Thread(target=visit, args=(url,), kwargs={"timeout": timeout}).start()
But it does not work - no effect, URLs are visited normally.
The simplified version of function visit:
def visit(url, proxy_addr=None, timeout=30):
(...)
request = urllib2.Request(url)
response = urllib2.urlopen(request)
return response.read()
To expand on senderle's answer, you can use the Pool class in multiprocessing to do this easily:
from multiprocessing import Pool
pool = Pool(processes=5)
pages = pool.map(visit, get_lines(file))
When the map function returns then "pages" will be a list of the contents of the URLs. You can adjust the number of processes to whatever is suitable for your system.
I suspect that you've run into the Global Interpreter Lock. Basically, threading in python can't achieve concurrency, which seems to be your goal. You need to use multiprocessing instead.
multiprocessing is designed to have a roughly analogous interface to threading, but it has a few quirks. Your visit function as written above should work correctly, I believe, because it's written in a functional style, without side effects.
In multiprocessing, the Process class is the equivalent of the Thread class in threading. It has all the same methods, so it's a drop-in replacement in this case. (Though I suppose you could use pool as JoeZuntz suggests -- but I would test with the basic Process class first, to see if it fixes the problem.)
I'm a multiprocessing newbie,
I know something about threading but I need to increase the speed of this calculation, hopefully with multiprocessing:
Example Description: sends string to a thread, alters string + benchmark test,
send result back for printing.
from threading import Thread
class Alter(Thread):
def __init__(self, word):
Thread.__init__(self)
self.word = word
self.word2 = ''
def run(self):
# Alter string + test processing speed
for i in range(80000):
self.word2 = self.word2 + self.word
# Send a string to be altered
thread1 = Alter('foo')
thread2 = Alter('bar')
thread1.start()
thread2.start()
#wait for both to finish
while thread1.is_alive() == True: pass
while thread2.is_alive() == True: pass
print(thread1.word2)
print(thread2.word2)
This is currently takes about 6 seconds and I need it to go faster.
I have been looking into multiprocessing and cannot find something equivalent to the above code. I think what I am after is pooling but examples I have found have been hard to understand. I would like to take advantage of all cores (8 cores) multiprocessing.cpu_count() but I really just have scraps of useful information on multiprocessing and not enough to duplicate the above code. If anyone can point me in the right direction or better yet, provide an example that would be greatly appreciated. Python 3 please
Just replace threading with multiprocessing and Thread with Process. Threads in Pyton are (almost) never used to gain performance because of the big bad GIL! I explained it in an another SO-post with some links to documentation and a great talk about threading in python.
But the multiprocessing module is intentionally very similar to the threading module. You can almost use it as an drop-in replacement!
The multiprocessing module doesn't AFAIK offer a functionality to enforce the use of a specific amount of cores. It relies on the OS-implementation. You could use the Pool object and limit the worker-onjects to the core-count. Or you could look for an other MPI library like pypar. Under Linux you could use a pipe under the shell to start multiple instances on different cores