Multiple urllib2 connections - python

I want to download multiple images at the same time. For that I'm using threads, each one downloading an image, using urllib2 module. My problem is that even if threads starts (almost) simultaneously, the images are downloaded one by one, like in a single-threaded environment.
Here is the threaded function:
def updateIcon(self, iter, imageurl):
req = urllib2.Request('http://site.com/' + imageurl)
response = urllib2.urlopen(req)
imgdata = response.read()
gobject.idle_add(self.setIcon, iter, imgdata)
Debugging my code I found that downloads seems to get stuck at "response = urllib2.urlopen(req)" line. What's the problem? It's because the threading module or urllib2? How I can fix that?
Thank you in advance

Consider using urllib3. It supports connection pooling and multiple concurrent requests via processes (not threads). It should solve this problem. Be careful to garbage collect connection pools if you contact many different sites, since each site gets its own pool.

In my experience, multithreads of CPython seems to make better performance than those of sigle thread. Because CPython has thread implementation based on kernel thread. But the difference is little, because of GIL(Global Interpreter Lock). Substitute multiprocessing for multithreading. It's easy. Both have similar interface.

Related

greuests alternative - python

I am developing a program that downloads multiple pages, and I used grequests to minimize the download time and also because it supports requests session since the program requires a login. grequests is based on gevent which gave me a hard time when compiling the program (py2exe, bbfreeze). Is there any alternative that can use requests sessions ? Or are there any tips on compiling a program with gevent ?
I can't use pyinstaller: I have to use esky which allows updates.
Sure, there are plenty of alternatives. There's absolutely no reason you have to use gevent—or greenlets at all—to download multiple pages.
If you're trying to handle thousands of connections, that's one thing, but normally a parallel downloader only wants 4-16 simultaneous connections, and any modern OS can run 4-16 threads just fine. Here's an example using Python 3.2+. If you're using 2.x or 3.1, download the futures backport from PyPI—it's pure Python, so you should have no trouble building and packaging it.
import concurrent.futures
import requests
def get_url(url, other, args):
# your existing requests-based code here
urls = [your, list, of, page, urls, here]
with concurrent.futures.ThreadPoolExecutor() as pool:
pool.map(get_url, urls)
If you have some simple post-processing to do after each of the downloads on the main thread, the example in the docs shows how to do exactly that.
If you've heard that "threads are bad in Python because of the GIL", you've heard wrong. Threads that do CPU-bound work in Python are bad because of the GIL. Threads that do I/O-bound work, like downloading a web page, are perfectly fine. And that's exactly the same restriction as when using greenlets, like your existing grequests code, which works.
As I said, this isn't the only alternative. For example, curl (with any of its various Python bindings) is a pain to get the hang of in the first place compared to requests—but once you do, having it multiplex multiple downloads for you isn't much harder than doing one at a time. But threading is the easiest alternative, especially if you've already written code around greenlets.
* In 2.x and 3.1, it can be a problem to have a single thread doing significant CPU work while background threads are doing I/O. In 3.2+, it works the way it should.

Multiprocessing useless with urllib2?

I recently tried to speed up a little tool (which uses urllib2 to send a request to the (unofficial)twitter-button-count-url (> 2000 urls) and parses it´s results) with the multiprocessing module (and it´s worker pools). I read several discussion here about multithreading (which slowed the whole thing down compared to a standard, non-threaded version) and multiprocessing, but i could´t find an answer to a (probably very simple) question:
Can you speed up url-calls with multiprocessing or ain´t the bottleneck something like the network-adapter? I don´t see which part of, for example, the urllib2-open-method could be parallelized and how that should work...
EDIT: THis is the request i want to speed up and the current multiprocessing-setup:
urls=["www.foo.bar", "www.bar.foo",...]
tw_url='http://urls.api.twitter.com/1/urls/count.json?url=%s'
def getTweets(self,urls):
for i in urls:
try:
self.tw_que=urllib2.urlopen(tw_url %(i))
self.jsons=json.loads(self.tw_que.read())
self.tweets.append({'url':i,'date':today,'tweets':self.jsons['count']})
except ValueError:
print ....
continue
return self.tweets
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=4)
result = [pool.apply_async(getTweets(i,)) for i in urls]
[i.get() for i in result]
Ah here comes yet another discussion about the GIL. Well here's the thing. Fetching content with urllib2 is going to be mostly IO-bound. Native threading AND multiprocessing will both have the same performance when the task is IO-bound (threading only becomes a problem when it's CPU-bound). Yes you can speed it up, I've done it myself using python threads and something like 10 downloader threads.
Basically you use a producer-consumer model with one thread (or process) producing urls to download, and N threads (or processes) consuming from that queue and making requests to the server.
Here's some pseudo-code:
# Make sure that the queue is thread-safe!!
def producer(self):
# Only need one producer, although you could have multiple
with fh = open('urllist.txt', 'r'):
for line in fh:
self.queue.enqueue(line.strip())
def consumer(self):
# Fire up N of these babies for some speed
while True:
url = self.queue.dequeue()
dh = urllib2.urlopen(url)
with fh = open('/dev/null', 'w'): # gotta put it somewhere
fh.write(dh.read())
Now if you're downloading very large chunks of data (hundreds of MB) and a single request completely saturates the bandwidth, then yes running multiple downloads is pointless. The reason you run multiple downloads (generally) is because requests are small and have a relatively high latency / overhead.
Take a look at a look at gevent and specifically at this example: concurrent_download.py. It will be reasonably faster than multiprocessing and multithreading + it can handle thousands of connections easily.
It depends! Are you contacting different servers, are the transferred files small or big, do you loose much of the time waiting for the server to reply or by transferring data,...
Generally, multiprocessing involves some overhead and as such you want to be sure that the speedup gained by parallelizing the work is larger than the overhead itself.
Another point: network and thus I/O bound applications work – and scale – better with asynchronous I/O and an event driven architecture instead of threading or multiprocessing, as in such applications much of the time is spent waiting on I/O and not doing any computation.
For your specific problem, I would try to implement a solution by using Twisted, gevent, Tornado or any other networking framework which does not use threads to parallelize connections.
What you do when you split web requests over several processes is to parallelize the network latencies (i.e. the waiting for responses). So you should normally get a good speedup, since most of the processes should sleep most of the time, waiting for an event.
Or use Twisted. ;)
Nothing is useful if your code is broken: f() (with parentheses) calls a function in Python immediately, you should pass just f (no parentheses) to be executed in the pool instead. Your code from the question:
#XXX BROKEN, DO NOT USE
result = [pool.apply_async(getTweets(i,)) for i in urls]
[i.get() for i in result]
notice parentheses after getTweets that means that all the code is executed in the main thread serially.
Delegate the call to the pool instead:
all_tweets = pool.map(getTweets, urls)
Also, you don't need separate processes here unless json.loads() is expensive (CPU-wise) in your case. You could use threads: replace multiprocessing.Pool with multiprocessing.pool.ThreadPool -- the rest is identical. GIL is released during IO in CPython and therefore threads should speed up your code if most of the time is spent in urlopen().read().
Here's a complete code example.

Python threads with urllib

I use python to request a web service with many requests in the same time. To do so I create threads and use urllib (first version, I use python 2.6).
When I start the threads, all goes well until one reach the ulllib.urlopen(). The second thread has to wait until the first one end before passing through the ulllib.urlopen() function. As I do a lot of work after having retrieved the Json from remote web service, I wish the second thread to "urlopen" in the same time or just after the first one closes its socket.
I tried closing the socket opened just after having collected the JSON returned but it changes nothing. The second thread has to wait for the first one to be ended. To see that I use prints.
I can understand that urllib isn't thread-safe (google this doesn't give clear answers) but why does the second thread has to wait for the first-one end (and not just the socket process end) ?
Thanks for your help and hints
PS: I do not use Python 3 for compatibility with modules / packages I require
This does not sounds intended behavior as two parallel urllib request should be possible. Are you sure your remote server can handle two paraller requests (e.g. it is not in debug mode with a single thread)?
Any case: threading is not a preferred approach for parallel programming with Python. Either use processes or async, especially on the server side (you didn't mention the use case or your platform which may also be buggy).
I have had very good experiences processing and transforming JSON/XML with Spawning and Eventlets which patch Python socket code to be asynchronous.
http://pypi.python.org/pypi/Spawning/
http://eventlet.net/

Threaded vs. asynchronous image processing?

I have a Python function which generates an image once it is accessed. I can either invoke it directly upon a HTTP request, or do it asynchronously using Gearman. There are a lot of requests.
Which way is better:
Inline - create an image inline, will result in many images being generated at once
Asynchronous - queue jobs (with Gearman) and generate images in a worker
Which option is better?
In this case "better" would mean the best speed / load combinations. The image generation example is symbolical, as this can also be applied to Database connections and other things.
I have a Python function which
generates an image once it is
accessed. I can either invoke it
directly upon a HTTP request, or do it
asynchronously using Gearman. There
are a lot of requests.
You should not do it inside you request because then you can't throttle(your server could get overloaded). All big sites use a message queue to do the processing offline.
Which option is better?
In this case "better" would mean the
best speed / load combinations. The
image generation example is
symbolical, as this can also be
applied to Database connections and
other things.
You should do it asynchronous because the most compelling reason to do it besides it speeds up your website is that you can throttle your queue if you are on high load. You could first execute the tasks with the highest priority.
I believe forking processes is expensive. I would create a couple worker processes(maybe do a little threading inside process) to handle the load. I would probably use redis because it is fast, actively developed(antirez/pietern commits almost everyday) and has a very good/stable python client library. blpop/rpush could be used to simulate a queue(job)
If your program is CPU bound in the interpreter then spawning multiple threads will actually slow down the result even if there are enough processors to run them all. This happens because the GIL (global interpreter lock) only allows one thread to run in the interpreter at a time.
If most of the work happens in a C library it's likely the lock is not held and you can productively use multiple threads.
If you are spawning threads yourself you'll need to make sure to not create too many - 10K threads at one would be bad news - so you'd need to setup a work queue that the threads read from instead of just spawning them in a loop.
If I was doing this I'd just use the standard multiprocessing module.

CherryPy and concurrency

I'm using CherryPy in order to serve a python application through WSGI.
I tried benchmarking it, but it seems as if CherryPy can only handle exactly 10 req/sec. No matter what I do.
Built a simple app with a 3 second pause, in order to accurately determine what is going on... and I can confirm that the 10 req/sec has nothing to do with the resources used by the python script.
__
Any ideas?
By default, CherryPy's builtin HTTP server will use a thread pool with 10 threads. If you are still using the defaults, you could try increasing this in your config file.
[global]
server.thread_pool = 30
See the cpserver documentation
Or the archive.org copy of the old documentation
This was extremely confounding for me too. The documentation says that CherryPy will automatically scale its thread pool based on observed load. But my experience is that it will not. If you have tasks which might take a while and may also use hardly any CPU in the mean time, then you will need to estimate a thread_pool size based on your expected load and target response time.
For instance, if the average request will take 1.5 seconds to process and you want to handle 50 requests per second, then you will need 75 threads in your thread_pool to handle your expectations.
In my case, I delegated the heavy lifting out to other processes via the multiprocessing module. This leaves the main CherryPy process and threads at idle. However, the CherryPy threads will still be blocked awaiting output from the delegated multiprocessing processes. For this reason, the server needs enough threads in the thread_pool to have available threads for new requests.
My initial thinking is that the thread_pool would not need to be larger than the multiprocessing pool worker size. But this turns out also to be a misled assumption. Somehow, the CherryPy threads will remain blocked even where there is available capacity in the multiprocessing pool.
Another misled assumption is that the blocking and poor performance have something to do with the Python GIL. It does not. In my case I was already farming out the work via multiprocessing and still needed a thread_pool sized on the average time and desired requests per second target. Raising the thread_pool size addressed the issue. Although it looks like and incorrect fix.
Simple fix for me:
cherrypy.config.update({
'server.thread_pool': 100
})
Your client needs to actually READ the server's response. Otherwise the socket/thread will stay open/running until timeout and garbage collected.
use a client that behaves correctly and you'll see that your server will behave too.

Categories

Resources