Best strategy to download many images with Python?

Best strategy to download many images with Python? - python

I'm trying to download many images from a list of URL. By many, I mean in the vicinity of 10 000. The images vary in size, from a few hundreds of KB to 15 MB.
I wonder what would be the best strategy to go about this task, trying to minimize the total time to finish, and to avoid freezing.
I use this function to save each image :
def save_image(name, base_dir, data):
with open(base_dir + name, "wb+") as destination:
for chunk in data:
destination.write(chunk)
I take the file extension from the URL with this function :
def get_ext(url):
"""Return the filename extension from url, or ''."""
""" From : https://stackoverflow.com/questions/28288987/identify-the-file-extension-of-a-url """
parsed = urlparse(url)
root, ext = splitext(parsed.path)
return ext # or ext[1:] if you don't want the leading '.'
And to get the images I just do :
for image in listofimages:
r = requests.get(image["url"], timeout=5)
extension = get_ext(image["url"])
name = str( int(image['ID']) ) + "AA" + extension
save_image( name, "images/", r )
Now putting it all together is quite slow. Hence my question.

One, as hinted in the above comments, you probably want to parallelize work. Multiprocessing and multithreading will work, but with a relatively high overhead. Alternatively, you could use an asynchronous approach such as patching your network libraries with Gevent, or using asyncio together with an async-aware HTTP client, for instance, httpx would do.
Regardless of the approach you take to parallelize I/O, you might find the queue paradigm convenient to work with -- put all your URLs into a queue, and let your workers consume them.
Two, to deal with non-responsive web servers blocking your workers from scraping, you'll probably need to set socket timeouts, check the chosen HTTP client library how to do it. For instance, the popular requests simply takes the timeout parameter.

Related

Asynchronous download of files with twisted and (tx)requests

I'm trying to download file(s) from the internet from within a twisted application. I'd like to do this using requests due to the other features it provides directly or has well maintained libraries to provide (retries, proxies, cachecontrol, etc.). I am open to a twisted only solution which does not have these features, but I can't seem to find one anyway.
The files should be expected to be fairly large and will be downloaded on slow connections. I'm therefore using requests' stream=True interface and the response's iter_content. A more or less complete code fragment is listed at the end of this question. The entry point for this would be http_download function, called with a url, a dst to write the file to, and a callback and an optional errback to handle a failed download. I've stripped away some of the code involved in preparing the destination (create folders, etc) and code to close the session during reactor exit but I think it should still work as is.
This code works. The file is downloaded, the twisted reactor continues to operate. However, I seem to have a problem with this bit of code :
def _stream_download(r, f):
for chunk in r.iter_content(chunk_size=128):
f.write(chunk)
yield None
cooperative_dl = cooperate(_stream_download(response, filehandle))
Because iter_content returns only when it has a chunk to return, the reactor handles a chunk, runs other bits of code, then returns to waiting for the next chunk instead of keeping itself busy updating a spinning wait animation on the GUI (code not actually posted here).
Here's the question -
Is there a way to get twisted to operate on this generator in such a way that it yields control when the generator itself is not prepared to yield something? I came across some docs for twisted.flow which seemed appropriate, but this does not seem to have made it into twisted or no longer exists today. This question can be read independent of the specifics, i.e., with respect to any arbitrary blocking generator, or can be read in the immediate context of the question.
Is there a way to get twisted to download files asynchronously using something full-featured like requests? Is there an existing twisted module which just does this which I can just use?
What would the basic approach be to such a problem with twisted, independent of the http features I want to use from requests. Let's assume I'm prepared to ditch them or otherwise implement them. How would I download a file asynchronously over HTTP.
import os
import re
from functools import partial
from six.moves.urllib.parse import urlparse
from requests import HTTPError
from twisted.internet.task import cooperate
from txrequests import Session
class HttpClientMixin(object):
def __init__(self, *args, **kwargs):
self._http_session = None
def http_download(self, url, dst, callback, errback=None, **kwargs):
dst = os.path.abspath(dst)
# Log request
deferred_response = self.http_session.get(url, stream=True, **kwargs)
deferred_response.addCallback(self._http_check_response)
deferred_response.addCallbacks(
partial(self._http_download, destination=dst, callback=callback),
partial(self._http_error_handler, url=url, errback=errback)
)
def _http_download(self, response, destination=None, callback=None):
def _stream_download(r, f):
for chunk in r.iter_content(chunk_size=128):
f.write(chunk)
yield None
def _rollback(r, f, d):
if r:
r.close()
if f:
f.close()
if os.path.exists(d):
os.remove(d)
filehandle = open(destination, 'wb')
cooperative_dl = cooperate(_stream_download(response, filehandle))
cooperative_dl.whenDone().addCallback(lambda _: response.close)
cooperative_dl.whenDone().addCallback(lambda _: filehandle.close)
cooperative_dl.whenDone().addCallback(
partial(callback, url=response.url, destination=destination)
)
cooperative_dl.whenDone().addErrback(
partial(_rollback, r=response, f=filehandle, d=destination)
)
def _http_error_handler(self, failure, url=None, errback=None):
failure.trap(HTTPError)
# Log error message
if errback:
errback(failure)
#staticmethod
def _http_check_response(response):
response.raise_for_status()
return response
#property
def http_session(self):
if not self._http_session:
# Log session start
self._http_session = Session()
return self._http_session

Is there a way to get twisted to operate on this generator in such a way that it yields control when the generator itself is not prepared to yield something?
No. All Twisted can do is invoke the code. If the code blocks indefinitely, then the calling thread is blocked indefinitely. This is a basic premise of the Python runtime.
Is there a way to get twisted to download files asynchronously using something full-featured like requests?
There's treq. You didn't say what "full-featured" means here but earlier you mentioned "retries", "proxies", and "cachecontrol". I don't believe treq currently has these features. You can find some kind of feature matrix in the treq docs (though I notice it doesn't include any of the features you mentioned - even for requests). I expect implementations of such features would be welcome as treq contributions.
Is there a way to get twisted to download files asynchronously using something full-featured like requests?
Run it in a thread - probably using Twisted's threadpool APIs.
What would the basic approach be to such a problem with twisted, independent of the http features I want to use from requests.
treq.

Lost HTTPS requests with parallel processing

I use the two following class methods to request information from the Questrade API (http://www.questrade.com/api/documentation/rest-operations/market-calls/markets-quotes-id). I have over 11,000 stock symbols where I request the Questrade API with batches of 100 symbols.
import requests
from joblib import Parallel, delayed
def parallel_request(self, elem, result, url, key):
response = requests.get(''.join((url, elem)), headers=self.headers)
result.extend(response.json().get(key))
Parallel(n_jobs=-1, backend="threading")(
delayed(self.parallel_request)(elem, self.symbol_ids_list, self.uri, 'symbols')\
for elem in self.batch_result
)
If I make over 110 HTTPS requests with Parallel class, then instead of getting 11,000 output I got 10,500 or 10,600. So I lost data with parallel processing. Be aware that I used two python module here, i.e. joblib (https://github.com/joblib/joblib/issues/651) and requests (https://github.com/requests/requests).
The following for loop worked perfectly, so I know my problem is with the Parallel class.
for elem in self.batch_result:
response = requests.get(''.join((self.uri, elem)), headers=self.headers)
self.symbol_ids_list.extend(response.json().get('symbols'))
How could I increase the performance of the last for loop without losing data?
UPDATE
A sample of self.batch_result (simplified result) could be ['AAME,ABAC,ABIL,ABIO,ACERW,ACHN,ACHV,ACRX,ACST,ACTG,ADMA,ADMP,ADOM,ADXS,ADXSW,AEHR,AEMD,AETI,AEY,AEZS,AFMD,AGFSW,AGRX,AGTC,AHPAW,AHPI,AIPT,AKER,AKTX,ALIM,ALJJ,ALQA,ALSK,ALT,AMCN,AMDA,AMMA,AMRH,AMRHW,AMRN,AMRWW,AMTX,ANDAR,ANDAW,ANTH,ANY,APDN,APDNW,APOPW,APPS,APRI,APTO,APVO,APWC,AQB,AQMS,ARCI,ARCW,ARDM,AREX,ARGS,ARLZ,ARQL,ARTW,ARTX,ASFI,ASNA,ASRV,ASTC,ATACR,ATEC,ATHX,ATLC,ATOS,ATRS,AUTO,AVEO,AVGR,AVID,AVXL,AWRE,AXAS,AXON,AXSM,AYTU,AZRX,BASI,BBOX,BBRG,BCACR,BCACW,BCLI,BDSI,BHACR,BHACW,BIOC,BIOL,BIOS,BKEP,BKYI', 'BLDP,BLIN,BLNK,BLNKW,BLPH,BLRX,BMRA,BNSO,BNTC,BNTCW,BOSC,BOXL,BPTH,BRACR,BRACW,BRPAR,BRPAW,BSPM,BSQR,BUR,BURG,BVSN,BVXVW,BWEN,BYFC,CAAS,CADC,CALI,CAPR,CARV,CASI,CASM,CATB,CATS,CBAK,CBLI,CCCL,CCCR,CCIH,CDMO,CDTI,CELGZ,CERCW,CETV,CETX,CETXW,CFBK,CFMS,CFRX,CGEN,CGIX,CGNT,CHCI,CHEK,CHEKW,CHFS,CHKE,CHMA,CHNR,CIDM,CJJD,CKPT,CLDC,CLDX,CLIR,CLIRW,CLNE,CLRB,CLRBW,CLRBZ,CLSN,CLWT,CMSSR,CMSSW,CNACR,CNACW,CNET,CNIT,CNTF,CODA,CODX,COGT,CPAH,CPLP,CPRX,CPSH,CPSS,CPST,CREG,CRIS,CRME,CRNT,CSBR,CTHR,CTIB,CTIC,CTRV,CTXR,CTXRW,CUI', 'CUR,CVONW,CXDC,CXRX,CYCC,CYHHZ,CYRN,CYTR,CYTX,CYTXW,DARE,DCAR,DCIX,DELT,DEST,DFBG,DFFN,DGLY,DHXM,DLPN,DLPNW,DMPI,DOGZ,DOTAR,DOTAW,DRAD,DRIO,DRIOW,DRRX,DRYS,DSKEW,DSWL,DTEA,DTRM,DXLG,DXYN,DYNT,DYSL,EACQW,EAGLW,EARS,EASTW,EBIO,EDAP,EFOI,EGLT,EKSO,ELECW,ELGX,ELON,ELSE,ELTK,EMITF,EMMS,ENG,ENPH,ENT,EPIX,ESEA,ESES,ESTRW,EVEP,EVGN,EVK,EVLV,EVOK,EXFO,EXXI,EYEG,EYEGW,EYES,EYESW,FCEL,FCRE,FCSC,FFHL,FLGT,FLL,FMCIR,FMCIW,FNJN,FNTEW,FORD,FORK,FPAY,FRAN,FRED,FRSX,FSACW,FSNN,FTD,FTEK,FTFT,FUV,FVE,FWP,GALT,GASS,GCVRZ,GEC']
and self.uri is simply 'https://api01.iq.questrade.com/v1/symbols?names=' as seen in the above Questrade API link.
UPDATE 2
The Marat's answer was a good try but didn't give me a better result. The first test gave me 31,356 (or 10,452 if I divide that result by 3) instead of 10,900. The second test just gave me 0 or the process block completely.
I found out that the Maximum allowed requests per second is 20. Link : http://www.questrade.com/api/documentation/rate-limiting. How could I increase the performance of the last for loop without losing data in considering that new information?

If you are not stuck with using joblib you could try some standard library parallel processing modules. In python2/3 multiprocessing.Pool is available and provides functions for mapping a task across parallel threads. A simplified version would look like this:
from multiprocessing import Pool
import requests
HEADERS = {} # define headers here
def parallel_request(symbols):
response = requests.get('https://api01.iq.questrade.com/v1/symbols?names={}'.format(symbols), headers=HEADERS)
return response.json()
if __name__ == '__main__':
p = Pool()
batch_result = ['AAME,ABAC,ABIL,...',
'BLDP,BLIN,BLNK,...',
'CUR,CVONW,CXDC,...',
...]
p.map(parallel_request, batch_result) # will return a list of len(batch_result) responses
There are asynchronous and iterable versions of map that you would probably want for larger sized jobs, and of course you could add parameters to your parallel_requests task to avoid hard coding things like I did. A caveat with using Pool is that any arguments passed to it have to be picklable.
In python3 the concurrent.futures module actually has a nice example of multithreaded url retrieval in the docs. With a little effort you could replace load_url in that example with your parallel_request function. There is a version of concurrent.futures backported to python2 as the futures module, as well.
These might require a bit more work in refactoring, so if there is a solution that sticks with joblib feel free to prefer that. On the off-chance that your problem is a bug in joblib, there are plenty of ways you could do this in a multithreaded fashion with standard library (albeit with some added boilerplate).

Most likely, it happens because some of HTTP calls fail due to network load. To test, change parallel_request:
def parallel_request(self, elem, result, url, key):
for i in range(3): # 3 retries
try:
response = requests.get(''.join((url, elem)), headers=self.headers)
except IOError:
continue
result.extend(response.json().get(key))
return
Much less likely: list.extend is not thread safe. If the snippet above didn't help, try guarding extend with a lock:
import threading
...
lock = threading.Lock()
def parallel_request(self, elem, result, url, key):
response = requests.get(''.join((url, elem)), headers=self.headers)
lock.acquire()
result.extend(response.json().get(key))
lock.release()

Web crawler returning list vs generator vs producer/consumer

I want to recursively crawl a web-server that hosts thousands of files and then check if they are different from what's in the local repository (this is a part of checking the delivery infrastructure for bugs).
So far I've been playing around with various prototypes and here is what I noticed. If I do a straightforward recursion and put all the files into a list, the operation completes in around 230 seconds. Note that I make only one request per directory, so it makes sense to actually download the files I'm interested in elsewhere:
def recurse_links(base):
result = []
try:
f = urllib.request.urlopen(base)
soup = BeautifulSoup(f.read(), "html.parser")
for anchor in soup.find_all('a'):
href = anchor.get('href')
if href.startswith('/') or href.startswith('..'):
pass
elif href.endswith('/'):
recurse_links(base + href)
else:
result.append(base + href)
except urllib.error.HTTPError as httperr:
print('HTTP Error in ' + base + ': ' + str(httperr))
I figured, if I could start processing the files I'm interested in while the crawler is still working, I could save time. So the next thing I tried was a generator that could be further used as a coroutine. The generator took 260 seconds, slightly more, but still acceptable. Here's the generator:
def recurse_links_gen(base):
try:
f = urllib.request.urlopen(base)
soup = BeautifulSoup(f.read(), "html.parser")
for anchor in soup.find_all('a'):
href = anchor.get('href')
if href.startswith('/') or href.startswith('..'):
pass
elif href.endswith('/'):
yield from recurse_links_gen(base + href)
else:
yield base + href
except urllib.error.HTTPError as http_error:
print(f'HTTP Error in {base}: {http_error}')
Update
Answering some questions that came up in the comments section:
I've got roughly 370k files, but not all of them will make it to the next step. I will check them against a set or dictionary (to get O(1) lookup) before going ahead and compare them to local repo
After more tests it looks like sequential crawler takes less time in roughly 4 out of 5 attempts. And generator took less time once. So at this point is seems like generator is okay
At this point consumer doesn't do anything other than get an item from queue, since it's a concept. However I have flexibility in what I will do with the file URL I get from producer. I can for instance, download only first 100KB of file, calculate it's checksum while in memory and then compare to a pre-calculated local version. What's clear though is that if simply adding thread creation bumps my execution time by a factor of 4 to 5, adding work on consumer thread will not make it any faster.
Finally I decided to give producer/consumer/queue a shot and a simple PoC ran 4 times longer while loading 100% of one CPU core. Here is the brief code (the crawler is the same generator-based crawler from above):
class ProducerThread(threading.Thread):
def __init__(self, done_event, url_queue, crawler, name):
super().__init__()
self._logger = logging.getLogger(__name__)
self.name = name
self._queue = url_queue
self._crawler = crawler
self._event = done_event
def run(self):
for file_url in self._crawler.crawl():
try:
self._queue.put(file_url)
except Exception as ex:
self._logger.error(ex)
So here are my questions:
Are the threads created with threading library actually threads and is there a way for them to be actually distributed between various CPU cores?
I believe the great deal of performance degradation comes from the producer waiting to put an item into the queue. But can this be avoided?
Is the generator slower because it has to save the function context and then load it again over and over?
What's the best way to start actually doing something with those files while the crawler is still populating the queue/list/whatever and thus make the whole program faster?

1) Are the threads created with threading library actually threads and is there a way for them to be actually distributed between various CPU cores?
Yes, these are the threads, but to utilize multiple cores of your CPU, you need to use multiprocessing package.
2) I believe the great deal of performance degradation comes from the producer waiting to put an item into the queue. But can this be avoided?
It depends on the number of threads you are created, one reason may be due to context switches, your threads are making. The optimum value for thread should be 2/3, i.e create 2/3 threads and check the performance again.
3) Is the generator slower because it has to save the function context and then load it again over and over?
Generators are not slow, it is rather good for the problem you are working on, as you find a url , you put that into queue.
4) What's the best way to start actually doing something with those files while the crawler is still populating the queue/list/whatever and thus make the whole program faster?
Create a ConsumerThread class, which fetches the data(url in your case) from the queue and start working on it.

gevent / requests hangs while making lots of head requests

I need to make 100k head requests, and I'm using gevent on top of requests. My code runs for a while, but then eventually hangs. I'm not sure why it's hanging, or whether it's hanging inside requests or gevent. I'm using the timeout argument inside both requests and gevent.
Please take a look at my code snippet below, and let me know what I should change.
import gevent
from gevent import monkey, pool
monkey.patch_all()
import requests
def get_head(url, timeout=3):
try:
return requests.head(url, allow_redirects=True, timeout=timeout)
except:
return None
def expand_short_urls(short_urls, chunk_size=100, timeout=60*5):
chunk_list = lambda l, n: ( l[i:i+n] for i in range(0, len(l), n) )
p = pool.Pool(chunk_size)
print 'Expanding %d short_urls' % len(short_urls)
results = {}
for i, _short_urls_chunked in enumerate(chunk_list(short_urls, chunk_size)):
print '\t%d. processing %d urls # %s' % (i, chunk_size, str(datetime.datetime.now()))
jobs = [p.spawn(get_head, _short_url) for _short_url in _short_urls_chunked]
gevent.joinall(jobs, timeout=timeout)
results.update({_short_url:job.get().url for _short_url, job in zip(_short_urls_chunked, jobs) if job.get() is not None and job.get().status_code==200})
return results
I've tried grequests, but it's been abandoned, and I've gone through the github pull requests, but they all have issues too.

The RAM usage you are observing mainly stems from all the data that piles up while storing 100.000 response objects, and all the underlying overhead. I have reproduced your application case, and fired off HEAD requests against 15000 URLS from the top Alexa ranking. It did not really matter
whether I used a gevent Pool (i.e. one greenlet per connection) or a fixed set of greenlets, all requesting multiple URLs
how large I set the pool size
In the end, the RAM usage grew over time, to considerable amounts. However, I noticed that changing from requests to urllib2 already lead to a reduction in RAM usage, by about factor two. That is, I replaced
result = requests.head(url)
with
request = urllib2.Request(url)
request.get_method = lambda : 'HEAD'
result = urllib2.urlopen(request)
Some other advice: do not use two timeout mechanisms. Gevent's timeout approach is very solid, and you can easily use it like this:
def gethead(url):
result = None
try:
with Timeout(5, False):
result = requests.head(url)
except Exception as e:
result = e
return result
Might look tricky, but either returns None (after quite precisely 5 seconds, and indicates timeout), any exception object representing a communication error, or the response. Works great!
Although this likely is not part of the issue, in such cases I recommend to keep workers alive and let them work on multiple items each! The overhead of spawning greenlets is small, indeed. Still, this would be a very simple solution with a set of long-lived greenlets:
def qworker(qin, qout):
while True:
try:
qout.put(gethead(qin.get(block=False)))
except Empty:
break
qin = Queue()
qout = Queue()
for url in urls:
qin.put(url)
workers = [spawn(qworker, qin, qout) for i in xrange(POOLSIZE)]
joinall(workers)
returnvalues = [qout.get() for _ in xrange(len(urls))]
Also, you really need to appreciate that this is a large-scale problem you are tackling there, yielding non-standard issues. When I reproduced your scenario with a timeout of 20 s and 100 workers and 15000 URLs to be requested, I easily got a large number of sockets:
# netstat -tpn | wc -l
10074
That is, the OS had more than 10000 sockets to manage, most of them in TIME_WAIT state. I also observed "Too many open files" errors, and tuned the limits up, via sysctl. When you request 100.000 URLs you will probably hit such limits, too, and you need to come up with measures to prevent system starving.
Also note the way you are using requests, it automatically follows redirects from HTTP to HTTPS, and automatically verifies the certificate, all of which surely costs RAM.
In my measurements, when I divided the number of requested URLs by the runtime of the program, I almost never passed 100 responses/s, which is the result of the high-latency connections to foreign servers all over the world. I guess you also are affected by such a limit. Adjust the rest of the architecture to this limit, and you will probably be able to generate a data stream from the Internet to disk (or database) with not so large RAM usage inbetween.
I should address your two main questions, specifically:
I think gevent/the way you are using it is not your problem. I think you are just underestimating the complexity of your task. It comes along with nasty problems, and drives your system to its limits.
your RAM usage issue: Start off by using urllib2, if you can. Then, if things accumulate still too high, you need to work against accumulation. Try to produce a steady state: you might want to start writing off data to disk and generally work towards the situation where objects can become garbage collected.
your code "eventually hangs": probably this is as of your RAM issue. If it is not, then do not spawn so many greenlets, but reuse them as indicated. Also, further reduce concurrency, monitor the number of open sockets, increase system limits if necessary, and try to find out exactly where your software hangs.

I'm not sure if this will resolve your issue, but you are not using pool.Pool() correctly.
Try this:
def expand_short_urls(short_urls, chunk_size=100):
# Pool() automatically limits your process to chunk_size greenlets running concurrently
# thus you don't need to do all that chunking business you were doing in your for loop
p = pool.Pool(chunk_size)
print 'Expanding %d short_urls' % len(short_urls)
# spawn() (both gevent.spawn() and Pool.spawn()) returns a gevent.Greenlet object
# NOT the value your function, get_head, will return
threads = [p.spawn(get_head, short_url) for short_url in short_urls]
p.join()
# to access the returned value of your function, access the Greenlet.value property
results = {short_url: thread.value.url for short_url, thread in zip(short_urls, threads)
if thread.value is not None and thread.value.status_code == 200}
return results

how to download multiple file simultaneously and join them in python?

I have some split files on a remote server.
I have tried downloading them one by one and join them. But it takes a lot of time. I googled and found that simultaneous download might speed up things. The script is on Python.
My pseudo is like this:
url1 = something
url2 = something
url3 = something
data1 = download(url1)
data2 = download(url2)
data3 = download(url3)
wait for all download to complete
join all data and save
Could anyone point me to a direction by which I can load files all simultaneously and wait till they are done.
I have tried by creating a class. But again I can't figure out how to wait till all complete.
I am more interested in Threading and Queue feature and I can import them in my platform.
I have tried with Thread and Queue with an example found on this site. Here is the code pastebin.com/KkiMLTqR . But it does not wait or waits forever..not sure

There are 2 ways to do things simultaneously. Or, really, 2-3/4 or so:
Multiple threads
Or multiple processes, especially if the "things" take a lot of CPU power
Or coroutines or greenlets, especially if there are thousands of "things"
Or pools of one of the above
Event loops (either coded manually)
Or hybrid greenlet/event loop systems like gevent.
If you have 1000 URLs, you probably don't want to do 1000 requests at the same time. For example, web browsers typically only do something like 8 requests at a time. A pool is a nice way to do only 8 things at a time, so let's do that.
And, since you're only doing 8 things at a time, and those things are primarily I/O bound, threads are perfect.
I'll implement it with futures. (If you're using Python 2.x, or 3.0-3.1, you will need to install the backport, futures.)
import concurrent.futures
urls = ['http://example.com/foo',
'http://example.com/bar']
with concurrent.futures.ThreadPoolExecutor(max_workers=8) as executor:
result = b''.join(executor.map(download, urls))
with open('output_file', 'wb') as f:
f.write(result)
Of course you need to write the download function, but that's exactly the same function you'd write if you were doing these one at a time.
For example, using urlopen (if you're using Python 2.x, use urllib2 instead of urllib.request):
def download(url):
with urllib.request.urlopen(url) as f:
return f.read()
If you want to learn how to build a thread pool executor yourself, the source is actually pretty simple, and multiprocessing.pool is another nice example in the stdlib.
However, both of those have a lot of excess code (handling weak references to improve memory usage, shutting down cleanly, offering different ways of waiting on the results, propagating exceptions properly, etc.) that may get in your way.
If you look around PyPI and ActiveState, you will find simpler designs like threadpool that you may find easier to understand.
But here's the simplest joinable threadpool:
class ThreadPool(object):
def __init__(self, max_workers):
self.queue = queue.Queue()
self.workers = [threading.Thread(target=self._worker) for _ in range(max_workers)]
def start(self):
for worker in self.workers:
worker.start()
def stop(self):
for _ in range(self.workers):
self.queue.put(None)
for worker in self.workers:
worker.join()
def submit(self, job):
self.queue.put(job)
def _worker(self):
while True:
job = self.queue.get()
if job is None:
break
job()
Of course the downside of a dead-simple implementation is that it's not as friendly to use as concurrent.futures.ThreadPoolExecutor:
urls = ['http://example.com/foo',
'http://example.com/bar']
results = [list() for _ in urls]
results_lock = threading.Lock()
def download(url, i):
with urllib.request.urlopen(url) as f:
result = f.read()
with results_lock:
results[i] = url
pool = ThreadPool(max_workers=8)
pool.start()
for i, url in enumerate(urls):
pool.submit(functools.partial(download, url, i))
pool.stop()
result = b''.join(results)
with open('output_file', 'wb') as f:
f.write(result)

You can use an async framwork like twisted.
Alternatively this is one thing that Python's threads do ok at. Since you are mostly IO bound

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Best strategy to download many images with Python? - python

Related

Asynchronous download of files with twisted and (tx)requests

Lost HTTPS requests with parallel processing

Web crawler returning list vs generator vs producer/consumer

gevent / requests hangs while making lots of head requests

how to download multiple file simultaneously and join them in python?

Categories

Resources