Asynchronous download of files with twisted and (tx)requests - python

I'm trying to download file(s) from the internet from within a twisted application. I'd like to do this using requests due to the other features it provides directly or has well maintained libraries to provide (retries, proxies, cachecontrol, etc.). I am open to a twisted only solution which does not have these features, but I can't seem to find one anyway.
The files should be expected to be fairly large and will be downloaded on slow connections. I'm therefore using requests' stream=True interface and the response's iter_content. A more or less complete code fragment is listed at the end of this question. The entry point for this would be http_download function, called with a url, a dst to write the file to, and a callback and an optional errback to handle a failed download. I've stripped away some of the code involved in preparing the destination (create folders, etc) and code to close the session during reactor exit but I think it should still work as is.
This code works. The file is downloaded, the twisted reactor continues to operate. However, I seem to have a problem with this bit of code :
def _stream_download(r, f):
for chunk in r.iter_content(chunk_size=128):
f.write(chunk)
yield None
cooperative_dl = cooperate(_stream_download(response, filehandle))
Because iter_content returns only when it has a chunk to return, the reactor handles a chunk, runs other bits of code, then returns to waiting for the next chunk instead of keeping itself busy updating a spinning wait animation on the GUI (code not actually posted here).
Here's the question -
Is there a way to get twisted to operate on this generator in such a way that it yields control when the generator itself is not prepared to yield something? I came across some docs for twisted.flow which seemed appropriate, but this does not seem to have made it into twisted or no longer exists today. This question can be read independent of the specifics, i.e., with respect to any arbitrary blocking generator, or can be read in the immediate context of the question.
Is there a way to get twisted to download files asynchronously using something full-featured like requests? Is there an existing twisted module which just does this which I can just use?
What would the basic approach be to such a problem with twisted, independent of the http features I want to use from requests. Let's assume I'm prepared to ditch them or otherwise implement them. How would I download a file asynchronously over HTTP.
import os
import re
from functools import partial
from six.moves.urllib.parse import urlparse
from requests import HTTPError
from twisted.internet.task import cooperate
from txrequests import Session
class HttpClientMixin(object):
def __init__(self, *args, **kwargs):
self._http_session = None
def http_download(self, url, dst, callback, errback=None, **kwargs):
dst = os.path.abspath(dst)
# Log request
deferred_response = self.http_session.get(url, stream=True, **kwargs)
deferred_response.addCallback(self._http_check_response)
deferred_response.addCallbacks(
partial(self._http_download, destination=dst, callback=callback),
partial(self._http_error_handler, url=url, errback=errback)
)
def _http_download(self, response, destination=None, callback=None):
def _stream_download(r, f):
for chunk in r.iter_content(chunk_size=128):
f.write(chunk)
yield None
def _rollback(r, f, d):
if r:
r.close()
if f:
f.close()
if os.path.exists(d):
os.remove(d)
filehandle = open(destination, 'wb')
cooperative_dl = cooperate(_stream_download(response, filehandle))
cooperative_dl.whenDone().addCallback(lambda _: response.close)
cooperative_dl.whenDone().addCallback(lambda _: filehandle.close)
cooperative_dl.whenDone().addCallback(
partial(callback, url=response.url, destination=destination)
)
cooperative_dl.whenDone().addErrback(
partial(_rollback, r=response, f=filehandle, d=destination)
)
def _http_error_handler(self, failure, url=None, errback=None):
failure.trap(HTTPError)
# Log error message
if errback:
errback(failure)
#staticmethod
def _http_check_response(response):
response.raise_for_status()
return response
#property
def http_session(self):
if not self._http_session:
# Log session start
self._http_session = Session()
return self._http_session

Is there a way to get twisted to operate on this generator in such a way that it yields control when the generator itself is not prepared to yield something?
No. All Twisted can do is invoke the code. If the code blocks indefinitely, then the calling thread is blocked indefinitely. This is a basic premise of the Python runtime.
Is there a way to get twisted to download files asynchronously using something full-featured like requests?
There's treq. You didn't say what "full-featured" means here but earlier you mentioned "retries", "proxies", and "cachecontrol". I don't believe treq currently has these features. You can find some kind of feature matrix in the treq docs (though I notice it doesn't include any of the features you mentioned - even for requests). I expect implementations of such features would be welcome as treq contributions.
Is there a way to get twisted to download files asynchronously using something full-featured like requests?
Run it in a thread - probably using Twisted's threadpool APIs.
What would the basic approach be to such a problem with twisted, independent of the http features I want to use from requests.
treq.

Related

Best strategy to download many images with Python?

I'm trying to download many images from a list of URL. By many, I mean in the vicinity of 10 000. The images vary in size, from a few hundreds of KB to 15 MB.
I wonder what would be the best strategy to go about this task, trying to minimize the total time to finish, and to avoid freezing.
I use this function to save each image :
def save_image(name, base_dir, data):
with open(base_dir + name, "wb+") as destination:
for chunk in data:
destination.write(chunk)
I take the file extension from the URL with this function :
def get_ext(url):
"""Return the filename extension from url, or ''."""
""" From : https://stackoverflow.com/questions/28288987/identify-the-file-extension-of-a-url """
parsed = urlparse(url)
root, ext = splitext(parsed.path)
return ext # or ext[1:] if you don't want the leading '.'
And to get the images I just do :
for image in listofimages:
r = requests.get(image["url"], timeout=5)
extension = get_ext(image["url"])
name = str( int(image['ID']) ) + "AA" + extension
save_image( name, "images/", r )
Now putting it all together is quite slow. Hence my question.
One, as hinted in the above comments, you probably want to parallelize work. Multiprocessing and multithreading will work, but with a relatively high overhead. Alternatively, you could use an asynchronous approach such as patching your network libraries with Gevent, or using asyncio together with an async-aware HTTP client, for instance, httpx would do.
Regardless of the approach you take to parallelize I/O, you might find the queue paradigm convenient to work with -- put all your URLs into a queue, and let your workers consume them.
Two, to deal with non-responsive web servers blocking your workers from scraping, you'll probably need to set socket timeouts, check the chosen HTTP client library how to do it. For instance, the popular requests simply takes the timeout parameter.

Lost HTTPS requests with parallel processing

I use the two following class methods to request information from the Questrade API (http://www.questrade.com/api/documentation/rest-operations/market-calls/markets-quotes-id). I have over 11,000 stock symbols where I request the Questrade API with batches of 100 symbols.
import requests
from joblib import Parallel, delayed
def parallel_request(self, elem, result, url, key):
response = requests.get(''.join((url, elem)), headers=self.headers)
result.extend(response.json().get(key))
Parallel(n_jobs=-1, backend="threading")(
delayed(self.parallel_request)(elem, self.symbol_ids_list, self.uri, 'symbols')\
for elem in self.batch_result
)
If I make over 110 HTTPS requests with Parallel class, then instead of getting 11,000 output I got 10,500 or 10,600. So I lost data with parallel processing. Be aware that I used two python module here, i.e. joblib (https://github.com/joblib/joblib/issues/651) and requests (https://github.com/requests/requests).
The following for loop worked perfectly, so I know my problem is with the Parallel class.
for elem in self.batch_result:
response = requests.get(''.join((self.uri, elem)), headers=self.headers)
self.symbol_ids_list.extend(response.json().get('symbols'))
How could I increase the performance of the last for loop without losing data?
UPDATE
A sample of self.batch_result (simplified result) could be ['AAME,ABAC,ABIL,ABIO,ACERW,ACHN,ACHV,ACRX,ACST,ACTG,ADMA,ADMP,ADOM,ADXS,ADXSW,AEHR,AEMD,AETI,AEY,AEZS,AFMD,AGFSW,AGRX,AGTC,AHPAW,AHPI,AIPT,AKER,AKTX,ALIM,ALJJ,ALQA,ALSK,ALT,AMCN,AMDA,AMMA,AMRH,AMRHW,AMRN,AMRWW,AMTX,ANDAR,ANDAW,ANTH,ANY,APDN,APDNW,APOPW,APPS,APRI,APTO,APVO,APWC,AQB,AQMS,ARCI,ARCW,ARDM,AREX,ARGS,ARLZ,ARQL,ARTW,ARTX,ASFI,ASNA,ASRV,ASTC,ATACR,ATEC,ATHX,ATLC,ATOS,ATRS,AUTO,AVEO,AVGR,AVID,AVXL,AWRE,AXAS,AXON,AXSM,AYTU,AZRX,BASI,BBOX,BBRG,BCACR,BCACW,BCLI,BDSI,BHACR,BHACW,BIOC,BIOL,BIOS,BKEP,BKYI', 'BLDP,BLIN,BLNK,BLNKW,BLPH,BLRX,BMRA,BNSO,BNTC,BNTCW,BOSC,BOXL,BPTH,BRACR,BRACW,BRPAR,BRPAW,BSPM,BSQR,BUR,BURG,BVSN,BVXVW,BWEN,BYFC,CAAS,CADC,CALI,CAPR,CARV,CASI,CASM,CATB,CATS,CBAK,CBLI,CCCL,CCCR,CCIH,CDMO,CDTI,CELGZ,CERCW,CETV,CETX,CETXW,CFBK,CFMS,CFRX,CGEN,CGIX,CGNT,CHCI,CHEK,CHEKW,CHFS,CHKE,CHMA,CHNR,CIDM,CJJD,CKPT,CLDC,CLDX,CLIR,CLIRW,CLNE,CLRB,CLRBW,CLRBZ,CLSN,CLWT,CMSSR,CMSSW,CNACR,CNACW,CNET,CNIT,CNTF,CODA,CODX,COGT,CPAH,CPLP,CPRX,CPSH,CPSS,CPST,CREG,CRIS,CRME,CRNT,CSBR,CTHR,CTIB,CTIC,CTRV,CTXR,CTXRW,CUI', 'CUR,CVONW,CXDC,CXRX,CYCC,CYHHZ,CYRN,CYTR,CYTX,CYTXW,DARE,DCAR,DCIX,DELT,DEST,DFBG,DFFN,DGLY,DHXM,DLPN,DLPNW,DMPI,DOGZ,DOTAR,DOTAW,DRAD,DRIO,DRIOW,DRRX,DRYS,DSKEW,DSWL,DTEA,DTRM,DXLG,DXYN,DYNT,DYSL,EACQW,EAGLW,EARS,EASTW,EBIO,EDAP,EFOI,EGLT,EKSO,ELECW,ELGX,ELON,ELSE,ELTK,EMITF,EMMS,ENG,ENPH,ENT,EPIX,ESEA,ESES,ESTRW,EVEP,EVGN,EVK,EVLV,EVOK,EXFO,EXXI,EYEG,EYEGW,EYES,EYESW,FCEL,FCRE,FCSC,FFHL,FLGT,FLL,FMCIR,FMCIW,FNJN,FNTEW,FORD,FORK,FPAY,FRAN,FRED,FRSX,FSACW,FSNN,FTD,FTEK,FTFT,FUV,FVE,FWP,GALT,GASS,GCVRZ,GEC']
and self.uri is simply 'https://api01.iq.questrade.com/v1/symbols?names=' as seen in the above Questrade API link.
UPDATE 2
The Marat's answer was a good try but didn't give me a better result. The first test gave me 31,356 (or 10,452 if I divide that result by 3) instead of 10,900. The second test just gave me 0 or the process block completely.
I found out that the Maximum allowed requests per second is 20. Link : http://www.questrade.com/api/documentation/rate-limiting. How could I increase the performance of the last for loop without losing data in considering that new information?
If you are not stuck with using joblib you could try some standard library parallel processing modules. In python2/3 multiprocessing.Pool is available and provides functions for mapping a task across parallel threads. A simplified version would look like this:
from multiprocessing import Pool
import requests
HEADERS = {} # define headers here
def parallel_request(symbols):
response = requests.get('https://api01.iq.questrade.com/v1/symbols?names={}'.format(symbols), headers=HEADERS)
return response.json()
if __name__ == '__main__':
p = Pool()
batch_result = ['AAME,ABAC,ABIL,...',
'BLDP,BLIN,BLNK,...',
'CUR,CVONW,CXDC,...',
...]
p.map(parallel_request, batch_result) # will return a list of len(batch_result) responses
There are asynchronous and iterable versions of map that you would probably want for larger sized jobs, and of course you could add parameters to your parallel_requests task to avoid hard coding things like I did. A caveat with using Pool is that any arguments passed to it have to be picklable.
In python3 the concurrent.futures module actually has a nice example of multithreaded url retrieval in the docs. With a little effort you could replace load_url in that example with your parallel_request function. There is a version of concurrent.futures backported to python2 as the futures module, as well.
These might require a bit more work in refactoring, so if there is a solution that sticks with joblib feel free to prefer that. On the off-chance that your problem is a bug in joblib, there are plenty of ways you could do this in a multithreaded fashion with standard library (albeit with some added boilerplate).
Most likely, it happens because some of HTTP calls fail due to network load. To test, change parallel_request:
def parallel_request(self, elem, result, url, key):
for i in range(3): # 3 retries
try:
response = requests.get(''.join((url, elem)), headers=self.headers)
except IOError:
continue
result.extend(response.json().get(key))
return
Much less likely: list.extend is not thread safe. If the snippet above didn't help, try guarding extend with a lock:
import threading
...
lock = threading.Lock()
def parallel_request(self, elem, result, url, key):
response = requests.get(''.join((url, elem)), headers=self.headers)
lock.acquire()
result.extend(response.json().get(key))
lock.release()

Python Twisted multithreaded TCP proxy

I am trying to write a TCP proxy using Python's twisted framework. I started with the Twisted's port forward example and it seems to do the job in a standard secnario. The problem is that I have a rather peculiar scenario. What we need to so is to process each TCP data packet and look for a certain pattern.
In case the pattern matches we need to do a certain process. This process takes anywhere between 30-40 seconds (I know its not a good design but currently thats how things stand). The trouble is that if this process starts all other packets get held up/stuck till the process completes. So if there are 100 live connections and even if 1 of them calls the process all the remaining 99 processes are stuck.
Is there a standard 'twisted' way wherein each connection/session is handled in a separate thread so that the 'blocking process' does not intervene with the other live connections?
Example Code:
from twisted.internet import reactor
from twisted.protocols import portforward
from twisted.internet import threads
def processingOperation(data)
# doing the processing operation here
sleep(30)
return data
def server_dataReceived(self, data):
if data.find("pattern we need to test")<> -1:
data = processingOperation(data)
portforward.Proxy.dataReceived(self, data)
portforward.ProxyServer.dataReceived = server_dataReceived
def client_dataReceived(self, data):
portforward.Proxy.dataReceived(self, data)
portforward.ProxyClient.dataReceived = client_dataReceived
reactor.listenTCP(8383, portforward.ProxyFactory('xxx.yyy.uuu.iii', 80))
reactor.run()
Of cause there is. You defer the processing to a thread. For example:
def render_POST(self, request):
# some code you may have to run before processing
d = threads.deferToThread(method_that_does_the_processing, request)
return ''
There is a trick: This will return before the processing is done. And the client will get the answer back. So you might want to return 202/Accepted instead of 200/Ok (or my dummy '').
If you need to return after the processing is complete, you can use an inline call-back (http://twistedmatrix.com/documents/10.2.0/api/twisted.internet.defer.inlineCallbacks.html).

Tornado async call to a function

I am making a web application using Python + Tornado which basically serves files to users. I have no database.
The files are either directly picked up and served if they are available, or generated on the fly if not.
I want the clients to be served in an async manner, because some files may already be available, while others need to be generated (thus they need to wait, and I don't want them to block other users).
I have a class that manages the picking or generation of files, and I just need to call it from Tornado.
What is the best way (most efficient on CPU and RAM) to achieve that? Should I use a thread? A sub process? A simple gen.Task like this one?
Also, I would like my implementation to work on Google App Engines (I think they do not allow sub processes to be spawned?).
I'm relatively new to the async web servicing, so any help is welcome.
I've found the answers to my questions: The genTask example is indeed the best way to implement an async call, and it is due to the fact that the example does use a Python coroutine, which I didn't understand at first glance because I thought yield was only used to return a value for generators.
Concrete example:
class MyHandler(tornado.web.RequestHandler):
#asynchronous
#gen.engine
def get(self):
response = yield gen.Task(self.dosomething, 'argument')
What is important here is the combination of two things:
yield , which in fact spawns a coroutine (or pseudo-thread, which is very efficient and are done to be highly concurrent-friendly).
http://www.python.org/dev/peps/pep-0342/
gen.Task() which is a non-blocking (async) function, because if you spawn a coroutine on a blocking function, it won't be async. gen.Task() is provided by Tornado, specifically to work with the coroutine syntax of Python. More infos:
http://www.tornadoweb.org/documentation/gen.html
So a canonical example of an async call in Python using coroutines:
response = yield non_blocking_func(**kwargs)
Now Documentation have solution.
Simple example:
import os.path
import tornado.web
from tornado import gen
class MyHandler(tornado.web.RequestHandler):
#gen.coroutine
def get(self, filename):
result = yield self.some_usefull_process(filename)
self.write(result)
#gen.coroutine
def some_usefull_process(self, filename):
if not os.path.exists(filename):
status = yield self.generate_file(filename)
result = 'File created'
else:
result = 'File exists'
raise gen.Return(result)
#gen.coroutine
def generate_file(self, filename):
fd = open(filename, 'w')
fd.write('created')
fd.close()

Python 2.5 - multi-threaded for loop

I've got a piece of code:
for url in get_lines(file):
visit(url, timeout=timeout)
It gets URLs from file and visit it (by urllib2) in for loop.
Is is possible to do this in few threads? For example, 10 visits at the same time.
I've tried:
for url in get_lines(file):
Thread(target=visit, args=(url,), kwargs={"timeout": timeout}).start()
But it does not work - no effect, URLs are visited normally.
The simplified version of function visit:
def visit(url, proxy_addr=None, timeout=30):
(...)
request = urllib2.Request(url)
response = urllib2.urlopen(request)
return response.read()
To expand on senderle's answer, you can use the Pool class in multiprocessing to do this easily:
from multiprocessing import Pool
pool = Pool(processes=5)
pages = pool.map(visit, get_lines(file))
When the map function returns then "pages" will be a list of the contents of the URLs. You can adjust the number of processes to whatever is suitable for your system.
I suspect that you've run into the Global Interpreter Lock. Basically, threading in python can't achieve concurrency, which seems to be your goal. You need to use multiprocessing instead.
multiprocessing is designed to have a roughly analogous interface to threading, but it has a few quirks. Your visit function as written above should work correctly, I believe, because it's written in a functional style, without side effects.
In multiprocessing, the Process class is the equivalent of the Thread class in threading. It has all the same methods, so it's a drop-in replacement in this case. (Though I suppose you could use pool as JoeZuntz suggests -- but I would test with the basic Process class first, to see if it fixes the problem.)

Categories

Resources