Lost HTTPS requests with parallel processing - python
I use the two following class methods to request information from the Questrade API (http://www.questrade.com/api/documentation/rest-operations/market-calls/markets-quotes-id). I have over 11,000 stock symbols where I request the Questrade API with batches of 100 symbols.
import requests
from joblib import Parallel, delayed
def parallel_request(self, elem, result, url, key):
response = requests.get(''.join((url, elem)), headers=self.headers)
result.extend(response.json().get(key))
Parallel(n_jobs=-1, backend="threading")(
delayed(self.parallel_request)(elem, self.symbol_ids_list, self.uri, 'symbols')\
for elem in self.batch_result
)
If I make over 110 HTTPS requests with Parallel class, then instead of getting 11,000 output I got 10,500 or 10,600. So I lost data with parallel processing. Be aware that I used two python module here, i.e. joblib (https://github.com/joblib/joblib/issues/651) and requests (https://github.com/requests/requests).
The following for loop worked perfectly, so I know my problem is with the Parallel class.
for elem in self.batch_result:
response = requests.get(''.join((self.uri, elem)), headers=self.headers)
self.symbol_ids_list.extend(response.json().get('symbols'))
How could I increase the performance of the last for loop without losing data?
UPDATE
A sample of self.batch_result (simplified result) could be ['AAME,ABAC,ABIL,ABIO,ACERW,ACHN,ACHV,ACRX,ACST,ACTG,ADMA,ADMP,ADOM,ADXS,ADXSW,AEHR,AEMD,AETI,AEY,AEZS,AFMD,AGFSW,AGRX,AGTC,AHPAW,AHPI,AIPT,AKER,AKTX,ALIM,ALJJ,ALQA,ALSK,ALT,AMCN,AMDA,AMMA,AMRH,AMRHW,AMRN,AMRWW,AMTX,ANDAR,ANDAW,ANTH,ANY,APDN,APDNW,APOPW,APPS,APRI,APTO,APVO,APWC,AQB,AQMS,ARCI,ARCW,ARDM,AREX,ARGS,ARLZ,ARQL,ARTW,ARTX,ASFI,ASNA,ASRV,ASTC,ATACR,ATEC,ATHX,ATLC,ATOS,ATRS,AUTO,AVEO,AVGR,AVID,AVXL,AWRE,AXAS,AXON,AXSM,AYTU,AZRX,BASI,BBOX,BBRG,BCACR,BCACW,BCLI,BDSI,BHACR,BHACW,BIOC,BIOL,BIOS,BKEP,BKYI', 'BLDP,BLIN,BLNK,BLNKW,BLPH,BLRX,BMRA,BNSO,BNTC,BNTCW,BOSC,BOXL,BPTH,BRACR,BRACW,BRPAR,BRPAW,BSPM,BSQR,BUR,BURG,BVSN,BVXVW,BWEN,BYFC,CAAS,CADC,CALI,CAPR,CARV,CASI,CASM,CATB,CATS,CBAK,CBLI,CCCL,CCCR,CCIH,CDMO,CDTI,CELGZ,CERCW,CETV,CETX,CETXW,CFBK,CFMS,CFRX,CGEN,CGIX,CGNT,CHCI,CHEK,CHEKW,CHFS,CHKE,CHMA,CHNR,CIDM,CJJD,CKPT,CLDC,CLDX,CLIR,CLIRW,CLNE,CLRB,CLRBW,CLRBZ,CLSN,CLWT,CMSSR,CMSSW,CNACR,CNACW,CNET,CNIT,CNTF,CODA,CODX,COGT,CPAH,CPLP,CPRX,CPSH,CPSS,CPST,CREG,CRIS,CRME,CRNT,CSBR,CTHR,CTIB,CTIC,CTRV,CTXR,CTXRW,CUI', 'CUR,CVONW,CXDC,CXRX,CYCC,CYHHZ,CYRN,CYTR,CYTX,CYTXW,DARE,DCAR,DCIX,DELT,DEST,DFBG,DFFN,DGLY,DHXM,DLPN,DLPNW,DMPI,DOGZ,DOTAR,DOTAW,DRAD,DRIO,DRIOW,DRRX,DRYS,DSKEW,DSWL,DTEA,DTRM,DXLG,DXYN,DYNT,DYSL,EACQW,EAGLW,EARS,EASTW,EBIO,EDAP,EFOI,EGLT,EKSO,ELECW,ELGX,ELON,ELSE,ELTK,EMITF,EMMS,ENG,ENPH,ENT,EPIX,ESEA,ESES,ESTRW,EVEP,EVGN,EVK,EVLV,EVOK,EXFO,EXXI,EYEG,EYEGW,EYES,EYESW,FCEL,FCRE,FCSC,FFHL,FLGT,FLL,FMCIR,FMCIW,FNJN,FNTEW,FORD,FORK,FPAY,FRAN,FRED,FRSX,FSACW,FSNN,FTD,FTEK,FTFT,FUV,FVE,FWP,GALT,GASS,GCVRZ,GEC']
and self.uri is simply 'https://api01.iq.questrade.com/v1/symbols?names=' as seen in the above Questrade API link.
UPDATE 2
The Marat's answer was a good try but didn't give me a better result. The first test gave me 31,356 (or 10,452 if I divide that result by 3) instead of 10,900. The second test just gave me 0 or the process block completely.
I found out that the Maximum allowed requests per second is 20. Link : http://www.questrade.com/api/documentation/rate-limiting. How could I increase the performance of the last for loop without losing data in considering that new information?
If you are not stuck with using joblib you could try some standard library parallel processing modules. In python2/3 multiprocessing.Pool is available and provides functions for mapping a task across parallel threads. A simplified version would look like this:
from multiprocessing import Pool
import requests
HEADERS = {} # define headers here
def parallel_request(symbols):
response = requests.get('https://api01.iq.questrade.com/v1/symbols?names={}'.format(symbols), headers=HEADERS)
return response.json()
if __name__ == '__main__':
p = Pool()
batch_result = ['AAME,ABAC,ABIL,...',
'BLDP,BLIN,BLNK,...',
'CUR,CVONW,CXDC,...',
...]
p.map(parallel_request, batch_result) # will return a list of len(batch_result) responses
There are asynchronous and iterable versions of map that you would probably want for larger sized jobs, and of course you could add parameters to your parallel_requests task to avoid hard coding things like I did. A caveat with using Pool is that any arguments passed to it have to be picklable.
In python3 the concurrent.futures module actually has a nice example of multithreaded url retrieval in the docs. With a little effort you could replace load_url in that example with your parallel_request function. There is a version of concurrent.futures backported to python2 as the futures module, as well.
These might require a bit more work in refactoring, so if there is a solution that sticks with joblib feel free to prefer that. On the off-chance that your problem is a bug in joblib, there are plenty of ways you could do this in a multithreaded fashion with standard library (albeit with some added boilerplate).
Most likely, it happens because some of HTTP calls fail due to network load. To test, change parallel_request:
def parallel_request(self, elem, result, url, key):
for i in range(3): # 3 retries
try:
response = requests.get(''.join((url, elem)), headers=self.headers)
except IOError:
continue
result.extend(response.json().get(key))
return
Much less likely: list.extend is not thread safe. If the snippet above didn't help, try guarding extend with a lock:
import threading
...
lock = threading.Lock()
def parallel_request(self, elem, result, url, key):
response = requests.get(''.join((url, elem)), headers=self.headers)
lock.acquire()
result.extend(response.json().get(key))
lock.release()
Related
How to do Multithreading Put Request in Python
What is the best and fastest pythonic way to program multithreading for a put request that is within a for loop? Now, as it is synchronous, it takes too long time to run the code. Therefore, we would like to include multithreading, to improve time. Synchronous: def econ_post_customers(self, file, data): try: for i in range(0, len(file['collection'])): rp = requests.put(url=self.url, headers=self.headers, params=self.params, data=data) except StopIteration: pass We attempted to make threading, but starting threads on iterations just seems unnecessary, and we have 1000's of iterations, and we might run up on much more, so that would become a big mess with threads. Maybe including pools would solve the problem, but this is where i am stuck. Anyone who has an idea on how to solve this? Parallel: def econ_post_customers(self, file, data): try: for i in range(0, len(file['collection'])): threading.Thread(target=lambda: request_put(url, self.headers, self.params, data)).start() except StopIteration: pass def request_put(url, headers, params, single): return requests.put(url=url, headers=headers, params=params, data=single) Any help is highly appreciated. Thank you for your time!
If you want to use multithreading, then the following should work. However, I am a bit confused about a few things. You seem to be doing PUT requests in a loop but all with the same exact arguments. And I don't quite see how you can get a StopIteration exception in the code you posted. Also using a lambda expression as your target argument rather than just specifying the function name and then passing the arguments as a separate tuple or list (as is done below) is a bit unusual. Assuming that loop variable i in reality is being used to index one value that actually varies in the call to request_put, then function map could be a better choice than apply_async. It probably does not matter significantly for multithreading, but could make a performance difference for multiprocessing if you had a very large list of elements on which you were looping. from multiprocessing.pool import ThreadPool def econ_post_customers(self, file, data): MAX_THREADS = 100 # some suitable value n_tasks = len(file['collection']) pool_size = min(MAX_THREADS, n_tasks) pool = ThreadPool(pool_size) for i in range(n_tasks): pool.apply_async(request_put, args=(url, self.headers, self.params, data)) # wait for all tasks to complete: pool.close() pool.join() def request_put(url, headers, params, single): return requests.put(url=url, headers=headers, params=params, data=single)
Do try grequests module which works with gevent(requests is not designed for async). If you see this you will get great results. (If this is not working pls do say).
How to execute requests.get without attachment Python
Right now I am trying to execute asynchronous requests without any related tie-in to each other, similar to how FTP can upload / download more than one file at once. I am using the following code: rec = reuests.get("https://url", stream=True) With rec.raw.read() To get responses. But I am wishing to be able to execute this same piece of code much faster with no need to wait for the server to respond, which takes about 2 seconds each time.
The easiest way to do something like that is to use threads. Here is a rough example of one of the ways you might do this. import requests from multiprocessing.dummy import Pool # the exact import depends on your python version pool = Pool(4) # the number represents how many jobs you want to run in parallel. def get_url(url): rec = requests.get(url, stream=True) return rec.raw.read() for result in pool.map(get_url, ["http://url/1", "http://url/2"]: do_things(result)
Python3 can't pickle _thread.RLock objects on list with multiprocessing
I'm trying to parse the websites that contain car's properties(154 kinds of properties). I have a huge list(name is liste_test) that consist of 280.000 used car announcement URL. def araba_cekici(liste_test,headers,engine): for link in liste_test: try: page = requests.get(link, headers=headers) ..... ..... When I start my code like that: araba_cekici(liste_test,headers,engine) It works and getting results. But approximately in 1 hour, I could only obtain 1500 URL's properties. It is very slow, and I must use multiprocessing. I found a result on here with multiprocessing. Then I applied to my code, but unfortunately, it is not working. import numpy as np import multiprocessing as multi def chunks(n, page_list): """Splits the list into n chunks""" return np.array_split(page_list,n) cpus = multi.cpu_count() workers = [] page_bins = chunks(cpus, liste_test) for cpu in range(cpus): sys.stdout.write("CPU " + str(cpu) + "\n") # Process that will send corresponding list of pages # to the function perform_extraction worker = multi.Process(name=str(cpu), target=araba_cekici, args=(page_bins[cpu],headers,engine)) worker.start() workers.append(worker) for worker in workers: worker.join() And it gives: TypeError: can't pickle _thread.RLock objects I found some kind of responses with respects to this error. But none of them works(at least I can't apply to my code). Also, I tried python multiprocess Pool but unfortunately it stucks on jupyter notebook and seems this code works infinitely.
Late answer, but since this question turns up when searching on Google: multiprocessing sends the data to the worker processes via a multiprocessing.Queue, which requires all data/objects sent to be picklable. In your code, you try to pass header and engine, whose implementations you don't show. (Since header holds the HTTP request header, I suspect that engine is the issue here.) To solve your issue, you either have to make engine picklable, or only instantiate engine within the worker process.
Asynchronous download of files with twisted and (tx)requests
I'm trying to download file(s) from the internet from within a twisted application. I'd like to do this using requests due to the other features it provides directly or has well maintained libraries to provide (retries, proxies, cachecontrol, etc.). I am open to a twisted only solution which does not have these features, but I can't seem to find one anyway. The files should be expected to be fairly large and will be downloaded on slow connections. I'm therefore using requests' stream=True interface and the response's iter_content. A more or less complete code fragment is listed at the end of this question. The entry point for this would be http_download function, called with a url, a dst to write the file to, and a callback and an optional errback to handle a failed download. I've stripped away some of the code involved in preparing the destination (create folders, etc) and code to close the session during reactor exit but I think it should still work as is. This code works. The file is downloaded, the twisted reactor continues to operate. However, I seem to have a problem with this bit of code : def _stream_download(r, f): for chunk in r.iter_content(chunk_size=128): f.write(chunk) yield None cooperative_dl = cooperate(_stream_download(response, filehandle)) Because iter_content returns only when it has a chunk to return, the reactor handles a chunk, runs other bits of code, then returns to waiting for the next chunk instead of keeping itself busy updating a spinning wait animation on the GUI (code not actually posted here). Here's the question - Is there a way to get twisted to operate on this generator in such a way that it yields control when the generator itself is not prepared to yield something? I came across some docs for twisted.flow which seemed appropriate, but this does not seem to have made it into twisted or no longer exists today. This question can be read independent of the specifics, i.e., with respect to any arbitrary blocking generator, or can be read in the immediate context of the question. Is there a way to get twisted to download files asynchronously using something full-featured like requests? Is there an existing twisted module which just does this which I can just use? What would the basic approach be to such a problem with twisted, independent of the http features I want to use from requests. Let's assume I'm prepared to ditch them or otherwise implement them. How would I download a file asynchronously over HTTP. import os import re from functools import partial from six.moves.urllib.parse import urlparse from requests import HTTPError from twisted.internet.task import cooperate from txrequests import Session class HttpClientMixin(object): def __init__(self, *args, **kwargs): self._http_session = None def http_download(self, url, dst, callback, errback=None, **kwargs): dst = os.path.abspath(dst) # Log request deferred_response = self.http_session.get(url, stream=True, **kwargs) deferred_response.addCallback(self._http_check_response) deferred_response.addCallbacks( partial(self._http_download, destination=dst, callback=callback), partial(self._http_error_handler, url=url, errback=errback) ) def _http_download(self, response, destination=None, callback=None): def _stream_download(r, f): for chunk in r.iter_content(chunk_size=128): f.write(chunk) yield None def _rollback(r, f, d): if r: r.close() if f: f.close() if os.path.exists(d): os.remove(d) filehandle = open(destination, 'wb') cooperative_dl = cooperate(_stream_download(response, filehandle)) cooperative_dl.whenDone().addCallback(lambda _: response.close) cooperative_dl.whenDone().addCallback(lambda _: filehandle.close) cooperative_dl.whenDone().addCallback( partial(callback, url=response.url, destination=destination) ) cooperative_dl.whenDone().addErrback( partial(_rollback, r=response, f=filehandle, d=destination) ) def _http_error_handler(self, failure, url=None, errback=None): failure.trap(HTTPError) # Log error message if errback: errback(failure) #staticmethod def _http_check_response(response): response.raise_for_status() return response #property def http_session(self): if not self._http_session: # Log session start self._http_session = Session() return self._http_session
Is there a way to get twisted to operate on this generator in such a way that it yields control when the generator itself is not prepared to yield something? No. All Twisted can do is invoke the code. If the code blocks indefinitely, then the calling thread is blocked indefinitely. This is a basic premise of the Python runtime. Is there a way to get twisted to download files asynchronously using something full-featured like requests? There's treq. You didn't say what "full-featured" means here but earlier you mentioned "retries", "proxies", and "cachecontrol". I don't believe treq currently has these features. You can find some kind of feature matrix in the treq docs (though I notice it doesn't include any of the features you mentioned - even for requests). I expect implementations of such features would be welcome as treq contributions. Is there a way to get twisted to download files asynchronously using something full-featured like requests? Run it in a thread - probably using Twisted's threadpool APIs. What would the basic approach be to such a problem with twisted, independent of the http features I want to use from requests. treq.
Python 2.5 - multi-threaded for loop
I've got a piece of code: for url in get_lines(file): visit(url, timeout=timeout) It gets URLs from file and visit it (by urllib2) in for loop. Is is possible to do this in few threads? For example, 10 visits at the same time. I've tried: for url in get_lines(file): Thread(target=visit, args=(url,), kwargs={"timeout": timeout}).start() But it does not work - no effect, URLs are visited normally. The simplified version of function visit: def visit(url, proxy_addr=None, timeout=30): (...) request = urllib2.Request(url) response = urllib2.urlopen(request) return response.read()
To expand on senderle's answer, you can use the Pool class in multiprocessing to do this easily: from multiprocessing import Pool pool = Pool(processes=5) pages = pool.map(visit, get_lines(file)) When the map function returns then "pages" will be a list of the contents of the URLs. You can adjust the number of processes to whatever is suitable for your system.
I suspect that you've run into the Global Interpreter Lock. Basically, threading in python can't achieve concurrency, which seems to be your goal. You need to use multiprocessing instead. multiprocessing is designed to have a roughly analogous interface to threading, but it has a few quirks. Your visit function as written above should work correctly, I believe, because it's written in a functional style, without side effects. In multiprocessing, the Process class is the equivalent of the Thread class in threading. It has all the same methods, so it's a drop-in replacement in this case. (Though I suppose you could use pool as JoeZuntz suggests -- but I would test with the basic Process class first, to see if it fixes the problem.)