How to do Multithreading Put Request in Python - python
What is the best and fastest pythonic way to program multithreading for a put request that is within a for loop? Now, as it is synchronous, it takes too long time to run the code. Therefore, we would like to include multithreading, to improve time.
Synchronous:
def econ_post_customers(self, file, data):
try:
for i in range(0, len(file['collection'])):
rp = requests.put(url=self.url, headers=self.headers, params=self.params, data=data)
except StopIteration:
pass
We attempted to make threading, but starting threads on iterations just seems unnecessary, and we have 1000's of iterations, and we might run up on much more, so that would become a big mess with threads. Maybe including pools would solve the problem, but this is where i am stuck.
Anyone who has an idea on how to solve this?
Parallel:
def econ_post_customers(self, file, data):
try:
for i in range(0, len(file['collection'])):
threading.Thread(target=lambda: request_put(url, self.headers, self.params, data)).start()
except StopIteration:
pass
def request_put(url, headers, params, single):
return requests.put(url=url, headers=headers, params=params, data=single)
Any help is highly appreciated. Thank you for your time!
If you want to use multithreading, then the following should work. However, I am a bit confused about a few things. You seem to be doing PUT requests in a loop but all with the same exact arguments. And I don't quite see how you can get a StopIteration exception in the code you posted. Also using a lambda expression as your target argument rather than just specifying the function name and then passing the arguments as a separate tuple or list (as is done below) is a bit unusual. Assuming that loop variable i in reality is being used to index one value that actually varies in the call to request_put, then function map could be a better choice than apply_async. It probably does not matter significantly for multithreading, but could make a performance difference for multiprocessing if you had a very large list of elements on which you were looping.
from multiprocessing.pool import ThreadPool
def econ_post_customers(self, file, data):
MAX_THREADS = 100 # some suitable value
n_tasks = len(file['collection'])
pool_size = min(MAX_THREADS, n_tasks)
pool = ThreadPool(pool_size)
for i in range(n_tasks):
pool.apply_async(request_put, args=(url, self.headers, self.params, data))
# wait for all tasks to complete:
pool.close()
pool.join()
def request_put(url, headers, params, single):
return requests.put(url=url, headers=headers, params=params, data=single)
Do try grequests module which works with gevent(requests is not designed for async).
If you see this you will get great results.
(If this is not working pls do say).
Related
Lost HTTPS requests with parallel processing
I use the two following class methods to request information from the Questrade API (http://www.questrade.com/api/documentation/rest-operations/market-calls/markets-quotes-id). I have over 11,000 stock symbols where I request the Questrade API with batches of 100 symbols. import requests from joblib import Parallel, delayed def parallel_request(self, elem, result, url, key): response = requests.get(''.join((url, elem)), headers=self.headers) result.extend(response.json().get(key)) Parallel(n_jobs=-1, backend="threading")( delayed(self.parallel_request)(elem, self.symbol_ids_list, self.uri, 'symbols')\ for elem in self.batch_result ) If I make over 110 HTTPS requests with Parallel class, then instead of getting 11,000 output I got 10,500 or 10,600. So I lost data with parallel processing. Be aware that I used two python module here, i.e. joblib (https://github.com/joblib/joblib/issues/651) and requests (https://github.com/requests/requests). The following for loop worked perfectly, so I know my problem is with the Parallel class. for elem in self.batch_result: response = requests.get(''.join((self.uri, elem)), headers=self.headers) self.symbol_ids_list.extend(response.json().get('symbols')) How could I increase the performance of the last for loop without losing data? UPDATE A sample of self.batch_result (simplified result) could be ['AAME,ABAC,ABIL,ABIO,ACERW,ACHN,ACHV,ACRX,ACST,ACTG,ADMA,ADMP,ADOM,ADXS,ADXSW,AEHR,AEMD,AETI,AEY,AEZS,AFMD,AGFSW,AGRX,AGTC,AHPAW,AHPI,AIPT,AKER,AKTX,ALIM,ALJJ,ALQA,ALSK,ALT,AMCN,AMDA,AMMA,AMRH,AMRHW,AMRN,AMRWW,AMTX,ANDAR,ANDAW,ANTH,ANY,APDN,APDNW,APOPW,APPS,APRI,APTO,APVO,APWC,AQB,AQMS,ARCI,ARCW,ARDM,AREX,ARGS,ARLZ,ARQL,ARTW,ARTX,ASFI,ASNA,ASRV,ASTC,ATACR,ATEC,ATHX,ATLC,ATOS,ATRS,AUTO,AVEO,AVGR,AVID,AVXL,AWRE,AXAS,AXON,AXSM,AYTU,AZRX,BASI,BBOX,BBRG,BCACR,BCACW,BCLI,BDSI,BHACR,BHACW,BIOC,BIOL,BIOS,BKEP,BKYI', 'BLDP,BLIN,BLNK,BLNKW,BLPH,BLRX,BMRA,BNSO,BNTC,BNTCW,BOSC,BOXL,BPTH,BRACR,BRACW,BRPAR,BRPAW,BSPM,BSQR,BUR,BURG,BVSN,BVXVW,BWEN,BYFC,CAAS,CADC,CALI,CAPR,CARV,CASI,CASM,CATB,CATS,CBAK,CBLI,CCCL,CCCR,CCIH,CDMO,CDTI,CELGZ,CERCW,CETV,CETX,CETXW,CFBK,CFMS,CFRX,CGEN,CGIX,CGNT,CHCI,CHEK,CHEKW,CHFS,CHKE,CHMA,CHNR,CIDM,CJJD,CKPT,CLDC,CLDX,CLIR,CLIRW,CLNE,CLRB,CLRBW,CLRBZ,CLSN,CLWT,CMSSR,CMSSW,CNACR,CNACW,CNET,CNIT,CNTF,CODA,CODX,COGT,CPAH,CPLP,CPRX,CPSH,CPSS,CPST,CREG,CRIS,CRME,CRNT,CSBR,CTHR,CTIB,CTIC,CTRV,CTXR,CTXRW,CUI', 'CUR,CVONW,CXDC,CXRX,CYCC,CYHHZ,CYRN,CYTR,CYTX,CYTXW,DARE,DCAR,DCIX,DELT,DEST,DFBG,DFFN,DGLY,DHXM,DLPN,DLPNW,DMPI,DOGZ,DOTAR,DOTAW,DRAD,DRIO,DRIOW,DRRX,DRYS,DSKEW,DSWL,DTEA,DTRM,DXLG,DXYN,DYNT,DYSL,EACQW,EAGLW,EARS,EASTW,EBIO,EDAP,EFOI,EGLT,EKSO,ELECW,ELGX,ELON,ELSE,ELTK,EMITF,EMMS,ENG,ENPH,ENT,EPIX,ESEA,ESES,ESTRW,EVEP,EVGN,EVK,EVLV,EVOK,EXFO,EXXI,EYEG,EYEGW,EYES,EYESW,FCEL,FCRE,FCSC,FFHL,FLGT,FLL,FMCIR,FMCIW,FNJN,FNTEW,FORD,FORK,FPAY,FRAN,FRED,FRSX,FSACW,FSNN,FTD,FTEK,FTFT,FUV,FVE,FWP,GALT,GASS,GCVRZ,GEC'] and self.uri is simply 'https://api01.iq.questrade.com/v1/symbols?names=' as seen in the above Questrade API link. UPDATE 2 The Marat's answer was a good try but didn't give me a better result. The first test gave me 31,356 (or 10,452 if I divide that result by 3) instead of 10,900. The second test just gave me 0 or the process block completely. I found out that the Maximum allowed requests per second is 20. Link : http://www.questrade.com/api/documentation/rate-limiting. How could I increase the performance of the last for loop without losing data in considering that new information?
If you are not stuck with using joblib you could try some standard library parallel processing modules. In python2/3 multiprocessing.Pool is available and provides functions for mapping a task across parallel threads. A simplified version would look like this: from multiprocessing import Pool import requests HEADERS = {} # define headers here def parallel_request(symbols): response = requests.get('https://api01.iq.questrade.com/v1/symbols?names={}'.format(symbols), headers=HEADERS) return response.json() if __name__ == '__main__': p = Pool() batch_result = ['AAME,ABAC,ABIL,...', 'BLDP,BLIN,BLNK,...', 'CUR,CVONW,CXDC,...', ...] p.map(parallel_request, batch_result) # will return a list of len(batch_result) responses There are asynchronous and iterable versions of map that you would probably want for larger sized jobs, and of course you could add parameters to your parallel_requests task to avoid hard coding things like I did. A caveat with using Pool is that any arguments passed to it have to be picklable. In python3 the concurrent.futures module actually has a nice example of multithreaded url retrieval in the docs. With a little effort you could replace load_url in that example with your parallel_request function. There is a version of concurrent.futures backported to python2 as the futures module, as well. These might require a bit more work in refactoring, so if there is a solution that sticks with joblib feel free to prefer that. On the off-chance that your problem is a bug in joblib, there are plenty of ways you could do this in a multithreaded fashion with standard library (albeit with some added boilerplate).
Most likely, it happens because some of HTTP calls fail due to network load. To test, change parallel_request: def parallel_request(self, elem, result, url, key): for i in range(3): # 3 retries try: response = requests.get(''.join((url, elem)), headers=self.headers) except IOError: continue result.extend(response.json().get(key)) return Much less likely: list.extend is not thread safe. If the snippet above didn't help, try guarding extend with a lock: import threading ... lock = threading.Lock() def parallel_request(self, elem, result, url, key): response = requests.get(''.join((url, elem)), headers=self.headers) lock.acquire() result.extend(response.json().get(key)) lock.release()
Mutliprocessing Queue vs. Pool
I'm having the hardest time trying to figure out the difference in usage between multiprocessing.Pool and multiprocessing.Queue. To help, this is bit of code is a barebones example of what I'm trying to do. def update(): def _hold(url): soup = BeautifulSoup(url) return soup def _queue(url): soup = BeautifulSoup(url) li = [l for l in soup.find('li')] return True if li else False url = 'www.ur_url_here.org' _hold(url) _queue(url) I'm trying to run _hold() and _queue() at the same time. I'm not trying to have them communicate with each other so there is no need for a Pipe. update() is called every 5 seconds. I can't really rap my head around the difference between creating a pool of workers, or creating a queue of functions. Can anyone assist me? The real _hold() and _queue() functions are much more elaborate than the example so concurrent execution actually is necessary, I just thought this example would suffice for asking the question.
The Pool and the Queue belong to two different levels of abstraction. The Pool of Workers is a concurrent design paradigm which aims to abstract a lot of logic you would otherwise need to implement yourself when using processes and queues. The multiprocessing.Pool actually uses a Queue internally for operating. If your problem is simple enough, you can easily rely on a Pool. In more complex cases, you might need to deal with processes and queues yourself. For your specific example, the following code should do the trick. def hold(url): ... return soup def queue(url): ... return bool(li) def update(url): with multiprocessing.Pool(2) as pool: hold_job = pool.apply_async(hold, args=[url]) queue_job = pool.apply_async(queue, args=[url]) # block until hold_job is done soup = hold_job.get() # block until queue_job is done li = queue_job.get() I'd also recommend you to take a look at the concurrent.futures module. As the name suggest, that is the future proof implementation for the Pool of Workers paradigm in Python. You can easily re-write the example above with that library as what really changes is just the API names.
Web crawler returning list vs generator vs producer/consumer
I want to recursively crawl a web-server that hosts thousands of files and then check if they are different from what's in the local repository (this is a part of checking the delivery infrastructure for bugs). So far I've been playing around with various prototypes and here is what I noticed. If I do a straightforward recursion and put all the files into a list, the operation completes in around 230 seconds. Note that I make only one request per directory, so it makes sense to actually download the files I'm interested in elsewhere: def recurse_links(base): result = [] try: f = urllib.request.urlopen(base) soup = BeautifulSoup(f.read(), "html.parser") for anchor in soup.find_all('a'): href = anchor.get('href') if href.startswith('/') or href.startswith('..'): pass elif href.endswith('/'): recurse_links(base + href) else: result.append(base + href) except urllib.error.HTTPError as httperr: print('HTTP Error in ' + base + ': ' + str(httperr)) I figured, if I could start processing the files I'm interested in while the crawler is still working, I could save time. So the next thing I tried was a generator that could be further used as a coroutine. The generator took 260 seconds, slightly more, but still acceptable. Here's the generator: def recurse_links_gen(base): try: f = urllib.request.urlopen(base) soup = BeautifulSoup(f.read(), "html.parser") for anchor in soup.find_all('a'): href = anchor.get('href') if href.startswith('/') or href.startswith('..'): pass elif href.endswith('/'): yield from recurse_links_gen(base + href) else: yield base + href except urllib.error.HTTPError as http_error: print(f'HTTP Error in {base}: {http_error}') Update Answering some questions that came up in the comments section: I've got roughly 370k files, but not all of them will make it to the next step. I will check them against a set or dictionary (to get O(1) lookup) before going ahead and compare them to local repo After more tests it looks like sequential crawler takes less time in roughly 4 out of 5 attempts. And generator took less time once. So at this point is seems like generator is okay At this point consumer doesn't do anything other than get an item from queue, since it's a concept. However I have flexibility in what I will do with the file URL I get from producer. I can for instance, download only first 100KB of file, calculate it's checksum while in memory and then compare to a pre-calculated local version. What's clear though is that if simply adding thread creation bumps my execution time by a factor of 4 to 5, adding work on consumer thread will not make it any faster. Finally I decided to give producer/consumer/queue a shot and a simple PoC ran 4 times longer while loading 100% of one CPU core. Here is the brief code (the crawler is the same generator-based crawler from above): class ProducerThread(threading.Thread): def __init__(self, done_event, url_queue, crawler, name): super().__init__() self._logger = logging.getLogger(__name__) self.name = name self._queue = url_queue self._crawler = crawler self._event = done_event def run(self): for file_url in self._crawler.crawl(): try: self._queue.put(file_url) except Exception as ex: self._logger.error(ex) So here are my questions: Are the threads created with threading library actually threads and is there a way for them to be actually distributed between various CPU cores? I believe the great deal of performance degradation comes from the producer waiting to put an item into the queue. But can this be avoided? Is the generator slower because it has to save the function context and then load it again over and over? What's the best way to start actually doing something with those files while the crawler is still populating the queue/list/whatever and thus make the whole program faster?
1) Are the threads created with threading library actually threads and is there a way for them to be actually distributed between various CPU cores? Yes, these are the threads, but to utilize multiple cores of your CPU, you need to use multiprocessing package. 2) I believe the great deal of performance degradation comes from the producer waiting to put an item into the queue. But can this be avoided? It depends on the number of threads you are created, one reason may be due to context switches, your threads are making. The optimum value for thread should be 2/3, i.e create 2/3 threads and check the performance again. 3) Is the generator slower because it has to save the function context and then load it again over and over? Generators are not slow, it is rather good for the problem you are working on, as you find a url , you put that into queue. 4) What's the best way to start actually doing something with those files while the crawler is still populating the queue/list/whatever and thus make the whole program faster? Create a ConsumerThread class, which fetches the data(url in your case) from the queue and start working on it.
How to execute a function several times asynchronously and get first result
I have a function get_data(request) that requests some data to a server. Every time this function is called, it request data to a different server. All of them should return the same response. I would like to get the response as soon as possible. I need to create a function that calls get_data several times, and returns the first response it gets. EDIT: I came up with an idea of using multithreading.Pipe(), but I have the feeling this is a very bad way to solve it, what do you think?: def get_data(request, pipe): data = # makes the request to a server, this can take a random amount of time pipe.send(data) def multiple_requests(request, num_servers): my_pipe, his_pipe = multithreading.Pipe() for i in range(num_servers): Thread(target = get_data, args = (request,his_pipe)).start() return my_pipe.recv() multiple_requests("the_request_string", 6) I think this is a bad way of doing it because you are passing the same pipe to all threads, and I don't really know but I guess that has to be very unsafe.
I think redis rq will be good for it. get_data is a job what you put in the queue six times. Jobs executes async, in the docs your also can read how to operate with results.
Is this a good use case for ndb async urlfetch tasklets?
I want to move to ndb, and have been wondering whether to use async urlfetch tasklets. I'm not sure I fully understand how it works, as the documentation is somewhat poor, but it seems quite promising for this particular use case. Currently I use async urlfetch like this. It is far from actual threading or parallel code, but it has still improved performance quite significantly, compared to just sequential requests. def http_get(url): rpc = urlfetch.create_rpc(deadline=3) urlfetch.make_fetch_call(rpc,url) return rpc rpcs = [] urls = [...] # hundreds of urls while rpcs < 10: rpcs.append(http_get(urls.pop())) while rpcs: rpc = rpcs.pop(0) result = rpc.get_result() if result.status_code == 200: # append another item to rpcs # process result else: # re-append same item to rpcs Please note that this code is simplified. The actual code catches exceptions, has some additional checks, and only tries to re-append the same item a few times. It makes no difference for this case. I should add that processing the result does not involve any db operations.
Actually yes, it's a good idea to use async urlfetch here. How it's working (rough explanation): - your code reach the point of async call. It triggers long background task and doesn't wait for it's result, but continue to execute. - task works in background, and when result is ready — it stores result somwhere, until you ask for it. Simple example: def get_fetch_all(): urls = ["http://www.example.com/", "http://mirror.example.com/"] ctx = ndb.get_context() futures = [ctx.urlfetch(url) for url in urls] results = ndb.Future.wait_all(futures) # do something with results here If you want to store result in ndb and make it more optimal — it's good idea to write custom tasklet for this. #ndb.tasklet def get_data_and_store(url): ctx = ndb.get_context() # until we don't receive result here, this function is "paused", allowing other # parallel tasks to work. when data will be fetched, control will be returned result = yield ctx.urlfetch("http://www.google.com/") if result.status_code == 200: store = Storage(data=result.content) # async job to put data yield store.put_async() raise ndb.Return(True) else: raise ndb.Return(False) And you can use this tasklet combined with loop in first sample. You should get list of ther/false values, indicating success of fetch. I'm not sure, how much this will boost overall productivity (it depends on google side), but it should.