Is this a good use case for ndb async urlfetch tasklets? - python
I want to move to ndb, and have been wondering whether to use async urlfetch tasklets. I'm not sure I fully understand how it works, as the documentation is somewhat poor, but it seems quite promising for this particular use case.
Currently I use async urlfetch like this. It is far from actual threading or parallel code, but it has still improved performance quite significantly, compared to just sequential requests.
def http_get(url):
rpc = urlfetch.create_rpc(deadline=3)
urlfetch.make_fetch_call(rpc,url)
return rpc
rpcs = []
urls = [...] # hundreds of urls
while rpcs < 10:
rpcs.append(http_get(urls.pop()))
while rpcs:
rpc = rpcs.pop(0)
result = rpc.get_result()
if result.status_code == 200:
# append another item to rpcs
# process result
else:
# re-append same item to rpcs
Please note that this code is simplified. The actual code catches exceptions, has some additional checks, and only tries to re-append the same item a few times. It makes no difference for this case.
I should add that processing the result does not involve any db operations.
Actually yes, it's a good idea to use async urlfetch here. How it's working (rough explanation):
- your code reach the point of async call. It triggers long background task and doesn't wait for it's result, but continue to execute.
- task works in background, and when result is ready — it stores result somwhere, until you ask for it.
Simple example:
def get_fetch_all():
urls = ["http://www.example.com/", "http://mirror.example.com/"]
ctx = ndb.get_context()
futures = [ctx.urlfetch(url) for url in urls]
results = ndb.Future.wait_all(futures)
# do something with results here
If you want to store result in ndb and make it more optimal — it's good idea to write custom tasklet for this.
#ndb.tasklet
def get_data_and_store(url):
ctx = ndb.get_context()
# until we don't receive result here, this function is "paused", allowing other
# parallel tasks to work. when data will be fetched, control will be returned
result = yield ctx.urlfetch("http://www.google.com/")
if result.status_code == 200:
store = Storage(data=result.content)
# async job to put data
yield store.put_async()
raise ndb.Return(True)
else:
raise ndb.Return(False)
And you can use this tasklet combined with loop in first sample. You should get list of ther/false values, indicating success of fetch.
I'm not sure, how much this will boost overall productivity (it depends on google side), but it should.
Related
Trying to use async to shorten computation time but failed... any tips?
I have a function that looks for relationships 2 levels deep. First, the function gets a company from the database and then looks for people related to it, then parses through the people to schedule asyncio.create_task(async_func()) as such: async def get_company_related_data(self, bno: str, uuid: str = ""): people = [] base_company = self.get_company_by_bno(bno, get_dict_objects=False)[0] people.extend(await base_company.get_related_people(convert_data=False)) ... task_list = [] for person in people: task_list.append(process_for_bubble_chart(person)) results = await asyncio.gather(*task_list) Here, the idea is to grab a company's related people first through the base_company.get_related_people() method. Once I've gotten those people, I iterate through those people and: 1 - Set up tasks to process_for_bubble_chart() so that they can run at the same time (there could be 20+ people and each of them could be related to multiple companies). 2 - I await ALL results (at least I think I am...) by inserting all tasks into the asyncio.gather() function. 3 - Below you can see I do the same thing for each person. The process_for_bubble_chart() function: async def process_for_bubble_chart(person: GcisCompanyInfoPerson or GcisLimitedPartnerPerson, convert_to_data: bool = True): """ Function that fetches related entities from the database based on the people objects within the 'people' list. """ related_entities = [] try: task_list = [ person.get_related_companies(), person.get_related_businesses(), person.get_related_limited_partners(), person.get_related_factories(), person.get_related_stockcompanies() ] results = asyncio.gather(*task_list) except Exception as err: # Exception stuff else: for task_res in await results: related_entities.extend(task_res) if convert_to_data: data = person.to_relation_data_object() data.update({"related": related_entities}) return data return related_entities And the get_related_XXX() methods look like this (more or less the same code returning different objects): async def get_related_companies(self, exclude_bno: bool = True): sql = """ SELECT * FROM Companies WHERE ... """ # SQL fetch logic return [GcisCompanyInfo1(row) for row in query_db(sql)] Where query_db() is just a wrapper function for querying the database. Before I implemented async, the full queries took too long (~20 sec.) so I looked into how to use the asyncio module to make things go quicker, but the computation time stayed about the same (if not even slightly longer!). How do I improve this? This code runs as a FastAPI backend.
async functions don't magically run in parallel - they only parallelize when you ultimately use await on some operation which waits for a specific event to occur (common examples are things like socket reads or timed sleeps). For example, if you have an async query_db function that can query the database asynchronously, then that may allow you to parallelize the operation. In the absence of such an async operation, you may consider standard threads instead, using e.g. asyncio.get_running_loop().run_in_executor(None, process_for_bubble_chart, person) to run a non-async function in a thread.
How to do Multithreading Put Request in Python
What is the best and fastest pythonic way to program multithreading for a put request that is within a for loop? Now, as it is synchronous, it takes too long time to run the code. Therefore, we would like to include multithreading, to improve time. Synchronous: def econ_post_customers(self, file, data): try: for i in range(0, len(file['collection'])): rp = requests.put(url=self.url, headers=self.headers, params=self.params, data=data) except StopIteration: pass We attempted to make threading, but starting threads on iterations just seems unnecessary, and we have 1000's of iterations, and we might run up on much more, so that would become a big mess with threads. Maybe including pools would solve the problem, but this is where i am stuck. Anyone who has an idea on how to solve this? Parallel: def econ_post_customers(self, file, data): try: for i in range(0, len(file['collection'])): threading.Thread(target=lambda: request_put(url, self.headers, self.params, data)).start() except StopIteration: pass def request_put(url, headers, params, single): return requests.put(url=url, headers=headers, params=params, data=single) Any help is highly appreciated. Thank you for your time!
If you want to use multithreading, then the following should work. However, I am a bit confused about a few things. You seem to be doing PUT requests in a loop but all with the same exact arguments. And I don't quite see how you can get a StopIteration exception in the code you posted. Also using a lambda expression as your target argument rather than just specifying the function name and then passing the arguments as a separate tuple or list (as is done below) is a bit unusual. Assuming that loop variable i in reality is being used to index one value that actually varies in the call to request_put, then function map could be a better choice than apply_async. It probably does not matter significantly for multithreading, but could make a performance difference for multiprocessing if you had a very large list of elements on which you were looping. from multiprocessing.pool import ThreadPool def econ_post_customers(self, file, data): MAX_THREADS = 100 # some suitable value n_tasks = len(file['collection']) pool_size = min(MAX_THREADS, n_tasks) pool = ThreadPool(pool_size) for i in range(n_tasks): pool.apply_async(request_put, args=(url, self.headers, self.params, data)) # wait for all tasks to complete: pool.close() pool.join() def request_put(url, headers, params, single): return requests.put(url=url, headers=headers, params=params, data=single)
Do try grequests module which works with gevent(requests is not designed for async). If you see this you will get great results. (If this is not working pls do say).
How to easily find a coroutine that has timed out?
key problem :asyncio.wait(aws,timeout=1,return_when=FIRST_COMPLETED) Is there a simple way to check if the returned task has timed out? This is an extended question. The scene is like this: Total number of coroutines is unknown server only allows 10 links The server will return a seemingly correct result (eg returning an incorrect page) The server sometimes does not return any data. Maximum possible access to all data So in order to get data faster, I need to limit the number of coroutines. Check the returned page. And timeout. There are two simple methods at present. 1. similar to the thread, use queue to build a coroutine pool + 10 infinite loop coro. I don't really like it. In fact, this method works very fast. 2. I tried to use the high-level API of async python3.7, try to simplify the structure of the program, using while tasks & asyncio.wait & return_when. Here I came across a problem with how to find timeouts for coroutines. I built a simple demo: import asyncio async def test(delaytime): print(f"begin {delaytime}") await asyncio.sleep(delaytime ) print(f"finish {delaytime} ") async def main(): # the number of tasks is unknow,range(10) is just a demo allts = list(range(10)) ts = [] while len(ts)<5: arg = allts.pop() t = asyncio.create_task(test(arg)) t.arg = arg ts.append(t) while ts: dones,pendings = await asyncio.wait(ts,timeout=2,return_when=asyncio.FIRST_COMPLETED) for t in dones: # if check t.result() is error , i can append ts again print(t.arg,"is done") ts.remove(t) while len(ts)<5: if len(allts): arg = allts.pop() t = asyncio.create_task(test(arg)) t.arg = arg ts.append(t) else: break # for t in pendings: # # if can check t is timeout , i can append ts again # pass if __name__=="__main__": asyncio.run(main()) After debugging, I know that return_when=asyncio.FIRST_COMPLETED, the tasks returned by asyncio.wait are in the pendings, except for the completed tasks. However, I can't tell which task is timeout. I thought about using wait_for, but wait_for has no return_when argument. Is there a simple way to determine the timeout task in order to re-join ts?
The issue is that the approach of using wait(return_when=FIRST_COMPLETED) is fundamentally incompatible with the use of timeout. Since different tasks have started at different times, a single timeout argument obviously can't apply to all tasks. If you want to use return_when=FIRST_COMPLETED, wrap each task in asyncio.wait_for: t = asyncio.create_task(asyncio.wait_for(test(arg), 2)) Then, when the task is done, you can use t.exception() to test if it has timed out, in which case it will return asyncio.TimeoutError. This check should only be performed among the done tasks.
Lost HTTPS requests with parallel processing
I use the two following class methods to request information from the Questrade API (http://www.questrade.com/api/documentation/rest-operations/market-calls/markets-quotes-id). I have over 11,000 stock symbols where I request the Questrade API with batches of 100 symbols. import requests from joblib import Parallel, delayed def parallel_request(self, elem, result, url, key): response = requests.get(''.join((url, elem)), headers=self.headers) result.extend(response.json().get(key)) Parallel(n_jobs=-1, backend="threading")( delayed(self.parallel_request)(elem, self.symbol_ids_list, self.uri, 'symbols')\ for elem in self.batch_result ) If I make over 110 HTTPS requests with Parallel class, then instead of getting 11,000 output I got 10,500 or 10,600. So I lost data with parallel processing. Be aware that I used two python module here, i.e. joblib (https://github.com/joblib/joblib/issues/651) and requests (https://github.com/requests/requests). The following for loop worked perfectly, so I know my problem is with the Parallel class. for elem in self.batch_result: response = requests.get(''.join((self.uri, elem)), headers=self.headers) self.symbol_ids_list.extend(response.json().get('symbols')) How could I increase the performance of the last for loop without losing data? UPDATE A sample of self.batch_result (simplified result) could be ['AAME,ABAC,ABIL,ABIO,ACERW,ACHN,ACHV,ACRX,ACST,ACTG,ADMA,ADMP,ADOM,ADXS,ADXSW,AEHR,AEMD,AETI,AEY,AEZS,AFMD,AGFSW,AGRX,AGTC,AHPAW,AHPI,AIPT,AKER,AKTX,ALIM,ALJJ,ALQA,ALSK,ALT,AMCN,AMDA,AMMA,AMRH,AMRHW,AMRN,AMRWW,AMTX,ANDAR,ANDAW,ANTH,ANY,APDN,APDNW,APOPW,APPS,APRI,APTO,APVO,APWC,AQB,AQMS,ARCI,ARCW,ARDM,AREX,ARGS,ARLZ,ARQL,ARTW,ARTX,ASFI,ASNA,ASRV,ASTC,ATACR,ATEC,ATHX,ATLC,ATOS,ATRS,AUTO,AVEO,AVGR,AVID,AVXL,AWRE,AXAS,AXON,AXSM,AYTU,AZRX,BASI,BBOX,BBRG,BCACR,BCACW,BCLI,BDSI,BHACR,BHACW,BIOC,BIOL,BIOS,BKEP,BKYI', 'BLDP,BLIN,BLNK,BLNKW,BLPH,BLRX,BMRA,BNSO,BNTC,BNTCW,BOSC,BOXL,BPTH,BRACR,BRACW,BRPAR,BRPAW,BSPM,BSQR,BUR,BURG,BVSN,BVXVW,BWEN,BYFC,CAAS,CADC,CALI,CAPR,CARV,CASI,CASM,CATB,CATS,CBAK,CBLI,CCCL,CCCR,CCIH,CDMO,CDTI,CELGZ,CERCW,CETV,CETX,CETXW,CFBK,CFMS,CFRX,CGEN,CGIX,CGNT,CHCI,CHEK,CHEKW,CHFS,CHKE,CHMA,CHNR,CIDM,CJJD,CKPT,CLDC,CLDX,CLIR,CLIRW,CLNE,CLRB,CLRBW,CLRBZ,CLSN,CLWT,CMSSR,CMSSW,CNACR,CNACW,CNET,CNIT,CNTF,CODA,CODX,COGT,CPAH,CPLP,CPRX,CPSH,CPSS,CPST,CREG,CRIS,CRME,CRNT,CSBR,CTHR,CTIB,CTIC,CTRV,CTXR,CTXRW,CUI', 'CUR,CVONW,CXDC,CXRX,CYCC,CYHHZ,CYRN,CYTR,CYTX,CYTXW,DARE,DCAR,DCIX,DELT,DEST,DFBG,DFFN,DGLY,DHXM,DLPN,DLPNW,DMPI,DOGZ,DOTAR,DOTAW,DRAD,DRIO,DRIOW,DRRX,DRYS,DSKEW,DSWL,DTEA,DTRM,DXLG,DXYN,DYNT,DYSL,EACQW,EAGLW,EARS,EASTW,EBIO,EDAP,EFOI,EGLT,EKSO,ELECW,ELGX,ELON,ELSE,ELTK,EMITF,EMMS,ENG,ENPH,ENT,EPIX,ESEA,ESES,ESTRW,EVEP,EVGN,EVK,EVLV,EVOK,EXFO,EXXI,EYEG,EYEGW,EYES,EYESW,FCEL,FCRE,FCSC,FFHL,FLGT,FLL,FMCIR,FMCIW,FNJN,FNTEW,FORD,FORK,FPAY,FRAN,FRED,FRSX,FSACW,FSNN,FTD,FTEK,FTFT,FUV,FVE,FWP,GALT,GASS,GCVRZ,GEC'] and self.uri is simply 'https://api01.iq.questrade.com/v1/symbols?names=' as seen in the above Questrade API link. UPDATE 2 The Marat's answer was a good try but didn't give me a better result. The first test gave me 31,356 (or 10,452 if I divide that result by 3) instead of 10,900. The second test just gave me 0 or the process block completely. I found out that the Maximum allowed requests per second is 20. Link : http://www.questrade.com/api/documentation/rate-limiting. How could I increase the performance of the last for loop without losing data in considering that new information?
If you are not stuck with using joblib you could try some standard library parallel processing modules. In python2/3 multiprocessing.Pool is available and provides functions for mapping a task across parallel threads. A simplified version would look like this: from multiprocessing import Pool import requests HEADERS = {} # define headers here def parallel_request(symbols): response = requests.get('https://api01.iq.questrade.com/v1/symbols?names={}'.format(symbols), headers=HEADERS) return response.json() if __name__ == '__main__': p = Pool() batch_result = ['AAME,ABAC,ABIL,...', 'BLDP,BLIN,BLNK,...', 'CUR,CVONW,CXDC,...', ...] p.map(parallel_request, batch_result) # will return a list of len(batch_result) responses There are asynchronous and iterable versions of map that you would probably want for larger sized jobs, and of course you could add parameters to your parallel_requests task to avoid hard coding things like I did. A caveat with using Pool is that any arguments passed to it have to be picklable. In python3 the concurrent.futures module actually has a nice example of multithreaded url retrieval in the docs. With a little effort you could replace load_url in that example with your parallel_request function. There is a version of concurrent.futures backported to python2 as the futures module, as well. These might require a bit more work in refactoring, so if there is a solution that sticks with joblib feel free to prefer that. On the off-chance that your problem is a bug in joblib, there are plenty of ways you could do this in a multithreaded fashion with standard library (albeit with some added boilerplate).
Most likely, it happens because some of HTTP calls fail due to network load. To test, change parallel_request: def parallel_request(self, elem, result, url, key): for i in range(3): # 3 retries try: response = requests.get(''.join((url, elem)), headers=self.headers) except IOError: continue result.extend(response.json().get(key)) return Much less likely: list.extend is not thread safe. If the snippet above didn't help, try guarding extend with a lock: import threading ... lock = threading.Lock() def parallel_request(self, elem, result, url, key): response = requests.get(''.join((url, elem)), headers=self.headers) lock.acquire() result.extend(response.json().get(key)) lock.release()
How to execute a function several times asynchronously and get first result
I have a function get_data(request) that requests some data to a server. Every time this function is called, it request data to a different server. All of them should return the same response. I would like to get the response as soon as possible. I need to create a function that calls get_data several times, and returns the first response it gets. EDIT: I came up with an idea of using multithreading.Pipe(), but I have the feeling this is a very bad way to solve it, what do you think?: def get_data(request, pipe): data = # makes the request to a server, this can take a random amount of time pipe.send(data) def multiple_requests(request, num_servers): my_pipe, his_pipe = multithreading.Pipe() for i in range(num_servers): Thread(target = get_data, args = (request,his_pipe)).start() return my_pipe.recv() multiple_requests("the_request_string", 6) I think this is a bad way of doing it because you are passing the same pipe to all threads, and I don't really know but I guess that has to be very unsafe.
I think redis rq will be good for it. get_data is a job what you put in the queue six times. Jobs executes async, in the docs your also can read how to operate with results.