I'm currently ingesting data through an API that returns close to 100,000 documents in a paginated fashion (100 per page). I currently have some code that roughly functions as follows:
while c <= limit:
if not api_url:
break
req = urllib2.Request(api_url)
opener = urllib2.build_opener()
f = opener.open(req)
response = simplejson.load(f)
for item in response['documents']:
# DO SOMETHING HERE
if 'more_url' in response:
api_url = response['more_url']
else:
api_url = None
break
c += 1
Downloading the data this way is really slow and I was wondering if there is any way to loop through the pages in an async way. I have been recommended to take a look at twisted, but I am not entirely sure how to proceed.
What you have here is that you do not know up front about what will be read next unless you will call an API. Think of this like, what you can do in parallel?
I do not know how much you can do in parallel and which tasks, but lets try...
some assumptions:
- you can retrieve data from the API without penalties or limits
- data processing of one page/batch can be done independently one from other
what is slow is an IO - so immediately you can split your code to two parallel running tasks - one that will read data, then put it in the queue and continue reading unless hit limit/empty response or pause if queue is full
then second task, that is taking data from queue, and do something with data
so you can call one task from another
other approach is that you have one task, that is calling other one immediately after data is read, so their execution will be running in parallel but slightly shifted
how I'll implement it? as celery tasks and yes requests
e.g. the second one:
#task
def do_data_process(data):
# do something with data
pass
#task
def parse_one_page(url):
response = requests.get(url)
data = response.json()
if 'more_url' in data:
parse_one_page.delay(data['more_url'])
# and here do data processing in this task
do_data_process(data)
# or call worker and try to do this in other process
# do_data_process.delay(data)
and it is up to you how many tasks you will run in parallel if you will add limits to your code, you can even have workers on multiple machines and have separate queues for parse_one_page and do_data_process
why this approach, not twisted or async?
because you have cpu-bond data processing (json, then data) and for this is better to have separate processes and celery is perfect with them.
Related
I'm trying to send simultaneous get requests with the Python requests module.
While searching for a solution I've come across lots of different approaches, including grequests, gevent.monkey, requests futures, threading, multi-processing...
I'm a little overwhelmed and not sure which one to pick, regarding speed and code-readibility.
The task is to download < 400 files as fast as possible, all from the same server. Ideally it should output the status for the downloads in terminal, e. g. print an error or success message per request.
def download(webpage):
requests.get(webpage)
# Whatever else you need to do to download your resource, put it in here
urls = ['https://www.example.com', 'https://www.google.com','https://yahoo.com'] # Populate with resources you wish to download
threads = {}
if __name__ == '__main__':
for i in urls:
print(i)
threads[i] = threading.Thread(target=download, args=(i,))
for i in threads:
threads[i].start()
for i in threads:
threads[i].join()
print('successfully done.')
The above code contains a function called download that represents whatever code you have to run to download the resource you're looking to download. Then a list is made populated with urls you wish to download - change these values as you please. This is assembled in to a second dictionary that contains the threads. This is so that you can have as many urls in the url dictionary as you want, and a separate thread is made for each of them. The threads are each started, then joined.
I would use threading as it is not necessary to run the downloads on multiple cores like multiprocessing does.
So write a function where requests.get() is in it and then start as a thread.
But remember that your internet connection has to be fast enough, otherwise it wouldn't be worth it.
I am trying to fetch data of all transactions for several addresses from an API. Each address can have several pages of transactions, which I find out only when I ask for first page.
I have methods api.get_address_data(address, page) and api.get_transaction_data(tx).
Synchronous code for what I want to do would look like this:
def all_transaction_data(addresses):
for address in addresses:
data = api.get_address_data(address, page=0)
transactions = data.transactions
for n in range(1, data.total_pages):
next_page = api.get_address_data(address, page=n)
transactions += next_page.transactions
for tx in data.transactions:
yield api.get_transaction_data(tx)
I don't care about the order of transactions received (I will have to reorder them when I have all of them ready). I can fit all the data in memory, but it's a lot of very little requests, so I'd like to do as much in parallel as possible.
What is the best way to accomplish this? I was playing around with asyncio (the API calls are under my control so I can convert them to async), but I have trouble with interleaving the layers: my best solution can fetch all the addresses first, list all the pages second and finally get all transactions in one big batch. I would like each processing step to be scheduled immediately when the appropriate input data is ready, and the results collected into one big list (or yielded from a single generator).
It seems that I need some sort of open-ended task queue, where task "get-address" fetches the data and enqueues a bunch of "get-pages" tasks, which in turn enqueue "get-transaction" tasks, and only these are then collected into a list of results?
Can this be done with asyncio? Would something like gevent be more suitable, or maybe just a plain ThreadPoolExecutor? Is there a better approach than what I outlined so far?
Note that I'd like to avoid inversion of control flow, or at least hide it as an implementation detail. I.e., the caller of this code should be able to just call for tx in all_transaction_data(), or at worst async for.
I'm writing a data processing pipeline using Celery because this speeds things up considerably.
Consider the following pseudo-code:
from celery.result import ResultSet
from some_celery_app import processing_task # of type #app.task
def crunch_data():
results = ResultSet([])
for document in mongo.find(): #Around 100K - 1M documents
job = processing_task.delay(document)
results.add(job)
return results.get()
collected_data = crunch_data()
#Do some stuff with this collected data
I successfully spawn four workers with concurrency enabled and when I run this script, the data is processed accordingly and I can do whatever I want.
I'm using RabbitMQ as message broker and rpc as backend.
What I see when I open the RabbitMQ management UI:
First, all the documents are processed
then, and only then, are the documents retrieved by the collective results.get() call.
My question: Is there a way to do the processing and subsequent retrieval simultaneously? In my case, as all documents are atomic entities that do not rely on each other, there seems to be no need to wait for the job to be processed completely.
You could try the callback parameter in ResultSet.get(callback=cbResult) and then you could process the result in the callback.
def cbResult(task_id, value):
print(value)
results.get(callback=cbResult)
I've been working off of Google Cloud Platform's Python API library. I've had much success with these API samples out-of-the-box, but I'd like to streamline it a bit further by combining the three queries I need to run (and subsequent tables that will be created) into a single file. Although the documentation mentions being able to run multiple jobs asynchronously, I've been having trouble figuring out the best way to accomplish that.
Thanks in advance!
The idea of running multiple jobs asynchronously is in creating/preparing as many jobs as you need and kick them off using jobs.insert API (important you should either collect all respective jobids or set you own - they just need to be unique). Those API returns immediately, so you can kick them all off "very quickly" in one loop
Meantime, you need to check repeatedly for status of those jobs (in loop) and as soon as job is done you can kick processing of result as needed
You can check for details in Running asynchronous queries
BigQuery jobs are always async by default; this being said, requesting the result of the operation isn't. As of Q4 2021, the Python API does not support a proper async way to collect results. Each call to job.result() blocks the thread, hence making it impossible to use with a single threaded event loop like asyncio. Thus, the best way to collect multiple job results is by using multithreading:
from typing import Dict
from concurrent.futures import ThreadPoolExecutor
from google.cloud import bigquery
client: bigquery.Client = bigquery.Client()
def run(name, statement):
return name, client.query(statement).result() # blocks the thread
def run_all(statements: Dict[str, str]):
with ThreadPoolExecutor() as executor:
jobs = []
for name, statement in statements.items():
jobs.append(executor.submit(run, name, statement))
result = dict([job.result() for job in jobs])
return result
P.S.: Some credits are due to #Fredrik Håård for this answer :)
my problem is how to best release memory the response of an asynchrones url fetch needs on appengine. Here is what I basically do in python:
rpcs = []
for event in event_list:
url = 'http://someurl.com'
rpc = urlfetch.create_rpc()
rpc.callback = create_callback(rpc)
urlfetch.make_fetch_call(rpc, url)
rpcs.append(rpc)
for rpc in rpcs:
rpc.wait()
In my test scenario it does that for 1500 request. But I need an architecture to handle even much more within a short amount of time.
Then there is a callback function, which adds a task to a queue to process the results:
def event_callback(rpc):
result = rpc.get_result()
data = json.loads(result.content)
taskqueue.add(queue_name='name', url='url', params={'data': data})
My problem is, that I do so many concurrent RPC calls, that the memory of my instance crashes: "Exceeded soft private memory limit with 159.234 MB after servicing 975 requests total"
I already tried three things:
del result
del data
and
result = None
data = None
and I ran the garbage collector manually after the callback function.
gc.collect()
But nothing seem to release the memory directly after a callback functions has added the task to a queue - and therefore the instance crashes. Is there any other way to do it?
Wrong approach: Put these urls into a (put)-queue, increase its rate to the desired value (defaut: 5/sec), and let each task handle one url-fetch (or a group hereof). Please note that theres a safety limit of 3000 url-fetch-api-calls / minute (and one url-fetch might use more than one api-call)
Use the task queue for urlfetch as well, fan out and avoid exhausting memory, register named tasks and provide the event_list cursor to next task. You might want to fetch+process in such a scenario instead of registering new task for every process, especially if process also includes datastore writes.
I also find ndb to make these async solutions more elegant.
Check out Brett Slatkins talk on scalable apps and perhaps pipelines.