Run Multiple BigQuery Jobs via Python API

Run Multiple BigQuery Jobs via Python API - python

I've been working off of Google Cloud Platform's Python API library. I've had much success with these API samples out-of-the-box, but I'd like to streamline it a bit further by combining the three queries I need to run (and subsequent tables that will be created) into a single file. Although the documentation mentions being able to run multiple jobs asynchronously, I've been having trouble figuring out the best way to accomplish that.
Thanks in advance!

The idea of running multiple jobs asynchronously is in creating/preparing as many jobs as you need and kick them off using jobs.insert API (important you should either collect all respective jobids or set you own - they just need to be unique). Those API returns immediately, so you can kick them all off "very quickly" in one loop
Meantime, you need to check repeatedly for status of those jobs (in loop) and as soon as job is done you can kick processing of result as needed
You can check for details in Running asynchronous queries

BigQuery jobs are always async by default; this being said, requesting the result of the operation isn't. As of Q4 2021, the Python API does not support a proper async way to collect results. Each call to job.result() blocks the thread, hence making it impossible to use with a single threaded event loop like asyncio. Thus, the best way to collect multiple job results is by using multithreading:
from typing import Dict
from concurrent.futures import ThreadPoolExecutor
from google.cloud import bigquery
client: bigquery.Client = bigquery.Client()
def run(name, statement):
return name, client.query(statement).result() # blocks the thread
def run_all(statements: Dict[str, str]):
with ThreadPoolExecutor() as executor:
jobs = []
for name, statement in statements.items():
jobs.append(executor.submit(run, name, statement))
result = dict([job.result() for job in jobs])
return result
P.S.: Some credits are due to #Fredrik Håård for this answer :)

Related

Concurrency questions with non-concurrent code

I have a library that does calls to smart contracts in the ethereum chain to read data
So for simplicity, my code is like this:
import library
items = [
"address1",
"address2",
"address3",
]
for item in items:
data = library.get_smartcontractinfo(item)
print(data)
if __name__ == '__main__':
main()
I am new to concurrency and this is a topic I need to explore further, as there are many options to do concurrency but seems asyncio is the one most people go for
The library I a musing is not built with asyncio or any sort of concurrency in mind. This means that each time I call the library.get_smartcontractinfo() function then I need to wait until it completes the query so it can do the next iteration, which is blocking the speed.
Lets say that I cannot modify the library, althought maybe I will in the future, but I wanto get something done asap with the existing code
What would be the easiest way to do simultaneous queries so I can get the info as fast as I can in an efficient way?
What about being rate limited? And would it be possible to group these calls into one without rewriting the library code?
Thank you.

Assuming that library.get_smartcontractinfo() does a lot of network I/O, you could use a ThreadPoolExecutor from concurrent.futures to run more of them in parallel.
The documentation has a good example.

Assuming the function library.get_smartcontractinfo() is a I/O bound, you have multiple options to go with asyncio. If you want to use pure asyncio you can go with something like
async def main():
loop = asyncio.get_running_loop()
all_runs = [loop.run_in_executor(None, library.get_smartcontractinfo, item) for item in items]
results = await asyncio.gather(*all_runs)
Bascially running the sync function in a thread. To run those concurrently, you first create all coroutines without awaiting them, and finally pass those into gather.
If you want to use some additional library, I can recommend using anyio or asyncer which basically is a nice wrapper around anyio. With `asyncer?, you basically can change the one line where you transfer a sync function into an async one to
from asyncer import asyncify
...
all_runs = [asyncify(library.get_smartcontractinfo)(item) for item in items]
the rest stays the same.

Python concurrency isn't concurrent

My employer uses Box. It's API is very slow. Fortunately our files are largely static. Nightly I can iterate (recursively) over Box folders and store the URLs in a local file. Using the local file during the day substantially improves the performance of our scripts that read from and write to Box.
Starting the recursive search (spider) at level 0 includes folders we don't care about. So we have a named list of starting points from level 1. I'd like to recursively search them in parallel.
When I observe the code below (via logging/print statements I have hidden) it does not seem to search under the starting points in parallel. In instead searches the entire tree under starting point 1, then the tree under starting point 2, etc.
My question is: why does the code below not execute the spider method concurrently for each item in for storage_dict, starting_point in zip(cache_dict_list, starting_dir_list)?
import asyncio
#asyncio.coroutine
def spider(storage_dict, dir_list):
"""Recursive storage of Box information in storage_dict."""
storage_dict = {"key": "value"}
cache_dict_list = [dict() for x in starting_dir_list]
task_list = list()
async def main():
for storage_dict, starting_point in zip(cache_dict_list, starting_dir_list):
task_list.append(asyncio.create_task(spider(storage_dict, [starting_point])))
await asyncio.gather(*task_list)
asyncio.run(main())
total_dict = dict()
total_dict.update([cache_dict.update(x) for x in cache_dict_list])

The reason is basically that async isn't multithreading (more on threading later). Async basically queues up tasks which are executed by the event loop. So when you await asyncio.gather(*task_list) you are basically saying "put all these tasks in the queue(ish) and wait until they are done." If you used more async and await statements within spider() you could split it up more in the queue, but ultimately it would still take about as long since only one item in the queue will be processed at a time.
Then, we have threading. This (kinda) allows for concurrency. However, it isn't much better if you are resource-capped, because cpython uses a global interpreter lock (GIL). The GIL means that basically, a single python process can only utilize one core at a time, which avoids issues that can happen when multiple cores try to access and modify data at the same time.
However, if you want true concurrency, you can use the multiprocessing module. How you implement this probably depends on exactly how you want to get and store your data (in order to avoid the issues with multiple cores that are the reason for the GIL), but basically it will allow you to use multiple cores concurrently.

How can I asynchronously receive processed messages in Celery?

I'm writing a data processing pipeline using Celery because this speeds things up considerably.
Consider the following pseudo-code:
from celery.result import ResultSet
from some_celery_app import processing_task # of type #app.task
def crunch_data():
results = ResultSet([])
for document in mongo.find(): #Around 100K - 1M documents
job = processing_task.delay(document)
results.add(job)
return results.get()
collected_data = crunch_data()
#Do some stuff with this collected data
I successfully spawn four workers with concurrency enabled and when I run this script, the data is processed accordingly and I can do whatever I want.
I'm using RabbitMQ as message broker and rpc as backend.
What I see when I open the RabbitMQ management UI:
First, all the documents are processed
then, and only then, are the documents retrieved by the collective results.get() call.
My question: Is there a way to do the processing and subsequent retrieval simultaneously? In my case, as all documents are atomic entities that do not rely on each other, there seems to be no need to wait for the job to be processed completely.

You could try the callback parameter in ResultSet.get(callback=cbResult) and then you could process the result in the callback.
def cbResult(task_id, value):
print(value)
results.get(callback=cbResult)

long running running job in flask

I have created a module that does some heavy computations, and returns some data to be stored in a nosqldatabase. The computation process is started via a post request in my flask application. The flask function will execute the cumputation code and after the code and then the returned results will be stored in db. I was thinking of celery. But I am wondering and haven't found any clear info on that if it would be possible to use python trheading E.g
from mysci_module import heavy_compute
#route('/initiate_task/', methods=['POST',])
def run_computation():
import thread
thread.start_new_thread(heavy_compute, post_data)
return reponse
Its very abstract I know. The only problem I see in this method is that my function will have to know and be responsible in storing data in the database, so It is not very independant on the database used. Correct? Why is Celery a better (is it really?) than the method above?

Since CPython is restricted from true concurrency using threads by the GIL, all computations will infact happen serially. Instead you could use the python multiprocessing module and create a pool of processes to complete your heavy computation task.
There are a few microframeworks such as twisted klein apart from celery that can also help achieve that concurrency and independence that you're looking for. They aren't necessarily better, but are available for those who don't want to get their hands messy with various issues that are likely to come up when one gets into synchronizing flask and the actual business logic, especially when response is based on that activity.

I would suggest the following method to start a thread for the long procedure first. Then leave Flask to communicate with the procedure time by time upon your requirements:
from mysci_module import heavy_compute
import thread
thread.start_new_thread(heavy_compute, post_data)
#route('/initiate_task/', methods=['POST',])
def check_computation():
response = heave_compute.status
return response
The best part of this method is to make sure you have a callable thread in the background all the time while it's possible to get the necessary result even passing some parameters to the task.

Python / rq - monitoring worker status

If this is an idiotic question, I apologize and will go hide my head in shame, but:
I'm using rq to queue jobs in Python. I want it to work like this:
Job A starts. Job A grabs data via web API and stores it.
Job A runs.
Job A completes.
Upon completion of A, job B starts. Job B checks each record stored by job A and adds some additional response data.
Upon completion of job B, user gets a happy e-mail saying their report's ready.
My code so far:
redis_conn = Redis()
use_connection(redis_conn)
q = Queue('normal', connection=redis_conn) # this is terrible, I know - fixing later
w = Worker(q)
job = q.enqueue(getlinksmod.lsGet, theURL,total,domainid)
w.work()
I assumed my best solution was to have 2 workers, one for job A and one for B. The job B worker could monitor job A and, when job A was done, get started on job B.
What I can't figure out to save my life is how I get one worker to monitor the status of another. I can grab the job ID from job A with job.id. I can grab the worker name with w.name. But haven't the foggiest as to how I pass any of that information to the other worker.
Or, is there a much simpler way to do this that I'm totally missing?

Update januari 2015, this pull request is now merged, and the parameter is renamed to depends_on, ie:
second_job = q.enqueue(email_customer, depends_on=first_job)
The original post left intact for people running older versions and such:
I have submitted a pull request (https://github.com/nvie/rq/pull/207) to handle job dependencies in RQ. When this pull request gets merged in, you'll be able to do:
def generate_report():
pass
def email_customer():
pass
first_job = q.enqueue(generate_report)
second_job = q.enqueue(email_customer, after=first_job)
# In the second enqueue call, job is created,
# but only moved into queue after first_job finishes
For now, I suggest writing a wrapper function to sequentially run your jobs. For example:
def generate_report():
pass
def email_customer():
pass
def generate_report_and_email():
generate_report()
email_customer() # You can also enqueue this function, if you really want to
# Somewhere else
q.enqueue(generate_report_and_email)

From this page on the rq docs, it looks like each job object has a result attribute, callable by job.result, which you can check. If the job hasn't finished, it'll be None, but if you ensure that your job returns some value (even just "Done"), then you can have your other worker check the result of the first job and then begin working only when job.result has a value, meaning the first worker was completed.

You are probably too deep into your project to switch, but if not, take a look at Twisted. http://twistedmatrix.com/trac/ I am using it right now for a project that hits APIs, scrapes web content, etc. It runs multiple jobs in parallel, as well as organizing certain jobs in order, so Job B doesn't execute until Job A is done.
This is the best tutorial for learning Twisted if you want to attempt. http://krondo.com/?page_id=1327

Combine the things that job A and job B do in one function, and then use e.g. multiprocessing.Pool (it's map_async method) to farm that out over different processes.
I'm not familiar with rq, but multiprocessing is a part of the standard library. By default it uses as many processes as your CPU has cores, which in my experience is usually enough to saturate the machine.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Run Multiple BigQuery Jobs via Python API - python

Related

Concurrency questions with non-concurrent code

Python concurrency isn't concurrent

How can I asynchronously receive processed messages in Celery?

long running running job in flask

Python / rq - monitoring worker status

Categories

Resources