I'm using RPC to fetch multiple URLs asynchronously. I'm using a global variable to track completion and notice that the contents of that global have radically different contents before and after the RPC calls complete.
Feels like I'm missing something obvious... Is it possible for the rpc.wait() to result in the app context being loaded on a new instance when the callbacks are made?
Here's the basic pattern...
aggregated_results = {}
def aggregateData(sid):
# local variable tracking results
aggregated_results[sid] = []
# create a bunch of asynchronous url fetches to get all of the route data
rpcs = []
for r in routes:
rpc = urlfetch.create_rpc()
rpc.callback = create_callback(rpc,sid)
urlfetch.make_fetch_call(rpc, url)
rpcs.append(rpc)
# all of the schedule URLs have been fetched. now wait for them to finish
for rpc in rpcs:
rpc.wait()
# look at results
try:
if len(aggregated_results[sid]) == 0:
logging.debug("We couldn't find results for transaction")
except KeyError as e:
logging.error('aggregation error: %s' % e.message)
logging.debug(aggregated_results)
return aggregated_results[sid]
def magic_callback(rpc, sid):
# do some work to parse the result
# of the urlfetch call...
# <hidden>
#
try:
if len(aggregated_results[sid]) == 0:
aggregated_results[sid] = [stop]
else:
done = False
for i, s in enumerate(aggregated_results[sid]):
if stop.time <= s.time:
aggregated_results[sid].insert(i,stop)
done = True
break
if not done:
aggregated_results[sid].append(stop)
except KeyError as e:
logging.error('aggregation error: %s' % e.message)
The KeyError is thrown both inside the callback as well as the end of processing all of the results. Neither of those should happen.
When I print out the contents of the dictionary, the sid is in fact gone, but there are other entries for other requests that are being processed. In some cases, more entries than I see when the respective request starts.
This pattern is called on a web request handler. Not in the background.
It's as if, the callbacks occur on a difference instance.
The sid key in this case is a combination of strings that includes a time string and I'm confident it is unique.
Related
I have a Google Cloud Function triggered by a PubSub. The doc states messages are acknowledged when the function end with success.
link
But randomly, the function retries (same execution ID) exactly 10 minutes after execution. It is the PubSub ack max timeout.
I also tried to get message ID and acknowledge it programmatically in Function code but the PubSub API respond there is no message to ack with that id.
In StackDriver monitoring, I see some messages not being acknowledged.
Here is my code : main.py
import base64
import logging
import traceback
from google.api_core import exceptions
from google.cloud import bigquery, error_reporting, firestore, pubsub
from sql_runner.runner import orchestrator
logging.getLogger().setLevel(logging.INFO)
def main(event, context):
bigquery_client = bigquery.Client()
firestore_client = firestore.Client()
publisher_client = pubsub.PublisherClient()
subscriber_client = pubsub.SubscriberClient()
logging.info(
'event=%s',
event
)
logging.info(
'context=%s',
context
)
try:
query_id = base64.b64decode(event.get('data',b'')).decode('utf-8')
logging.info(
'query_id=%s',
query_id
)
# inject dependencies
orchestrator(
query_id,
bigquery_client,
firestore_client,
publisher_client
)
sub_path = (context.resource['name']
.replace('topics', 'subscriptions')
.replace('function-sql-runner', 'gcf-sql-runner-europe-west1-function-sql-runner')
)
# explicitly ack message to avoid duplicates invocations
try:
subscriber_client.acknowledge(
sub_path,
[context.event_id] # message_id to ack
)
logging.warning(
'message_id %s acknowledged (FORCED)',
context.event_id
)
except exceptions.InvalidArgument as err:
# google.api_core.exceptions.InvalidArgument: 400 You have passed an invalid ack ID to the service (ack_id=982967258971474).
logging.info(
'message_id %s already acknowledged',
context.event_id
)
logging.debug(err)
except Exception as err:
# catch all exceptions and log to prevent cold boot
# report with error_reporting
error_reporting.Client().report_exception()
logging.critical(
'Internal error : %s -> %s',
str(err),
traceback.format_exc()
)
if __name__ == '__main__': # for testing
from collections import namedtuple # use namedtuple to avoid Class creation
Context = namedtuple('Context', 'event_id resource')
context = Context('666', {'name': 'projects/my-dev/topics/function-sql-runner'})
script_to_start = b' ' # launch the 1st script
script_to_start = b'060-cartes.sql'
main(
event={"data": base64.b64encode(script_to_start)},
context=context
)
Here is my code : runner.py
import logging
import os
from retry import retry
PROJECT_ID = os.getenv('GCLOUD_PROJECT') or 'my-dev'
def orchestrator(query_id, bigquery_client, firestore_client, publisher_client):
"""
if query_id empty, start the first sql script
else, call the given query_id.
Anyway, call the next script.
If the sql script is the last, no call
retrieve SQL queries from FireStore
run queries on BigQuery
"""
docs_refs = [
doc_ref.get() for doc_ref in
firestore_client.collection(u'sql_scripts').list_documents()
]
sorted_queries = sorted(docs_refs, key=lambda x: x.id)
if not bool(query_id.strip()) : # first execution
current_index = 0
else:
# find the query to run
query_ids = [ query_doc.id for query_doc in sorted_queries]
current_index = query_ids.index(query_id)
query_doc = sorted_queries[current_index]
bigquery_client.query(
query_doc.to_dict()['request'], # sql query
).result()
logging.info(
'Query %s executed',
query_doc.id
)
# exit if the current query is the last
if len(sorted_queries) == current_index + 1:
logging.info('All scripts were executed.')
return
next_query_id = sorted_queries[current_index+1].id.encode('utf-8')
publish(publisher_client, next_query_id)
#retry(tries=5)
def publish(publisher_client, next_query_id):
"""
send a message in pubsub to call the next query
this mechanism allow to run one sql script per Function instance
so as to not exceed the 9min deadline limit
"""
logging.info('Calling next query %s', next_query_id)
future = publisher_client.publish(
topic='projects/{}/topics/function-sql-runner'.format(PROJECT_ID),
data=next_query_id
)
# ensure publish is successfull
message_id = future.result()
logging.info('Published message_id = %s', message_id)
It looks like the pubsub message is not ack on success.
I do not think I have background activity in my code.
My question : why my Function is randomly retrying even when success ?
Cloud Functions does not guarantee that your functions will run exactly once. According to the documentation, background functions, including pubsub functions, are given an at-least-once guarantee:
Background functions are invoked at least once. This is because of the
asynchronous nature of handling events, in which there is no caller
that waits for the response. The system might, in rare circumstances,
invoke a background function more than once in order to ensure
delivery of the event. If a background function invocation fails with
an error, it will not be invoked again unless retries on failure are
enabled for that function.
Your code will need to expect that it could possibly receive an event more than once. As such, your code should be idempotent:
To make sure that your function behaves correctly on retried execution
attempts, you should make it idempotent by implementing it so that an
event results in the desired results (and side effects) even if it is
delivered multiple times. In the case of HTTP functions, this also
means returning the desired value even if the caller retries calls to
the HTTP function endpoint. See Retrying Background Functions for more
information on how to make your function idempotent.
I have a large table in BigQuery, which i have to go through, get all data and process it in my GAE app. Since my table is going to be about 4m rows, i decided i have to get data via pagination mechanism implemeted in code examples here > https://cloud.google.com/bigquery/querying-data
def async_query(query):
client = bigquery.Client()
query_job = client.run_async_query(str(uuid.uuid4()), query)
query_job.use_legacy_sql = False
query_job.use_query_cache = False
query_job.begin()
wait_for_job(query_job)
query_results = query_job.results()
page_token = None
output_rows = []
while True:
rows, total_rows, page_token = query_results.fetch_data(max_results = 200, page_token = page_token)
output_rows = output_rows + rows
if not page_token:
break
def wait_for_job(job):
while True:
job.reload() # Refreshes the state via a GET request.
if job.state == 'DONE':
if job.error_result:
raise RuntimeError(job.errors)
return
time.sleep(1)
But when i execute it i receive an error:
DeadlineExceededError: The overall deadline for responding to the HTTP request was exceeded.
When max_results parameter > table size it works fine. When max_results < table size and pagination required - i get this error.
Am i missing something?
The error indicates your overall request handler processing takes too long. Very likely because of the multiple query_results.fetch_data iterations, due to pagination.
You may want to check:
Dealing with DeadlineExceededErrors
Deadline errors: 60 seconds or less in Google App Engine
You'll probably have to re-think your app a bit, maybe try to not get the whole result immediately and instead either:
get just a portion of the result
get the result on a separate request, later on, after obtaining it in the background, for example:
via a single task queue request (10 min instead or 60s deadline)
by assembling it from multiple chunks collected in separate task queue requests to really make it scalable (not sure if this actually works with bigquery, I only tried it with the datastore)
I have a method that pulls the information about a specific gerrit revision, and there are two possible endpoints that it could live under. Given the following sample values:
revision_id = rev1
change_id = chg1
proj_branch_id = pbi1
The revision can either live under
/changes/ch1/revisions/rev1/commit
or
/changes/pbi1/revisions/rev1/commit
I'm trying to handle this cleanly with minimal code re-use as follows:
#retry(
wait_exponential_multiplier=1000,
stop_max_delay=3000
)
def get_data(
self,
api_obj: GerritAPI,
writer: cu.LockingWriter = None,
test: bool = False
):
"""
Make an HTTP Request using a pre-authed GerritAPI object to pull the
JSON related to itself. It checks the endpoint using just the change
id first, and if that doesn't return anything, will try the endpoint
using the full id. It also cleans the data and adds ETL tracking
fields.
:param api_obj: An authed GerritAPI object.
:param writer: A LockingWriter object.
:param test: If True then the data will be returned instead of written
out.
:raises: If an HTTP Error occurs, an alternate endpoint will be
attempted. If this also raises an HTTP Error then the error is
re-raised and propagated to the caller.
"""
logging.debug('Getting data for a commit for {f_id}'.format(
f_id=self.proj_branch_id
))
endpoint = (r'/changes/{c_id}/revisions/{r_id}/commit'.format(
c_id=self.change_id,
r_id=self.revision_id
))
try:
data = api_obj.get(endpoint)
except HTTPError:
try:
endpoint = (r'/changes/{f_id}/revisions/{r_id}/commit'.format(
f_id=self.proj_branch_id,
r_id=self.revision_id
))
data = api_obj.get(endpoint)
except HTTPError:
logging.debug('Neither endpoint returned data: {ep}'.format(
ep=endpoint
))
raise
else:
data['name'] = data.get('committer')['name']
data['email'] = data.get('committer')['email']
data['date'] = data.get('committer')['date']
mu.clean_data(data)
except ReadTimeout:
logging.warning('Read Timeout occurred for a commit. Endpoint: '
'{ep}'.format(ep=endpoint))
else:
data['name'] = data.get('committer')['name']
data['email'] = data.get('committer')['email']
data['date'] = data.get('committer')['date']
mu.clean_data(data)
finally:
try:
data.update(self.__dict__)
except NameError:
data = self.__dict__
if test:
return data
mu.add_etl_fields(data)
self._write_data(data, writer)
I don't much like that I'm repeating the portion under the else, so I'm wondering if there is a way to more cleanly handle this? As a side note, as it stands currently my program will write out up to 3 times for every commit if it returns an HTTPError, is creating an instance variable self.written which tracks whether it has already been written out a best practice way to do this?
I have a simple long poll thing using python3 and the requests package. It currently looks something like:
def longpoll():
session = requests.Session()
while True:
try:
fetched = session.get(MyURL)
input = base64.b64decode(fetched.content)
output = process(data)
session.put(MyURL, data=base64.b64encode(response))
except Exception as e:
print(e)
time.sleep(10)
There is a case where instead of processing the input and puting the result, I'd like to raise an http error. Is there a simple way to do this from the high level Session interface? Or do I have to drill down to use the lower level objects?
Since You have control over the server you may want to reverse the 2nd call
Here is an example using bottle to recive the 2nd poll
def longpoll():
session = requests.Session()
while True: #I'm guessing that the server does not care that we call him a lot of times ...
try:
session.post(MyURL, {"ip_address": my_ip_address}) # request work or I'm alive
#input = base64.b64decode(fetched.content)
#output = process(data)
#session.put(MyURL, data=base64.b64encode(response))
except Exception as e:
print(e)
time.sleep(10)
#bottle.post("/process")
def process_new_work():
data = bottle.request.json()
output = process(data) #if an error is thrown an HTTP error will be returned by the framework
return output
This way the server will get the output or an bad HTTP status
I have a Google AppEngine (in Python) application where I need to perform 4 to 5 url fetches, and then combine the data before I print it out to the response.
I can do this without any problems using a synchronous workflow, but since the urls that I am fetching are not related or dependent on each other, performing this asynchronously would be the most ideal (and quickest).
I have read and re-read the documentation here, but I just can't figure out how to get read the contents for each url. I've also searched the web for a small example (which is really what I am in need of). I have seen this SO question, but again, here they don't mention anything about reading the contents of these individual asynchronous url fetches.
Does anyone have any simple examples of how to perform 4 or 5 asynchronous url fetches with AppEngine? And then combine the results before I print it to the response?
Here is what I have so far:
rpcs = []
for album in result_object['data']:
total_facebook_photo_count = total_facebook_photo_count + album['count']
facebook_albumid_array.append(album['id'])
#Get the photos in the photo album
facebook_photos_url = 'https://graph.facebook.com/%s/photos?access_token=%s&limit=1000' % (album['id'], access_token)
rpc = urlfetch.create_rpc()
urlfetch.make_fetch_call(rpc, facebook_photos_url)
rpcs.append(rpc)
for rpc in rpcs:
result = rpc.get_result()
self.response.out.write(result.content)
However, it still looks like the line: result = rpc.get_result() is forcing it to wait for the first request to finish, then the second, then the third, and so forth. Is there a way to simply put the results in a variables as they are received?
Thanks!
In the example, text = result.content is where you get the content (body).
To do url fetches in parallell, you could set them up, add to a list and check results afterwards. Expanding on the example already mentioned, it could look something like:
from google.appengine.api import urlfetch
futures = []
for url in urls:
rpc = urlfetch.create_rpc()
urlfetch.make_fetch_call(rpc, url)
futures.append(rpc)
contents = []
for rpc in futures:
try:
result = rpc.get_result()
if result.status_code == 200:
contents.append(result.content)
# ...
except urlfetch.DownloadError:
# Request timed out or failed.
# ...
concatenated_result = '\n'.join(contents)
In this example, we assemble the body of all the requests that returned status code 200, and concatenate with linebreak between them.
Or with ndb, my personal preference for anything async on GAE, something like:
#ndb.tasklet
def get_urls(urls):
ctx = ndb.get_context()
result = yield map(ctx.urlfetch, urls)
contents = [r.content for r in result if r.status_code==200]
raise ndb.Return('\n'.join(contents))
I use this code (implmented before I learned about ndb tasklets):
while rpcs:
rpc = UserRPC.wait_any(rpcs)
result = rpc.get_result()
# process result here
rpcs.remove(rpc)