I have some trouble understanding a sequence of events causing a bug in my appplication which can only be seen intermittently in the app deployed on GAE, and never when running with the local devserver.py.
All the related code snippets below (trimmed for MCV, hopefully I didn't lose anything significant) are executed during handling of the same task queue request.
The entry point:
def job_completed_task(self, _):
# running outside transaction as query is made
if not self.all_context_jobs_completed(self.context.db_key, self):
# this will transactionally enqueue another task
self.trigger_job_mark_completed_transaction()
else:
# this is transactional
self.context.jobs_completed(self)
The corresponding self.context.jobs_completed(self) is:
#ndb.transactional(xg=True)
def jobs_completed(self, job):
if self.status == QAStrings.status_done:
logging.debug('%s jobs_completed %s NOP' % (self.lid, job.job_id))
return
# some logic computing step_completed here
if step_completed:
self.status = QAStrings.status_done # includes self.db_data.put()
# this will transactionally enqueue another task
job.trigger_job_mark_completed_transaction()
The self.status setter, hacked to obtain a traceback for debugging this scenario:
#status.setter
def status(self, new_status):
assert ndb.in_transaction()
status = getattr(self, self.attr_status)
if status != new_status:
traceback.print_stack()
logging.info('%s status change %s -> %s' % (self.name, status, new_status))
setattr(self, self.attr_status, new_status)
The job.trigger_job_mark_completed_transaction() eventually enqueues a new task like this:
task = taskqueue.add(queue_name=self.task_queue_name, url=url, params=params,
transactional=ndb.in_transaction(), countdown=delay)
The GAE log for the occurence, split as it doesn't fit into a single screen:
My expectation from the jobs_completed transaction is to either see the ... jobs_completed ... NOP debug message and no task enqueued or to at least see the status change running -> done info message and a task enqueued by job.trigger_job_mark_completed_transaction().
What I'm actually seeing is both messages and no task enqueued.
The logs appears to indicate the transaction is attempted twice:
1st time it finds the status not done, so it executes the logic, sets the status to done (and displays the traceback and the info msg) and should transactionally enqueue the new task - but it doesn't
2nd time it finds the status done and just prints the debug message
My question is - if the 1st transaction attempt fails shouldn't the status change be rolled back as well? What am I missing?
I found a workaround: specifying no retries to the jobs_completed() transaction:
#ndb.transactional(xg=True, retries=0)
def jobs_completed(self, job):
This prevents the automatic repeated execution, instead causing an exception:
TransactionFailedError(The transaction could not be committed. Please
try again.)
Which is acceptable as I already have in place a back-off/retry safety net for the entire job_completed_task(). Things are OK now.
As for why the rollback didn't happen, the only thing that crosses my mind is that somehow the entity was read (and cached in my object attribute) prior to entering the transaction, thus not being considered part of the (same) transaction. But I couldn't find a code path that would do that, so it's just speculation.
Related
In the following code, an API gives a task to a task broker, who puts it in a queue, where it is picked up by a worker. The worker will then execute the task and notify the task broker (using a redis message channel) that he is done, after which the task broker will remove it from its queue. This works.
What I'd like is that the task broker is then able to return the result of the task to the API. But I'm unsure on how to do so since it is asynchronous code and I'm having difficulty figuring it out. Can you help?
Simplified the code is roughly as follows, but incomplete.
The API code:
#router.post('', response_model=BaseDocument)
async def post_document(document: BaseDocument):
"""Create the document with a specific type and an optional name given in the payload"""
task = DocumentTask({ <SNIP>
})
task_broker.give_task(task)
result = await task_broker.get_task_result(task)
return result
The task broker code, first part is giving the task, the second part is removing the task and the final part is what I assume should be a blocking call on the status of the removed task
def give_task(self, task_obj):
self.add_task_to_queue(task_obj)
<SNIP>
self.message_channel.publish(task_obj)
# ...
def remove_task_from_queue(self, task):
id_task_to_remove = task.id
for i in range(len(task_queue)):
if task_queue[i]["id"] == id_task_to_remove:
removed_task = task_queue.pop(i)
logger.debug(
f"[TaskBroker] Task with id '{id_task_to_remove}' succesfully removed !"
)
removed_task["status"] = "DONE"
return
# ...
async def get_task_result(self, task):
return task.result
My intuition would like to implement a way in get_task_result that blocks on task.result until it is modified, where I would modify it in remove_task_from_queue when it is removed from the queue (and thus done).
Any idea in how to do this, asynchronously?
I have a Google Cloud Function triggered by a PubSub. The doc states messages are acknowledged when the function end with success.
link
But randomly, the function retries (same execution ID) exactly 10 minutes after execution. It is the PubSub ack max timeout.
I also tried to get message ID and acknowledge it programmatically in Function code but the PubSub API respond there is no message to ack with that id.
In StackDriver monitoring, I see some messages not being acknowledged.
Here is my code : main.py
import base64
import logging
import traceback
from google.api_core import exceptions
from google.cloud import bigquery, error_reporting, firestore, pubsub
from sql_runner.runner import orchestrator
logging.getLogger().setLevel(logging.INFO)
def main(event, context):
bigquery_client = bigquery.Client()
firestore_client = firestore.Client()
publisher_client = pubsub.PublisherClient()
subscriber_client = pubsub.SubscriberClient()
logging.info(
'event=%s',
event
)
logging.info(
'context=%s',
context
)
try:
query_id = base64.b64decode(event.get('data',b'')).decode('utf-8')
logging.info(
'query_id=%s',
query_id
)
# inject dependencies
orchestrator(
query_id,
bigquery_client,
firestore_client,
publisher_client
)
sub_path = (context.resource['name']
.replace('topics', 'subscriptions')
.replace('function-sql-runner', 'gcf-sql-runner-europe-west1-function-sql-runner')
)
# explicitly ack message to avoid duplicates invocations
try:
subscriber_client.acknowledge(
sub_path,
[context.event_id] # message_id to ack
)
logging.warning(
'message_id %s acknowledged (FORCED)',
context.event_id
)
except exceptions.InvalidArgument as err:
# google.api_core.exceptions.InvalidArgument: 400 You have passed an invalid ack ID to the service (ack_id=982967258971474).
logging.info(
'message_id %s already acknowledged',
context.event_id
)
logging.debug(err)
except Exception as err:
# catch all exceptions and log to prevent cold boot
# report with error_reporting
error_reporting.Client().report_exception()
logging.critical(
'Internal error : %s -> %s',
str(err),
traceback.format_exc()
)
if __name__ == '__main__': # for testing
from collections import namedtuple # use namedtuple to avoid Class creation
Context = namedtuple('Context', 'event_id resource')
context = Context('666', {'name': 'projects/my-dev/topics/function-sql-runner'})
script_to_start = b' ' # launch the 1st script
script_to_start = b'060-cartes.sql'
main(
event={"data": base64.b64encode(script_to_start)},
context=context
)
Here is my code : runner.py
import logging
import os
from retry import retry
PROJECT_ID = os.getenv('GCLOUD_PROJECT') or 'my-dev'
def orchestrator(query_id, bigquery_client, firestore_client, publisher_client):
"""
if query_id empty, start the first sql script
else, call the given query_id.
Anyway, call the next script.
If the sql script is the last, no call
retrieve SQL queries from FireStore
run queries on BigQuery
"""
docs_refs = [
doc_ref.get() for doc_ref in
firestore_client.collection(u'sql_scripts').list_documents()
]
sorted_queries = sorted(docs_refs, key=lambda x: x.id)
if not bool(query_id.strip()) : # first execution
current_index = 0
else:
# find the query to run
query_ids = [ query_doc.id for query_doc in sorted_queries]
current_index = query_ids.index(query_id)
query_doc = sorted_queries[current_index]
bigquery_client.query(
query_doc.to_dict()['request'], # sql query
).result()
logging.info(
'Query %s executed',
query_doc.id
)
# exit if the current query is the last
if len(sorted_queries) == current_index + 1:
logging.info('All scripts were executed.')
return
next_query_id = sorted_queries[current_index+1].id.encode('utf-8')
publish(publisher_client, next_query_id)
#retry(tries=5)
def publish(publisher_client, next_query_id):
"""
send a message in pubsub to call the next query
this mechanism allow to run one sql script per Function instance
so as to not exceed the 9min deadline limit
"""
logging.info('Calling next query %s', next_query_id)
future = publisher_client.publish(
topic='projects/{}/topics/function-sql-runner'.format(PROJECT_ID),
data=next_query_id
)
# ensure publish is successfull
message_id = future.result()
logging.info('Published message_id = %s', message_id)
It looks like the pubsub message is not ack on success.
I do not think I have background activity in my code.
My question : why my Function is randomly retrying even when success ?
Cloud Functions does not guarantee that your functions will run exactly once. According to the documentation, background functions, including pubsub functions, are given an at-least-once guarantee:
Background functions are invoked at least once. This is because of the
asynchronous nature of handling events, in which there is no caller
that waits for the response. The system might, in rare circumstances,
invoke a background function more than once in order to ensure
delivery of the event. If a background function invocation fails with
an error, it will not be invoked again unless retries on failure are
enabled for that function.
Your code will need to expect that it could possibly receive an event more than once. As such, your code should be idempotent:
To make sure that your function behaves correctly on retried execution
attempts, you should make it idempotent by implementing it so that an
event results in the desired results (and side effects) even if it is
delivered multiple times. In the case of HTTP functions, this also
means returning the desired value even if the caller retries calls to
the HTTP function endpoint. See Retrying Background Functions for more
information on how to make your function idempotent.
In my simple webapp I have a model called Document. When the document is created it is empty. The user can then request to generate it, which means that its content is filled with data. Since this generating step can take some time, it is an asynchronous request: the server starts a thread to generate the document, the user obtains a quick response saying that the generation process started, and after some time the generation is over and the database is updated.
This is the code that describes the model:
import time
from threading import Thread
from django.db import models
STATE_EMPTY = 0
STATE_GENERATING = 1
STATE_READY = 2
class Document(models.Model):
text = models.TextField(blank=True, null=True)
state = models.IntegerField(default=STATE_EMPTY, choices=(
(STATE_EMPTY, 'empty'),
(STATE_GENERATING, 'generating'),
(STATE_READY, 'ready'),
))
def generate(self):
def generator():
time.sleep(5)
self.state = STATUS_READY
self.text = 'This is the content of the document'
self.state = STATE_GENERATING
self.save()
t = Thread(target=generator, name='GeneratorThread')
t.start()
As you can see, the generate function changes the state, saves the document and spawns a thread. The thread works for a while (well,... sleeps for a while), then changes and state and the content.
This is the corresponding test:
def test_document_can_be_generated_asynchronously(self):
doc = Document()
doc.save()
self.assertEqual(STATE_EMPTY, doc.state)
doc.generate()
self.assertEqual(STATE_GENERATING, doc.state)
time.sleep(8)
self.assertEqual(STATE_READY, doc.state)
self.assertEqual('This is the content of the document', doc.text)
This test passes. The document object correctly undergoes all expected changes.
Unfortunately, the code is wrong: after changing the content of the document, it is never saved, so the changes are not persistent. This can be verified by adding the following line to the test:
self.assertEqual(STATE_READY, Document.objects.first().state)
This assertion fails:
self.assertEqual(STATE_READY, Document.objects.first().state)
AssertionError: 2 != 1
The solution is simple: just add self.save() at the end of the generator function. But this results in different kind of problem:
Destroying test database for alias 'default'...
Traceback (most recent call last):
File ".../virtualenvs/DjangoThreadTest-elBGAiyX/lib/python3.7/site-packages/django/db/backends/utils.py", line 82, in _execute
return self.cursor.execute(sql)
psycopg2.errors.ObjectInUse: database "test_postgres" is being accessed by other users
DETAIL: There is 1 other session using the database.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
...
File ".../virtualenvs/DjangoThreadTest-elBGAiyX/lib/python3.7/site-packages/django/db/backends/utils.py", line 82, in _execute
return self.cursor.execute(sql)
django.db.utils.OperationalError: database "test_postgres" is being accessed by other users
DETAIL: There is 1 other session using the database.
The problem seems related to the save() placed in a different thread. The engine used does not seem to affect the result: I obtain almost identical error messages when using postgresql (as shown) and sqlite (in that case the error is along the lines of "The database table is locked").
Some similar questions obtain replies such as "Just use Celery to manage heavy processing tasks". I would rather understand what I'm doing wrong and how to solve it using Django tools. In fact, there is no heavy processing, nor the need to scale to large users (the webapp is to be used by one user at the time)
When you spawn a new thread, Django creates a new connection to the database for that thread. Normally, all connections are closed in the start/end of the request cycle and at the end of a test run. But if the thread is manually spawned, there is no code to close connection - the thread ends, its local data is destroyed but the connection is not closed on database side properly (connections is stored in thread.local object if you are interested).
So, to solve the issue you have to manually close connections at the end of a thread.
from django.db import connection
def generate(self):
def generator():
time.sleep(5)
self.state = STATUS_READY
self.text = 'This is the content of the document'
self.save()
connection.close()
self.state = STATE_GENERATING
self.save()
t = Thread(target=generator, name='GeneratorThread')
t.start()
PostgreSQL and SQL defines a Serializable transaction isolation level. If you isolate transactions to this level, conflicting concurrent transactions abort and need retrying.
I am familiar with the concept of transaction retries from Plone / Zope world where the entire HTTP request can be replayed in the case there is a transaction conflict. How similar functionality could be achieved with SQLAlchemy (and potentially with zope.sqlalchemy)? I tried to read the documentation of zope.sqlalchemy and Zope transaction manager, but this is not obvious the me.
Specially I want something like this:
# Try to do the stuff, if it fails because of transaction conflict do again until retry count is exceeded
with transaction.manager(retries=3):
do_stuff()
# If we couldn't get the transaction through even after 3 attempts, fail with a horrible exception
So, after poking around two weeks and getting no off-the-shelf solution I came up with my own.
Here is a ConflictResolver class which provides managed_transaction function decorator. You can use the decorator to mark functions to be retryable. I.e. if there is an database conflict error when running the function, the function is run again, now with more hopes the db transaction which caused the conflict error would have finished.
The source code is here: https://bitbucket.org/miohtama/cryptoassets/src/529c50d74972ff90fe5b61dfbfc1428189cc248f/cryptoassets/core/tests/test_conflictresolver.py?at=master
The unit tests to cover it are here: https://bitbucket.org/miohtama/cryptoassets/src/529c50d74972ff90fe5b61dfbfc1428189cc248f/cryptoassets/core/tests/test_conflictresolver.py?at=master
Python 3.4+ only.
"""Serialized SQL transaction conflict resolution as a function decorator."""
import warnings
import logging
from collections import Counter
from sqlalchemy.orm.exc import ConcurrentModificationError
from sqlalchemy.exc import OperationalError
UNSUPPORTED_DATABASE = "Seems like we might know how to support serializable transactions for this database. We don't know or it is untested. Thus, the reliability of the service may suffer. See transaction documentation for the details."
#: Tuples of (Exception class, test function). Behavior copied from _retryable_errors definitions copied from zope.sqlalchemy
DATABASE_COFLICT_ERRORS = []
try:
import psycopg2.extensions
except ImportError:
pass
else:
DATABASE_COFLICT_ERRORS.append((psycopg2.extensions.TransactionRollbackError, None))
# ORA-08177: can't serialize access for this transaction
try:
import cx_Oracle
except ImportError:
pass
else:
DATABASE_COFLICT_ERRORS.append((cx_Oracle.DatabaseError, lambda e: e.args[0].code == 8177))
if not DATABASE_COFLICT_ERRORS:
# TODO: Do this when cryptoassets app engine is configured
warnings.warn(UNSUPPORTED_DATABASE, UserWarning, stacklevel=2)
#: XXX: We need to confirm is this the right way for MySQL, SQLIte?
DATABASE_COFLICT_ERRORS.append((ConcurrentModificationError, None))
logger = logging.getLogger(__name__)
class CannotResolveDatabaseConflict(Exception):
"""The managed_transaction decorator has given up trying to resolve the conflict.
We have exceeded the threshold for database conflicts. Probably long-running transactions or overload are blocking our rows in the database, so that this transaction would never succeed in error free manner. Thus, we need to tell our service user that unfortunately this time you cannot do your thing.
"""
class ConflictResolver:
def __init__(self, session_factory, retries):
"""
:param session_factory: `callback()` which will give us a new SQLAlchemy session object for each transaction and retry
:param retries: The number of attempst we try to re-run the transaction in the case of transaction conflict.
"""
self.retries = retries
self.session_factory = session_factory
# Simple beancounting diagnostics how well we are doing
self.stats = Counter(success=0, retries=0, errors=0, unresolved=0)
#classmethod
def is_retryable_exception(self, e):
"""Does the exception look like a database conflict error?
Check for database driver specific cases.
:param e: Python Exception instance
"""
if not isinstance(e, OperationalError):
# Not an SQLAlchemy exception
return False
# The exception SQLAlchemy wrapped
orig = e.orig
for err, func in DATABASE_COFLICT_ERRORS:
# EXception type matches, now compare its values
if isinstance(orig, err):
if func:
return func(e)
else:
return True
return False
def managed_transaction(self, func):
"""SQL Seralized transaction isolation-level conflict resolution.
When SQL transaction isolation level is its highest level (Serializable), the SQL database itself cannot alone resolve conflicting concurrenct transactions. Thus, the SQL driver raises an exception to signal this condition.
``managed_transaction`` decorator will retry to run everyhing inside the function
Usage::
# Create new session for SQLAlchemy engine
def create_session():
Session = sessionmaker()
Session.configure(bind=engine)
return Session()
conflict_resolver = ConflictResolver(create_session, retries=3)
# Create a decorated function which can try to re-run itself in the case of conflict
#conflict_resolver.managed_transaction
def myfunc(session):
# Both threads modify the same wallet simultaneously
w = session.query(BitcoinWallet).get(1)
w.balance += 1
# Execute the conflict sensitive code inside a managed transaction
myfunc()
The rules:
- You must not swallow all exceptions within ``managed_transactions``. Example how to handle exceptions::
# Create a decorated function which can try to re-run itself in the case of conflict
#conflict_resolver.managed_transaction
def myfunc(session):
try:
my_code()
except Exception as e:
if ConflictResolver.is_retryable_exception(e):
# This must be passed to the function decorator, so it can attempt retry
raise
# Otherwise the exception is all yours
- Use read-only database sessions if you know you do not need to modify the database and you need weaker transaction guarantees e.g. for displaying the total balance.
- Never do external actions, like sending emails, inside ``managed_transaction``. If the database transaction is replayed, the code is run twice and you end up sending the same email twice.
- Managed transaction section should be as small and fast as possible
- Avoid long-running transactions by splitting up big transaction to smaller worker batches
This implementation heavily draws inspiration from the following sources
- http://stackoverflow.com/q/27351433/315168
- https://gist.github.com/khayrov/6291557
"""
def decorated_func():
# Read attemps from app configuration
attempts = self.retries
while attempts >= 0:
session = self.session_factory()
try:
result = func(session)
session.commit()
self.stats["success"] += 1
return result
except Exception as e:
if self.is_retryable_exception(e):
session.close()
self.stats["retries"] += 1
attempts -= 1
if attempts < 0:
self.stats["unresolved"] += 1
raise CannotResolveDatabaseConflict("Could not replay the transaction {} even after {} attempts".format(func, self.retries)) from e
continue
else:
session.rollback()
self.stats["errors"] += 1
# All other exceptions should fall through
raise
return decorated_func
Postgres and Oracle conflict errors are marked as retryable by zope.sqlalchemy. Set your isolation level in the engine configuration and the transaction retry logic in pyramid_tm or Zope will work.
I have a trouble with long calculations in django. I am not able to install Celery because of idiocy of my company, so I have to "reinvent the wheel".I am trying to make all calculations in TaskQueue class, which stores all calculations in dictionary "results". Also, I am trying to make "Please Wait" page, which will asks this TaskQueue if task with provided key is ready.
And the problem is that the results somehow disappear.
I have some view with long calculations.
def some_view(request):
...
uuid = task_queue.add_task(method_name, params) #method_name(params) returns HttpResponse
return redirect('/please_wait/?uuid={0}'.format(uuid))
And please_wait view:
def please_wait(request):
uuid = request.GET.get('uuid','0')
ready = task_queue.task_ready(uuid)
if ready:
return task_queue.task_result(uuid)
elif ready == None:
return render_to_response('admin/please_wait.html',{'not_found':True})
else:
return render_to_response('admin/please_wait.html',{'not_found':False})
And last code, my TaskQueue:
class TaskQueue:
def __init__(self):
self.pool = ThreadPool()
self.results = {}
self.lock = Lock()
def add_task(self, method, params):
self.lock.acquire()
new_uuid = self.generate_new_uuid()
while self.results.has_key(new_uuid):
new_uuid = self.generate_new_uuid()
self.results[new_uuid] = self.pool.apply_async(func=method, args=params)
self.lock.release()
return new_uuid
def generate_new_uuid(self):
return uuid.uuid1().hex[0:8]
def task_ready(self, task_id):
if self.results.has_key(task_id):
return self.results[task_id].ready()
else:
return None
def task_result(self, task_id):
if self.task_ready(task_id):
return self.results[task_id].get()
else:
return None
global task_queue = TaskQueue()
After task addition I could log result providing it's uuid for some seconds, and then it says that task doesn't ready. Here is my log: (I am outputting task_queue.results)
[INFO] 2013-10-01 16:04:52,782 logger: {'ade5d154': <multiprocessing.pool.ApplyResult object at 0x1989906c>}
[INFO] 2013-10-01 16:05:05,740 logger: {}
Help me, please! Why the hell result disappears?
UPD: #freakish helped me to find out some new information. This result doesn't disappear forever, it disappears sometimes if I will repeat my tries to log it.
[INFO] 2013-10-01 16:52:41,743 logger: {}
[INFO] 2013-10-01 16:52:45,775 logger: {}
[INFO] 2013-10-01 16:52:48,855 logger: {'ade5d154': <multiprocessing.pool.ApplyResult object at 0x1989906c>}
OK, so we've established that you are running 4 processes of Django. In that case your queue won't be shared between them. Actually there are two possible solutions AFAIK:
Use a shared queueing server. You can write your own (see for example this entry) but using a proper one (like Celery) will be a lot easier (if you can't convince your employer to install it, then quit the job ;)).
Use database to store results inside it and let each server do the calculations (via processes or threads). It does not have to be a proper database server. You can use sqlite3 for example. This is more secure and reliable way but less efficient. I think this is easier then queueing mechanism. You simply create table with columns: id, state, result. When you create job you update entry with state=processing, when you finish the job you update entry with state=done and result=result (for example as JSON string). This is easy and reliable (you actually don't need a queue here at all, the order of jobs doesn't matter unless I'm missing something).
Of course you won't be able to use this .ready() functions with it (you should store results inside these storages) unless you pickle results but that is an unnecessary overhead.