How can I get a task by name?
from google.appengine.api import taskqueue
taskqueue.add(name='foobar', url='/some-handler', params={'foo': 'bar'}
task_queue = taskqueue.Queue('default')
task_queue.delete_tasks_by_name('foobar') # would work
# looking for a method like this:
foobar_task = task_queue.get_task_by_name('foobar')
It should be possible with the REST API (https://developers.google.com/appengine/docs/python/taskqueue/rest/tasks/get). But I would prefer something like task_queue.get_task_by_name('foobar'). Any ideas? Did I miss something?
There is no guarantee that the task with this name exists - it may have been already executed. And even if you manage to get a task, it may be executed while you are trying to do something with it. So when you try to put it back, you have no idea if you are adding it for the first time or for the second time.
Because of this uncertainty, I can't see any use case where getting a task by name may be useful.
EDIT:
You can give a name to your task in order to ensure that a particular task only executes once. When you add a task with a name to a queue, App Engine will check if the task with such name already exists. If it does, the subsequent attempt will fail.
For example, you can have many instances running, and each instance may need to insert an entity in the Datastore. Your first option is to check if an entity already exists in a datastore. This is a relatively slow operation, and by the time you received your response (entity does not exist) and decide to insert it, another instance could have already inserted it. So you end up with two entities instead of one.
Your second option is to use tasks. Instead of inserting a new entity directly into a datastore, an instance creates a task to insert it, and it gives this task a name. If another instance tries to add a task with the same name, it will simply override the existing task. As a result, you are guaranteed that an entity will be inserted only once.
Related
Reading this documentation https://docs.djangoproject.com/en/4.0/topics/db/transactions/#django.db.transaction.on_commit
This is the use case for on_commit
with transaction.atomic(): # Outer atomic, start a new transaction
transaction.on_commit(foo)
# Do things...
with transaction.atomic(): # Inner atomic block, create a savepoint
transaction.on_commit(bar)
# Do more things...
# foo() and then bar() will be called when leaving the outermost block
But why not just write the code like normal without on_commit hooks? Like this:
with transaction.atomic(): # Outer atomic, start a new transaction
# Do things...
with transaction.atomic(): # Inner atomic block, create a savepoint
# Do more things...
foo()
bar()
# foo() and then bar() will be called when leaving the outermost block
It's easier to read since it doesn't require more knowledge of the Django APIs and the statements are put in the order of when they are executed. It's easier to test since you don't have to use any special test classes for Django.
So what is the use-case for the on_commit hook?
The example code given in the Django docs is transaction.on_commit(lambda: some_celery_task.delay('arg1')) and it's probably specifically because this comes up a lot with celery tasks.
Imagine if you do the following within a transaction:
my_object = MyObject.objects.create()
some_celery_task.delay(my_object.pk)
Then in your celery task you try doing this:
#app.task
def some_celery_task(object_pk)
my_object = MyObject.objects.get(pk=object_pk)
This may work a lot of the time, but randomly you'll get errors where it's not able to find the object (depending on how fast the work task is run because it's a race condition). This is because you created a MyObject record within a transaction, but it isn't actually available in the database until a COMMIT is run. Celery has no access to that open transaction, so it needs to be run after the COMMIT. There's also the very real possibility that something later on causes a ROLLBACK and that celery task should never actually be called.
So... You need to do:
my_object = MyObject.objects.create()
transaction.on_commit(lambda: some_celery_task.delay(my_object.pk))
Now, the celery task won't be called until the MyObject has actually been saved to the database after the COMMIT was called.
I should note, though, this is primarily only a concern when you aren't using AUTOCOMMIT (which is actually the default). If you're in AUTOCOMMIT mode then you can be certain that a commit has been finished as part of a .create() or .save(). However, if you're code base has any possibility of being called within a #transaction.atomic() then it's no longer AUTOCOMMIT and you're back to needing .on_commit(), so it's best/safest to always use it.
Django documentation:
Django provides the on_commit() function to register callback functions that should be executed after a transaction is successfully committed
It is the main purpose. A transaction is a unit of work that you want to treat atomically. It either happens completely or not at all. The same applies to your code. If something went wrong during DB operations you might not need to do some things.
Let's consider some business logic flow:
User sends his registration data to our endpoint, we validate it, etc.
We save the new user to our DB.
We send him a "hello" letter to email with a link for confirming his account.
If something goes wrong during step 2, we shouldn't go to step 3.
We can think that, well, I'll get an exception and wouldn't execute that code as well. Why do we still need it?
Sometimes you take actions in your code based on an assumption of the transaction being successful before potentially dangerous DB operations. For example, you want firstly to check if can send an email to your user, because you know that your emailing 3rd-party often gives you 500. In that case, you want to raise a 500 for the user and ask him to register later (a very bad idea, btw, but it's just a synthetic example).
When your function (e.g. with #atomic decorator) contains a lot of DB operations you surely don't want to memorize all the variables states in order to use them after all DB-related code. Like this:
Validation of user's order.
Checking at DB if it could be completed.
If it could be done we need to send a request to 3rd-party CRM with the order's details.
If it couldn't, then we should create a support ticket in another 3rd-party.
Saving user's order to DB, updating user's model.
Sending a messenger notification to the employee who is responsible for the order.
Saving information, that notification for employee was sent successfully to the DB.
You can imagine what a mess would we have if we hadn't on_commit in this situation and we had a really big try-catch on this.
I'm working on a multiprocessed application, and each process sometimes executes the following code:
db_cursor.execute("SELECT MAX(id) FROM prqueue;")
for record in db_cursor.fetchall():
if record[0]:
db_cursor.execute("DELETE FROM prqueue WHERE id='%s'" % record[0]);
db_connector.commit()
And I'm facing the following problem: there may be a situation, when two processes take the same maximum value, and both try to delete this value. Such situation is not acceptable in the context of my application, each value must be taken (and deleted) only by one process.
How can I achieve this? Is table locking while taking the maximum and deleting absolutely necessary, or there is another, nice way to do that?
Thank you.
Consider simulating record locks with GET_LOCK();
Choose a name specific to the op you want locking. e.g. 'prqueue_max_del'.
Call SELECT GET_LOCK('prqueue_max_del',30) to lock the name 'prqueue_max_del'.. it will return 1 and set the lock if the name becomes available, or return 0 if the lock is not available after 30 seconds (the second parameter is the timeout).
Use SELECT RELEASE_LOCK('prqueue_max_del') when you are finished.
You will have to use the same names in each transaction and calling GET_LOCK() again in a transaction will release the previously set lock.
Beware; As only the abstract name is locked, all other processes not using this method and abstract name will be able to savage your table independently.
GET_LOCK() docs
I have tens (potentially hundreds) of thousands of persistent objects that I want to generate in a multithreaded fashion due the processing required.
While the creation of the objects happens in separate threads (using Flask-SQLAlchemy extension btw with scoped sessions) the call to write the generated objects to the DB happens in 1 place after the generation has completed.
The problem, I believe, is that the objects being created are part of several existing relationships-- thereby triggering the automatic addition to the identity map despite being created in separate, concurrent, threads with no explicit session in any of the threads.
I was hoping to contain the generated objects in a single list, and then write the whole list (using a single session object) to the database. This results in an error like this:
AssertionError: A conflicting state is already present in the identity map for key (<class 'app.ModelObject'>, (1L,))
Hence why I believe the identity map has already been populated, because it's when I try to add and commit using the global session outside of the concurrent code, the assertion error is triggered.
The final detail is that whatever session object(s), (scoped or otherwise, as I don't fully understand how automatic addition to the identity map works in the case of multithreading) I cannot find a way / don't know how to get a reference to them so that even if I wanted to deal with a separate session per process I could.
Any advice is greatly appreciated. The only reason I am not posting code (yet) is because it's difficult to abstract a working example immediately out of my app. I will post if somebody really needs to see it though.
Each session is thread-local; in other words there is a separate session for each thread. If you decide to pass some instances to another thread, they will become "detached" from the session. Use db.session.add_all(objects) in the receiving thread to put them all back.
For some reason, it looks like you're creating objects with the same identity (primary key columns) in different threads, then trying to send them both to the database. One option is to fix why this is happening, so that identities will be guaranteed unique. You may also try merging; merged_object = db.session.merge(other_object, load=False).
Edit: zzzeek's comment clued me in on something else that may be going on:
With Flask-SQLAlchemy, the session is tied to the app context. Since that is thread local, spawning a new thread will invalidate the context; there will be no database session in the threads. All the instances are detached there, and cannot properly track relationships. One solution is to pass app to each thread and perform everything within a with app.app_context(): block. Inside the block, first use db.session.add to populate the local session with the passed instances. You should still merge in the master task afterwards to ensure consistency.
I just want to clarify the problem and the solution with some pseudo-code in case somebody has this problem / wants to do this in the future.
class ObjA(object):
obj_c = relationship('ObjC', backref='obj_c')
class ObjB(object):
obj_c = relationship('ObjC', backref='obj_c')
class ObjC(object):
obj_a_id = Column(Integer, ForeignKey('obj_a.id'))
obj_b_id = Column(Integer, ForeignKey('obj_b.id'))
def __init__(self, obj_a, obj_b):
self.obj_a = obj_a
self.obj_b = obj_b
def make_a_bunch_of_c(obj_a, list_of_b=None):
return [ObjC(obj_a, obj_b) for obj_b in list_of_b]
def parallel_generate():
list_of_a = session.query(ObjA).all() # assume there are 1000 of these
list_of_b = session.query(ObjB).all() # and 30 of these
fxn = functools.partial(make_a_bunch_of_c, list_of_b=list_of_b)
pool = multiprocessing.Pool(10)
all_the_things = pool.map(fxn, list_of_a)
return all_the_things
Now let's stop here a second. The original problem was that attempting to ADD the list of ObjC's caused the error message in the original question:
session.add_all(all_the_things)
AssertionError: A conflicting state is already present in the identity map for key [...]
Note: The error occurs during the adding phase, the commit attempt never even happens because the assertion occurs pre-commit. As far as I could tell.
Solution:
all_the_things = parallel_generate()
for thing in all_the_things:
session.merge(thing)
session.commit()
The details of session utilization when dealing with automatically added objects (via the relationship cascading) is still beyond me and I cannot explain why the conflict originally occurred. All I know is that using the merge function will cause SQLAlchemy to sort all of the child objects that were created across 10 different processes into a single session in the master process.
I would be curious in the why, if anyone happens across this.
_createAccount() method taking two parameters as below
def _createAccount(self,username,emailID):
<statements to create account with respect to received emailID>
need to test this method using unittest, i.e with a single test method i want to send two request at a time with same emailID and different Usernames.
one of the two requests must get the response like already an account created with this emailID.
How to send parallel createAccount requests with unit test.
I guess this code will run within a web application, so that multiple requests can be handled at a time.
One way could be to create threads in the test, run the method in different threads an check the results, but that comes with a lot of caveats. As soon as execute threads in parallel, the order of execution stops being deterministic, but instead depends on the scheduler, which can be considered more or less random. That means that even if your method was to fail under certain circumstances (with a precise order or execution), there is no way to make sure that will be able to recreate those circumstances. In other words, a passing test won't tell you anything about the validity of the method.
For this kind of synchronization problem, you have to write the code as to make sure that the thing you don't want to happen twice can't. For that, you need to make sure that your creation/verification code is atomic.
For example, if you're working with an sql database, you could specify a uniqueness constraint on the username or emailID column, so that the second request will fail (sql transaction are atomic). In other cases you'd want to use a lock to make sure only one thread is executing the "check if exists, and if not create" part.
I'm developing software using the Google App Engine.
I have some considerations about the optimal design regarding the following issue: I need to create and save snapshots of some entities at regular intervals.
In the conventional relational db world, I would create db jobs which would insert new summary records.
For example, a job would insert a record for every active user that would contain his current score to the "userrank" table, say, every hour.
I'd like to know what's the best method to achieve this in Google App Engine. I know that there is the Cron service, but does it allow us to execute jobs which will insert/update thousands of records?
I think you'll find that snapshotting every user's state every hour isn't something that will scale well no matter what your framework. A more ordinary environment will disguise this by letting you have longer running tasks, but you'll still reach the point where it's not practical to take a snapshot of every user's data, every hour.
My suggestion would be this: Add a 'last snapshot' field, and subclass the put() function of your model (assuming you're using Python; the same is possible in Java, but I don't know the syntax), such that whenever you update a record, it checks if it's been more than an hour since the last snapshot, and if so, creates and writes a snapshot record.
In order to prevent concurrent updates creating two identical snapshots, you'll want to give the snapshots a key name derived from the time at which the snapshot was taken. That way, if two concurrent updates try to write a snapshot, one will harmlessly overwrite the other.
To get the snapshot for a given hour, simply query for the oldest snapshot newer than the requested period. As an added bonus, since inactive records aren't snapshotted, you're saving a lot of space, too.
Have you considered using the remote api instead? This way you could get a shell to your datastore and avoid the timeouts. The Mapper class they demonstrate in that link is quite useful and I've used it successfully to do batch operations on ~1500 objects.
That said, cron should work fine too. You do have a limit on the time of each individual request so you can't just chew through them all at once, but you can use redirection to loop over as many users as you want, processing one user at a time. There should be an example of this in the docs somewhere if you need help with this approach.
I would use a combination of Cron jobs and a looping url fetch method detailed here: http://stage.vambenepe.com/archives/549. In this way you can catch your timeouts and begin another request.
To summarize the article, the cron job calls your initial process, you catch the timeout error and call the process again, masked as a second url. You have to ping between two URLs to keep app engine from thinking you are in a accidental loop. You also need to be careful that you do not loop infinitely. Make sure that there is an end state for your updating loop, since this would put you over your quotas pretty quickly if it never ended.