Prevent duplicate DynamoDB transaction

Prevent duplicate DynamoDB transaction - python

Hi I've a Lambda function treated as a webhook. The webhook may be called multiple time simultaneously with same data. In the lambda function, I check if the transaction record is present in the DynamoDB or not. If it's present in db the Lambda simply returns otherwise it execute further. The problem arises here that when checking if a record in db the Lambda get called again and that check fails because the previous transaction still not inserted in db. and transaction can get executed multiple times.
My question is how to handle this situation. will SQS be helpful in this situation?

You can use optimistic locking for this. I've written a more detailed blog about implementing it, but here are the core ideas.
For each item you track a version number that always gets incremented. Each update to the item will increment the version number by one.
When you want to perform an update, you first read the old item and store its version number locally. Then you change the item locally and increment its version number. When you write it back to the table in your transaction, you add a conditional write. The condition being that the current version number of the item is still the same it was when you read it.
This means the transaction will fail if the item has been update in the mean time. Optimistic Locking helps you with collision detection and is a good solution under the assumption that such collisions are relatively rare. You'd be better served with different locking strategies if they're more frequent.
Optimistic Locking will help you identify the cases you're worried about. It doesn't resolve them, you'll have to implement that yourself. A common conflict resolution approach would be to read the item again and check if your changes have already been applied.

" if it's present in db the lambda simply return otherwise it execute further" given that, is it possible to use FIFO queue and use some "key" from the data as deduplication id (fifo) and that would mean all duplicate messages would never make it to your logic and then you would also need
dynamodb's "strongly consistent" option.

Related

id of my user table is increased unnaturally [duplicate]

I must / have to create unique ID for invoices. I have a table id and another column for this unique number. I use serialization isolation level. Using
var seq = #"SELECT invoice_serial + 1 FROM invoice WHERE ""type""=#type ORDER BY invoice_serial DESC LIMIT 1";
Doesn't help because even using FOR UPDATE it wont read correct value as in serialization level.
Only solution seems to put some retry code.

Sequences do not generate gap-free sets of numbers, and there's really no way of making them do that because a rollback or error will "use" the sequence number.
I wrote up an article on this a while ago. It's directed at Oracle but is really about the fundamental principles of gap-free numbers, and I think the same applies here.
Well, it’s happened again. Someone has asked how to implement a requirement to generate a gap-free series of numbers and a swarm of nay-sayers have descended on them to say (and here I paraphrase slightly) that this will kill system performance, that’s it’s rarely a valid requirement, that whoever wrote the requirement is an idiot blah blah blah.
As I point out on the thread, it is sometimes a genuine legal requirement to generate gap-free series of numbers. Invoice numbers for the 2,000,000+ organisations in the UK that are VAT (sales tax) registered have such a requirement, and the reason for this is rather obvious: that it makes it more difficult to hide the generation of revenue from tax authorities. I’ve seen comments that it is a requirement in Spain and Portugal, and I’d not be surprised if it was not a requirement in many other countries.
So, if we accept that it is a valid requirement, under what circumstances are gap-free series* of numbers a problem? Group-think would often have you believe that it always is, but in fact it is only a potential problem under very particular circumstances.
The series of numbers must have no gaps.
Multiple processes create the entities to which the number is associated (eg. invoices).
The numbers must be generated at the time that the entity is created.
If all of these requirements must be met then you have a point of serialisation in your application, and we’ll discuss that in a moment.
First let’s talk about methods of implementing a series-of-numbers requirement if you can drop any one of those requirements.
If your series of numbers can have gaps (and you have multiple processes requiring instant generation of the number) then use an Oracle Sequence object. They are very high performance and the situations in which gaps can be expected have been very well discussed. It is not too challenging to minimise the amount of numbers skipped by making design efforts to minimise the chance of a process failure between generation of the number and commiting the transaction, if that is important.
If you do not have multiple processes creating the entities (and you need a gap-free series of numbers that must be instantly generated), as might be the case with the batch generation of invoices, then you already have a point of serialisation. That in itself may not be a problem, and may be an efficient way of performing the required operation. Generating the gap-free numbers is rather trivial in this case. You can read the current maximum value and apply an incrementing value to every entity with a number of techniques. For example if you are inserting a new batch of invoices into your invoice table from a temporary working table you might:
insert into
invoices
(
invoice#,
...)
with curr as (
select Coalesce(Max(invoice#)) max_invoice#
from invoices)
select
curr.max_invoice#+rownum,
...
from
tmp_invoice
...
Of course you would protect your process so that only one instance can run at a time (probably with DBMS_Lock if you're using Oracle), and protect the invoice# with a unique key contrainst, and probably check for missing values with separate code if you really, really care.
If you do not need instant generation of the numbers (but you need them gap-free and multiple processes generate the entities) then you can allow the entities to be generated and the transaction commited, and then leave generation of the number to a single batch job. An update on the entity table, or an insert into a separate table.
So if we need the trifecta of instant generation of a gap-free series of numbers by multiple processes? All we can do is to try to minimise the period of serialisation in the process, and I offer the following advice, and welcome any additional advice (or counter-advice of course).
Store your current values in a dedicated table. DO NOT use a sequence.
Ensure that all processes use the same code to generate new numbers by encapsulating it in a function or procedure.
Serialise access to the number generator with DBMS_Lock, making sure that each series has it’s own dedicated lock.
Hold the lock in the series generator until your entity creation transaction is complete by releasing the lock on commit
Delay the generation of the number until the last possible moment.
Consider the impact of an unexpected error after generating the number and before the commit is completed — will the application rollback gracefully and release the lock, or will it hold the lock on the series generator until the session disconnects later? Whatever method is used, if the transaction fails then the series number(s) must be “returned to the pool”.
Can you encapsulate the whole thing in a trigger on the entity’s table? Can you encapsulate it in a table or other API call that inserts the row and commits the insert automatically?
Original article

You could create a sequence with no cache , then get the next value from the sequence and use that as your counter.
CREATE SEQUENCE invoice_serial_seq START 101 CACHE 1;
SELECT nextval('invoice_serial_seq');
More info here

You either lock the table to inserts, and/or need to have retry code. There's no other option available. If you stop to think about what can happen with:
parallel processes rolling back
locks timing out
you'll see why.

In 2006, someone posted a gapless-sequence solution to the PostgreSQL mailing list: http://www.postgresql.org/message-id/44E376F6.7010802#seaworthysys.com

Let the SQL engine do the constraint check or execute a query to check the constraint beforehand

Python application, standard web app.
If a particular request gets executed twice by error the second request will try to insert a row with an already existing primary key.
What is the most sensible way to deal with it.
a) Execute a query to check if the primary key already exists and do the checking and error handling in the python app
b) Let the SQL engine reject the insertion with a constraint failure and use exception handling to handle it back in the app
From a speed perspective it might seem that a failed request will take the same amount of time as a successful one, making b faster because its only one request and not two.
However, when you take things in account like read-only db slaves and table write-locks and stuff like that things get fuzzy for my experience on scaling standard SQL databases.

The best option is (b), from almost any perspective. As mentioned in a comment, there is a multi-threading issue. That means that option (a) doesn't even protect data integrity. And that is a primary reason why you want data integrity checks inside the database, not outside it.
There are other reasons. Consider performance. Passing data into and out of the database takes effort. There are multiple levels of protocol and data preparation, not to mention round trip, sequential communication from the database server. One call has one such unit of overhead. Two calls have two such units.
It is true that under some circumstances, a failed query can have a long clean-up period. However, constraint checking for unique values is a single lookup in an index, which is both fast and has minimal overhead for cleaning up. The extra overhead for handling the error should be tiny in comparison to the overhead for running the queries from the application -- both are small, one is much smaller.
If you had a query load where the inserts were really rare with respect to the comparison, then you might consider doing the check in the application. It is probably a tiny bit faster to check to see if something exists using a SELECT rather than using INSERT. However, unless your query load is many such checks for each actual insert, I would go with checking in the database and move on to other issues.

The latter one you need to do and handle in any case, thus I do not see there is much value in querying for duplicates, except to show the user information beforehand - e.g. report "This username has been taken already, please choose another" when the user is still filling in the form.

Penalties for INSERT with existing primary key?

I'm trying to insert a row if the same primary key does not exist yet (ignore in that case). Doing this from Python, using psycopg2 and Postgres version 9.3.
There are several options how to do this: 1) use subselect, 2) use transaction, 3) let it fail.
It seems easiest to do something like this:
try:
cursor.execute('INSERT...')
except psycopg2.IntegrityError:
pass
Are there any drawbacks to this approach? Is there any performance penalty with the failure?

The foolproof way to do it at the moment is try the insert and let it fail. You can do that at the app level or at the Postgres level; assuming it's not part of a procedure being executed on the server, it doesn't materially matter if it's one or the other when it comes to performance, since either way you're sending a request to the server and retrieving the result. (Where it may matter is in your need to define a save point if you're trying it from within a transaction, for the same reason. Or, as highlighted in Craig's answer, if you've many failed statements.)
In future releases, a proper merge and upsert are on the radar, but as the near-decade long discussion will suggest implementing them properly is rather thorny:
https://wiki.postgresql.org/wiki/SQL_MERGE
https://wiki.postgresql.org/wiki/UPSERT
With respect to the other options you mentioned, the above wiki pages and the links within them should highlight the difficulties. Basically though, using a subselect is cheap, as noted by Erwin, but isn't concurrency-proof (unless you lock properly); using locks basically amounts to locking the entire table (trivial but not great) or reinventing the wheel that's being forged in core (trivial for existing rows, less so for potentially new ones which are inserted concurrently if seek to use predicates instead of a table-level lock); and using a transaction and catching the exception is what you'll end up doing anyway.

Work is ongoing to add a native upsert to PostgreSQL 9.5, which will probably take the form of an INSERT ... ON CONFLICT UPDATE ... statement.
In the mean time, you must attempt the update and if it fails, retry. There's no safe alternative, though you can loop within a PL/PgSQL function to hide this from the application.
Re trying and letting it fail:
Are there any drawbacks to this approach?
It creates a large volume of annoying noise in the log files. It also burns through transaction IDs very rapidly if the conflict rate is high, potentially requiring more frequent VACUUM FREEZE to be run by autovacuum, which can be an issue on large databases.
Is there any performance penalty with the failure?
If the conflict rate is high, you'll be doing a bunch of extra round trips to the database. Otherwise not much really.

Find out what's going to be deleted

When using more complex, hierarchical models with differing settings on how cascade deletes are handled it gets quite hard to figure out beforehand what a delete() will exactly do with the database.
I couldn't find any way to get this piece of information ("Hey SQLAlchemy, what will be deleted if I delete that object over there?") from SQLAlchemy. Implementing this by myself doesn't really seem like an option since this would result sooner or later in situations where my prediction and the actual consequences of the delete() differ, which would be very… unpleasant for the user.

I asked this question on the SQLAlchemy mailing list to
and Michael Bayer explained the possible options (thanks a lot again! :-):
The only deletes that aren't present in session.deleted before the flush are those that will occur because a particular object is an "orphan", and the objects which would be deleted as a result of a cascade on that orphan.
So without orphans taken into account, session.deleted tells you everything that is to be deleted.
To take orphans into account requires traversing through all the relationships as the unit of work does, looking for objects that are currently orphans (there's an API function that will tell you this - if the object is considered an "orphan" by any attribute that refers to it with delete-orphan cascade, it's considered an "orphan"), and then traversing through the relationships of those orphans, considering them to be marked as "deleted", and then doing all the rules again for those newly-deleted objects.
The system right now is implemented by orm/dependency.py. It is probably not hard to literally run a unit of work process across the session normally, but just not emit the SQL, this would give you the final flush plan. But this is an expensive process that I wouldn't want to be calling all the time.
A feature add is difficult here because the use case is not clear. Knowing what will be deleted basically requires half the flush process actually proceed. But you can already implement events inside the flush process itself, most directly the before_delete() and after_delete() events that will guaranteed catch everything. So the rationale for a new feature that basically runs half the flush, before you just do the flush anyway and could just put events inside of it, isn't clear.
I guess the big question is, "when are you calling this".
An easy system would be to add a new event "flush_plan_complete" which will put you into a flush() right as the full plan has been assembled, but before any SQL occurs. It could allow you to register more objects for activity, and it would then rerun the flush plan to consider any new changes (since that's how it works right now anyway). How this registration would proceed is tricky, since it would be nice to use the Session normally there, but that makes this more complicated to implement. But then it could iterate again through the new changes and find any remaining steps to take before proceeding normally.

Per-session transactions in Django

I'm making a Django web-app which allows a user to build up a set of changes over a series of GETs/POSTs before committing them to the database (or reverting) with a final POST. I have to keep the updates isolated from any concurrent database users until they are confirmed (this is a configuration front-end), ruling out committing after each POST.
My preferred solution is to use a per-session transaction. This keeps all the problems of remembering what's changed (and how it affects subsequent queries), together with implementing commit/rollback, in the database where it belongs. Deadlock and long-held locks are not an issue, as due to external constraints there can only be one user configuring the system at any one time, and they are well-behaved.
However, I cannot find documentation on setting up Django's ORM to use this sort of transaction model. I have thrown together a minimal monkey-patch (ew!) to solve the problem, but dislike such a fragile solution. Has anyone else done this before? Have I missed some documentation somewhere?
(My version of Django is 1.0.2 Final, and I am using an Oracle database.)

Multiple, concurrent, session-scale transactions will generally lead to deadlocks or worse (worse == livelock, long delays while locks are held by another session.)
This design is not the best policy, which is why Django discourages it.
The better solution is the following.
Design a Memento class that records the user's change. This could be a saved copy of their form input. You may need to record additional information if the state changes are complex. Otherwise, a copy of the form input may be enough.
Accumulate the sequence of Memento objects in their session. Note that each step in the transaction will involve fetches from the data and validation to see if the chain of mementos will still "work". Sometimes they won't work because someone else changed something in this chain of mementos. What now?
When you present the 'ready to commit?' page, you've replayed the sequence of Mementos and are pretty sure they'll work. When the submit "Commit", you have to replay the Mementos one last time, hoping they're still going to work. If they do, great. If they don't, someone changed something, and you're back at step 2: what now?
This seems complex.
Yes, it does. However it does not hold any locks, allowing blistering speed and little opportunity for deadlock. The transaction is confined to the "Commit" view function which actually applies the sequence of Mementos to the database, saves the results, and does a final commit to end the transaction.
The alternative -- holding locks while the user steps out for a quick cup of coffee on step n-1 out of n -- is unworkable.
For more information on Memento, see this.

In case anyone else ever has the exact same problem as me (I hope not), here is my monkeypatch. It's fragile and ugly, and changes private methods, but thankfully it's small. Please don't use it unless you really have to. As mentioned by others, any application using it effectively prevents multiple users doing updates at the same time, on penalty of deadlock. (In my application, there may be many readers, but multiple concurrent updates are deliberately excluded.)
I have a "user" object which persists across a user session, and contains a persistent connection object. When I validate a particular HTTP interaction is part of a session, I also store the user object on django.db.connection, which is thread-local.
def monkeyPatchDjangoDBConnection():
import django.db
def validConnection():
if django.db.connection.connection is None:
django.db.connection.connection = django.db.connection.user.connection
return True
def close():
django.db.connection.connection = None
django.db.connection._valid_connection = validConnection
django.db.connection.close = close
monkeyPatchDBConnection()
def setUserOnThisThread(user):
import django.db
django.db.connection.user = user
This last is called automatically at the start of any method annotated with #login_required, so 99% of my code is insulated from the specifics of this hack.

I came up with something similar to the Memento pattern, but different enough that I think it bears posting. When a user starts an editing session, I duplicate the target object to a temporary object in the database. All subsequent editing operations affect the duplicate. Instead of saving the object state in a memento at each change, I store operation objects. When I apply an operation to an object, it returns the inverse operation, which I store.
Saving operations is much cheaper for me than mementos, since the operations can be described with a few small data items, while the object being edited is much bigger. Also I apply the operations as I go and save the undos, so that the temporary in the db always corresponds to the version in the user's browser. I never have to replay a collection of changes; the temporary is always only one operation away from the next version.
To implement "undo," I pop the last undo object off the stack (as it were--by retrieving the latest operation for the temporary object from the db) apply it to the temporary and return the transformed temporary. I could also push the resultant operation onto a redo stack if I cared to implement redo.
To implement "save changes," i.e. commit, I de-activate and time-stamp the original object and activate the temporary in it's place.
To implement "cancel," i.e. rollback, I do nothing! I could delete the temporary, of course, because there's no way for the user to retrieve it once the editing session is over, but I like to keep the canceled edit sessions so I can run stats on them before clearing them out with a cron job.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.