Find out what's going to be deleted

Find out what's going to be deleted - python

When using more complex, hierarchical models with differing settings on how cascade deletes are handled it gets quite hard to figure out beforehand what a delete() will exactly do with the database.
I couldn't find any way to get this piece of information ("Hey SQLAlchemy, what will be deleted if I delete that object over there?") from SQLAlchemy. Implementing this by myself doesn't really seem like an option since this would result sooner or later in situations where my prediction and the actual consequences of the delete() differ, which would be very… unpleasant for the user.

I asked this question on the SQLAlchemy mailing list to
and Michael Bayer explained the possible options (thanks a lot again! :-):
The only deletes that aren't present in session.deleted before the flush are those that will occur because a particular object is an "orphan", and the objects which would be deleted as a result of a cascade on that orphan.
So without orphans taken into account, session.deleted tells you everything that is to be deleted.
To take orphans into account requires traversing through all the relationships as the unit of work does, looking for objects that are currently orphans (there's an API function that will tell you this - if the object is considered an "orphan" by any attribute that refers to it with delete-orphan cascade, it's considered an "orphan"), and then traversing through the relationships of those orphans, considering them to be marked as "deleted", and then doing all the rules again for those newly-deleted objects.
The system right now is implemented by orm/dependency.py. It is probably not hard to literally run a unit of work process across the session normally, but just not emit the SQL, this would give you the final flush plan. But this is an expensive process that I wouldn't want to be calling all the time.
A feature add is difficult here because the use case is not clear. Knowing what will be deleted basically requires half the flush process actually proceed. But you can already implement events inside the flush process itself, most directly the before_delete() and after_delete() events that will guaranteed catch everything. So the rationale for a new feature that basically runs half the flush, before you just do the flush anyway and could just put events inside of it, isn't clear.
I guess the big question is, "when are you calling this".
An easy system would be to add a new event "flush_plan_complete" which will put you into a flush() right as the full plan has been assembled, but before any SQL occurs. It could allow you to register more objects for activity, and it would then rerun the flush plan to consider any new changes (since that's how it works right now anyway). How this registration would proceed is tricky, since it would be nice to use the Session normally there, but that makes this more complicated to implement. But then it could iterate again through the new changes and find any remaining steps to take before proceeding normally.

Related

Prevent duplicate DynamoDB transaction

Hi I've a Lambda function treated as a webhook. The webhook may be called multiple time simultaneously with same data. In the lambda function, I check if the transaction record is present in the DynamoDB or not. If it's present in db the Lambda simply returns otherwise it execute further. The problem arises here that when checking if a record in db the Lambda get called again and that check fails because the previous transaction still not inserted in db. and transaction can get executed multiple times.
My question is how to handle this situation. will SQS be helpful in this situation?

You can use optimistic locking for this. I've written a more detailed blog about implementing it, but here are the core ideas.
For each item you track a version number that always gets incremented. Each update to the item will increment the version number by one.
When you want to perform an update, you first read the old item and store its version number locally. Then you change the item locally and increment its version number. When you write it back to the table in your transaction, you add a conditional write. The condition being that the current version number of the item is still the same it was when you read it.
This means the transaction will fail if the item has been update in the mean time. Optimistic Locking helps you with collision detection and is a good solution under the assumption that such collisions are relatively rare. You'd be better served with different locking strategies if they're more frequent.
Optimistic Locking will help you identify the cases you're worried about. It doesn't resolve them, you'll have to implement that yourself. A common conflict resolution approach would be to read the item again and check if your changes have already been applied.

" if it's present in db the lambda simply return otherwise it execute further" given that, is it possible to use FIFO queue and use some "key" from the data as deduplication id (fifo) and that would mean all duplicate messages would never make it to your logic and then you would also need
dynamodb's "strongly consistent" option.

Keep lock on database object after commit

I'm selecting some object for update and then perform operation on it
obj = Model.objects.select_for_update().get(id=someID)
obj.somefield = 1
obj.save()
But I need still to keep FOR UPDATE lock on this object.
PostgreSQL documentation says that FOR UPDATE lock will live until the end of transaction, which will be ended, because save will trigger commit.
Even if I will manage commit manually, I need to save some info to the database (to do this I need to commit).
So, what I can do in this situation?
If I will select object again, some other process may perform changes on this object before I will set a new lock.
(I'm using django 1.7 and postgresql 9.3)

You can't hold a row lock after the transaction commits or rolls back.
This might be a reasonable use case for advisory locks, though it's hard to say with the limited detail provided. Advisory locks can operate at the session level, crossing transaction boundaries.
You can't (ab)use a WITH HOLD cursor:
test=> DECLARE test_curs CURSOR WITH HOLD FOR SELECT * FROM dummy FOR UPDATE;
ERROR: DECLARE CURSOR WITH HOLD ... FOR UPDATE is not supported
DETAIL: Holdable cursors must be READ ONLY.
so I think an advisory lock is pretty much your only option.
The usual way of handling this, by the way, is to let the other task make changes in between. Make sure that each transaction leaves the object in a sensible state that makes sense. If someone else makes a change between your two changes, design things so that's OK.
To avoid overwriting the changes made in between you can use optimistic concurrency control, otherwise known as optimistic locking, using a row-version column. If you see someone snuck in and made a change you reload the object to get the new version, repeat your change, and try saving again. Unfortunately, unlike more sophisticated ORMs like Hibernate, it doesn't look like Django has built-in support for optimistic concurrency control, but there seem to be extensions that add it.

advantages of serializing data during db synchronization

I'm trying to develop a system that will allow users to update local, offline databases on their laptops and, upon reconnection to the network, synchronize their dbs with the main, master db.
I looked at MySQL replication, but that documentation focuses on unidirectional syncing. So I think I'm going to build a custom app in python for doing this (bilateral syncing), and I have a couple of questions.
I've read a couple of posts regarding this issue, and one of the items which has been passively mentioned is serialization (which I would be implementing through the pickle and cPickle modules in python). Could someone please tell me whether this is necessary, and the advantages of serializing data in the context of syncing databases?
One of the uses in wikipedia's entry on serialization states it can be used as "a method for detecting changes in time-varying data." This sounds really important, because my application will be looking at timestamps to determine which records have precedence when updating the master database. So, I guess the thing I don't really get is how pickling data in python can be used to "detect changes in time-varying data", and whether or not this would supplement using timestamps in the database to determine precedence or replace this method entirely.
Anyways, high level explanations or code examples are both welcome. I'm just trying to figure this out.
Thanks

how pickling data in python can be used to "detect changes in time-varying data."
Bundling data in an opaque format tells you absolutely nothing about time-varying data, except that it might have possibly changed (but you'd need to check that manually by unwrapping it). What the article is actually saying is...
To quote the actual relevant section (link to article at this moment in time):
Since both serializing and deserializing can be driven from common code, (for example, the Serialize function in Microsoft Foundation Classes) it is possible for the common code to do both at the same time, and thus 1) detect differences between the objects being serialized and their prior copies, and 2) provide the input for the next such detection. It is not necessary to actually build the prior copy, since differences can be detected "on the fly". This is a way to understand the technique called differential execution[a link which does not exist]. It is useful in the programming of user interfaces whose contents are time-varying — graphical objects can be created, removed, altered, or made to handle input events without necessarily having to write separate code to do those things.
The term "differential execution" seems to be a neologism coined by this person, where he described it in another StackOverflow answer: How does differential execution work?. Reading over that answer, I think I understand what he's trying to say. He seems to be using "differential execution" as a MVC-style concept, in the context where you have lots of view widgets (think a webpage) and you want to allow incremental changes to update just those elements, without forcing a global redraw of the screen. I would not call this "serialization" in the classic sense of the word (not by any stretch, in my humble opinion), but rather "keeping track of the past" or something like that. Because this basically has nothing to do with serialization, the rest of this answer (my interpretation of what he is describing) is probably not worth your time unless you are interested in the topic.
In general, avoiding a global redraw is impossible. Global redraws must sometimes happen: for example in HTML, if you increase the size of an element, you need to reflow lower elements, triggering a repaint. In 3D, you need to redraw everything behind what you update. However if you follow this technique, you can reduce (though not minimize) the number of redraws. This technique he claims will avoid the use of most events, avoid OOP, and use only imperative procedures and macros. My interpretation goes as follows:
Your drawing functions must know, somehow, how to "erase" themselves and anything they do which may affect the display of unrelated functions.
Write a sideffect-free paintEverything() script that imperatively displays everything (e.g. using functions like paintButton() and paintLabel()), using nothing but IF macros/functions. The IF macro works just like an if-statement, except...
Whenever you encounter an IF branch, keep track of both which IF statement this was, and the branch you took. "Which IF statement this was" is sort of a vague concept. For example you might decide to implement a FOR loop by combining IFs with recursion, in which case I think you'd need to keep track of the IF statement as a tree (whose nodes are either function calls or IF statements). You ensure the structure of that tree corresponds to the precedence rule "child layout choices depend on this layout choice".
Every time a user input event happens, rerun your paintEverything() script. However because we have kept track of which part of the code depends on which other parts, we can automatically skip anything which did not depend on what was updated. For example if paintLabel() did not depend on the state of the button, we can avoid rerunning that part of the paintEverything() script.
The "serialization" (not really serialization, more like naturally-serialized data structure) comes from the execution history of the if-branches. Except, serialization here is not necessary at all; all you needed was to keep track of which part of the display code depends on which others. It just so happens that if you use this technique with serially-executed "smart-if"-statements, it makes sense to use a lazily-evaluated diff of execution history to determine what you need to update.
However this technique does have useful takeaways. I'd say the main takeaway is: it is also a reasonable thing to keep track of dependencies not just in an OOP-style (e.g. not just widget A depends on widget B), but dependencies of the basic combinators in whatever DSL you are programming in. Also dependencies can be inferred from the structure of your program (e.g. like HTML does).

Using pickle for in-memory "transaction"?

A user can perform an action, which may trigger dependent actions (which themselves may have dependent actions) and I want to be able to cancel the whole thing if the user cancels a dependent action.
The typical way I've seen this done is some variant of an undo stack and each action will need to know how to undo itself, and then if a child action is cancelled the undo's cascade their way up. Sometimes writing undo methods are tricky and there isn't always enough information in context to properly know how to undo an action in an isolated manner.
I just thought of a (potentially) easier way which is to just pickle the state of the (relevant parts of) program, and then the cancel would just restore to it's former state, without needing to create separate undo logic for each action.
Has anyone tried this? Any gotchas to watch out for? Any reason not to do this?
Edit: The dependent actions must happen after the parent action (and even whether there are dependent actions may depend on the result of the parent action), so just checking all the dependencies before doing anything isn't an option. I guess you could say an action triggers other actions, but if one of the triggered actions cannot be performed, then none of it happened.

Well, as you mentioned the data design is loosely coupled, so you I don't think you need to pickle it if it's in memory. Just take a copy of all the relevant variables, and the transaction.abort() would just copy them back, and transaction.commit() would then just remove the copy of the data.
There are issues, but none that you don't have with the pickle solution.

You can use pickle to store your state if all elements of state are serializable (usually they are). The only reasons for not doing so:
if you have to store pointers to any objects that are not saved in state, you will have problems with these pointers after performing undo operation.
this method could be expensive, depending on the size of your state.
Also you can use zip() to lower memory usage in exchange of raising CPU usage.

Per-session transactions in Django

I'm making a Django web-app which allows a user to build up a set of changes over a series of GETs/POSTs before committing them to the database (or reverting) with a final POST. I have to keep the updates isolated from any concurrent database users until they are confirmed (this is a configuration front-end), ruling out committing after each POST.
My preferred solution is to use a per-session transaction. This keeps all the problems of remembering what's changed (and how it affects subsequent queries), together with implementing commit/rollback, in the database where it belongs. Deadlock and long-held locks are not an issue, as due to external constraints there can only be one user configuring the system at any one time, and they are well-behaved.
However, I cannot find documentation on setting up Django's ORM to use this sort of transaction model. I have thrown together a minimal monkey-patch (ew!) to solve the problem, but dislike such a fragile solution. Has anyone else done this before? Have I missed some documentation somewhere?
(My version of Django is 1.0.2 Final, and I am using an Oracle database.)

Multiple, concurrent, session-scale transactions will generally lead to deadlocks or worse (worse == livelock, long delays while locks are held by another session.)
This design is not the best policy, which is why Django discourages it.
The better solution is the following.
Design a Memento class that records the user's change. This could be a saved copy of their form input. You may need to record additional information if the state changes are complex. Otherwise, a copy of the form input may be enough.
Accumulate the sequence of Memento objects in their session. Note that each step in the transaction will involve fetches from the data and validation to see if the chain of mementos will still "work". Sometimes they won't work because someone else changed something in this chain of mementos. What now?
When you present the 'ready to commit?' page, you've replayed the sequence of Mementos and are pretty sure they'll work. When the submit "Commit", you have to replay the Mementos one last time, hoping they're still going to work. If they do, great. If they don't, someone changed something, and you're back at step 2: what now?
This seems complex.
Yes, it does. However it does not hold any locks, allowing blistering speed and little opportunity for deadlock. The transaction is confined to the "Commit" view function which actually applies the sequence of Mementos to the database, saves the results, and does a final commit to end the transaction.
The alternative -- holding locks while the user steps out for a quick cup of coffee on step n-1 out of n -- is unworkable.
For more information on Memento, see this.

In case anyone else ever has the exact same problem as me (I hope not), here is my monkeypatch. It's fragile and ugly, and changes private methods, but thankfully it's small. Please don't use it unless you really have to. As mentioned by others, any application using it effectively prevents multiple users doing updates at the same time, on penalty of deadlock. (In my application, there may be many readers, but multiple concurrent updates are deliberately excluded.)
I have a "user" object which persists across a user session, and contains a persistent connection object. When I validate a particular HTTP interaction is part of a session, I also store the user object on django.db.connection, which is thread-local.
def monkeyPatchDjangoDBConnection():
import django.db
def validConnection():
if django.db.connection.connection is None:
django.db.connection.connection = django.db.connection.user.connection
return True
def close():
django.db.connection.connection = None
django.db.connection._valid_connection = validConnection
django.db.connection.close = close
monkeyPatchDBConnection()
def setUserOnThisThread(user):
import django.db
django.db.connection.user = user
This last is called automatically at the start of any method annotated with #login_required, so 99% of my code is insulated from the specifics of this hack.

I came up with something similar to the Memento pattern, but different enough that I think it bears posting. When a user starts an editing session, I duplicate the target object to a temporary object in the database. All subsequent editing operations affect the duplicate. Instead of saving the object state in a memento at each change, I store operation objects. When I apply an operation to an object, it returns the inverse operation, which I store.
Saving operations is much cheaper for me than mementos, since the operations can be described with a few small data items, while the object being edited is much bigger. Also I apply the operations as I go and save the undos, so that the temporary in the db always corresponds to the version in the user's browser. I never have to replay a collection of changes; the temporary is always only one operation away from the next version.
To implement "undo," I pop the last undo object off the stack (as it were--by retrieving the latest operation for the temporary object from the db) apply it to the temporary and return the transformed temporary. I could also push the resultant operation onto a redo stack if I cared to implement redo.
To implement "save changes," i.e. commit, I de-activate and time-stamp the original object and activate the temporary in it's place.
To implement "cancel," i.e. rollback, I do nothing! I could delete the temporary, of course, because there's no way for the user to retrieve it once the editing session is over, but I like to keep the canceled edit sessions so I can run stats on them before clearing them out with a cron job.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.