Automatically merging arbitrary Django models - python

I have two Django-ORM managed databases that I'd like to merge. Both have a very similar schema, and both have the standard auth_users table, along with a few other shared tables that reference each other as well as auth_users, which I'd like to merge into a single database automatically.
Understandably, this could be very non-trivial depending upon the foreign-key relationships, and what constitutes a "unique" record in each table.
Does anyone know if there exists a tool to do this merge operation?
If nothing like this currently exists, I was considering writing my own management command, based on the standard loaddata command. Essentially, you'd use the standard dumpdata command to export tables from a source database, and then use a modified version of loaddata to "merge" them into the destination database.
For example, if I have databases A and B, and I want to merge database B into database A, then I'd want to follow a procedure according to the pseudo-code:
merge_database_dst = A
merge_database_src = B
for table in sorted(merge_database_dst.get_redundant_tables(merge_database_src), key=acyclic_dependency):
key = table.get_unique_column_key()
src_id_to_dst_id = {}
for record_src in merge_database_src.table.objects.all():
src_key_value = record_src.get_key_value(key)
try:
record_dst = merge_database_dst.table.objects.get(key)
dst_key_value = record_dst.get_key_value(key)
except merge_database_dst.table.DoesNotExist:
record_dst = merge_database_dst.table(**[(k,convert_fk(v)) for k,v in record_src._meta.fields])
record_dst.save()
dst_key_value = record_dst.get_key_value(key)
src_id_to_dst_id[(table,record_src.id)] = record_dst.id
The convert_fk() function would use the src_id_to_dst_id index to convert foreign key references in the source table to the equivalent IDs in the destination table.
To summarize, the algorithm would iterate over the table to be merged in the order of dependency, with parents iterated over first. So if we wanted to merge tables auth_users and mycustomprofile, which is dependent on auth_users, we'd iterate ['auth_users','mycustomprofile'].
Each merged table would need some sort of indicator documenting the combination of columns that denotes a universally unique record (i.e. the "key"). For auth_users, that might be the "username" and/or "email" column.
If the value of the key in database B already exists in A, then the record is not imported from B, but the ID of the existing record in A is recorded.
If the value of the key in database B does not exist in A, then the record is imported from B, and the ID of the new record is recorded.
Using the previously recorded ID, a mapping is created, explaining how to map foreign-key references to that specific record in B to the new merged/pre-existing record in A. When future records are merged into A, this mapping would be used to convert the foreign keys.
I could still envision some cases where an imported record references a table not included in the dumpdata, which might cause the entire import to fail, therefore some sort of "dryrun" option would be needed to simulate the import to ensure all FK references can be translated.
Does this seem like a practical approach? Is there a better way?
EDIT: This isn't exactly what I'm looking for, but I thought others might find it interesting. The Turbion project has a mechanism for copying changes between equivalent records in different Django models within the same database. It works by defining a translation layer (i.e. merging.ModelLayer) between two Django models, so, say if you update the "www" field in user bob#bob.com's profile, it'll automatically update the "url" field in user bob#bob.com's otherprofile.
The functionality I'm looking for is a bit different, in that I want to merge an entire (or partial) database snapshot at infrequent intervals, sort of the way the loaddata management command does.

Wow. This is going to be a complex job regardless. That said:
If I understand the needs of your project correctly, this can be something that can be done using a data migration in South. Even so, I'd be lying if I said it was going to be a joke.
My recommendation is -- and this is mostly a parrot of an assumption in your question, but I want to make it clear -- that you have one "master" table that is the base, and which has records from the other table added to it. So, table A keeps all of its existing records, and only gets additions from B. B feeds additions into A, and once done, B is deleted.
I'm hesitant to write you sample code because your actual job will be so much more complex than this, but I will anyway to try and point you in the right direction. Consider something like...
import datetime
from south.db import db
from south.v2 import DataMigration
from django.db import models
class Migration(DataMigration):
def forwards(self, orm):
for b in orm.B.objects.all():
# sanity check: does this item get copied into A at all?
if orm.A.objects.filter(username=b.username):
continue
# make an A record with the properties of my B record
a = orm.A(
first_name=b.first_name,
last_name=b.last_name,
email_address=b.email_address,
[...]
)
# save the new A record, and delete the B record
a.save()
b.delete()
def backwards(self, orm):
# backwards method, if you write one
This would end up migrating all of the Bs not in A to A, and leave you a table of Bs that are expected duplicates, which you could then check by some other means before deleting.
Like I said, this sample isn't meant to be complete. If you decide to go this route, spend time in the South documentation, and particularly make sure you look at data migrations.
That's my 2ยข. Hope it helps.

Related

What is the difference between with_entities and load_only in SQLAlchemy?

When querying my database, I only want to load specified columns. Creating a query with with_entities requires a reference to the model column attribute, while creating a query with load_only requires a string corresponding to the column name. I would prefer to use load_only because it is easier to create a dynamic query using strings. What is the difference between the two?
load_only documentation
with_entities documentation
There are a few differences. The most important one when discarding unwanted columns (as in the question) is that using load_only will still result in creation of an object (a Model instance), while using with_entities will just get you tuples with values of chosen columns.
>>> query = User.query
>>> query.options(load_only('email', 'id')).all()
[<User 1 using e-mail: n#d.com>, <User 2 using e-mail: n#d.org>]
>>> query.with_entities(User.email, User.id).all()
[('n#d.org', 1), ('n#d.com', 2)]
load_only
load_only() defers loading of particular columns from your models.
It removes columns from query. You can still access all the other columns later, but an additional query (in the background) will be performed just when you try to access them.
"Load only" is useful when you store things like pictures of users in your database but you do not want to waste time transferring the images when not needed. For example, when displaying a list of users this might suffice:
User.query.options(load_only('name', 'fullname'))
with_entities
with_entities() can either add or remove (simply: replace) models or columns; you can even use it to modify the query, to replace selected entities with your own function like func.count():
query = User.query
count_query = query.with_entities(func.count(User.id)))
count = count_query.scalar()
Note that the resulting query is not the same as of query.count(), which would probably be slower - at least in MySQL (as it generates a subquery).
Another example of the extra capabilities of with_entities would be:
query = (
Page.query
.filter(<a lot of page filters>)
.join(Author).filter(<some author filters>)
)
pages = query.all()
# ok, I got the pages. Wait, what? I want the authors too!
# how to do it without generating the query again?
pages_and_authors = query.with_entities(Page, Author).all()

Effective batch "update-or-insert" in SqlAlchemy

There exists a table Users and in my code I have a big list of User objects. To insert them I can use :
session.add_all(user_list)
session.commit()
The problem is that there can be several duplicates which I want to update but the database wont allow to insert duplicate entries. For sure, I can iterate over user_list and try to insert user in the database and if it fails - update it :
for u in users:
q = session.query(T).filter(T.fullname==u.fullname).first()
if q:
session.query(T).filter_by(index=q.index).update({column: getattr(u,column) for column in Users.__table__.columns.keys() if column!='id'})
session.commit()
else:
session.add(u)
session.commit()
but I find this solution quiet ineffective : first, I am making several requests to retrieve object q, and instead of batch inserting of new items I insert them one per one. I wonder if there exists a better solution for this task.
UPD better version:
for u in users:
q = session.query(T).filter(Users.fullname==u.fullname).first()
if q:
for column in Users.__table__.columns.keys():
if not column=='index':
setattr(q,column,getattr(u,column))
session.add(q)
else:
session.add(u)
session.commit()
a better solution would be to use
INSERT ... ON DUPLICATE KEY UPDATE ...
bulk MySQL construct (I assume you're using MySQL because your post is tagged with 'mysql'). This way you're both inserting new entries and updating existing ones in one statement / transaction, see http://dev.mysql.com/doc/refman/5.6/en/insert-on-duplicate.html
It's not ideal if you have multiple unique indexes and, depending on your schema, you'll have to fill in all NOT NULL values (hence issuing one bulk SELECT before calling it), but it's definitely the most efficient option and we use it a lot. The bulk version will look something like (let's assume name is a unique key):
INSERT INTO User (name, phone, ...) VALUES
('ksmith', '111-11-11', ...),
('jford', '222-22,22', ...),
...,
ON DUPLICATE KEY UPDATE
phone = VALUES(phone),
... ;
Unfortunately, INSERT ... ON DUPLICATE KEY UPDATE ... is not supported natively by SQLa so you'll have to implement a little helper function which will build the query for you.

SQLAlchemy: Flush-Order for Inserts wrong?

I have 2 tables (contracts and contract_items) where the latter has a foreign key set to the first one.
When using SQLAlchemy for inserting new data from a list into my postgre database I'm basically doing the following:
for row in list:
# Get contract and item from row
...
session.add(contract)
session.add(contract_item)
# Do some select statements (which will raise an auto-flush)
...
session.commit()
Now... this works for maybe 2-3 runs, sometimes more, sometimes less. Then the part where an auto-flush is executed will end in an exception telling me that contract_item could not be inserted because it has an foreign key to contract and the contract row does not exist yet.
Is the order in which I pass the data to the add-function of the session not the order in which the data will be flushed? I actually hoped SQLAlchemy would find the right order in which to flush statements on it's own based on the dependencies. It should be clear that the contract_item row should not be inserted before the contract row, when contract_item has a foreign key to contract set. Yet the order seems to be random.
I then tried to flush the contract manually before adding contract_item:
for row in list:
# Getting contract and item from row
...
session.add(contract)
session.flush() # Flushing manually
session.add(contract_item)
# Do some select statements (which will raise an auto-flush)
...
session.commit()
This worked without any problems and the rows got inserted into the database.
Is there any way to set the order in which statements will be flushed for the session? Does SQLAlchemy really not care about dependencies such as foreign keys or am I making a mistake when adding the data? I'd rather not manage the flushs manually if somehow possible.
Is there a way to make SQLAlchemy get the order right?
Had the same problem. What solved it in my case is creating a biderictional relationship - you need to make relationship from contracts to contract_items, as described HERE
UPD: actually you can do it simplier: just add relationship from contract_items table to contract table and that should do the thing.
The way session handles related objects is defined by cascades. Use "save-update" cascade on a relationship (that is enabled by default) to automatically add related objects, so that you only have to use one add call. The documentation I linked contains code example.

How to generate a random Id?

I have a Postgres 9.3 table that has a column called id as PKEY, id is char(9), and only allow lowercase a-z0-9, I use Python with psycopg to insert to this table.
When I need to insert into this table, I call a Python function get_new_id(), my question is, how to make get_new_id() efficient?
I have the following solutions, none of them satisfy me.
a) Pre-generate a lot of ids, store them in some table, when I need a new id, I SELECT one from this table, then delete it from this table, then return this selected id. Down side of this solution is that it need to maintain this table, in each get_new_id() call, there will also have a SELECT COUNT in order to find out if I need to generate more ids to put into this table.
b) When get_new_id() gets called, it generate a random id, then pass this id to a stored procedure to check if this id is already in use, if no, we are good, if yes, do b) again. Down side of this solution is, when the table gets bigger, the failure rate may be high, and there is a chance that, two get_new_id() calls in two processes will generate the same id, say, 1234567, and 1234567 is not used a PKEY yet, so, when insert, one process will fail.
I think this is a pretty old problem, what's the perfect solution?
Edit
I think this has been answered, see Jon Clements' comment.
Offtopic because you already have a char(9) datatype:
I would use an UUID when a random string is needed, it's a standard and almost any programming language (including Python) can generate UUIDs for you.
PostgreSQL can also do it for you, using the uuid-ossp extension.
select left(md5(random()::text || now()), 9);
left
-----------
c4c384561
Make the id the primary key and try the insert. If an exception is thrown catch it and retry. Nothing fancy about it. why only 9 characters? Make it the full 32.
Check this answer for how to make it smaller: https://stackoverflow.com/a/15982876/131874

Python script to diff same table in two different databases

I am about to write a python script to help me migrate data between different versions of the same application.
Before I get started, I would like to know if there is a script or module that does something similar, and I can either use, or use as a starting point for rolling my own at least. The idea is to diff the data between specific tables, and then to store the diff as SQL INSERT statements to be applied to the earlier version database.
Note: This script is not robust in the face of schema changes
Generally the logic would be something along the lines of
def diff_table(table1, table2):
# return all rows in table 2 that are not in table1
pass
def persist_rows_tofile(rows, tablename):
# save rows to file
pass
dbnames=('db.v1', 'db.v2')
tables_to_process = ('foo', 'foobar')
for table in tables_to_process:
table1 = dbnames[0]+'.'+table
table2 = dbnames[1]+'.'+table
rows = diff_table(table1, table2)
if len(rows):
persist_rows_tofile(rows, table)
Is this a good way to write such a script or could it be improved?. I suspect it could be improved by cacheing database connections etc (which I have left out - because I am not too familiar with SqlAlchemy etc).
Any tips on how to add SqlAlchemy and to generally improve such a script?
To move data between two databases I use pg_comparator. It's like diff and patch for sql! You can use it to swap the order of columns but if you need to split or merge columns you need to use something else.
I also use it to duplicate a database asynchronously. A cron-job runs every five minutes and pushes all changes on the "master"-database to the "slave"-databases. Especially handy if you only need distribute a single table, or a not all columns per table etc.

Categories

Resources