Python script to diff same table in two different databases

Python script to diff same table in two different databases - python

I am about to write a python script to help me migrate data between different versions of the same application.
Before I get started, I would like to know if there is a script or module that does something similar, and I can either use, or use as a starting point for rolling my own at least. The idea is to diff the data between specific tables, and then to store the diff as SQL INSERT statements to be applied to the earlier version database.
Note: This script is not robust in the face of schema changes
Generally the logic would be something along the lines of
def diff_table(table1, table2):
# return all rows in table 2 that are not in table1
pass
def persist_rows_tofile(rows, tablename):
# save rows to file
pass
dbnames=('db.v1', 'db.v2')
tables_to_process = ('foo', 'foobar')
for table in tables_to_process:
table1 = dbnames[0]+'.'+table
table2 = dbnames[1]+'.'+table
rows = diff_table(table1, table2)
if len(rows):
persist_rows_tofile(rows, table)
Is this a good way to write such a script or could it be improved?. I suspect it could be improved by cacheing database connections etc (which I have left out - because I am not too familiar with SqlAlchemy etc).
Any tips on how to add SqlAlchemy and to generally improve such a script?

To move data between two databases I use pg_comparator. It's like diff and patch for sql! You can use it to swap the order of columns but if you need to split or merge columns you need to use something else.
I also use it to duplicate a database asynchronously. A cron-job runs every five minutes and pushes all changes on the "master"-database to the "slave"-databases. Especially handy if you only need distribute a single table, or a not all columns per table etc.

Related

Get All of Single Column from Every Table in Schema

In our system, we have 1000+ tables, each of which has an 'date' column containing DateTime object. I want to get a list containing every date that exists within all of the tables. I'm sure there should be an easy way to do this, but I've very limited knowledge of either postgresql or sqlalchemy.
In postgresql, I can do a full join on two tables, but there doesn't seem to be a way to do a join on every table in a schema, for a single common field.
I then tried to solve this programmatically in python with sqlalchemy. For each table, I did created a select distinct for the 'date' column, then set that list of selectes that to the selects property of a CompoundSelect object, and executed. As one might expect from an ugly brute force query, it has ben running now for an hour or so, and I am unsure if it has broken silently somewhere and will never return.
Is there a clean and better way to do this?

You definitely want to do this on the server, not at the application level, due to the many round trips between application and server and likely duplication of data in intermediate results.
Since you need to process 1,000+ tables, you should use the system catalogs and dynamically query the tables. You need a function to do that efficiently:
CREATE FUNCTION get_all_dates() RETURNS SETOF date AS $$
DECLARE
tbl name;
BEGIN
FOR tbl IN SELECT 'public.' || tablename FROM pg_tables WHERE schemaname = 'public' LOOP
RETURN QUERY EXECUTE 'SELECT DISTINCT date::date FROM ' || tbl;
END LOOP
END; $$ LANGUAGE plpgsql;
This will process all the tables in the public schema; change as required. If the tables are in multiple schemas you need to insert your additional logic on where tables are stored, or you can make the schema name a parameter of the function and call the function multiple times and UNION the results.
Note that you may get duplicate dates from multiple tables. These duplicates you can weed out in the statement calling the function:
SELECT DISTINCT * FROM get_all_dates() ORDER BY 1;
The function creates a result set in memory, but if the number of distinct dates in the rows in the 1,000+ tables is very large, the results will be written to disk. If you expect this to happen, then you are probably better off creating a temporary table at the beginning of the function and inserting the dates into that temp table.

Ended up reverting back to a previous solution of using SqlAlchemy to run the queries. This allowed me to parallelize things and run a little faster, since it really was a very large query.
I knew a few things with the dataset that helped with this query- I only wanted distinct dates from each table, and that the dates were the PK in my set. I ended up using the approach from this wiki page. Code being sent in the query looked like the following:
WITH RECURSIVE t AS (
(SELECT date FROM schema.tablename ORDER BY date LIMIT 1)
UNION ALL SELECT (SELECT knowledge_date FROM schema.table WHERE date > t.date ORDER BY date LIMIT 1)
FROM t WHERE t.date IS NOT NULL)
SELECT date FROM t WHERE date IS NOT NULL;
I pulled the results of that query into a list of all my dates if they weren't already in the list, then saved that for use later. It's possible that it takes just as long as running it all in the pgsql console, but it was easier for me to save locally than to have to query the temp table in the db.

How to read a csv using sql

I would like to know how to read a csv file using sql. I would like to use group by and join other csv files together. How would i go about this in python.
example:
select * from csvfile.csv where name LIKE 'name%'

SQL code is executed by a database engine. Python does not directly understand or execute SQL statements.
While some SQL database store their data in csv-like files, almost all of them use more complicated file structures. Therefore, you're required to import each csv file into a separate table in the SQL database engine. You can then use Python to connect to the SQL engine and send it SQL statements (such as SELECT). The engine will perform the SQL, extract the results from its data files, and return them to your Python program.
The most common lightweight engine is SQLite.

littletable is a Python module I wrote for working with lists of objects as if they were database tables, but using a relational-like API, not actual SQL select statements. Tables in littletable can easily read and write from CSV files. One of the features I especially like is that every query from a littletable Table returns a new Table, so you don't have to learn different interfaces for Table vs. RecordSet, for instance. Tables are iterable like lists, but they can also be selected, indexed, joined, and pivoted - see the opening page of the docs.
# print a particular customer name
# (unique indexes will return a single item; non-unique
# indexes will return a Table of all matching items)
print(customers.by.id["0030"].name)
print(len(customers.by.zipcode["12345"]))
# print all items sold by the pound
for item in catalog.where(unitofmeas="LB"):
print(item.sku, item.descr)
# print all items that cost more than 10
for item in catalog.where(lambda o : o.unitprice>10):
print(item.sku, item.descr, item.unitprice)
# join tables to create queryable wishlists collection
wishlists = customers.join_on("id") + wishitems.join_on("custid") + catalog.join_on("sku")
# print all wishlist items with price > 10
bigticketitems = wishlists().where(lambda ob : ob.unitprice > 10)
for item in bigticketitems:
print(item)
Columns of Tables are inferred from the attributes of the objects added to the table. namedtuples are good also, as well as a types.SimpleNamespaces. You can insert dicts into a Table, and they will be converted to SimpleNamespaces.
littletable takes a little getting used to, but it sounds like you are already thinking along a similar line.

You can easily query an SQL Database using PHP script. PHP runs serverside, so all your code will have to be on a webserver (the one with the database). You could make a function to connect to the database like this:
$con= mysql_connect($hostname, $username, $password)
or die("An error has occured");
Then use the $con to accomplish other tasks such as looping through data and creating a table, or even adding rows and columns to an existing table.
EDIT: I noticed you said .CSV file. You can upload a CSV file into a SQL database and create a table out of it. If you are using a control panel service such as phpMyAdmin, you can simply import a CSV file into your database like this:
If you are looking for a free web host to test your SQL and PHP files on, check out x10 hosting.

Automatically merging arbitrary Django models

I have two Django-ORM managed databases that I'd like to merge. Both have a very similar schema, and both have the standard auth_users table, along with a few other shared tables that reference each other as well as auth_users, which I'd like to merge into a single database automatically.
Understandably, this could be very non-trivial depending upon the foreign-key relationships, and what constitutes a "unique" record in each table.
Does anyone know if there exists a tool to do this merge operation?
If nothing like this currently exists, I was considering writing my own management command, based on the standard loaddata command. Essentially, you'd use the standard dumpdata command to export tables from a source database, and then use a modified version of loaddata to "merge" them into the destination database.
For example, if I have databases A and B, and I want to merge database B into database A, then I'd want to follow a procedure according to the pseudo-code:
merge_database_dst = A
merge_database_src = B
for table in sorted(merge_database_dst.get_redundant_tables(merge_database_src), key=acyclic_dependency):
key = table.get_unique_column_key()
src_id_to_dst_id = {}
for record_src in merge_database_src.table.objects.all():
src_key_value = record_src.get_key_value(key)
try:
record_dst = merge_database_dst.table.objects.get(key)
dst_key_value = record_dst.get_key_value(key)
except merge_database_dst.table.DoesNotExist:
record_dst = merge_database_dst.table(**[(k,convert_fk(v)) for k,v in record_src._meta.fields])
record_dst.save()
dst_key_value = record_dst.get_key_value(key)
src_id_to_dst_id[(table,record_src.id)] = record_dst.id
The convert_fk() function would use the src_id_to_dst_id index to convert foreign key references in the source table to the equivalent IDs in the destination table.
To summarize, the algorithm would iterate over the table to be merged in the order of dependency, with parents iterated over first. So if we wanted to merge tables auth_users and mycustomprofile, which is dependent on auth_users, we'd iterate ['auth_users','mycustomprofile'].
Each merged table would need some sort of indicator documenting the combination of columns that denotes a universally unique record (i.e. the "key"). For auth_users, that might be the "username" and/or "email" column.
If the value of the key in database B already exists in A, then the record is not imported from B, but the ID of the existing record in A is recorded.
If the value of the key in database B does not exist in A, then the record is imported from B, and the ID of the new record is recorded.
Using the previously recorded ID, a mapping is created, explaining how to map foreign-key references to that specific record in B to the new merged/pre-existing record in A. When future records are merged into A, this mapping would be used to convert the foreign keys.
I could still envision some cases where an imported record references a table not included in the dumpdata, which might cause the entire import to fail, therefore some sort of "dryrun" option would be needed to simulate the import to ensure all FK references can be translated.
Does this seem like a practical approach? Is there a better way?
EDIT: This isn't exactly what I'm looking for, but I thought others might find it interesting. The Turbion project has a mechanism for copying changes between equivalent records in different Django models within the same database. It works by defining a translation layer (i.e. merging.ModelLayer) between two Django models, so, say if you update the "www" field in user bob#bob.com's profile, it'll automatically update the "url" field in user bob#bob.com's otherprofile.
The functionality I'm looking for is a bit different, in that I want to merge an entire (or partial) database snapshot at infrequent intervals, sort of the way the loaddata management command does.

Wow. This is going to be a complex job regardless. That said:
If I understand the needs of your project correctly, this can be something that can be done using a data migration in South. Even so, I'd be lying if I said it was going to be a joke.
My recommendation is -- and this is mostly a parrot of an assumption in your question, but I want to make it clear -- that you have one "master" table that is the base, and which has records from the other table added to it. So, table A keeps all of its existing records, and only gets additions from B. B feeds additions into A, and once done, B is deleted.
I'm hesitant to write you sample code because your actual job will be so much more complex than this, but I will anyway to try and point you in the right direction. Consider something like...
import datetime
from south.db import db
from south.v2 import DataMigration
from django.db import models
class Migration(DataMigration):
def forwards(self, orm):
for b in orm.B.objects.all():
# sanity check: does this item get copied into A at all?
if orm.A.objects.filter(username=b.username):
continue
# make an A record with the properties of my B record
a = orm.A(
first_name=b.first_name,
last_name=b.last_name,
email_address=b.email_address,
[...]
)
# save the new A record, and delete the B record
a.save()
b.delete()
def backwards(self, orm):
# backwards method, if you write one
This would end up migrating all of the Bs not in A to A, and leave you a table of Bs that are expected duplicates, which you could then check by some other means before deleting.
Like I said, this sample isn't meant to be complete. If you decide to go this route, spend time in the South documentation, and particularly make sure you look at data migrations.
That's my 2¢. Hope it helps.

Join with Pythons SQLite module is slower than doing it manually

I am using pythons built-in sqlite3 module to access a database. My query executes a join between a table of 150000 entries and a table of 40000 entries, the result contains about 150000 entries again. If I execute the query in the SQLite Manager it takes a few seconds, but if I execute the same query from Python, it has not finished after a minute. Here is the code I use:
cursor = self._connection.cursor()
annotationList = cursor.execute("SELECT PrimaryId, GOId " +
"FROM Proteins, Annotations " +
"WHERE Proteins.Id = Annotations.ProteinId")
annotations = defaultdict(list)
for protein, goterm in annotationList:
annotations[protein].append(goterm)
I did the fetchall just to measure the execution time. Does anyone have an explanation for the huge difference in performance? I am using Python 2.6.1 on Mac OS X 10.6.4.
I implemented the join manually, and this works much faster. The code looks like this:
cursor = self._connection.cursor()
proteinList = cursor.execute("SELECT Id, PrimaryId FROM Proteins ").fetchall()
annotationList = cursor.execute("SELECT ProteinId, GOId FROM Annotations").fetchall()
proteins = dict(proteinList)
annotations = defaultdict(list)
for protein, goterm in annotationList:
annotations[proteins[protein]].append(goterm)
So when I fetch the tables myself and then do the join in Python, it takes about 2 seconds. The code above takes forever. Am I missing something here?
I tried the same with apsw, and it works just fine (the code does not need to be changed at all), the performance it great. I'm still wondering why this is so slow with the sqlite3-module.

There is a discussion about it here: http://www.mail-archive.com/python-list#python.org/msg253067.html
It seems that there is a performance bottleneck in the sqlite3 module. There is an advice how to make your queries faster:
make sure that you do have indices on the join columns
use pysqlite

You haven't posted the schema of the tables in question, but I think there might be a problem with indexes, specifically not having an index on Proteins.Id or Annotations.ProteinId (or both).
Create the SQLite indexes like this
CREATE INDEX IF NOT EXISTS index_Proteins_Id ON Proteins (Id)
CREATE INDEX IF NOT EXISTS index_Annotations_ProteinId ON Annotations (ProteinId)

I wanted to update this because I am noticing the same issue and we are now 2022...
In my own application I am using python3 and sqlite3 to do some data wrangling on large databases (>100000 rows * >200 columns). In particular, I have noticed that my 3 table inner join clocks in around ~12 minutes of run time in python, whereas running the same join query in sqlite3 from the CLI runs in ~100 seconds. All the join predicates are properly indexed and the EXPLAIN QUERY PLAN indicates that the added time is most likely because I am using SELECT *, which is a necessary evil in my particular context.
The performance discrepancy caused me to pull my hair out all night until I realized there is a quick fix from here: Running a Sqlite3 Script from Command Line. This is definitely a workaround at best, but I have research due so this is my fix.
Write out the query to an .sql file (I am using f-strings to pass variables in so I used an example with {foo} here)
fi = open("filename.sql", "w")
fi.write(f"CREATE TABLE {Foo} AS SELECT * FROM Table1 INNER JOIN Table2 ON Table2.KeyColumn = Table1.KeyColumn INNER JOIN Table3 ON Table3.KeyColumn = Table1.KeyColumn;")
fi.close()
Run os.system from inside python and send the .sql file to sqlite3
os.system(f"sqlite3 {database} < filename.sql")
Make sure you close any open connection before running this so you don't end up locked out and you'll have to re-instantiate any connection objects afterward if you're going back to working in sqlite within python.
Hope this helps and if anyone has figured the source of this out, please link to it!

What's the most efficient way to insert thousands of records into a table (MySQL, Python, Django)

I have a database table with a unique string field and a couple of integer fields. The string field is usually 10-100 characters long.
Once every minute or so I have the following scenario: I receive a list of 2-10 thousand tuples corresponding to the table's record structure, e.g.
[("hello", 3, 4), ("cat", 5, 3), ...]
I need to insert all these tuples to the table (assume I verified neither of these strings appear in the database). For clarification, I'm using InnoDB, and I have an auto-incremental primary key for this table, the string is not the PK.
My code currently iterates through this list, for each tuple creates a Python module object with the appropriate values, and calls ".save()", something like so:
#transaction.commit_on_success
def save_data_elements(input_list):
for (s, i1, i2) in input_list:
entry = DataElement(string=s, number1=i1, number2=i2)
entry.save()
This code is currently one of the performance bottlenecks in my system, so I'm looking for ways to optimize it.
For example, I could generate SQL codes each containing an INSERT command for 100 tuples ("hard-coded" into the SQL) and execute it, but I don't know if it will improve anything.
Do you have any suggestion to optimize such a process?
Thanks

You can write the rows to a file in the format
"field1", "field2", .. and then use LOAD DATA to load them
data = '\n'.join(','.join('"%s"' % field for field in row) for row in data)
f= open('data.txt', 'w')
f.write(data)
f.close()
Then execute this:
LOAD DATA INFILE 'data.txt' INTO TABLE db2.my_table;
Reference

For MySQL specifically, the fastest way to load data is using LOAD DATA INFILE, so if you could convert the data into the format that expects, it'll probably be the fastest way to get it into the table.

If you don't LOAD DATA INFILE as some of the other suggestions mention, two things you can do to speed up your inserts are :
Use prepared statements - this cuts out the overhead of parsing the SQL for every insert
Do all of your inserts in a single transaction - this would require using a DB engine that supports transactions (like InnoDB)

If you can do a hand-rolled INSERT statement, then that's the way I'd go. A single INSERT statement with multiple value clauses is much much faster than lots of individual INSERT statements.

Regardless of the insert method, you will want to use the InnoDB engine for maximum read/write concurrency. MyISAM will lock the entire table for the duration of the insert whereas InnoDB (under most circumstances) will only lock the affected rows, allowing SELECT statements to proceed.

what format do you receive? if it is a file, you can do some sort of bulk load: http://www.classes.cs.uchicago.edu/archive/2005/fall/23500-1/mysql-load.html

This is unrelated to the actual load of data into the DB, but...
If providing a "The data is loading... The load will be done shortly" type of message to the user is an option, then you can run the INSERTs or LOAD DATA asynchronously in a different thread.
Just something else to consider.

I donot know the exact details, but u can use json style data representation and use it as fixtures or something. I saw something similar on Django Video Workshop by Douglas Napoleone. See the videos at http://www.linux-magazine.com/online/news/django_video_workshop. and http://www.linux-magazine.com/online/features/django_reloaded_workshop_part_1. Hope this one helps.
Hope you can work it out. I just started learning django, so I can just point you to resources.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.