How to compare data in tables across SQL Server and Postgres?

How to compare data in tables across SQL Server and Postgres? - python

I'm migrating data from SQL Server 2017 to Postgres 10.5, i.e., all the tables, stored procedures etc.
I want to compare the data consistency between SQL Server and Postgres databases after the data migration.
All I can think of now is using Python Pandas and loading the tables into data frames from SQL Server and also Postgres and compare the data frames.
But the data is around 6 GB which takes much time for loading table into the data frame and also hosted on a server which is not local to where I'm running the Python script. Is there any way to efficiently compare the data consistency across SQL Server and Postgres?

Yes, you can order the data by primary key, and then write the data to a json or xml file.
Then you can run diff over the two files.
You can also run this chunked by primary-key, that way you don't have to work with a huge file.
Log any diff that doesn't show as equal.
If it doesn't matter what the difference is, you could also just run MD5/SHA1 on the two file chunks, and if the hash machtches, there is no difference, if it doesn't, there is.
Speaking from experience with nhibernate, what you need to watch out for is:
bit fields
text, ntext, varchar(MAX), nvarchar(MAX) fields (they map to varchar with no length, by the way - encoding UTF8)
varbinary, varbinary(MAX), image (bytea[] vs. LOB)
xml
that all primary-key's id serial generator is reset after you inserted all data in pgsql.
Another thing to watch out is which time zone CURRENT_TIMESTAMP uses.
Note:
I'd actually run System.Data.DataRowComparer directly, without writing data to a file:
static void Main(string[] args)
{
DataTable dt1 = dt1();
DataTable dt2= dt2();
IEnumerable<DataRow> idr1 = dt1.Select();
IEnumerable<DataRow> idr2 = dt2.Select();
// MyDataRowComparer MyComparer = new MyDataRowComparer();
// IEnumerable<DataRow> Results = idr1.Except(idr2, MyComparer);
IEnumerable<DataRow> results = idr1.Except(idr2);
}
Then you write all non-matching DataRows into a logfile, for each table one directory (if there are differences).
Don't know what Python uses in place of System.Data.DataRowComparer, though.
Since this would be a one-time task, you could also opt to not do it in Python, and use C# instead (see above code sample).
Also, if you had large tables, you could use DataReader with sequential access to do the comparison. But if the other way cuts it, it reduces the required work considerably.

Have you considered making your SQL Server data visible within your Postgres with a Foreign Data Wrapper (FDW)?
https://github.com/tds-fdw/tds_fdw
I haven't used this FDW tool but, overall, the basic FDW setup process is simple. An FDW acts like a proxy/alias, allowing you to access remote data as though it were housed in Postgres. The tool linked above doesn't support joins, so you would have to perform your comparisons iteratively, etc. Depending on your setup, you would have to check if performance is adequate.
Please report back!

Related

How To Store Query Results (Using Python)

Background:
I have an application written in Python to monitor the status of tools. The tools send their data from specific runs and it all gets stored in an Oracle database as JSON files.
My Problem/Solution:
Instead of connecting to the DB and then querying it repeatedly when I want to compare the current run data to the previous run's data, I want to make a copy of the database query so that I can compare the new run data to the copy that I made instead of to the results of the query.
The reason I want to do this is because constantly querying the server for the previous run's data is slow and puts unwanted load/usage on the server.
For the previous run's data there are multiple files associated with it (because there are multiple tools) and therefore each query has more than one file that would need to be copied. Locally storing the copies of the files in the query is what I intended to do, but I was wondering what the best way to go about this was since I am relativity new to doing something like this.
So any help and suggestions on how to efficiently store the results of a query, which are multiple JSON files, would be greatly appreciated!

As you described querying the db too many times is not an option. OK in that case I would do this the following way :
When your program starts you get the data for all tools as a set of JSON-Files per tool right? OK. I am not sure how you get the data by querying the tools directly or by querying the db .. does not matter.
You check if you have old data in the "cache-dictionary" for that tool. If yes do your compare and store the "new data" as "previous data" in the cache. Ready for the next run. Do this for all tools. This loops forever :-)
This "cache dictionary" now can be implemented in memory or on disk. For your amount of data I think memory is just fine.
With that approach you do not have to query the db for the old data. The case that you cannot do the compare if you do not have old data in the "cache" at program start could be handled that you try to get it from db (risking long query times but what to do :-)

Python, computationally efficient data storage methods

I am retrieving structured numerical data (float 2-3 decimal spaces) via http requests from a server. The data comes in as sets of numbers which are then converted into an array/list. I want to then store each set of data locally on my computer so that I can further operate on it.
Since there are very many of these data sets which need to be collected, simply writing each data set that comes in to a .txt file does not seem very efficient. On the other hand I am aware that there are various solutions such as mongodb, python to sql interfaces...ect but i'm unsure of which one I should use and which would be the most appropriate and efficient for this scenario.
Also the database that is created must be able to interface and be queried from different languages such as MATLAB.

If you just want to store it somewhere so MATLAB can work with it; pick your choice from the databases supported by matlab and then install the appropriate drivers for Python for that database.
All databases in Python have a standard API (called the dbapi) so there is a uniform way of dealing with databases.
As you haven't told us how to intend to work with this data later on, it is difficult to provide any further specifics.
the idea is that i wish to essentially download all of the data onto
my machine so that i can operate on it locally later (run analytics
and perform certain mathematical operations on it) instead of having
to call it from the server constantly.
For that purpose you can use any storage mechanism from text files to any of the databases supported by MATLAB - as all databases supported by MATLAB are supported by Python.
You can choose to store the data as "text" and then do the numeric calculations on the application side (ie, MATLAB side). Or you can choose to store the data as numbers/float/decimal (depending on the precision you need) and this will allow you to do some calculations on the database side.
If you just want to store it as text and do calculations on the application side then the easiest option is mongodb as it is schema-less. You would be storing the data as JSON - which may be the format that it is being retrieved from the web.
If you wish to take advantages of some math functions or other capabilities (for example, geospatial calculations) then a better choice is a traditional database that you are familiar with. You'll have to create a schema and define the datatypes for each of your incoming data objects; and then store them appropriately in order to take advantage of the database's query features.

I can recommend using a lightweight ORM like peewee which can use a number of SQL databases as the storage method. Then it becomes a matter of choosing the database you want. The simplest database to use is sqlite, but should you decide that's not fast enough switching to another database like PostgreSQL or MySQL is trivial.
The advantage of the ORM is that you can use Python syntax to interact with the SQL database and don't have to learn any SQL.

Have you considered HDF5? It's very efficient for numerical data, and is supported by both Python and Matlab.

How to export a large table (100M+ rows) to a text file?

I have a database with a large table containing more that a hundred million rows. I want to export this data (after some transformation, like joining this table with a few others, cleaning some fields, etc.) and store it int a big text file, for later processing with Hadoop.
So far, I tried two things:
Using Python, I browse the table by chunks (typically 10'000 records at a time) using this subquery trick, perform the transformation on each row and write directly to a text file. The trick helps, but the LIMIT becomes slower and slower as the export progresses. I have not been able to export the full table with this.
Using the mysql command-line tool, I tried to output the result of my query in CSV form to a text file directly. Because of the size, it ran out of memory and crashed.
I am currently investigating Sqoop as a tool to import the data directly to HDFS, but I was wondering how other people handle such large-scale exports?

Memory issues point towards using the wrong database query machanism.
Normally, it is advisable to use mysql_store_result() on C level, which corresponds to having a Cursor or DictCursor on Python level. This ensures that the database is free again as soon as possible and the client can do with thedata whatever he wants.
But it is not suitable for large amounts of data, as the data is cached in the client process. This can be very memory consuming.
In this case, it may be better to use mysql_use_result() (C) resp. SSCursor / SSDictCursor (Python). This limits you to have to take the whole result set and doing nothing else with the database connection in the meanwhile. But it saves your client process a lot of memory. With the mysql CLI, you would achieve this with the -q argument.

I don't know what query exactly you have used because you have not given it here, but I suppose you're specifying the limit and offset. This are quite quick queries at begin of data, but are going very slow.
If you have unique column such as ID, you can fetch only the first N row, but modify the query clause:
WHERE ID > (last_id)
This would use index and would be acceptably fast.
However, it should be generally faster to do simply
SELECT * FROM table
and open cursor for such query, with reasonable big fetch size.

SQLite table in RAM instead of FLASH

We're currently working on a python project that basically reads and writes M2M data into/from a SQLite database. This database consists of multiple tables, one of them storing current values coming from the cloud. This last table is worrying me a bit since it's being written very often and the application runs on a flash drive.
I've read that virtual tables could be the solution. I've thought in converting the critical table into a virtual one and then link its contents to a real file (XML or JSON) stored in RAM (/tmp for example in Debian). I've been reading this article:
http://drdobbs.com/database/202802959?pgno=1
that explains more or less how to do what I want. It's quite complex and I think that this is not very doable using Python. Maybe we need to develop our own sqlite extension, I don't know...
Any idea about how to "place" our conflicting table in RAM whilst the rest of the database stays in FLASH? Any better/simpler approach about how take the virtual table way under Python?

A very simple, SQL-only solution to create a in-memory table is using SQLite's ATTACH command with the special ":memory:" pseudo-filename:
ATTACH DATABASE ":memory:" AS memdb;
CREATE TABLE memdb.my_table (...);
Since the whole database "memdb" is kept in RAM, the data will be lost once you close the database connection, so you will have to take care of persistence by yourself.
One way to do it could be:
Open your main SQLite database file
Attach a in-memory secondary database
Duplicate your performance-critical table in the in-memory database
Run all queries on the duplicate table
Once done, write the in-memory table back to the original table (BEGIN; DELETE FROM real_table; INSERT INTO real_table SELECT * FROM memory_table;)
But the best advice I can give you: Make sure that you really have a performance problem, the simple solution could just as well be fast enough!

Use an in-memory data structure server. Redis is a sexy option, and you can easily implement a table using lists. Also, it comes with a decent python driver.

Django with huge mysql database

What would be the best way to import multi-million record csv files into django.
Currently using python csv module, it takes 2-4 days for it process 1 million record file. It does some checking if the record already exists, and few others.
Can this process be achieved to execute in few hours.
Can memcache be used somehow.
Update: There are django ManyToManyField fields that get processed as well. How will these used with direct load.

I'm not sure about your case, but we had similar scenario with Django where ~30 million records took more than one day to import.
Since our customer was totally unsatisfied (with the danger of losing the project), after several failed optimization attempts with Python, we took a radical strategy change and did the import(only) with Java and JDBC (+ some mysql tuning), and got the import time down to ~45 minutes (with Java it was very easy to optimize because of the very good IDE and profiler support).

I would suggest using the MySQL Python driver directly. Also, you might want to take some multi-threading options into consideration.

Depending upon the data format (you said CSV) and the database, you'll probably be better off loading the data directly into the database (either directly into the Django-managed tables, or into temp tables). As an example, Oracle and SQL Server provide custom tools for loading large amounts of data. In the case of MySQL, there are a lot of tricks that you can do. As an example, you can write a perl/python script to read the CSV file and create a SQL script with insert statements, and then feed the SQL script directly to MySQL.
As others have said, always drop your indexes and triggers before loading large amounts of data, and then add them back afterwards -- rebuilding indexes after every insert is a major processing hit.
If you're using transactions, either turn them off or batch your inserts to keep the transactions from being too large (the definition of too large varies, but if you're doing 1 million rows of data, breaking that into 1 thousand transactions is probably about right).
And most importantly, BACKUP UP YOUR DATABASE FIRST! The only thing worse than having to restore your database from a backup because of an import screwup is not having a current backup to restore from.

As mentioned you want to bypass the ORM and go directly to the database. Depending on what type of database you're using you'll probably find good options for loading the CSV data directly. With Oracle you can use External Tables for very high speed data loading, and for mysql you can use the LOAD command. I'm sure there's something similar for Postgres as well.
Loading several million records shouldn't take anywhere near 2-4 days; I routinely load a database with several million rows into mysql running on a very load end machine in minutes using mysqldump.

Like Craig said, you'd better fill the db directly first.
It implies creating django models that just fits the CSV cells (you can then create better models and scripts to move the data)
Then, db feedding : a tool of choice for doing this is Navicat, you can grab a functional 30 days demo on their site. It allows you to import CSV in MySQL, save the importation profile in XML...
Then I would launch the data control scripts from within Django, and when you're done, migrating your model with South to get what you want or , like I said earlier, create another set of models within your project and use scripts to convert/copy the data.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.