Looks like there are several options:
mysqldump + rsync - can this be done for only specific data from an existing table, and not a whole table?
An insert to a federated table - this seems pretty untested and unknown at this point...
A Python script (pull into memory from A, then insert into B) - this would probably be pretty slow...
What kind of data warrants what method?
You also have another option -- mysql replication!
Can you not extract that single table into its own database and replicate just that to the second server (or as many servers as you like)?
Related
I'm migrating data from SQL Server 2017 to Postgres 10.5, i.e., all the tables, stored procedures etc.
I want to compare the data consistency between SQL Server and Postgres databases after the data migration.
All I can think of now is using Python Pandas and loading the tables into data frames from SQL Server and also Postgres and compare the data frames.
But the data is around 6 GB which takes much time for loading table into the data frame and also hosted on a server which is not local to where I'm running the Python script. Is there any way to efficiently compare the data consistency across SQL Server and Postgres?
Yes, you can order the data by primary key, and then write the data to a json or xml file.
Then you can run diff over the two files.
You can also run this chunked by primary-key, that way you don't have to work with a huge file.
Log any diff that doesn't show as equal.
If it doesn't matter what the difference is, you could also just run MD5/SHA1 on the two file chunks, and if the hash machtches, there is no difference, if it doesn't, there is.
Speaking from experience with nhibernate, what you need to watch out for is:
bit fields
text, ntext, varchar(MAX), nvarchar(MAX) fields (they map to varchar with no length, by the way - encoding UTF8)
varbinary, varbinary(MAX), image (bytea[] vs. LOB)
xml
that all primary-key's id serial generator is reset after you inserted all data in pgsql.
Another thing to watch out is which time zone CURRENT_TIMESTAMP uses.
Note:
I'd actually run System.Data.DataRowComparer directly, without writing data to a file:
static void Main(string[] args)
{
DataTable dt1 = dt1();
DataTable dt2= dt2();
IEnumerable<DataRow> idr1 = dt1.Select();
IEnumerable<DataRow> idr2 = dt2.Select();
// MyDataRowComparer MyComparer = new MyDataRowComparer();
// IEnumerable<DataRow> Results = idr1.Except(idr2, MyComparer);
IEnumerable<DataRow> results = idr1.Except(idr2);
}
Then you write all non-matching DataRows into a logfile, for each table one directory (if there are differences).
Don't know what Python uses in place of System.Data.DataRowComparer, though.
Since this would be a one-time task, you could also opt to not do it in Python, and use C# instead (see above code sample).
Also, if you had large tables, you could use DataReader with sequential access to do the comparison. But if the other way cuts it, it reduces the required work considerably.
Have you considered making your SQL Server data visible within your Postgres with a Foreign Data Wrapper (FDW)?
https://github.com/tds-fdw/tds_fdw
I haven't used this FDW tool but, overall, the basic FDW setup process is simple. An FDW acts like a proxy/alias, allowing you to access remote data as though it were housed in Postgres. The tool linked above doesn't support joins, so you would have to perform your comparisons iteratively, etc. Depending on your setup, you would have to check if performance is adequate.
Please report back!
Ok, I can load records into a table using to_sql in pandas. (Unfortunately, I cannot use bcp or bulk insert, because my (SQL Server) database server is remote). How about a table with (Type 2) slowly changing dimensions? In SSIS, I would use SCD Wizard - what is the Python alternative? I would like to avoid coding up SCD logic from scratch - hopefully, there is a package (perhaps one of these) which does that? Simply loading a table is fine and dandy, but in the world of dimensional datamarts one surely has to support SCD?
I have a database with a large table containing more that a hundred million rows. I want to export this data (after some transformation, like joining this table with a few others, cleaning some fields, etc.) and store it int a big text file, for later processing with Hadoop.
So far, I tried two things:
Using Python, I browse the table by chunks (typically 10'000 records at a time) using this subquery trick, perform the transformation on each row and write directly to a text file. The trick helps, but the LIMIT becomes slower and slower as the export progresses. I have not been able to export the full table with this.
Using the mysql command-line tool, I tried to output the result of my query in CSV form to a text file directly. Because of the size, it ran out of memory and crashed.
I am currently investigating Sqoop as a tool to import the data directly to HDFS, but I was wondering how other people handle such large-scale exports?
Memory issues point towards using the wrong database query machanism.
Normally, it is advisable to use mysql_store_result() on C level, which corresponds to having a Cursor or DictCursor on Python level. This ensures that the database is free again as soon as possible and the client can do with thedata whatever he wants.
But it is not suitable for large amounts of data, as the data is cached in the client process. This can be very memory consuming.
In this case, it may be better to use mysql_use_result() (C) resp. SSCursor / SSDictCursor (Python). This limits you to have to take the whole result set and doing nothing else with the database connection in the meanwhile. But it saves your client process a lot of memory. With the mysql CLI, you would achieve this with the -q argument.
I don't know what query exactly you have used because you have not given it here, but I suppose you're specifying the limit and offset. This are quite quick queries at begin of data, but are going very slow.
If you have unique column such as ID, you can fetch only the first N row, but modify the query clause:
WHERE ID > (last_id)
This would use index and would be acceptably fast.
However, it should be generally faster to do simply
SELECT * FROM table
and open cursor for such query, with reasonable big fetch size.
We're currently working on a python project that basically reads and writes M2M data into/from a SQLite database. This database consists of multiple tables, one of them storing current values coming from the cloud. This last table is worrying me a bit since it's being written very often and the application runs on a flash drive.
I've read that virtual tables could be the solution. I've thought in converting the critical table into a virtual one and then link its contents to a real file (XML or JSON) stored in RAM (/tmp for example in Debian). I've been reading this article:
http://drdobbs.com/database/202802959?pgno=1
that explains more or less how to do what I want. It's quite complex and I think that this is not very doable using Python. Maybe we need to develop our own sqlite extension, I don't know...
Any idea about how to "place" our conflicting table in RAM whilst the rest of the database stays in FLASH? Any better/simpler approach about how take the virtual table way under Python?
A very simple, SQL-only solution to create a in-memory table is using SQLite's ATTACH command with the special ":memory:" pseudo-filename:
ATTACH DATABASE ":memory:" AS memdb;
CREATE TABLE memdb.my_table (...);
Since the whole database "memdb" is kept in RAM, the data will be lost once you close the database connection, so you will have to take care of persistence by yourself.
One way to do it could be:
Open your main SQLite database file
Attach a in-memory secondary database
Duplicate your performance-critical table in the in-memory database
Run all queries on the duplicate table
Once done, write the in-memory table back to the original table (BEGIN; DELETE FROM real_table; INSERT INTO real_table SELECT * FROM memory_table;)
But the best advice I can give you: Make sure that you really have a performance problem, the simple solution could just as well be fast enough!
Use an in-memory data structure server. Redis is a sexy option, and you can easily implement a table using lists. Also, it comes with a decent python driver.
I facing an atypical conversion problem. About a decade ago I coded up a large site in ASP. Over the years this turned into ASP.NET but kept the same database.
I've just re-done the site in Django and I've copied all the core data but before I cancel my account with the host, I need to make sure I've got a long-term backup of the data so if it turns out I'm missing something, I can copy it from a local copy.
To complicate matters, I no longer have Windows. I moved to Ubuntu on all my machines some time back. I could ask the host to send me a backup but having no access to a machine with MSSQL, I wouldn't be able to use that if I needed to.
So I'm looking for something that does:
db = {}
for table in database:
db[table.name] = [row for row in table]
And then I could serialize db off somewhere for later consumption... But how do I do the table iteration? Is there an easier way to do all of this? Can MSSQL do a cross-platform SQLDump (inc data)?
For previous MSSQL I've used pymssql but I don't know how to iterate the tables and copy rows (ideally with column headers so I can tell what the data is). I'm not looking for much code but I need a poke in the right direction.
Have a look at the sysobjects and syscolumns tables. Also try:
SELECT * FROM sysobjects WHERE name LIKE 'sys%'
to find any other metatables of interest. See here for more info on these tables and the newer SQL2005 counterparts.
I've liked the ADOdb python module when I've needed to connect to sql server from python. Here is a link to a simple tutorial/example: http://phplens.com/lens/adodb/adodb-py-docs.htm#tutorial
I know you said JSON, but it's very simple to generate a SQL script to do an entire dump in XML:
SELECT REPLACE(REPLACE('SELECT * FROM {TABLE_SCHEMA}.{TABLE_NAME} FOR XML RAW', '{TABLE_SCHEMA}',
QUOTENAME(TABLE_SCHEMA)), '{TABLE_NAME}', QUOTENAME(TABLE_NAME))
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_TYPE = 'BASE TABLE'
ORDER BY TABLE_SCHEMA
,TABLE_NAME
As an aside to your coding approach - I'd say :
set up a virtual machine with an eval on windows
put sql server eval on it
restore your data
check it manually or automatically using the excellent db scripting tools from red-gate to script the data and the schema
if fine then you have (a) a good backup and (b) a scripted output.