I have a CSV file and want to generate dumps of the data for sqlite, mysql, postgres, oracle, and mssql.
Is there a common API (ideally Python based) to do this?
I could use an ORM to insert the data into each database and then export dumps, however that would require installing each database. It also seems a waste of resources - these CSV files are BIG.
I am wary of trying to craft the SQL myself because of the variations with each database. Ideally someone has already done this hard work, but I haven't found it yet.
SQLAlchemy is a database library that (as well as ORM functionality) supports SQL generation in the dialects of the all the different databases you mention (and more).
In normal use, you could create a SQL expression / instruction (using a schema.Table object), create a database engine, and then bind the instruction to the engine, to generate the SQL.
However, the engine is not strictly necessary; the dialects each have a compiler that can generate the SQL without a connection; the only caveat being that you need to stop it from generating bind parameters as it does by default:
from sqlalchemy.sql import expression, compiler
from sqlalchemy import schema, types
import csv
# example for mssql
from sqlalchemy.dialects.mssql import base
dialect = base.dialect()
compiler_cls = dialect.statement_compiler
class NonBindingSQLCompiler(compiler_cls):
def _create_crud_bind_param(self, col, value, required=False):
# Don't do what we're called; return a literal value rather than binding
return self.render_literal_value(value, col.type)
recipe_table = schema.Table("recipe", schema.MetaData(), schema.Column("name", types.String(50), primary_key=True), schema.Column("culture", types.String(50)))
for row in [{"name": "fudge", "culture": "america"}]: # csv.DictReader(open("x.csv", "r")):
insert = expression.insert(recipe_table, row, inline=True)
c = NonBindingSQLCompiler(dialect, insert)
c.compile()
sql = str(c)
print sql
The above example actually works; it assumes you know the target database table schema; it should be easily adaptable to import from a CSV and generate for multiple target database dialects.
I am no database wizard, but AFAIK in Python there's not a common API that would do out-of-the-box what you ask for. There is PEP 249 that defines an API that should be used by modules accessing DB's and that AFAIK is used at least by the MySQL and Postgre python modules (here and here) and that perhaps could be a starting point.
The road I would attempt to follow myself - however - would be another one:
Import the CVS nto MySQL (this is just because MySQL is the one I know best and there are tons of material on the net, as for example this very easy recipe, but you could do the same procedure starting from another database).
Generate the MySQL dump.
Process the MySQL dump file in order to modify it to meet SQLite (and others) syntax.
The scripts for processing the dump file could be very compact, although they might somehow be tricky if you use regex for parsing the lines. Here's an example script MySQL → SQLite that I simply pasted from this page:
#!/bin/sh
mysqldump --compact --compatible=ansi --default-character-set=binary mydbname |
grep -v ' KEY "' |
grep -v ' UNIQUE KEY "' |
perl -e 'local $/;$_=<>;s/,\n\)/\n\)/gs;print "begin;\n";print;print "commit;\n"' |
perl -pe '
if (/^(INSERT.+?)\(/) {
$a=$1;
s/\\'\''/'\'\''/g;
s/\\n/\n/g;
s/\),\(/\);\n$a\(/g;
}
' |
sqlite3 output.db
You could write your script in python (in which case you should have a look to re.compile for performance).
The rationale behind my choice would be:
I get the heavy-lifting [importing and therefore data consistency checks + generating starting SQL file] done for me by mysql
I only have to have one database installed.
I have full control on what is happening and the possibility to fine-tune the process.
I can structure my script in such a way that it will be very easy to extend it for other databases (basically I would structure it like a parser that recognises individual fields + a set of grammars - one for each database - that I can select via command-line option)
There is much more documentation on the differences between SQL flavours than on single DB import/export libraries.
EDIT: A template-based approach
If for any reason you don't feel confident enough to write the SQL yourself, you could use a sort of template-based script. Here's how I would do it:
Import and generate a dump of the table in all the 4 DB you are planning to use.
For each DB save the initial part of the dump (with the schema declaration and all the rest) and a single insert instruction.
Write a python script that - for each DB export - will output the "header" of the dump plus the same "saved line" into which you will programmatically replace the values for each line in your CVS file.
The obvious drawback of this approach is that your "template" will only work for one table. The strongest point of it is that writing such script would be extremely easy and quick.
HTH at least a bit!
You could do this - Create SQL tables from CSV files
or Generate Insert Statements from CSV file
or try this Generate .sql from .csv python
Of course you might need to tweak the scripts mentioned to suite your needs.
Related
I'm migrating data from SQL Server 2017 to Postgres 10.5, i.e., all the tables, stored procedures etc.
I want to compare the data consistency between SQL Server and Postgres databases after the data migration.
All I can think of now is using Python Pandas and loading the tables into data frames from SQL Server and also Postgres and compare the data frames.
But the data is around 6 GB which takes much time for loading table into the data frame and also hosted on a server which is not local to where I'm running the Python script. Is there any way to efficiently compare the data consistency across SQL Server and Postgres?
Yes, you can order the data by primary key, and then write the data to a json or xml file.
Then you can run diff over the two files.
You can also run this chunked by primary-key, that way you don't have to work with a huge file.
Log any diff that doesn't show as equal.
If it doesn't matter what the difference is, you could also just run MD5/SHA1 on the two file chunks, and if the hash machtches, there is no difference, if it doesn't, there is.
Speaking from experience with nhibernate, what you need to watch out for is:
bit fields
text, ntext, varchar(MAX), nvarchar(MAX) fields (they map to varchar with no length, by the way - encoding UTF8)
varbinary, varbinary(MAX), image (bytea[] vs. LOB)
xml
that all primary-key's id serial generator is reset after you inserted all data in pgsql.
Another thing to watch out is which time zone CURRENT_TIMESTAMP uses.
Note:
I'd actually run System.Data.DataRowComparer directly, without writing data to a file:
static void Main(string[] args)
{
DataTable dt1 = dt1();
DataTable dt2= dt2();
IEnumerable<DataRow> idr1 = dt1.Select();
IEnumerable<DataRow> idr2 = dt2.Select();
// MyDataRowComparer MyComparer = new MyDataRowComparer();
// IEnumerable<DataRow> Results = idr1.Except(idr2, MyComparer);
IEnumerable<DataRow> results = idr1.Except(idr2);
}
Then you write all non-matching DataRows into a logfile, for each table one directory (if there are differences).
Don't know what Python uses in place of System.Data.DataRowComparer, though.
Since this would be a one-time task, you could also opt to not do it in Python, and use C# instead (see above code sample).
Also, if you had large tables, you could use DataReader with sequential access to do the comparison. But if the other way cuts it, it reduces the required work considerably.
Have you considered making your SQL Server data visible within your Postgres with a Foreign Data Wrapper (FDW)?
https://github.com/tds-fdw/tds_fdw
I haven't used this FDW tool but, overall, the basic FDW setup process is simple. An FDW acts like a proxy/alias, allowing you to access remote data as though it were housed in Postgres. The tool linked above doesn't support joins, so you would have to perform your comparisons iteratively, etc. Depending on your setup, you would have to check if performance is adequate.
Please report back!
I am building a database interface using Python's peewee module. I am trying to figure out how to insert data into an existing database where I do not know the schema.
My idea is to use playhouse.reflection.Introspector to find out the database schema, then use that information to create class objects which can then be inserted into the existing database.
So far I've gotten to:
introspector = Introspector.from_database(database)
models = introspector.generate_models()
I'm don't know where to go from there.
1) Can I create database objects in this manner? What is the next step?
2) Is there an easier way to do this?
peewee includes an introspection tool called pwiz that can (basically) introspect a database and produce model definitions. It is run as a command line script and dumps the model definitions to stdout, so invokation is like any other unix tool. Here is an example from the docs:
python -m pwiz -e postgresql my_postgres_db > mymodels.py
From there edit mymodels.py to get what you need.
You could do this on the fly, but it would require a few steps and is hackish (not to mention pointless if you really don't know anything about the schema):
Run pwiz as an os command
Read it to pick out the model names
Import whatever you find
BUT
If you really don't know the schema to start with then you have no idea what the semantics of the database are anyway, which means whatever you find is literally meaningless. Unless you at least know some schema/table/column names you are hunting for (in which case you do know something about the schema) there isn't really much you can do with regard to inserting data (not in a sane way), though you could certainly dump data from the db. But if you just wanted a database dump then pg_dump would have been easier.
I suspect this is actually an X-Y problem. What problem is it you are trying to solve by using this technique? What effect is it supposed to achieve within the context of your system?
If you want to create a GUI, check out the sqlite_web project. It uses Peewee to create a web-based SQLite database manager.
My company has decided to implement a datamart using [Greenplum] and I have the task of figuring out how to go on about it. A ballpark figure of the amount of data to be transferred from the existing [DB2] DB to the Greenplum DB is about 2 TB.
I would like to know :
1) Is the Greenplum DB the same as vanilla [PostgresSQL]? (I've worked on Postgres AS 8.3)
2) Are there any (free) tools available for this task (extract and import)
3) I have some knowledge of Python. Is it feasible, even easy to do this in a resonable amount of time?
I have no idea how to do this. Any advice, tips and suggestions will be hugely welcome.
1) Greenplum is not vanilla postgres, but it is similar. It has some new syntax, but in general, is highly consistent.
2) Greenplum itself provides something called "gpfdist" which lets you listen on a port that you specify in order to bring in a file (but the file has to be split up). You want readable external tables. They are quite fast. Syntax looks like this:
CREATE READABLE EXTERNAL TABLE schema.ext_table
( thing int, thing2 int )
LOCATION (
'gpfdist://server:port1/path/to/filep1.txt',
'gpfdist://server:port2/path/to/filep2.txt',
'gpfdist://server:port3/path/to/filep3.txt'
) FORMAT 'text' (delimiter E'\t' null 'null' escape 'off') ENCODING 'UTF8';
CREATE TEMP TABLE import AS SELECT * FROM schema.ext_table DISTRIBUTED RANDOMLY;
If you play to their rules and your data is clean, the loading can be blazing fast.
3) You don't need python to do this, although you could automate it by using python to kick off the gpfdist processes, and then sending a command to psql that creates the external table and loads the data. Depends on what you want to do though.
Many of Greenplum's utilities are written in python and the current DBMS distribution comes with python 2.6.2 installed, including the pygresql module which you can use to work inside the GPDB.
For data transfer into greenplum, I've written python scripts that connect to the source (Oracle) DB using cx_Oracle and then dumping that output either to flat files or named pipes. gpfdist can read from either sort of source and load the data into the system.
Generally, it is really slow if you use SQL insert or merge to import big bulk data.
The recommended way is to use the external tables you define to use file-based, web-based or gpfdist protocol hosted files.
And also greenplum has a utility named gpload, which can be used to define your transferring jobs, like source, output, mode(inert, update or merge).
1) It's not vanilla postgres
2) I have used pentaho data integration with good success in various types of data transfer projects.
It allows for complex transformations and multi-threaded, multi-step loading of data if you design your steps carefully.
Also I believe Pentaho support Greenplum specifically though I have no experience of this.
I facing an atypical conversion problem. About a decade ago I coded up a large site in ASP. Over the years this turned into ASP.NET but kept the same database.
I've just re-done the site in Django and I've copied all the core data but before I cancel my account with the host, I need to make sure I've got a long-term backup of the data so if it turns out I'm missing something, I can copy it from a local copy.
To complicate matters, I no longer have Windows. I moved to Ubuntu on all my machines some time back. I could ask the host to send me a backup but having no access to a machine with MSSQL, I wouldn't be able to use that if I needed to.
So I'm looking for something that does:
db = {}
for table in database:
db[table.name] = [row for row in table]
And then I could serialize db off somewhere for later consumption... But how do I do the table iteration? Is there an easier way to do all of this? Can MSSQL do a cross-platform SQLDump (inc data)?
For previous MSSQL I've used pymssql but I don't know how to iterate the tables and copy rows (ideally with column headers so I can tell what the data is). I'm not looking for much code but I need a poke in the right direction.
Have a look at the sysobjects and syscolumns tables. Also try:
SELECT * FROM sysobjects WHERE name LIKE 'sys%'
to find any other metatables of interest. See here for more info on these tables and the newer SQL2005 counterparts.
I've liked the ADOdb python module when I've needed to connect to sql server from python. Here is a link to a simple tutorial/example: http://phplens.com/lens/adodb/adodb-py-docs.htm#tutorial
I know you said JSON, but it's very simple to generate a SQL script to do an entire dump in XML:
SELECT REPLACE(REPLACE('SELECT * FROM {TABLE_SCHEMA}.{TABLE_NAME} FOR XML RAW', '{TABLE_SCHEMA}',
QUOTENAME(TABLE_SCHEMA)), '{TABLE_NAME}', QUOTENAME(TABLE_NAME))
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_TYPE = 'BASE TABLE'
ORDER BY TABLE_SCHEMA
,TABLE_NAME
As an aside to your coding approach - I'd say :
set up a virtual machine with an eval on windows
put sql server eval on it
restore your data
check it manually or automatically using the excellent db scripting tools from red-gate to script the data and the schema
if fine then you have (a) a good backup and (b) a scripted output.
I'm working on a small app which will help browse the data generated by vim-logging, and I'd like to allow people to run arbitrary SQL queries against the datasets.
How can I do that safely?
For example, I'd like to let someone enter, say, SELECT file_type, count(*) FROM commands GROUP BY file_type, then send the result back to their web browser.
Do this:
cmd = "update people set name=%s where id=%s"
curs.execute(cmd, (name, id))
Note that the placeholder syntax depends on the database you are using.
Source and more info here:
http://bobby-tables.com/python.html
Allowing expressive power while preventing destruction is a difficult job. If you let them enter "SELECT .." themselves, you need to prevent them from entering "DELETE .." instead. You can require the statement to begin with "SELECT", but then you also have to be sure it doesn't contain "; DELETE" somewhere in the middle.
The safest thing to do might be to connect to the database with read-only user credentials.
In MySQL, you can create a limited user (create new user and grant limited access), which can only access certain table.
Consider using SQLAlchemy. While SQLAlchemy is arguably the greatest Object Relational Mapper ever, you certainly don't need to use any of the ORM stuff to take advantage of all of the great Python/SQL work that's been done.
As the introductory documentation suggests:
Most importantly, SQLAlchemy is not just an ORM. Its data abstraction layer allows construction and manipulation of SQL expressions in a platform agnostic way, and offers easy to use and superfast result objects, as well as table creation and schema reflection utilities. No object relational mapping whatsoever is involved until you import the orm package. Or use SQLAlchemy to write your own!
Using SQLAlchemy will give you input sanitation "for free" and let you use standard Python logic to analyze statements for safety without having to do any messy text-parsing/pattern-matching.