I have a Python project that makes a variety of MySQL queries, which take some variables and insert them into the query itself. So for example:
number_of_rows = 50
query1 = '''select *
from some_db
limit %s''' % (number_of_rows)
However, since I have a lot of long queries, within a script that manipulates and cleans the data, it's making my script less readable. What is a reasonable way to structure my program so that it is both readable and makes these query calls? One way that has worked so far is to have another python file, let's call it my_query_file.py, with something along the lines of
def my_first_query(number_of_rows):
query1 = '''select *
from some_db
limit %s''' % (number_of_rows)
return query
and then importing my_query_file from within my main project file, and calling my_first_query. Is there a better way to do this, though?
consider using existing query builder like python-sql:
https://code.google.com/p/python-sql/
or if the complexity of your application justifies that you might try full blown ORM like SQL Alchemy:
http://www.sqlalchemy.org/
Have you thought about using a template system like Jinja2? I've used it to store and use templates for a variety of situations (ranging from web development to automatic code generation) and it works very well.
You could have your SQL queries in template files, and load and fill them as you wish.
Related
I have a Python script to import data from raw csv/xlsx files. For these I use Pandas to load the files, do some light transformation, and save to an sqlite3 database. This is fast (as fast as any method). After this, I run some queries against these to make some intermediate datasets. These I run through a function (see below).
More information: I am using Anaconda/Python3 (3.9) on Windows 10 Enterprise.
UPDATE:
Just as information for anybody reading this, I ended up going back to
just using standalone python (still using JupyterLab though)... I no
longer have this issue. So not sure if it is a problem with something
Anaconda does or just the versions of various libraries being used for
that particular Anaconda distribution (latest available). My script
runs more or less in the time that I would expect using Python 3.11
and the versions pulled in by pip for Pandas and sqlite (1.5.3 and
3.38.4).
Python function for running sqlite3 queries:
def runSqliteScript(destConnString, queryString):
'''Runs an sqlite script given a connection string and a query string
'''
try:
print('Trying to execute sql script: ')
print(queryString)
cursorTmp = destConnString.cursor()
cursorTmp.executescript(queryString)
except Exception as e:
print('Error caught: {}'.format(e))
Because somebody asked, here is the function that creates the "destConnString", though it's called something else in the actual function call, but is the same type.
def createSqliteDb(db_file):
''' Creates an sqlite database at direct/file name specified
'''
conSqlite = None
try:
conSqlite = sqlite3.connect(db_file)
return conSqlite
except Error as e:
print('Error {} when trying to create {}'.format(e, db_file))
Example of one of the queries (I commented out journal mode/synchronous pragmas after it didn't seem to help at all):
-- PRAGMA journal_mode = WAL;
-- PRAGMA synchronous = NORMAL;
BEGIN;
drop table if exists tbl_1110_cop_omd_fmd;
COMMIT;
BEGIN;
create table tbl_1110_cop_omd_fmd as
select
siteId,
orderNumber,
familyGroup01, familyGroup02,
count(*) as countOfLines
from tbl_0000_ob_trx_for_frazelle
where 1 = 1
-- and dateCreated between datetime('now', '-365 days') and datetime('now', 'localtime') -- temporarily commented due to no date in file
group by siteId,
orderNumber,
familyGroup01, familyGroup02
order by dateCreated asc
;
COMMIT
;
Here is a list of things that I have tried. Unfortunately, no matter what combination of things I have tried, it has ended up having one bottleneck or another. It seems there is some kind of write bottleneck from python to sqlite3, yet the pandas to_sql method doesn't seem to be affected by it. Complete list of all combinations of things that I have tried.
I tried wrapping all my queries in begin/commit statements. I put these in-line with the query, though I'd be interested in knowing if this is the correct way to do this. This seemed to have no effect.
I tried setting the journal mode to WAL and synchronous to normal, again to no effect.
I tried running the queries in an in-memory database.
Firstly, I tried creating everything from scratch in the in-memory database. The tables didn't create any faster. Saving this in-memory database seems to be a bottleneck (backup method).
Next, I tried creating views instead of tables (again, creating everything from scratch in the in-memory database). This created really quickly. Weirdly, querying these views was very fast. Saving this in-memory database seems to be a bottleneck (backup method).
I tried just writing views to the database file (not in-memory). Unfortunately, the views take as long as the make tables when running from Python/sqlite.
I don't really want to do anything strictly in-memory for the database creation, as this python script is used for different sets of data, some which could have too many rows for an in-memory setup. The only thing I have left to try is to take the in-memory from scratch setup, make views instead of tables, read ALL the in-memory db tables with pandas (from_sql), then write ALL the tables to a file db with pandas (to_sql)... Hoping there is something easy to try to resolve this problem.
connOBData = sqlite3.connect('file:cachedb?mode=memory?cache=shared')
These take approximately 1,000 times or more longer than if I run these queries directly in DB Browser (an sqlite frontend). These queries aren't that complex and run fine (in ~2-4 seconds) in DB Browser. All told, if I run all the queries in a row in DB Browser they'd run in 1-2 minutes. If I let them run through the Python script, it literally takes close to 20 hours. I'd expect the queries to finish in approximately the same time that they run in DB Browser.
I am currently porting some code from Spark to the MSSQL Analytical Services with Python. Everything is nice and dandy, but I am not sure if my solution is the correct one for multiple inputs for the scripts.
Consider the following code snippet:
DROP PROCEDURE IF EXISTS SampleModel;
GO
CREATE PROCEDURE SampleModel
AS
BEGIN
exec sp_execute_external_script
#language =N'Python',
#script=N'
import sys
sys.path.append("C:\path\to\custom\package")
from super_package.sample_model import run_model
OutputDataSet = run_model()'
WITH RESULT SETS ((Score float));
END
GO
INSERT INTO [dbo].[SampleModelPredictions] (prediction) EXEC [dbo].[SampleModel]
GO
I have a custom package called super_package and a sample model called sample_model. Since this model uses multiple database tables as input, and I would rather have everything in one place I have a module which connects to the database and fetches the data directly:
def go_go_get_data(query, config):
return rx_data_step(RxSqlServerData(
sql_query=query,
connection_string=config.connection_string,
user=config.user,
password=config.password))
Inside the run_model() function I fetch all necessary data from the database with the go_go_get_data function.
If the data is too big to handle in one go I would to some pagination.
In general I cannot join the tables so this solution doesn't work.
The questions is: Is this the right approach to tackle this problem? Or did I miss something? For now this works, but as I am still in the development / tryout phase I cannot be certain that this will scale. I would rather use the parameters for the stored procedure than fetching inside the Python context.
As you've already figured out, sp_execucte_external_script only allows one result set to be passed in. :-(
You can certainly query from inside the script to fetch data as long as your script is okay with the fact that it's not executing under the current SQL session's user's permissions.
If pagination is important and one data set is significantly larger than the others and you're using Enterprise Edition, you might consider passing the largest data set into the script in chunks using sp_execute_external_script's streaming feature.
If you'd like all of your data to be assembled in SQL Server (vs. fetched by queries in your script), you could try to serialize the result sets and then pass them in as parameters (link describes how to do this in R but something similar should be possible with Python).
I have been using cx_Oracle to perform SQL queries on an Oracle database in Python. So far I have been pasting these queries into strings and then running them using the cursor.execute() function that comes with cx_Oracle:
#simple example
query = """SELECT *
FROM my_table"""
cursor.execute(query)
However, my select queries have gotten quite complex, and the code is starting to look a bit messy. I was wondering if there were any way to simply save the SQL code into a .sql file and for Python or cx_Oracle to call that file? I thought something like that might be simple to find using Google, but my searches are oddly coming up dry.
Well, you can certainly save SQL code to a file and load it:
query = open('foo.sql', 'r').read()
cursor.execute(query)
I can't find any reference to saved queries in cx_Oracle, so that may be your best bet.
I am trying to analyse the SQL performance of our Django (1.3) web application. I have added a custom log handler which attaches to django.db.backends and set DEBUG = True, this allows me to see all the database queries that are being executed.
However the SQL is not valid SQL! The actual query is select * from app_model where name = %s with some parameters passed in (e.g. "admin"), however the logging message doesn't quote the params, so the sql is select * from app_model where name = admin, which is wrong. This also happens using django.db.connection.queries. AFAIK the django debug toolbar has a complex custom cursor to handle this.
Update For those suggesting the Django debug toolbar: I am aware of that tool, it is great. However it does not do what I need. I want to run a sample interaction of our application, and aggregate the SQL that's used. DjDT is great for showing and shallow learning. But not great for aggregating and summarazing the interaction of dozens of pages.
Is there any easy way to get the real, legit, SQL that is run?
Check out django-debug-toolbar. Open a page, and a sidebar will be displayed with all SQL queries plus other information.
select * from app_model where name = %s is a prepared statement. I would recommend you to log the statement and the parameters separately. In order to get a wellformed query you need to do something like "select * from app_model where name = %s" % quote_string("user") or more general query % map(quote_string, params).
Please note that quote_string is DB specific and the DB 2.0 API does not define a quote_string method. So you need to write one yourself. For logging purposes I'd recommend keeping the queries and parameters separate as it allows for far better profiling as you can easily group the queries without taking the actual values into account.
The Django Docs state that this incorrect quoting only happens for SQLite.
https://docs.djangoproject.com/en/dev/ref/databases/#sqlite-connection-queries
Have you tried another Database Engine?
Every QuerySet object has a 'query' attribute. One way to do what you want (I accept perhaps not an ideal one) is to chain the lookups each view is producing into a kind of scripted user-story, using Django's test client. For each lookup your user story contains just append the query to a file-like object that you write at the end, for example (using a list instead for brevity):
l = []
o = Object.objects.all()
l.append(o.query)
I have a CSV file and want to generate dumps of the data for sqlite, mysql, postgres, oracle, and mssql.
Is there a common API (ideally Python based) to do this?
I could use an ORM to insert the data into each database and then export dumps, however that would require installing each database. It also seems a waste of resources - these CSV files are BIG.
I am wary of trying to craft the SQL myself because of the variations with each database. Ideally someone has already done this hard work, but I haven't found it yet.
SQLAlchemy is a database library that (as well as ORM functionality) supports SQL generation in the dialects of the all the different databases you mention (and more).
In normal use, you could create a SQL expression / instruction (using a schema.Table object), create a database engine, and then bind the instruction to the engine, to generate the SQL.
However, the engine is not strictly necessary; the dialects each have a compiler that can generate the SQL without a connection; the only caveat being that you need to stop it from generating bind parameters as it does by default:
from sqlalchemy.sql import expression, compiler
from sqlalchemy import schema, types
import csv
# example for mssql
from sqlalchemy.dialects.mssql import base
dialect = base.dialect()
compiler_cls = dialect.statement_compiler
class NonBindingSQLCompiler(compiler_cls):
def _create_crud_bind_param(self, col, value, required=False):
# Don't do what we're called; return a literal value rather than binding
return self.render_literal_value(value, col.type)
recipe_table = schema.Table("recipe", schema.MetaData(), schema.Column("name", types.String(50), primary_key=True), schema.Column("culture", types.String(50)))
for row in [{"name": "fudge", "culture": "america"}]: # csv.DictReader(open("x.csv", "r")):
insert = expression.insert(recipe_table, row, inline=True)
c = NonBindingSQLCompiler(dialect, insert)
c.compile()
sql = str(c)
print sql
The above example actually works; it assumes you know the target database table schema; it should be easily adaptable to import from a CSV and generate for multiple target database dialects.
I am no database wizard, but AFAIK in Python there's not a common API that would do out-of-the-box what you ask for. There is PEP 249 that defines an API that should be used by modules accessing DB's and that AFAIK is used at least by the MySQL and Postgre python modules (here and here) and that perhaps could be a starting point.
The road I would attempt to follow myself - however - would be another one:
Import the CVS nto MySQL (this is just because MySQL is the one I know best and there are tons of material on the net, as for example this very easy recipe, but you could do the same procedure starting from another database).
Generate the MySQL dump.
Process the MySQL dump file in order to modify it to meet SQLite (and others) syntax.
The scripts for processing the dump file could be very compact, although they might somehow be tricky if you use regex for parsing the lines. Here's an example script MySQL → SQLite that I simply pasted from this page:
#!/bin/sh
mysqldump --compact --compatible=ansi --default-character-set=binary mydbname |
grep -v ' KEY "' |
grep -v ' UNIQUE KEY "' |
perl -e 'local $/;$_=<>;s/,\n\)/\n\)/gs;print "begin;\n";print;print "commit;\n"' |
perl -pe '
if (/^(INSERT.+?)\(/) {
$a=$1;
s/\\'\''/'\'\''/g;
s/\\n/\n/g;
s/\),\(/\);\n$a\(/g;
}
' |
sqlite3 output.db
You could write your script in python (in which case you should have a look to re.compile for performance).
The rationale behind my choice would be:
I get the heavy-lifting [importing and therefore data consistency checks + generating starting SQL file] done for me by mysql
I only have to have one database installed.
I have full control on what is happening and the possibility to fine-tune the process.
I can structure my script in such a way that it will be very easy to extend it for other databases (basically I would structure it like a parser that recognises individual fields + a set of grammars - one for each database - that I can select via command-line option)
There is much more documentation on the differences between SQL flavours than on single DB import/export libraries.
EDIT: A template-based approach
If for any reason you don't feel confident enough to write the SQL yourself, you could use a sort of template-based script. Here's how I would do it:
Import and generate a dump of the table in all the 4 DB you are planning to use.
For each DB save the initial part of the dump (with the schema declaration and all the rest) and a single insert instruction.
Write a python script that - for each DB export - will output the "header" of the dump plus the same "saved line" into which you will programmatically replace the values for each line in your CVS file.
The obvious drawback of this approach is that your "template" will only work for one table. The strongest point of it is that writing such script would be extremely easy and quick.
HTH at least a bit!
You could do this - Create SQL tables from CSV files
or Generate Insert Statements from CSV file
or try this Generate .sql from .csv python
Of course you might need to tweak the scripts mentioned to suite your needs.