Stored Procedures in Python for PostgreSQL

Stored Procedures in Python for PostgreSQL - python

we are still pretty new to Postgres and came from Microsoft Sql Server.
We are wanting to write some stored procedures now. Well, after struggling to get something more complicated than a hello world to work in pl/pgsql, we decided it's better if we are going to learn a new language we might as well learn Python because we got the same query working in it in about 15 minutes(note, none of us actually know python).
So I have some questions about it in comparison to pl/psql.
Is pl/Pythonu slower than pl/pgsql?
Is there any kind of "good" reference for how to write good stored procedures using it? Five short pages in the Postgres documentation doesn't really tell us enough.
What about query preparation? Should it always be used?
If we use the SD and GD arrays for a lot of query plans, will it ever get too full or have a negative impact on the server? Will it automatically delete old values if it gets too full?
Is there any hope of it becoming a trusted language?
Also, our stored procedure usage is extremely light. Right now we only have 4, but we are still trying to convert little bits of code over from Sql Server specific syntax(such as variables, which can't be used in Postgres outside of stored procedures)

Depends on what operations you're doing.
Well, combine that with a general Python documentation, and that's about what you have.
No. Again, depends on what you're doing. If you're only going to run a query once, no point in preparing it separately.
If you are using persistent connections, it might. But they get cleared out whenever a connection is closed.
Not likely. Sandboxing is broken in Python and AFAIK nobody is really interested in fixing it. I heard someone say that python-on-parrot may be the most viable way, once we have pl/parrot (which we don't yet).
Bottom line though - if your stored procedures are going to do database work, use pl/pgsql. Only use pl/python if you are going to do non-database stuff, such as talking to external libraries.

Related

Suggestions on workflow between SQL Server and Python

I'm currently working on something where data is produced, with a fair amount of processing, within SQL server, that I ultimately need to manipulate within Python to complete the task. It seems to me I have a couple different options:
(1) Run the SQL code from within Python, manipulate output
(2) Create an SP in SSMS, run the SP from within Python, manipulate output
(3) ?
The second seems cleanest, but I wonder if there's a better way to achieve my objective without needing to create a stored procedure every time I need SQL data in Python. Copying the entirety of the SQL code into Python seems similarly kludgy, particularly for larger or complex queries.
For those with more experience working between the two: can you offer any suggestions on workflow?

There is no silver bullet.
It really depends on the specifics of what you're doing. What amount of data are we talking? Is it even feasible to stream it all over the network, through Python, and back? How much more load can the database server handle? How complex are the manipulations you consider doing in Python? How proficient are you and your team in SQL, and in Python?
I've seen both approaches in production, and one slight advantage that sometimes gets overlooked is that when you have all the SQL nicely formatted inside your Python program, it's automatically under some Version Control, and you can check who edited what last and is thus to blame for the latest SNAFU ;-)

Python prepared statement security

I took a slight peek behind the curtain at the MySQLdb python driver, and to my horror I saw it was simply escaping the parameters and putting them directly into the query string. I realize that escaping inputs should be fine in most cases, but coming from PHP, I have seen bugs where, given certain database character sets and versions of the MySQL driver, SQL injection was still possible.
This question had some incredibly detailed responses regarding the edge cases of string escaping in PHP, and has led me to the belief that prepared statements should be used whenever possible.
So then my questions are: Are there any known cases where the MySQLdb driver has been successfully exploited due to this? When a query needs to be run in a loop, say in the case of an incremental DB migration script, will this degrade performance? Are my concerns regarding escaped input fundamentally flawed?

I can't point to any known exploit cases, but I can say that yes, this is terrible.
The Python project calling itself MySQLdb is no longer maintained. It's been piling up unresolved Github issues since 2014, and just quickly looking at the source code, I can find more bugs not yet reported - for example, it's using a regex to parse queries in execute_many, leading it to mishandle any query with the string " values()" in it where values isn't the keyword.
Instead of MySQLdb, you should be using MySQL's Connector/Python. That is still maintained, and it's hopefully less terrible than MySQLdb. (Hopefully. I didn't check that one.)

...prepared statements should be used whenever possible.
Yes. That's the best advice. If the prepared statement system is broken there will be klaxons blaring from the rooftops and everyone in the Python world will pounce on the problem to fix it. If there's a mistake in your own code that doesn't use prepared statements you're on your own.
What you're seeing in the driver is probably prepared statement emulation, that is the driver is responsible for inserting data into the placeholders and forwarding the final, composed statement to the server. This is done for various reasons, some historical, some to do with compatibility.
Drivers are generally given a lot of serious scrutiny as they're the foundation of most systems. If there is a security bug in there then there's a lot of people that are going to be impacted by it, the stakes are very high.
The difference between using prepared statements with placeholder values and your own interpolated code is massive even if behind the scenes the same thing happens. This is because the driver, by design, always escapes your data. Your code might not, you may omit the escaping on one value and then you have a catastrophic hole.
Use placeholder values like your life depends on it, because it very well might. You do not want to wake up to a phone call or email one day saying your site got hacked and now your database is floating around on the internet.

Using prepared statements is faster than concatenating a query. The database can precompile the statement, so only the parameters will be changed when iterating in a loop.

Python - Multiprocessing and database entries

I'm working on a framework for Digital Forensic Investigators to use to compare files with each other for my Master's capstone project. However, I ran into a bit of a snag...
I'm trying to implement multiprocessing on the comparisons since using a single core seems to be really slow. The trouble I'm having, however, is when the code goes to enter information into an SQLite database. It will occasionally get a "Database is locked" error when two cores complete at nearly the same time.
So, simple side of my question, is it unsafe to operate database functions within a multiprocessing environment due to the errors I'm encountering? If not, is there a method of going about this that is safe and won't result in random errors?
Thanks!

Your problem is that you are trying to have multiple writers access a toy database -- i.e. sqlite -- which is stored in a single file. Using Lock might help, but it's going to kill your multiprocess throughput because of all the waiting-for-the-lock time. In essence, the lock choke point will serialize your program.
Setting up either MySQL or Postgres on almost any platform is straightforward, and there are several excellent Python modules for accessing them. Using one of those will completely eliminate this problem.
Update for an extended response to comment:
I always ask clients / students, "What problem are you trying to solve?" I'm assuming that you are not trying to create a database system, simply to use one. SQLite3 is fine for a well-defined set of problems, but multiprocess access is not one of them. I could veer off into asking what aspect of your project requires multiprocess access, but I'll assume that you have already determined that this is needed. I don't know either your programming skills or your understanding of how a database works, so forgive me if the following is a bit basic.
Normally you need a database (my preference is Postgres), and a Python module that understands all of the fiddly details of how to talk to that database. Then you need to know what it is you want the DBMS to do for you. The Good News is that you are hardly the first to go down this path.
The Postgres Wiki is full of good stuff. See their page on Python Drivers. Psycopg2 is the category leader and runs on Win/Linux/Mac. Also check out PyPi, the Python Package Index, for many well-written extensions.
If you want to stay more object-oriented, as opposed to writing straight SQL, you might want to look at an ORM like SQLAlchemy. This is another category leader that is well-maintained and widely deployed.
The value of using an ORM is that you can (mostly) keep your head in ObjectLand, where most of your problem lives, and not get tangled up in the cognitive dissonance created by object-oriented programming vs. relational database management, which are two very different views of the world of data.
If you need more help, email me. My address is in my profile.

You can make use of Lock. Take a look at https://docs.python.org/2/library/multiprocessing.html#synchronization-between-processes

Pros and cons of using sqlite3 vs custom table implementation

I noticed that a significant part of my (pure Python) code deals with tables. Of course, I have class Table which supports the basic functionality, but I end up adding more and more features to it, such as queries, validation, sorting, indexing, etc.
I to wonder if it's a good idea to remove my class Table, and refactor the code to use a regular relational database that I will instantiate in-memory.
Here's my thinking so far:
Performance of queries and indexing would improve but communication between Python code and the separate database process might be less efficient than between Python functions. I assume that is too much overhead, so I would have to go with sqlite which comes with Python and lives in the same process. I hope this means it's a pure performance gain (at the cost of non-standard SQL definition and limited features of sqlite).
With SQL, I will get a lot more powerful features than I would ever want to code myself. Seems like a clear advantage (even with sqlite).
I won't need to debug my own implementation of tables, but debugging mistakes in SQL are hard since I can't put breakpoints or easily print out interim state. I don't know how to judge the overall impact of my code reliability and debugging time.
The code will be easier to read, since instead of calling my own custom methods I would write SQL (everyone who needs to maintain this code knows SQL). However, the Python code to deal with database might be uglier and more complex than the code that uses pure Python class Table. Again, I don't know which is better on balance.
Any corrections to the above, or anything else I should think about?

SQLite does not run in a separate process. So you don't actually have any extra overhead from IPC. But IPC overhead isn't that big, anyway, especially over e.g., UNIX sockets. If you need multiple writers (more than one process/thread writing to the database simultaneously), the locking overhead is probably worse, and MySQL or PostgreSQL would perform better, especially if running on the same machine. The basic SQL supported by all three of these databases is the same, so benchmarking isn't that painful.
You generally don't have to do the same type of debugging on SQL statements as you do on your own implementation. SQLite works, and is fairly well debugged already. It is very unlikely that you'll ever have to debug "OK, that row exists, why doesn't the database find it?" and track down a bug in index updating. Debugging SQL is completely different than procedural code, and really only ever happens for pretty complicated queries.
As for debugging your code, you can fairly easily centralize your SQL calls and add tracing to log the queries you are running, the results you get back, etc. The Python SQLite interface may already have this (not sure, I normally use Perl). It'll probably be easiest to just make your existing Table class a wrapper around SQLite.
I would strongly recommend not reinventing the wheel. SQLite will have far fewer bugs, and save you a bunch of time. (You may also want to look into Firefox's fairly recent switch to using SQLite to store history, etc., I think they got some pretty significant speedups from doing so.)
Also, SQLite's well-optimized C implementation is probably quite a bit faster than any pure Python implementation.

You could try to make a sqlite wrapper with the same interface as your class Table, so that you keep your code clean and you get the sqlite performences.

If you're doing database work, use a database, if your not, then don't. Using tables, it sound's like you are. I'd recommend using an ORM to make it more pythonic. SQLAlchemy is the most flexible (though it's not strictly just an ORM).

Using CSV as a mutable database?

Yes, this is as stupid a situation as it sounds like. Due to some extremely annoying hosting restrictions and unresponsive tech support, I have to use a CSV file as a database.
While I can use MySQL with PHP, I can't use it with the Python backend of my program because of install issues with the host. I can't use SQLite with PHP because of more install issues, but can use it as it's a Python builtin.
Anyways, now, the question: is it possible to update values SQL-style in a CSV database? Or should I keep on calling the help desk?

Don't walk, run to get a new host immediately. If your host won't even get you the most basic of free databases, it's time for a change. There are many fish in the sea.
At the very least I'd recommend an xml data store rather than a csv. My blog uses an xml data provider and I haven't had any issues with performance at all.

Take a look at this: http://www.netpromi.com/kirbybase_python.html

Keep calling on the help desk.
While you can use a CSV as a database, it's generally a bad idea. You would have to implement you own locking, searching, updating, and be very careful with how you write it out to make sure that it isn't erased in case of a power outage or other abnormal shutdown. There will be no transactions, no query language unless you write your own, etc.

I couldn't imagine this ever being a good idea. The current mess I've inherited writes vital billing information to CSV and updates it after projects are complete. It runs horribly and thousands of dollars are missed a month. For the current restrictions that you have, I'd consider finding better hosting.

You can probably used sqlite3 for more real database. It's hard to imagine hosting that won't allow you to install it as a python module.
Don't even think of using CSV, your data will be corrupted and lost faster than you say "s#&t"

"Anyways, now, the question: is it possible to update values SQL-style in a CSV database?"
Technically, it's possible. However, it can be hard.
If both PHP and Python are writing the file, you'll need to use OS-level locking to assure that they don't overwrite each other. Each part of your system will have to lock the file, rewrite it from scratch with all the updates, and unlock the file.
This means that PHP and Python must load the entire file into memory before rewriting it.
There are a couple of ways to handle the OS locking.
Use the same file and actually use some OS lock module. Both processes have the file open at all times.
Write to a temp file and do a rename. This means each program must open and read the file for each transaction. Very safe and reliable. A little slow.
Or.
You can rearchitect it so that only Python writes the file. The front-end reads the file when it changes, and drops off little transaction files to create a work queue for Python. In this case, you don't have multiple writers -- you have one reader and one writer -- and life is much, much simpler.

I'd keep calling help desk. You don't want to use CSV for data if it's relational at all. It's going to be nightmare.

I agree. Tell them that 5 random strangers agree that you being forced into a corner to use CSV is absurd and unacceptable.

If I understand you correctly: you need to access the same database from both python and php, and you're screwed because you can only use mysql from php, and only sqlite from python?
Could you further explain this? Maybe you could use xml-rpc or plain http requests with xml/json/... to get the php program to communicate with the python program (or the other way around?), so that only one of them directly accesses the db.
If this is not the case, I'm not really sure what the problem.

It's technically possible. For example, Perl has DBD::CSV that provides a driver that runs SQL queries on the CSV file.
That being said, why not run off a SQLite database on your server?

What about postgresql? I've found that quite nice to work with, and python supports it well.
But I really would look for another provider unless it's really not an option.

Disagreeing with the noble colleagues, I often use DBD::CSV from Perl. There are good reasons to do it. Foremost is data update made simple using a spreadsheet. As a bonus, since I am using SQL queries, the application can be easily upgraded to a real database engine. Bear in mind these were extremely small database in a single user application.
So rephrasing the question: Is there a python module equivalent to Perl's DBD:CSV

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.