I'm quite new to Python and Flask, and while working through the examples, couldn't help noticing cursors. Before this I programmed in PHP, where I never needed cursors. So I got to wondering: What are cursors and why are they used so much in these code examples?
But no matter where I turned, I saw no clear verdict and lots of warnings:
Wikipedia: "Fetching a row from the cursor may result in a network round trip each time", and "Cursors allocate resources on the server, such as locks, packages, processes, and temporary storage."
StackOverflow: See the answer by AndreasT.
The Island of Misfit Cursors: "A good developer is never reluctant to use a tool only because it's often misused by others."
And to top it all, I learned that MySQL does NOT support cursors!
It looks like the only code that doesn't use cursors in the mysqlclient library is the _msql module, and the author repeatedly warns not to use it for compatibility reasons: "If you want to write applications which are portable across databases, use MySQLdb, and avoid using this module directly."
Well, I hope I have explained and supported my dilemma sufficiently well. Here are two big questions troubling me:
Since MySQL doesn't support cursors, what's the whole point of building the entire thing on a Cursor class hierarchy?
Why aren't cursors optional in mysqlclient?
Your are confusing database-engine level cursors and Python db-api cursors. The second ones only exists at the Python code level and are not necessarily tied to database-level ones.
At the Python level, cursors are a way to encapsulate a query and it's results. This abstraction level allow to have a simple, usable and common api for different vendors. Whether the actual implementation for a given vendor relies on database-level cursors or not is a totally different problem.
To make a long story short: there are two distinct concepts here:
database (server) cursors, a feature that exists in some but not all SQL engines
db api (client) cursors (as defined in pep 249), which are used to execute a query and eventually fetch the results.
db api cursors are named so because they conceptually have some similarity with database cursors but are technically totally unrelated.
As to why mysqlclient works this way, it's plain simple: it implements pep 249, which is the community-defined API for Python SQL databases clients.
Related
I have a painted-myself-into-a-corner question that hopefully has a sensible solution I'm overlooking. I had a Python project using sqlite3, which I like a lot and use all the time, and I wanted to try to also support running it on postgres, in case scaling becomes an issue.
Some initial research suggested that there wasn't really a single de facto Python database abstraction layer (hopefully I didn't get this wrong), but psycopg2 fortunately seemed to have very similar structure and methods to sqlite3, and I was able to get away with only adding a couple helper functions and switch cases to my existing code to allow it to support both database libraries with the same queries.
The only exception, unbelievably enough, is the replacement character for variables; sqlite3 needs ? and psycopg2 needs %s. These are probably inherent to sqlite and postgres themselves for all I know.
This means that a function like this:
cur.execute("INSERT INTO repositories (repository_url, repository_name, repository_type, repository_thumbnail, last_crawl_timestamp, item_url_pattern) VALUES (%s,%s,%s,%s,%s,%s)", (repository_url, repository_name, repository_type, repository_thumbnail, time.time(), item_url_pattern))
Will only work for postgres, and if I change the %s's, to ?'s, it'll only work for sqlite. This defies any kind of elegant solution -- I don't really want to rig up some kind of string replacement to construct my queries, as that'll get dumb pretty quickly -- and mostly I'm just astonished that this has turned out to be my blocker.
Any thoughts?
The API in use by both implementations is the Python Database API Specification v2.0, documented in PEP 249. The module global paramstyle tells you what style of parameters a particular implementation expects. The possible values and meanings are documented here.
I'm working on a framework for Digital Forensic Investigators to use to compare files with each other for my Master's capstone project. However, I ran into a bit of a snag...
I'm trying to implement multiprocessing on the comparisons since using a single core seems to be really slow. The trouble I'm having, however, is when the code goes to enter information into an SQLite database. It will occasionally get a "Database is locked" error when two cores complete at nearly the same time.
So, simple side of my question, is it unsafe to operate database functions within a multiprocessing environment due to the errors I'm encountering? If not, is there a method of going about this that is safe and won't result in random errors?
Thanks!
Your problem is that you are trying to have multiple writers access a toy database -- i.e. sqlite -- which is stored in a single file. Using Lock might help, but it's going to kill your multiprocess throughput because of all the waiting-for-the-lock time. In essence, the lock choke point will serialize your program.
Setting up either MySQL or Postgres on almost any platform is straightforward, and there are several excellent Python modules for accessing them. Using one of those will completely eliminate this problem.
Update for an extended response to comment:
I always ask clients / students, "What problem are you trying to solve?" I'm assuming that you are not trying to create a database system, simply to use one. SQLite3 is fine for a well-defined set of problems, but multiprocess access is not one of them. I could veer off into asking what aspect of your project requires multiprocess access, but I'll assume that you have already determined that this is needed. I don't know either your programming skills or your understanding of how a database works, so forgive me if the following is a bit basic.
Normally you need a database (my preference is Postgres), and a Python module that understands all of the fiddly details of how to talk to that database. Then you need to know what it is you want the DBMS to do for you. The Good News is that you are hardly the first to go down this path.
The Postgres Wiki is full of good stuff. See their page on Python Drivers. Psycopg2 is the category leader and runs on Win/Linux/Mac. Also check out PyPi, the Python Package Index, for many well-written extensions.
If you want to stay more object-oriented, as opposed to writing straight SQL, you might want to look at an ORM like SQLAlchemy. This is another category leader that is well-maintained and widely deployed.
The value of using an ORM is that you can (mostly) keep your head in ObjectLand, where most of your problem lives, and not get tangled up in the cognitive dissonance created by object-oriented programming vs. relational database management, which are two very different views of the world of data.
If you need more help, email me. My address is in my profile.
You can make use of Lock. Take a look at https://docs.python.org/2/library/multiprocessing.html#synchronization-between-processes
I noticed that a significant part of my (pure Python) code deals with tables. Of course, I have class Table which supports the basic functionality, but I end up adding more and more features to it, such as queries, validation, sorting, indexing, etc.
I to wonder if it's a good idea to remove my class Table, and refactor the code to use a regular relational database that I will instantiate in-memory.
Here's my thinking so far:
Performance of queries and indexing would improve but communication between Python code and the separate database process might be less efficient than between Python functions. I assume that is too much overhead, so I would have to go with sqlite which comes with Python and lives in the same process. I hope this means it's a pure performance gain (at the cost of non-standard SQL definition and limited features of sqlite).
With SQL, I will get a lot more powerful features than I would ever want to code myself. Seems like a clear advantage (even with sqlite).
I won't need to debug my own implementation of tables, but debugging mistakes in SQL are hard since I can't put breakpoints or easily print out interim state. I don't know how to judge the overall impact of my code reliability and debugging time.
The code will be easier to read, since instead of calling my own custom methods I would write SQL (everyone who needs to maintain this code knows SQL). However, the Python code to deal with database might be uglier and more complex than the code that uses pure Python class Table. Again, I don't know which is better on balance.
Any corrections to the above, or anything else I should think about?
SQLite does not run in a separate process. So you don't actually have any extra overhead from IPC. But IPC overhead isn't that big, anyway, especially over e.g., UNIX sockets. If you need multiple writers (more than one process/thread writing to the database simultaneously), the locking overhead is probably worse, and MySQL or PostgreSQL would perform better, especially if running on the same machine. The basic SQL supported by all three of these databases is the same, so benchmarking isn't that painful.
You generally don't have to do the same type of debugging on SQL statements as you do on your own implementation. SQLite works, and is fairly well debugged already. It is very unlikely that you'll ever have to debug "OK, that row exists, why doesn't the database find it?" and track down a bug in index updating. Debugging SQL is completely different than procedural code, and really only ever happens for pretty complicated queries.
As for debugging your code, you can fairly easily centralize your SQL calls and add tracing to log the queries you are running, the results you get back, etc. The Python SQLite interface may already have this (not sure, I normally use Perl). It'll probably be easiest to just make your existing Table class a wrapper around SQLite.
I would strongly recommend not reinventing the wheel. SQLite will have far fewer bugs, and save you a bunch of time. (You may also want to look into Firefox's fairly recent switch to using SQLite to store history, etc., I think they got some pretty significant speedups from doing so.)
Also, SQLite's well-optimized C implementation is probably quite a bit faster than any pure Python implementation.
You could try to make a sqlite wrapper with the same interface as your class Table, so that you keep your code clean and you get the sqlite performences.
If you're doing database work, use a database, if your not, then don't. Using tables, it sound's like you are. I'd recommend using an ORM to make it more pythonic. SQLAlchemy is the most flexible (though it's not strictly just an ORM).
So I'm writing yet another Twisted based daemon. It'll have an xmlrpc interface as usual so I can easily communicate with it and have other processes interchange data with it as needed.
This daemon needs to access a database. We've been using SQL Alchemy in place of hard coding SQL strings for our latest projects - those mostly done for web apps in Pylons.
We'd like to do the same for this app and re-use library code that makes use of SQL Alchemy. So what to do? Well of course since that library was written for use in a Pylons app it's all the straight-forward blocking style code that everyone is accustomed to and all of the non-blocking is magically handled by Pylons via threading, thread locals, scoped sessions and so on.
So now for Twisted I guess I'm a bit stuck. I could:
Just write the sql I need directly if it's minimal and use the dbapi pool in twisted to do runInteractions etc when I need to hit the db.
Use the objects and inherently blocking methods in our library and block now and then in my Twisted daemon. Bah.
Use sAsync which was last updated in 2008 and kind of reuse the models we have defined already but not really and this doesn't address that the library code needs to work in Pylons too. Does that even work with the latest version SQL Alchemy? Who knows. That project looked great though - why was it apparently abandoned?
Spawn a separate subprocess and have it deal with the library code and all it's blocking, the results being returned back to my daemon when ready as objects marshalled via YAML over xmlrpc.
Use deferToThread and then expunge the objects returned having made sure to do eager loads so that I have all my stuff that I might need. Seems kind of ugha to me.
I'm also stuck using Python 2.5.4 atm so no 2.6 yet and I don't think I can just do an import from future to get access to the cool new multiprocessing module stuff in there. That's OK though I guess as we've got dealing with interprocess communication down pretty well.
So I'm leaning towards option 4 mostly as that would avoid the mortal sin of logic duplication with option 1 while also staying the heck away from threads.
My first attempt though will be option 2 to just get the thing going and then separate out the calls to the library code perhaps into a separate process if it looks like there's a good chance that something might take a bit too long to block on. Sad. Maybe a combination of Stackless Python and Twisted would be interesting here.
Any better ideas?
In the intervening couple of years, Alex Gaynor created https://github.com/alex/alchimia which may be a better central repository for doing integration with SQLAlchemy and Twisted.
Firstly, I can unfortunately only second your opinion that twisted and
SQLAlchemy don't play along very well. I have worked some with both
and would be somewhat afraid of the complexity that would arise from
putting them together.
All the database integration layers that I know of to date use
twisteds threading integration layer, and if you want to avoid that at
all costs you are pretty much stuck with point 4 in your list.
On the other hand, I have seen examples of database connecting code
using deferToThread() and friends that worked very well.
Anyway, some pointers if you'd be ready to consider other frameworks
than SQLAlchemy:
The DivMod guys have been doing some tentative work on twisted -
database integration based on the Storm ORM (google for "storm orm").
See this link for an example:
http://divmod.readthedocs.org/en/latest/products/nevow/storm-approach.html
Also, head over to DivMod's site and have a look at the sources of
their Axiom db layer (probably not of any use to you directly since
it's Sqlite only, but it's principles might be useful).
There's a storm branch that you can use with twisted directly (internally it does the defer to thread stuff) on launchpad https://code.launchpad.net/~therve/storm/twisted-integration. I've used it nicely.
Sadly sqlalchemy is significantly more complex in implementation to audit for async usage. If you really want to use it, i'd recommend an out of process approach with a storage rpc layer.
alternatively if your feeling adventurous and using postgresql, the latest pyscopg2 supports true async usage (https://launchpad.net/txpostgres), and the storm source is pretty simple to hack on ;-)
incidentally the storm you tried last year may not have had the C-extension on by default (it is now in the latest releases.) which might account for your speed issues.
Perhaps twistar is what you're looking for. It's a native active record (aka ORM) implementation for twisted, working on top of twisted.enterprise.adbapi.
http://findingscience.com/twistar/
I want to use sqlite memory database for all my testing and Postgresql for my development/production server.
But the SQL syntax is not same in both dbs. for ex: SQLite has autoincrement, and Postgresql has serial
Is it easy to port the SQL script from sqlite to postgresql... what are your solutions?
If you want me to use standard SQL, how should I go about generating primary key in both the databases?
My suggestion would be: don't. The capabilities of Postgresql are far beyond what SQLite can provide, particularly in the areas of date/numeric support, functions and stored procedures, ALTER support, constraints, sequences, other types like UUID, etc., and even using various SQLAlchemy tricks to try to smooth that over will only get you a slight bit further. In particular date and interval arithmetic are totally different beasts on the two platforms, and SQLite has no support for precision decimals (non floating-point) the way PG does. PG is very easy to install on every major OS and life is just easier if you go that route.
Don't do it. Don't test in one environment and release and develop in another. Your asking for buggy software using this process.
Although we started with sqllite for our testing environment we are seriously looking to having postgres running for each developer. We have scripts that build the test database that our unittests run against, and we have a 'development' version that the devs use.
We investigated running postgres 'in memory' on ramdisk, but this discussion: http://dbaspot.com/forums/postgresql/395602-memory-postgresql-database.html suggests that it isn't necessary.
We haven't run into any problems yet, but it is still early in the development process and we haven't had to do anything too fancy yet.
zzzeek points out some items that will probably trip us up soon :(
Best make the move now....