Pickle to file instead of using database

Pickle to file instead of using database - python

I'm writing a basic membership web app in python.
Is it always bad practice to abandon databases completely and simply pickle a python dictionary to a file (http://docs.python.org/2/library/pickle.html)? The program should never have to deal with more than ca. 500 members, and will only keep a few fields about each member, so I don't see scaling being an issue. And as this app may need to be run locally as well, it's easier to make it run on various machines if things are kept as simple as possible.
So the options are to set up a mysql db, or to simply pickle to a file. I would prefer the 2nd, but would like to know if this is a terrible idea.

No, if you can keep all the data in memory a database is not necessary, and just pickling everything could work.
Notable drawbacks with pickling is that it is not secure, somebody can replace your data with something else including executable code and that it's Python-only. With a database you also typically update the data and write to disk at the same time, while you will have to remember to pickle the data and save it to disk. Since this takes time, as you write all the data at once, it's usually done only when you exit, so crashes mean you lose all your changes.
There is a middle-ground though: sqlite. It's a lightwieght, simple SQL database included in Python (since Python 2.5) so you don't have to install any extra software to use it. It is often a quick solution. Since SQLAlcehmey has SQLite support it also means you can even use SQLAlchemy with SQLite as default database, and hence provide an "upgrade path" to more serious databases.

Related

Is there a simpler way to restore state between SqlAlchemy integration tests?

Flask example applications Flasky and Flaskr create, drop, and re-seed their entire database between each test. Even if this doesn't make the test suite run slowly, I wonder if there is a way to accomplish the same thing while not being so "destructive". I'm surprised there isn't a "softer" way to roll back any changes. I've tried a few things that haven't worked.
For context, my tests call endpoints through the Flask test_client using something like self.client.post('/things'), and within the endpoints session.commit() is called.
I've tried making my own "commit" function that actually only flushes during tests, but then if I make two sequential requests like self.client.post('/things') and self.client.get('/things'), the newly created item is not present in the result set because the new request has a new request context with a new DB session (and transaction) which is not aware of changes that are merely flushed, not committed. This seems like an unavoidable problem with this approach.
I've tried using subtransactions with db.session.begin(subtransactions=True), but then I run into an even worse problem. Because I have autoflush=False, nothing actually gets committed OR flushed until the outer transaction is committed. So again, any requests that rely on data modified by earlier requests in the same test will fail. Even with autoflush=True, the earlier problem would occur for sequential requests.
I've tried nested transactions with the same result as subtransactions, and apparently they don't do what I was hoping they would do. I saw that nested transactions issue a SAVEPOINT command to the DB. I hoped that would allow commits to happen, visible to other sessions, and then be able to rollback to that save point at an arbitrary time, but that's not what they do. They're used within transactions, and have the same issues as the previous approach.
Update: Apparently there is a way of using nested transactions on a Connection rather than a Session, which might work but requires some restructuring of an application to use a Connection created by the test code. I haven't tried this yet. I'll get around to it eventually, but meanwhile I hope there's another way. Some say this approach may not work with MySQL due to a distinction between "real nested transactions" and savepoints, but the Postgres documentation also says to use SAVEPOINT rather than attempting to nest transactions. I think we can disregard this warning. I don't see any difference between these two databases anymore and if it works on one it will probably work on the other.
Another option that avoids a DB drop_all, create_all, and re-seeding with data, is to manually un-do the changes that a test introduces. But when testing an endpoint, many rows could be inserted into many tables, and reliably undoing this manually would be both exhausting and bug prone.
After trying all those things, I start to see the wisdom in dropping and creating between tests. However, is there something I've tried above that SHOULD work, but I'm simply doing something incorrectly? Or is there yet another method that someone is aware of that I haven't tried yet?
Update: Another method I just found on StackOverflow is to truncate all the tables instead of dropping and creating them. This is apparently about twice as fast, but it still seems heavy-handed and isn't as convenient as a rollback (which would not delete any sample data placed in the DB prior to the test case).

For unit tests I think the standard approach of regenerating the entire database is what makes the most sense, as you've seen in my examples and many others. But I agree, for large applications this can take a lot of time during your test run.
Thanks to SQLAlchemy you can get away with writing a lot of generic database code that runs on your production database, which might be MySQL, Postgres, etc. and at the same time it runs on sqlite for tests. It is not possible for every application out there to use 100% generic SQLAlchemy, since sqlite has some important differences with the others, but in many cases this works well.
So whenever possible, I set up a sqlite database for my tests. Even for large databases, using an in-memory sqlite database should be pretty fast. Another very fast alternative is to generate your tables once, make a backup of your sqlite file with all the emtpy tables, then before each test restore the file instead of doing a create_all().
I have not explored the idea of doing an initial backup of the database with empty tables and then use file based restores between tests for MySQL or Postgres, but in theory that should work as well, so I guess that is one solution you haven't mentioned in your list. You will need to stop and restart the db service in between your tests, though.

Pros and cons of using sqlite3 vs custom table implementation

I noticed that a significant part of my (pure Python) code deals with tables. Of course, I have class Table which supports the basic functionality, but I end up adding more and more features to it, such as queries, validation, sorting, indexing, etc.
I to wonder if it's a good idea to remove my class Table, and refactor the code to use a regular relational database that I will instantiate in-memory.
Here's my thinking so far:
Performance of queries and indexing would improve but communication between Python code and the separate database process might be less efficient than between Python functions. I assume that is too much overhead, so I would have to go with sqlite which comes with Python and lives in the same process. I hope this means it's a pure performance gain (at the cost of non-standard SQL definition and limited features of sqlite).
With SQL, I will get a lot more powerful features than I would ever want to code myself. Seems like a clear advantage (even with sqlite).
I won't need to debug my own implementation of tables, but debugging mistakes in SQL are hard since I can't put breakpoints or easily print out interim state. I don't know how to judge the overall impact of my code reliability and debugging time.
The code will be easier to read, since instead of calling my own custom methods I would write SQL (everyone who needs to maintain this code knows SQL). However, the Python code to deal with database might be uglier and more complex than the code that uses pure Python class Table. Again, I don't know which is better on balance.
Any corrections to the above, or anything else I should think about?

SQLite does not run in a separate process. So you don't actually have any extra overhead from IPC. But IPC overhead isn't that big, anyway, especially over e.g., UNIX sockets. If you need multiple writers (more than one process/thread writing to the database simultaneously), the locking overhead is probably worse, and MySQL or PostgreSQL would perform better, especially if running on the same machine. The basic SQL supported by all three of these databases is the same, so benchmarking isn't that painful.
You generally don't have to do the same type of debugging on SQL statements as you do on your own implementation. SQLite works, and is fairly well debugged already. It is very unlikely that you'll ever have to debug "OK, that row exists, why doesn't the database find it?" and track down a bug in index updating. Debugging SQL is completely different than procedural code, and really only ever happens for pretty complicated queries.
As for debugging your code, you can fairly easily centralize your SQL calls and add tracing to log the queries you are running, the results you get back, etc. The Python SQLite interface may already have this (not sure, I normally use Perl). It'll probably be easiest to just make your existing Table class a wrapper around SQLite.
I would strongly recommend not reinventing the wheel. SQLite will have far fewer bugs, and save you a bunch of time. (You may also want to look into Firefox's fairly recent switch to using SQLite to store history, etc., I think they got some pretty significant speedups from doing so.)
Also, SQLite's well-optimized C implementation is probably quite a bit faster than any pure Python implementation.

You could try to make a sqlite wrapper with the same interface as your class Table, so that you keep your code clean and you get the sqlite performences.

If you're doing database work, use a database, if your not, then don't. Using tables, it sound's like you are. I'd recommend using an ORM to make it more pythonic. SQLAlchemy is the most flexible (though it's not strictly just an ORM).

Lightweight crash recovery for Python

What would be the best way to handle lightweight crash recovery for my program?
I have a Python program that runs a number of test cases and the results are stored in a dictionary which serves as a cache. If I could save (and then restore) each item that is added to the dictionary, I could simply run the program again and the caching would provide suitable crash recovery.
You may assume that the keys and values in the dictionary are easily convertible to strings ie. using either str or the pickle module.
I want this to be completely cross platform - well at least as cross platform as Python is
I don't want to simply write out each value to a file and load it in my program might crash while I am writing the file
UPDATE: This is intended to be a lightweight module so a DBMS is out of the question.
UPDATE: Alex is correct in that I don't actually need to protect against crashes while writing out, but there are circumstances where I would like to be able to manually terminate it in a recoverable state.
UPDATE Added a highly limited solution using standard input below

There's no good way to guard against "your program crashing while writing a checkpoint to a file", but why should you worry so much about that?! What ELSE is your program doing at that time BESIDES "saving checkpoint to a file", that could easily cause it to crash?!
It's hard to beat pickle (or cPickle) for portability of serialization in Python, but, that's just about "turning your keys and values to strings". For saving key-value pairs (once stringified), few approaches are safer than just appending to a file (don't pickle to files if your crashes are far, far more frequent than normal, as you suggest tjey are).
If your environment is incredibly crash-prone for whatever reason (very cheap HW?-), just make sure you close the file (and fflush if the OS is also crash-prone;-), then reopen it for append. This way, worst that can happen is that the very latest append will be incomplete (due to a crash in the middle of things) -- then you just catch the exception raised by unpickling that incomplete record and redo only the things that weren't saved (because they weren't completed due to a crash, OR because they were completed but not fully saved due to a crash, comes to much the same thing in the end).
If you have the option of checkpointing to a database engine (instead of just doing so to files), consider it seriously! The DB engine will keep transaction logs and ensure ACID properties, making your application-side programming much easier IF you can count on that!-)

The pickle module supports serializing objects to a file (and loading from file):
http://docs.python.org/library/pickle.html

One possibility would be to create a number of smaller files ... each representing a subset of the state that you're trying to preserve and each with a checksum or tag indicating that it's complete as the last line/datum of the file (just before the file is closed).
If the checksum/tag is good then the rest of the data can be considered valid ... though program would then have to find all of these files, open and read all of them, and use meta data you've provided (in their headers or their names?) to determine which ones constitute the most recent cohesive state representation (or checkpoint) from which you can continue processing.
Without knowing more about the nature of the data that you're working with it's impossible to be more specific.
You can use files, of course, or you could use a DBMS system just about as easily. Any decent DBMS (PostgreSQL, MySQL if you're using the proper storage back-ends) can give you ACID guarantees and transactional support. So the data you read back should always be consistent with the constraints that you put in your schema and/or with the transactions (BEGIN, COMMIT, ROLLBACK) that you processed.
A possible advantage of posting your serialized date to a DBMS is that you can host the DBMS on a separate system (which is unlikely to suffer the same instabilities as your test host at the same times).

Pickle/cPickle have problems.
I use the JSON module to serialize objects out. I like it because not only does it work on any OS, but it will work fine in other programming languages, too; many other languages and platforms have readily-accessible JSON deserialization support, which makes it easy to use the same objects in different programs.

Solution with severe restrictions
If I don't worry about it crashing while writing out and I only want to allow manual termination, I can use standard output to control this. Unfortunately, this can only terminate the program when a control point is reached. This could be solved by creating a new thread to read standard input. This thread could use a global lock to check if the main thread is inside a critical section (writing to a file) and terminate the program if this is not the case.
Downsides:
This is reasonably complex
It adds an extra thread
It stops me using standard input for anything else

Using CSV as a mutable database?

Yes, this is as stupid a situation as it sounds like. Due to some extremely annoying hosting restrictions and unresponsive tech support, I have to use a CSV file as a database.
While I can use MySQL with PHP, I can't use it with the Python backend of my program because of install issues with the host. I can't use SQLite with PHP because of more install issues, but can use it as it's a Python builtin.
Anyways, now, the question: is it possible to update values SQL-style in a CSV database? Or should I keep on calling the help desk?

Don't walk, run to get a new host immediately. If your host won't even get you the most basic of free databases, it's time for a change. There are many fish in the sea.
At the very least I'd recommend an xml data store rather than a csv. My blog uses an xml data provider and I haven't had any issues with performance at all.

Take a look at this: http://www.netpromi.com/kirbybase_python.html

Keep calling on the help desk.
While you can use a CSV as a database, it's generally a bad idea. You would have to implement you own locking, searching, updating, and be very careful with how you write it out to make sure that it isn't erased in case of a power outage or other abnormal shutdown. There will be no transactions, no query language unless you write your own, etc.

I couldn't imagine this ever being a good idea. The current mess I've inherited writes vital billing information to CSV and updates it after projects are complete. It runs horribly and thousands of dollars are missed a month. For the current restrictions that you have, I'd consider finding better hosting.

You can probably used sqlite3 for more real database. It's hard to imagine hosting that won't allow you to install it as a python module.
Don't even think of using CSV, your data will be corrupted and lost faster than you say "s#&t"

"Anyways, now, the question: is it possible to update values SQL-style in a CSV database?"
Technically, it's possible. However, it can be hard.
If both PHP and Python are writing the file, you'll need to use OS-level locking to assure that they don't overwrite each other. Each part of your system will have to lock the file, rewrite it from scratch with all the updates, and unlock the file.
This means that PHP and Python must load the entire file into memory before rewriting it.
There are a couple of ways to handle the OS locking.
Use the same file and actually use some OS lock module. Both processes have the file open at all times.
Write to a temp file and do a rename. This means each program must open and read the file for each transaction. Very safe and reliable. A little slow.
Or.
You can rearchitect it so that only Python writes the file. The front-end reads the file when it changes, and drops off little transaction files to create a work queue for Python. In this case, you don't have multiple writers -- you have one reader and one writer -- and life is much, much simpler.

I'd keep calling help desk. You don't want to use CSV for data if it's relational at all. It's going to be nightmare.

I agree. Tell them that 5 random strangers agree that you being forced into a corner to use CSV is absurd and unacceptable.

If I understand you correctly: you need to access the same database from both python and php, and you're screwed because you can only use mysql from php, and only sqlite from python?
Could you further explain this? Maybe you could use xml-rpc or plain http requests with xml/json/... to get the php program to communicate with the python program (or the other way around?), so that only one of them directly accesses the db.
If this is not the case, I'm not really sure what the problem.

It's technically possible. For example, Perl has DBD::CSV that provides a driver that runs SQL queries on the CSV file.
That being said, why not run off a SQLite database on your server?

What about postgresql? I've found that quite nice to work with, and python supports it well.
But I really would look for another provider unless it's really not an option.

Disagreeing with the noble colleagues, I often use DBD::CSV from Perl. There are good reasons to do it. Foremost is data update made simple using a spreadsheet. As a bonus, since I am using SQL queries, the application can be easily upgraded to a real database engine. Bear in mind these were extremely small database in a single user application.
So rephrasing the question: Is there a python module equivalent to Perl's DBD:CSV

How would one make Python objects persistent in a web-app?

I'm writing a reasonably complex web application. The Python backend runs an algorithm whose state depends on data stored in several interrelated database tables which does not change often, plus user specific data which does change often. The algorithm's per-user state undergoes many small changes as a user works with the application. This algorithm is used often during each user's work to make certain important decisions.
For performance reasons, re-initializing the state on every request from the (semi-normalized) database data quickly becomes non-feasible. It would be highly preferable, for example, to cache the state's Python object in some way so that it can simply be used and/or updated whenever necessary. However, since this is a web application, there several processes serving requests, so using a global variable is out of the question.
I've tried serializing the relevant object (via pickle) and saving the serialized data to the DB, and am now experimenting with caching the serialized data via memcached. However, this still has the significant overhead of serializing and deserializing the object often.
I've looked at shared memory solutions but the only relevant thing I've found is POSH. However POSH doesn't seem to be widely used and I don't feel easy integrating such an experimental component into my application.
I need some advice! This is my first shot at developing a web application, so I'm hoping this is a common enough issue that there are well-known solutions to such problems. At this point solutions which assume the Python back-end is running on a single server would be sufficient, but extra points for solutions which scale to multiple servers as well :)
Notes:
I have this application working, currently live and with active users. I started out without doing any premature optimization, and then optimized as needed. I've done the measuring and testing to make sure the above mentioned issue is the actual bottleneck. I'm sure pretty sure I could squeeze more performance out of the current setup, but I wanted to ask if there's a better way.
The setup itself is still a work in progress; assume that the system's architecture can be whatever suites your solution.

Be cautious of premature optimization.
Addition: The "Python backend runs an algorithm whose state..." is the session in the web framework. That's it. Let the Django framework maintain session state in cache. Period.
"The algorithm's per-user state undergoes many small changes as a user works with the application." Most web frameworks offer a cached session object. Often it is very high performance. See Django's session documentation for this.
Advice. [Revised]
It appears you have something that works. Leverage to learn your framework, learn the tools, and learn what knobs you can turn without breaking a sweat. Specifically, using session state.
Second, fiddle with caching, session management, and things that are easy to adjust, and see if you have enough speed. Find out whether MySQL socket or named pipe is faster by trying them out. These are the no-programming optimizations.
Third, measure performance to find your actual bottleneck. Be prepared to provide (and defend) the measurements as fine-grained enough to be useful and stable enough to providing meaningful comparison of alternatives.
For example, show the performance difference between persistent sessions and cached sessions.

I think that the multiprocessing framework has what might be applicable here - namely the shared ctypes module.
Multiprocessing is fairly new to Python, so it might have some oddities. I am not quite sure whether the solution works with processes not spawned via multiprocessing.

I think you can give ZODB a shot.
"A major feature of ZODB is transparency. You do not need to write any code to explicitly read or write your objects to or from a database. You just put your persistent objects into a container that works just like a Python dictionary. Everything inside this dictionary is saved in the database. This dictionary is said to be the "root" of the database. It's like a magic bag; any Python object that you put inside it becomes persistent."
Initailly it was a integral part of Zope, but lately a standalone package is also available.
It has the following limitation:
"Actually there are a few restrictions on what you can store in the ZODB. You can store any objects that can be "pickled" into a standard, cross-platform serial format. Objects like lists, dictionaries, and numbers can be pickled. Objects like files, sockets, and Python code objects, cannot be stored in the database because they cannot be pickled."
I have read it but haven't given it a shot myself though.
Other possible thing could be a in-memory sqlite db, that may speed up the process a bit - being an in-memory db, but still you would have to do the serialization stuff and all.
Note: In memory db is expensive on resources.
Here is a link: http://www.zope.org/Documentation/Articles/ZODB1

First of all your approach is not a common web development practice. Even multi threading is being used, web applications are designed to be able to run multi-processing environments, for both scalability and easier deployment .
If you need to just initialize a large object, and do not need to change later, you can do it easily by using a global variable that is initialized while your WSGI application is being created, or the module contains the object is being loaded etc, multi processing will do fine for you.
If you need to change the object and access it from every thread, you need to be sure your object is thread safe, use locks to ensure that. And use a single server context, a process. Any multi threading python server will serve you well, also FCGI is a good choice for this kind of design.
But, if multiple threads are accessing and changing your object the locks may have a really bad effect on your performance gain, which is likely to make all the benefits go away.

This is Durus, a persistent object system for applications written in the Python
programming language. Durus offers an easy way to use and maintain a consistent
collection of object instances used by one or more processes. Access and change of a
persistent instances is managed through a cached Connection instance which includes
commit() and abort() methods so that changes are transactional.
http://www.mems-exchange.org/software/durus/
I've used it before in some research code, where I wanted to persist the results of certain computations. I eventually switched to pytables as it met my needs better.

Another option is to review the requirement for state, it sounds like if the serialisation is the bottle neck then the object is very large. Do you really need an object that large?
I know in the Stackoverflow podcast 27 the reddit guys discuss what they use for state, so that maybe useful to listen to.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.