I intend to have a python script do many UPDATEs per second on 2,433,000 rows. I am currently trying to keep the dynamic column in python as a value in a python dict. Yet to keep my python dict synchronized with changes in the other columns is becoming more and more difficult or nonviable.
I know I could put the autovacuum on overdrive, but I wonder if this would be enough to catch up with the sheer amount of UPDATEs. If only I could associate a python variable to each row...
I fear that the VACUUM and diskwrite overhead will kill my server?
Any suggestions on how to associate extremely dynamic variables to rows/keys?
Thx
PostgreSQL supports asynchronous notifications using the LISTEN and NOTIFY commands. An application (client) LISTENs for a notification using a notification name (e.g. "table_updated"). The database itself can be made to issue notifications either manually i.e. in the code that performs the insertions or modifications (useful when a large number of updates are made, allowing for batch notifications) or automatically inside a row update TRIGGER.
You could use such notifications to keep your data structures up to date.
Alternatively (or you can use this in combination with the above), you can customize your Python dictionary by overriding the __getitem__(), has_key(), contains() methods and have them perform lookups as needed, allowing you to cache the results using timeouts etc.
Related
Is there any way to maintain a variable that is accessible and mutable across processes?
Example
User A made a request to a view called make_foo and the operation within that view takes time. We want to have a flag variable that says making_foo = True that is viewable by User B that will make a request and by any other user or service within that django app and be able to set it to False when done
Don't take the example too seriously, I know about task queues but what I am trying to understand is the idea of having a shared mutable variable across processes without the need to use a database.
Is there any best practice to achieve that?
One thing you need to be aware of is that when your django server is running in production, there is not just one django process, there will be several worker threads running at the same time.
If you want to share data between processes, even internally, you will need some kind of database to do so, whether that's with SQLite3 or Redis (which I recommend for stuff like this).
I won't go into the details because it's already been said before by other people, but Redis is an in-memory database that uses key-value storing (unlike how Django uses a model, Redis is essentially a giant dictionary). Redis is fast and most operations are atomic which means you are unlikely to encounter race conditions.
I'd like people's views on current design I'm considering for a tornado app. Although I'm using mongoDB to store permanent information I currently have the session information as a python data structure that I've simply added within the Application object at initialisation.
I will need to perform some iteration and manipulation of the sessions while the server is running. I keep debating whether to move these to another mongoDB or just keep it as a python structure.
Is there anything wrong with keeping session information this way?
If you store session data in Python your apllication will:
loose it if you stop the Python process;
likely consume more memory as Python isn't very efficient in memory management (and you will have to store all the sessions in memory, not the ones you need right now).
If these are not problems for you you can go with Python structures. But usually these are serious concerns and most of the projects use some external storage for sessions.
I am creating a GUI that is dependent on information from MySQL table, what i want to be able to do is to display a message every time the table is updated with new data. I am not sure how to do this or even if it is possible. I have codes that retrieve the newest MySQL update but I don't know how to have a message every time new data comes into a table. Thanks!
Quite simple and straightforward solution will be just to poll the latest autoincrement id from your table, and compare it with what you've seen at the previous poll. If it is greater -- you have new data. This is called 'active polling', it's simple to implement and will suffice if you do this not too often. So you have to store the last id value somewhere in your GUI. And note that this stored value will reset when you restart your GUI application -- be sure to think what to do at the start of the GUI. Probably you will need to track only insertions that occur while GUI is running -- then, at the GUI startup you need just to poll and store current id value, and then poll peroidically and react on its changes.
#spacediver gives some good advice about the active polling approach. I wanted to post some other options as well.
You could use some type of message passing to communcate notifications between clients. ZeroMQ, twisted, etc offer these features. One way to do it is have the updating client issue the message along with their successful database insert. Clients can all listen to a channel for notifications instead of always polling the db.
If you cant control adding an update message to the client doing the insertions, you could also look at this link for using a database trigger to call a script which would simply issue an update message to your messaging framework. It explains installing a UDF extension to allow you to run a sys_exec command in a trigger and call a simple script.
This way clients simply respond to a notification instead of all checking regularily.
I noticed that sqlite3 isnĀ“t really capable nor reliable when i use it inside a multiprocessing enviroment. Each process tries to write some data into the same database, so that a connection is used by multiple threads. I tried it with the check_same_thread=False option, but the number of insertions is pretty random: Sometimes it includes everything, sometimes not. Should I parallel-process only parts of the function (fetching data from the web), stack their outputs into a list and put them into the table all together or is there a reliable way to handle multi-connections with sqlite?
First of all, there's a difference between multiprocessing (multiple processes) and multithreading (multiple threads within one process).
It seems that you're talking about multithreading here. There are a couple of caveats that you should be aware of when using SQLite in a multithreaded environment. The SQLite documentation mentions the following:
Do not use the same database connection at the same time in more than
one thread.
On some operating systems, a database connection should
always be used in the same thread in which it was originally created.
See here for a more detailed information: Is SQLite thread-safe?
I've actually just been working on something very similar:
multiple processes (for me a processing pool of 4 to 32 workers)
each process worker does some stuff that includes getting information
from the web (a call to the Alchemy API for mine)
each process opens its own sqlite3 connection, all to a single file, and each
process adds one entry before getting the next task off the stack
At first I thought I was seeing the same issue as you, then I traced it to overlapping and conflicting issues with retrieving the information from the web. Since I was right there I did some torture testing on sqlite and multiprocessing and found I could run MANY process workers, all connecting and adding to the same sqlite file without coordination and it was rock solid when I was just putting in test data.
So now I'm looking at your phrase "(fetching data from the web)" - perhaps you could try replacing that data fetching with some dummy data to ensure that it is really the sqlite3 connection causing you problems. At least in my tested case (running right now in another window) I found that multiple processes were able to all add through their own connection without issues but your description exactly matches the problem I'm having when two processes step on each other while going for the web API (very odd error actually) and sometimes don't get the expected data, which of course leaves an empty slot in the database. My eventual solution was to detect this failure within each worker and retry the web API call when it happened (could have been more elegant, but this was for a personal hack).
My apologies if this doesn't apply to your case, without code it's hard to know what you're facing, but the description makes me wonder if you might widen your considerations.
sqlitedict: A lightweight wrapper around Python's sqlite3 database, with a dict-like interface and multi-thread access support.
If I had to build a system like the one you describe, using SQLITE, then I would start by writing an async server (using the asynchat module) to handle all of the SQLITE database access, and then I would write the other processes to use that server. When there is only one process accessing the db file directly, it can enforce a strict sequence of queries so that there is no danger of two processes stepping on each others toes. It is also faster than continually opening and closing the db.
In fact, I would also try to avoid maintaining sessions, in other words, I would try to write all the other processes so that every database transaction is independent. At minimum this would mean allowing a transaction to contain a list of SQL statements, not just one, and it might even require some if then capability so that you could SELECT a record, check that a field is equal to X, and only then, UPDATE that field. If your existing app is closing the database after every transaction, then you don't need to worry about sessions.
You might be able to use something like nosqlite http://code.google.com/p/nosqlite/
I'm developing software using the Google App Engine.
I have some considerations about the optimal design regarding the following issue: I need to create and save snapshots of some entities at regular intervals.
In the conventional relational db world, I would create db jobs which would insert new summary records.
For example, a job would insert a record for every active user that would contain his current score to the "userrank" table, say, every hour.
I'd like to know what's the best method to achieve this in Google App Engine. I know that there is the Cron service, but does it allow us to execute jobs which will insert/update thousands of records?
I think you'll find that snapshotting every user's state every hour isn't something that will scale well no matter what your framework. A more ordinary environment will disguise this by letting you have longer running tasks, but you'll still reach the point where it's not practical to take a snapshot of every user's data, every hour.
My suggestion would be this: Add a 'last snapshot' field, and subclass the put() function of your model (assuming you're using Python; the same is possible in Java, but I don't know the syntax), such that whenever you update a record, it checks if it's been more than an hour since the last snapshot, and if so, creates and writes a snapshot record.
In order to prevent concurrent updates creating two identical snapshots, you'll want to give the snapshots a key name derived from the time at which the snapshot was taken. That way, if two concurrent updates try to write a snapshot, one will harmlessly overwrite the other.
To get the snapshot for a given hour, simply query for the oldest snapshot newer than the requested period. As an added bonus, since inactive records aren't snapshotted, you're saving a lot of space, too.
Have you considered using the remote api instead? This way you could get a shell to your datastore and avoid the timeouts. The Mapper class they demonstrate in that link is quite useful and I've used it successfully to do batch operations on ~1500 objects.
That said, cron should work fine too. You do have a limit on the time of each individual request so you can't just chew through them all at once, but you can use redirection to loop over as many users as you want, processing one user at a time. There should be an example of this in the docs somewhere if you need help with this approach.
I would use a combination of Cron jobs and a looping url fetch method detailed here: http://stage.vambenepe.com/archives/549. In this way you can catch your timeouts and begin another request.
To summarize the article, the cron job calls your initial process, you catch the timeout error and call the process again, masked as a second url. You have to ping between two URLs to keep app engine from thinking you are in a accidental loop. You also need to be careful that you do not loop infinitely. Make sure that there is an end state for your updating loop, since this would put you over your quotas pretty quickly if it never ended.