Inter-database communications in PostgreSQL

Inter-database communications in PostgreSQL - python

I am using PostgreSQL 8.4. I really like the new unnest() and array_agg() features; it is about time they realize the dynamic processing potential of their Arrays!
Anyway, I am working on web server back ends that uses long Arrays a lot. Their will be two successive processes which will each occur on a different physical machine. Each such process is a light python application which ''manage'' SQL queries to the database on each of their machines as well as requests from the front ends.
The first process will generate an Array which will be buffered into an SQL Table. Each such generated Array is accessible via a Primary Key. When its done the first python app sends the key to the second python app. Then the second python app, which is running on a different machine, uses it to go get the referenced Array found in the first machine. It then sends it to it's own db for generating a final result.
The reason why I send a key is because I am hopping that this will make the two processes go faster. But really what I would like is for a way to have the second database send a query to the first database in the hope of minimizing serialization delay and such.
Any help/advice would be appreciated.
Thanks

Sounds like you want dblink from contrib. This allows some inter-db postgres communication. The pg docs are great and should provide the needed examples.

not sure I totally understand, but you've looked at notify/listen? http://www.postgresql.org/docs/8.1/static/sql-listen.html

I am thinking either listen/notify or something with a cache such as memcache. You would send the key to memcache and have the second python app retrieve it from there. You could even do it with listen/notify... e.g; send the key and notify your second app that the key is in memcache waiting to be retrieved.

Related

Using a global static variable server wide in Django

I have a very long list of objects that I would like to only load from the db once to memory (Meaning not for each session) this list WILL change it's values and grow over time by user inputs, The reason I need it in memory is because I am doing some complex searches on it and want to give a quick answer back.
My question is how do I load a list on the start of the server and keep it alive through sessions letting them all READ/WRITE to it.
Will it be better to do a heavy SQL search instead of keeping the list alive through my server?

The answer is that this is bad idea, you are opening a pandora's box specially since you need write access as well. However all is not lost. You can quite easily use redis for this task.
Redis is a peristent data store but at the same time everything is held in memory. If the redis server runs on the same device as the web server access is almost instantaneous

Persistant MySQL connection in Python for social media harvesting

I am using Python to stream large amounts of Twitter data into a MySQL database. I anticipate my job running over a period of several weeks. I have code that interacts with the twitter API and gives me an iterator that yields lists, each list corresponding to a database row. What I need is a means of maintaining a persistent database connection for several weeks. Right now I find myself having to restart my script repeatedly when my connection is lost, sometimes as a result of MySQL being restarted.
Does it make the most sense to use the mysqldb library, catch exceptions and reconnect when necessary? Or is there an already made solution as part of sqlalchemy or another package? Any ideas appreciated!

I think the right answer is to try and handle the connection errors; it sounds like you'd only be pulling in a much a larger library just for this feature, while trying and catching is probably how it's done, whatever level of the stack it's at. If necessary, you could multithread these things since they're probably IO-bound (i.e. suitable for Python GIL threading as opposed to multiprocessing) and decouple the production and the consumption with a queue, too, which would maybe take some of the load off of the database connection.

Best practice to make partial search results appear (one by one as they come in from a secondary server)

I'd like to do the following:
the queries on a django site (first server) are send to a second
server (for performance and security reasons)
the query is processed on the second server using sqlite
the python search function has to keep a lot of data in memory. a simple cgi would always have to reread data from disk which would further slow down the search process. so i guess i need some daemon to run on the second server.
the search process is slow and i'd like to send partial results back, and show them as they arrive.
this looks like a common task, but somehow i don't get it.
i tried Pyro first which exposes the search class (and then i needed a workaround to avoid sqlite threading issues). i managed to get the complete search results onto the first server, but only as a whole. i don't know how to "yield" the results one by one (as generators cannot be pickled), and i anyway wouldn't know how to write them one by one onto the search result page.
i may need some "push technology" says this thread: https://stackoverflow.com/a/5346075/1389074 talking about some different framework. but which?
i don't seem to search for the right terms. maybe someone can point me to some discussions or frameworks that address this task?
thanks a lot in advance!

You can use python tornado websockets. This will allow you to establish 2 way connection from the client side to the server and return data as it comes. Tornado is an async framework built in python.

Options to write a Python Server that will check database consitently for updates and changes

I am a self taught Python programmer and I have an idea for a project that I'd like to use to better my understanding of Socket programming and networking in general. I was hoping that someone would be able to tell me if I am on the right path, or point me in another direction.
The general idea is to be able to update a database via a website UI, then a Python Server would then be consistently checking that database for changes. If it notices changes it would then hand out a set of instructions to the first available connected client.
The objective is to
A - Create a server that reads from a database.
B - Create clients that connect to said server from a remote machine.
C - The server would then consistently read the database and and look for changes, for instance a column that is a Boolean that would signify Run/Don't run.
D - If say the run Boolean is true, the server than would hand the instructions off to the first available client.
E - The client itself would then handle updating the database of certain runtime occurrences
Questions/Concerns
A - My first concern is the resources it would take to be constantly reading the database and searching for changes. Is there a better way of doing this? Or Could I write a loop to do this and not worry much about it?
B - I have been reading the documentation/tutorials on Twisted and at this moment this looks like a viable option to be able to handle the connections of multiple clients (20-30 for arguments sake). From what I've read Threading looks to be more of a hassle than it's worth.
Am I on the right track? Any suggestions? Or reading material worth looking at?
Thank you in advance

The Django web framework implements something called Signals which at the ORM level lets you detect changes to specific database objects and attach handlers to those changes. Since it is an open source project, you might want to check the source code to understand how Django does it while supporting multiple database backends (Mysql, postgres, oracle, sqlite).
If you directly want to listen to database changes then most databases have some kind of a system that logs every transactional change. For example MySql has the binary log you can keep reading from to detect changes to the database.
Also while Twisted is great I would recommend you use Gevent/Greenlet + this. If you want to integrate with Django, how to combine django plus gevent the basics? would help

SQLite3 and Multiprocessing

I noticed that sqlite3 isn´t really capable nor reliable when i use it inside a multiprocessing enviroment. Each process tries to write some data into the same database, so that a connection is used by multiple threads. I tried it with the check_same_thread=False option, but the number of insertions is pretty random: Sometimes it includes everything, sometimes not. Should I parallel-process only parts of the function (fetching data from the web), stack their outputs into a list and put them into the table all together or is there a reliable way to handle multi-connections with sqlite?

First of all, there's a difference between multiprocessing (multiple processes) and multithreading (multiple threads within one process).
It seems that you're talking about multithreading here. There are a couple of caveats that you should be aware of when using SQLite in a multithreaded environment. The SQLite documentation mentions the following:
Do not use the same database connection at the same time in more than
one thread.
On some operating systems, a database connection should
always be used in the same thread in which it was originally created.
See here for a more detailed information: Is SQLite thread-safe?

I've actually just been working on something very similar:
multiple processes (for me a processing pool of 4 to 32 workers)
each process worker does some stuff that includes getting information
from the web (a call to the Alchemy API for mine)
each process opens its own sqlite3 connection, all to a single file, and each
process adds one entry before getting the next task off the stack
At first I thought I was seeing the same issue as you, then I traced it to overlapping and conflicting issues with retrieving the information from the web. Since I was right there I did some torture testing on sqlite and multiprocessing and found I could run MANY process workers, all connecting and adding to the same sqlite file without coordination and it was rock solid when I was just putting in test data.
So now I'm looking at your phrase "(fetching data from the web)" - perhaps you could try replacing that data fetching with some dummy data to ensure that it is really the sqlite3 connection causing you problems. At least in my tested case (running right now in another window) I found that multiple processes were able to all add through their own connection without issues but your description exactly matches the problem I'm having when two processes step on each other while going for the web API (very odd error actually) and sometimes don't get the expected data, which of course leaves an empty slot in the database. My eventual solution was to detect this failure within each worker and retry the web API call when it happened (could have been more elegant, but this was for a personal hack).
My apologies if this doesn't apply to your case, without code it's hard to know what you're facing, but the description makes me wonder if you might widen your considerations.

sqlitedict: A lightweight wrapper around Python's sqlite3 database, with a dict-like interface and multi-thread access support.

If I had to build a system like the one you describe, using SQLITE, then I would start by writing an async server (using the asynchat module) to handle all of the SQLITE database access, and then I would write the other processes to use that server. When there is only one process accessing the db file directly, it can enforce a strict sequence of queries so that there is no danger of two processes stepping on each others toes. It is also faster than continually opening and closing the db.
In fact, I would also try to avoid maintaining sessions, in other words, I would try to write all the other processes so that every database transaction is independent. At minimum this would mean allowing a transaction to contain a list of SQL statements, not just one, and it might even require some if then capability so that you could SELECT a record, check that a field is equal to X, and only then, UPDATE that field. If your existing app is closing the database after every transaction, then you don't need to worry about sessions.
You might be able to use something like nosqlite http://code.google.com/p/nosqlite/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.