Persistant MySQL connection in Python for social media harvesting

Persistant MySQL connection in Python for social media harvesting - python

I am using Python to stream large amounts of Twitter data into a MySQL database. I anticipate my job running over a period of several weeks. I have code that interacts with the twitter API and gives me an iterator that yields lists, each list corresponding to a database row. What I need is a means of maintaining a persistent database connection for several weeks. Right now I find myself having to restart my script repeatedly when my connection is lost, sometimes as a result of MySQL being restarted.
Does it make the most sense to use the mysqldb library, catch exceptions and reconnect when necessary? Or is there an already made solution as part of sqlalchemy or another package? Any ideas appreciated!

I think the right answer is to try and handle the connection errors; it sounds like you'd only be pulling in a much a larger library just for this feature, while trying and catching is probably how it's done, whatever level of the stack it's at. If necessary, you could multithread these things since they're probably IO-bound (i.e. suitable for Python GIL threading as opposed to multiprocessing) and decouple the production and the consumption with a queue, too, which would maybe take some of the load off of the database connection.

Related

Writing in SQLite multiple Threads in Python

I've got a sqlite3 database and I want to write in it from multiple threads. I've got multiple ideas but I'm not sure which I should implement.
create multiple connection, detect and waif if the DB is locked
use one connection and try to make use of Serialized connections (which don't seem to be implemented in python)
have a background process with a single connection, which collects the queries from all threads and then executes them on their behalft
forget about SQlite and use something like Postgresql
What are the advances of these different approaches and which is most likely to be fruitful? Are there any other possibilities?

Try to use https://pypi.python.org/pypi/sqlitedict
A lightweight wrapper around Python's sqlite3 database, with a dict-like interface and multi-thread access support.
But take into account "Concurrent requests are still serialized internally, so this "multithreaded support" doesn't give you any performance benefits. It is a work-around for sqlite limitations in Python."
PostgreSQL, MySQL, etc. give you the better performance for several connections in one time

I used method 1 before. It is the easiest in coding. Since that project has a small website, each query take only several milliseconds. All the users requests can be processed promptly.
I also used method 3 before. Because when the query take longer time, it is better to queue the queries since frequent "detect and wait" makes no sense here. And would require a classic consumer-producer model. It would require more time to code.
But if the query is really heavy and frequent. I suggest look to other db like MS SQL/MySQL.

How to manage db connection especially in case of multi threading

I am working on an online judge.I am using python 2.7 and Mysql ( as I am working on back end-part)
My Method:
I create a main thread which pulls out submissions from database( 10 at a time) and puts them in a queue.Then I have multiple threads that take submissions from queue, evaluate it and write the result back to database.
Now I have some doubts(I know they are doubts from different topics but approach to some of them also is highly appreciated).
Currently when I start the threads I give them their own db connections, Which they use.Is this a good practice to give one connection per thread. Does sharing of connections between threads create problems.How do I go about this.
My main thread uses a single connection as its only work is to pull submissions from db and put then in queue(also update their status in db to Assessing Submission). But sometimes I get the error: Lost connection to Mysql server while querying. I keep getting it even when I stop the program and start it again.What do I do about it? Also should I implement a Pool of connections for only the main thread?
Also does a db connection stay alive for ever? What to do when its session memory etc gets exhausted how to handle that?

Use a connection pool. Sharing the database connection is not always bad but you have to be careful about it. You can try SQLAlchemy to manage a lot of this for you: http://docs.sqlalchemy.org/en/rel_0_8/orm/session.html#unitofwork-contextual
The server might be out of connections, your connection might have been killed because it uses too many resources.. etc. A connection pool could help you solve this.
It all depends, it could stay alive indefinitely theoretically, but usually you have a timeout somewhere.

If you give the same connection to every thread then the threads will not be able to query the database and race condition will occur. So you need to provide separate connection to every thread and indeed it is a good idea. Use a Connection Pool for the purpose it will help you get different connections.
Connection Pool will surely help.
Release the connection once your work is over. There is a limit to connection which is termed as connection time out. So you need to use some third party library to handle that, c3p0 is a good library which can help you in this.
Please refer the below link to configure it:
Best configuration of c3p0

SQLite3 and Multiprocessing

I noticed that sqlite3 isn´t really capable nor reliable when i use it inside a multiprocessing enviroment. Each process tries to write some data into the same database, so that a connection is used by multiple threads. I tried it with the check_same_thread=False option, but the number of insertions is pretty random: Sometimes it includes everything, sometimes not. Should I parallel-process only parts of the function (fetching data from the web), stack their outputs into a list and put them into the table all together or is there a reliable way to handle multi-connections with sqlite?

First of all, there's a difference between multiprocessing (multiple processes) and multithreading (multiple threads within one process).
It seems that you're talking about multithreading here. There are a couple of caveats that you should be aware of when using SQLite in a multithreaded environment. The SQLite documentation mentions the following:
Do not use the same database connection at the same time in more than
one thread.
On some operating systems, a database connection should
always be used in the same thread in which it was originally created.
See here for a more detailed information: Is SQLite thread-safe?

I've actually just been working on something very similar:
multiple processes (for me a processing pool of 4 to 32 workers)
each process worker does some stuff that includes getting information
from the web (a call to the Alchemy API for mine)
each process opens its own sqlite3 connection, all to a single file, and each
process adds one entry before getting the next task off the stack
At first I thought I was seeing the same issue as you, then I traced it to overlapping and conflicting issues with retrieving the information from the web. Since I was right there I did some torture testing on sqlite and multiprocessing and found I could run MANY process workers, all connecting and adding to the same sqlite file without coordination and it was rock solid when I was just putting in test data.
So now I'm looking at your phrase "(fetching data from the web)" - perhaps you could try replacing that data fetching with some dummy data to ensure that it is really the sqlite3 connection causing you problems. At least in my tested case (running right now in another window) I found that multiple processes were able to all add through their own connection without issues but your description exactly matches the problem I'm having when two processes step on each other while going for the web API (very odd error actually) and sometimes don't get the expected data, which of course leaves an empty slot in the database. My eventual solution was to detect this failure within each worker and retry the web API call when it happened (could have been more elegant, but this was for a personal hack).
My apologies if this doesn't apply to your case, without code it's hard to know what you're facing, but the description makes me wonder if you might widen your considerations.

sqlitedict: A lightweight wrapper around Python's sqlite3 database, with a dict-like interface and multi-thread access support.

If I had to build a system like the one you describe, using SQLITE, then I would start by writing an async server (using the asynchat module) to handle all of the SQLITE database access, and then I would write the other processes to use that server. When there is only one process accessing the db file directly, it can enforce a strict sequence of queries so that there is no danger of two processes stepping on each others toes. It is also faster than continually opening and closing the db.
In fact, I would also try to avoid maintaining sessions, in other words, I would try to write all the other processes so that every database transaction is independent. At minimum this would mean allowing a transaction to contain a list of SQL statements, not just one, and it might even require some if then capability so that you could SELECT a record, check that a field is equal to X, and only then, UPDATE that field. If your existing app is closing the database after every transaction, then you don't need to worry about sessions.
You might be able to use something like nosqlite http://code.google.com/p/nosqlite/

Postgres 8.4.4 + psycopg2 + python 2.6.5 + Win7 instability

You can see the combination of software components I'm using in the title of the question.
I have a simple 10-table database running on a Postgres server (Win 7 Pro). I have client apps (python using psycopg to connect to Postgres) who connect to the database at random intervals to conduct relatively light transactions. There's only one client app at a time doing any kind of heavy transaction, and those are typically < 500ms. The rest of them spend more time connecting than actually waiting for the database to execute the transaction. The point is that the database is under light load, but the load is evenly split between reads and writes.
My client apps run as servers/services themselves. I've found that it is pretty common for me to be able to (1) take the Postgres server completely down, and (2) ruin the database by killing the client app with a keyboard interrupt.
By (1), I mean that the Postgres process on the server aborts and the service needs to be restarted.
By (2), I mean that the database crashes again whenever a client tries to access the database after it has restarted and (presumably) finished "recovery mode" operations. I need to delete the old database/schema from the database server, then rebuild it each time to return it to a stable state. (After recovery mode, I have tried various combinations of Vacuums to see whether that improves stability; the vacuums run, but the server will still go down quickly when clients try to access the database again.)
I don't recall seeing the same effect when I kill the client app using a "taskkill" - only when using a keyboard interrupt to take the python process down. It doesn't happen all the time, but frequently enough that it's a major concern (25%?).
Really surprised that anything on a client would actually be able to take down an "enterprise class" database. Can anyone share tips on how to improve robustness, and hopefully help me to understand why this is happening in the first place? Thanks, M

If you're having problems with postgresql acting up like this, you should read this page:
http://wiki.postgresql.org/wiki/Guide_to_reporting_problems
For an example of a real bug, and how to ask a question that gets action and answers, read this thread.
http://archives.postgresql.org/pgsql-general/2010-12/msg01030.php

Inter-database communications in PostgreSQL

I am using PostgreSQL 8.4. I really like the new unnest() and array_agg() features; it is about time they realize the dynamic processing potential of their Arrays!
Anyway, I am working on web server back ends that uses long Arrays a lot. Their will be two successive processes which will each occur on a different physical machine. Each such process is a light python application which ''manage'' SQL queries to the database on each of their machines as well as requests from the front ends.
The first process will generate an Array which will be buffered into an SQL Table. Each such generated Array is accessible via a Primary Key. When its done the first python app sends the key to the second python app. Then the second python app, which is running on a different machine, uses it to go get the referenced Array found in the first machine. It then sends it to it's own db for generating a final result.
The reason why I send a key is because I am hopping that this will make the two processes go faster. But really what I would like is for a way to have the second database send a query to the first database in the hope of minimizing serialization delay and such.
Any help/advice would be appreciated.
Thanks

Sounds like you want dblink from contrib. This allows some inter-db postgres communication. The pg docs are great and should provide the needed examples.

not sure I totally understand, but you've looked at notify/listen? http://www.postgresql.org/docs/8.1/static/sql-listen.html

I am thinking either listen/notify or something with a cache such as memcache. You would send the key to memcache and have the second python app retrieve it from there. You could even do it with listen/notify... e.g; send the key and notify your second app that the key is in memcache waiting to be retrieved.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.