Managing multiple Twisted client connections - python

I'm trying to use Twisted in a sort of spidering program that manages multiple client connections. I'd like to maintain of a pool of about 5 clients working at one time. The functionality of each client is to connect to a specified IRC server that it gets from a list, enter a specific channel, and then save the list of the users in that channel to a database.
The problem I'm having is more architectural than anything. I'm fairly new to Twisted and I don't know what options are available for managing multiple clients. I'm assuming the easiest way is to simply have each ClientCreator instance die off once it's completed its work and have a central loop that can check to see if there's room to add a new client. I would think this isn't a particularly unusual problem so I'm hoping to glean some information from other peoples' experiences.

The best option is really just to do the obvious thing here. Don't have a loop, or a repeating timed call; just have handlers that do the right thing.
Keep a central connection-management object around, and make event-handling methods feed it the information it needs to keep going. When it starts, make 5 outgoing connections. Keep track of how many are in progress, maintain a list with them in it. When a connection succeeds (in connectionMade) update the list to remember the connection's new state. When a connection completes (in connectionLost) tell the connection manager; its response should be to remove that connection and make a new connection somewhere else. In the middle, it should be fairly obvious how to fire off a request for the names you need and stuff them into a database (waiting for the database insert to complete before dropping your IRC connection, most likely, by waiting for the Deferred to come back from adbapi).

Since each of your clients needs to update a database, instinctively I think I'd piggyback off the connection pool -- see here for more (the whole doc is recommended for some important design patterns that often emerge when using twisted).

I don't know if you are forced to use Twisted, otherwise you might want to give Gevent a try.

Related

Is it thread-safe to use SQLAlchemy with engine/connections instead of sessions?

I've got a simple webservice which uses SQLAlchemy to connect to a database using the pattern
engine = create_engine(database_uri)
connection = engine.connect()
In each endpoint of the service, I then use the same connection, in the following fashion:
for result in connection.execute(query):
<do something fancy>
Since Sessions are not thread-safe, I'm afraid that connections aren't either.
Can I safely keep doing this? If not, what's the easiest way to fix it?
Minor note -- I don't know if the service will ever run multithreaded, but I'd rather be sure that I don't get into trouble when it does.
Short answer: you should be fine.
There is a difference between a connection and a Session. The short description is that connections represent just that… a connection to a database. Information you pass into it will come out pretty plain. It won't keep track of your transactions unless you tell it to, and it won't care about what order you send it data. So if it matters that you create your Widget object before you create your Sprocket object, then you better call that in a thread-safe context. Same generally goes for if you want to keep track of a database transaction.
Session, on the other hand, keeps track of data and transactions for you. If you check out the source code, you'll notice quite a bit of back and forth over database transactions and without a way to know that you have everything you want in a transaction, you could very well end up committing in one thread while you expect to be able to add another object (or several) in another.
In case you don't know what a transaction is this is Wikipedia, but the short version is that transactions help make sure your data stays stable. If you have 15 inserts and updates, and insert 15 fails, you might not want to make the other 14. A transaction would let you cancel the entire operation in bulk.

Reusing database connection for multiple requests

If I don't need transactions, can I reuse the same database connection for multiple requests?
Flask documentation says:
Because database connections encapsulate a transaction, we also need to make sure that only one request at the time uses the connection.
Here's how I understand the meaning of the above sentence:
Python DB-API connection can only handle one transaction at a time; to start a new transaction, one must first commit or roll back the previous one. So if each of our requests needs its own transaction, then of course each request needs its own database connection.
Please let me know if I got it wrong.
But let's say I set autocommit mode, and handle each request in a single SQL statement. Or, alternatively, let's say I only read - not write - to the database. In either case, it seems I can just reuse the same database connection for all my requests to save the overhead of multiple connections. But I'm not sure if there's any downside to this approach.
Edit: I can see one issue with what I'm proposing: each request might be handled by a different process. Since connections should probably not be reused across processes, let me clarify my question: I mean creating one connection per process, and using it for all requests that happen to be handled by this process.
On the other hand, the whole point of (green or native) threads is usually to serve one request per thread, so my proposed approach implies sharing connection across threads. It seems one connection can be used concurrently in multiple native threads, but not in multiple green threads.
So let's say for concreteness my environment is flask + gunicorn with multiple multi-threaded sync workers.
Based on #Craig Ringer comment on a different question, I think I know the answer.
The only possible advantage of connection sharing is performance (other factors - like transaction encapsulation and simplicity - favor a separate connection per request). And since a connection can't be shared across processes or green threads, it only has a chance with native threads. But psycopg2 (and presumably other drivers) doesn't allow concurrent access from the same connection. So unless each request spends very little time talking to the database, there is likely a performance hit, not benefit, from connection sharing.

Persistant MySQL connection in Python for social media harvesting

I am using Python to stream large amounts of Twitter data into a MySQL database. I anticipate my job running over a period of several weeks. I have code that interacts with the twitter API and gives me an iterator that yields lists, each list corresponding to a database row. What I need is a means of maintaining a persistent database connection for several weeks. Right now I find myself having to restart my script repeatedly when my connection is lost, sometimes as a result of MySQL being restarted.
Does it make the most sense to use the mysqldb library, catch exceptions and reconnect when necessary? Or is there an already made solution as part of sqlalchemy or another package? Any ideas appreciated!
I think the right answer is to try and handle the connection errors; it sounds like you'd only be pulling in a much a larger library just for this feature, while trying and catching is probably how it's done, whatever level of the stack it's at. If necessary, you could multithread these things since they're probably IO-bound (i.e. suitable for Python GIL threading as opposed to multiprocessing) and decouple the production and the consumption with a queue, too, which would maybe take some of the load off of the database connection.

(py)zmq/PUB : Is it possible to call connect() then send() immediately and do not lose the message?

With this code, I always lose the message :
def publish(frontend_url, message):
context = zmq.Context()
socket = context.socket(zmq.PUB)
socket.connect(frontend_url)
socket.send(message)
However, if I introduce a short sleep(), I can get the message :
def publish(frontend_url, message):
context = zmq.Context()
socket = context.socket(zmq.PUB)
socket.connect(frontend_url)
time.sleep(0.1) # wait for the connection to be established
socket.send(message)
Is there a way to ensure the message will be delivered without sleeping between the calls to connect() and send() ?
I'm afraid I can't predict the sleep duration (network latencies, etc.)
UPDATE:
Context : I'd like to publish data updates from a Flask REST application to a message broker (eg. on resource creation/update/deletion).
Currently, the message broker is drafted using the 0mq FORWARDER device
I understand 0mq is designed to abstract the TCP sockets and message passing complexities.
In a context where connections are long-lived, I could use it.
However, when running my Flask app in an app container like gunicorn or uwsgi, I have N worker processes and I can't expect the connection nor the process to be long-lived.
As I understand the issue, I should use a real message broker (like RabbitMQ) and use a synchronous client to publish the messages there.
You can't do this exactly, but there may be other solutions that would solve your problem.
Why are you using PUB/SUB sockets? The nature of pub/sub is more suited to long-running sockets, and typically you will bind() on the PUB socket and connect on the SUB socket. What you're doing here, spinning up a socket to send one message, presumably to a "server" of some sort, doesn't really fit the PUB/SUB paradigm very well.
If you instead choose some variation of REQ or DEALER to REP or ROUTER, then things might go smoother for you. A REQ socket will hold a message until its pair is ready to receive it. If you don't particularly care about the response from the "server", then you can just discard it.
Is there any particular reason you aren't just leaving the socket open, instead of building a whole new context and socket, and re-connecting each time you want to send a message? I can think of some limited scenarios where this might be the preferred behavior, but generally it's a better idea to just leave the socket up. If you wanted to stick with PUB/SUB, then just spin the socket up at the start of your app, sleep some safe period of time that covers any reasonable latency scenario, and then start sending your messages without worrying about re-connecting every time. If you'll leave this socket up for long periods of time without any new messages you'll probably want to use heart-beating to make sure the connection stays open.
From the ZMQ Guide:
There is one more important thing to know about PUB-SUB sockets: you do not know precisely when a subscriber starts to get messages. Even if you start a subscriber, wait a while, and then start the publisher, the subscriber will always miss the first messages that the publisher sends. This is because as the subscriber connects to the publisher (something that takes a small but non-zero time), the publisher may already be sending messages out.
Many posts here start with:
"I used .PUB/.SUB and it did not the job I wanted it to do ... Anyone here, do help me make it work like I think it shall work out of the box."
This approach does not work in real world, the less in distributed systems design, the poorer in systems, where near-real-time scheduling and/or tight resources-management is simply un-avoid-able.
Inter-process / inter-platform messaging is not "just another" simple-line-of-code (SLOC)
# A sample demo-code snippet # Issues the demo-code has left to be resolved
#------------------------------------ #------------------------------------------------
def publish( frontend_url, message ): # what is a benefit of a per-call OneStopPUBLISH function?
context = zmq.Context() # .Context() has to be .Terminate()-d (!)
socket = context.socket(zmq.PUB) # is this indeed "a disposable" for each call?
socket.connect(frontend_url) # what transport-class used for .connect()/.bind()?
time.sleep(0.1) # wait for the connection to be established
socket.send(message) # ^ has no control over low-level "connection" handshaking
Anybody may draft a few one-liners and put a decent effort ( own or community outsourced ) to make it finally work ( at least somehow ).
However this is a field of vast capabilities and as such requires a bit of reshaping one's mind to allow its potential to become unlocked and fully utilised.
Sketching a need for a good solution but with wrong grounds or mis-understood SLOC-s ( be it copy/paste-d or not ) typically does not yield anything reasonable for the near, the less for the farther future.
Messaging simply introduces a new paradigm -- a new Macro-COSMOS -- of building automation in wider scale - surprisingly, your (deterministic) code becomes a member of a more complex set of Finite State Automata ( FSA ), that - not so surprisingly, as we intend to do some "MESSAGING" - speak among each other.
For that, there needs to be some [local-resource-management], some "outer" [transport], some "formal behaviour model etiquette" ( not to shout one over another ) [communication-primitive].
This is typically in-built into ZeroMQ, nanomsg and other libraries.
However, there are two important things that remain hidden.
The micro-cosmos of how the things work internally ( many, if not all, attempts to tweak this, instead of making one's best to make proper use of it, are typically waste of time )
The macro-cosmos of how to orchestrate a non-trivial herd of otherwise trivial elements [communication-primitives] into a ROBUST, SCALEABLE messaging ARCHITECTURE, that co-operates across process/localhost/network boundaries and that meets the overall design needs.
Failure to understand the distance between these two worlds typically causes a poor use of the greatest strengths we have received pre-cooked in the messaging libraries.
Simply the best thing to do is to forget the one-liner tweaking approaches. It is not productive.
Understanding the global view first, allows you to harness the powers that will work best for your good to meet your goals.
Why it is so complex?
( courtesy nanomsg.org )
Any non-trivial system is complex. Both in TimeDOMAIN and in ResourcesDOMAIN. The more, if one strives to create a stable, smart, High-performance, Low-latency, transport-class-agnostic Universal Communication Framework.
The good news is, this has been already elaborated and in-built into the micro-cosmos architecture.
The bad news is, this does not solve your needs right from the box ( except a case of some really trivial ones ).
Here we come with the macro-COSMOS design.
It is your responsibility to design a higher-space algorithm, how to make many isolated FSA-primitives converse and find an agreement in accord with the evolving many-to-many conversation. Yes. The library gives you "just" primitive building blocks (very powerful, out of doubt). But it is your responsibility to make the "outer-space" work for your needs.
And this can and typically is complex.
Well, if that would be trivial, then it most probably would have been already included "inside" the library, wouldn't it?
Where to go next?
Perhaps a best next step one may do is IMHO to make a step towards a bit more global view, which may and will sound complicated for the first few things one tries to code with ZeroMQ, but if you at least jump to the page 265 of Pieter Hintjens' book, Code Connected, Volume 1, if it were not the case of reading step-by-step there.
One can start to realise the way, how it is possible to start "programming" the macro-COSMOS of FSA-primitives, so as to form a higher-order-FSA-of-FSAs, that can and will solve all the ProblemDOMAIN specific issues.
First have an un-exposed view on the Fig.60 Republishing Updates and Fig.62 HA Clone Server pair and try only after that to go back to the roots, elements and details.

How to manage db connection especially in case of multi threading

I am working on an online judge.I am using python 2.7 and Mysql ( as I am working on back end-part)
My Method:
I create a main thread which pulls out submissions from database( 10 at a time) and puts them in a queue.Then I have multiple threads that take submissions from queue, evaluate it and write the result back to database.
Now I have some doubts(I know they are doubts from different topics but approach to some of them also is highly appreciated).
Currently when I start the threads I give them their own db connections, Which they use.Is this a good practice to give one connection per thread. Does sharing of connections between threads create problems.How do I go about this.
My main thread uses a single connection as its only work is to pull submissions from db and put then in queue(also update their status in db to Assessing Submission). But sometimes I get the error: Lost connection to Mysql server while querying. I keep getting it even when I stop the program and start it again.What do I do about it? Also should I implement a Pool of connections for only the main thread?
Also does a db connection stay alive for ever? What to do when its session memory etc gets exhausted how to handle that?
Use a connection pool. Sharing the database connection is not always bad but you have to be careful about it. You can try SQLAlchemy to manage a lot of this for you: http://docs.sqlalchemy.org/en/rel_0_8/orm/session.html#unitofwork-contextual
The server might be out of connections, your connection might have been killed because it uses too many resources.. etc. A connection pool could help you solve this.
It all depends, it could stay alive indefinitely theoretically, but usually you have a timeout somewhere.
If you give the same connection to every thread then the threads will not be able to query the database and race condition will occur. So you need to provide separate connection to every thread and indeed it is a good idea. Use a Connection Pool for the purpose it will help you get different connections.
Connection Pool will surely help.
Release the connection once your work is over. There is a limit to connection which is termed as connection time out. So you need to use some third party library to handle that, c3p0 is a good library which can help you in this.
Please refer the below link to configure it:
Best configuration of c3p0

Categories

Resources