What all possible times 'kazoo.exceptions.ConnectionLoss' is raised?

What all possible times 'kazoo.exceptions.ConnectionLoss' is raised? - python

I am using apache-zookeeper and kazoo framework for one of my requirement. I have a simple zookeeper cluster setup and few clients connecting to server cluster to read node information. I am facing kazoo.exceptions.ConnectionLoss randomly(once in fifty times).
My concern is on what all times this exception is raised ? Below are the points I thought.
Connection to server was lost
Server didn't respond back within timeout set in server configuration
Can there be any other reasons for this exception? I don't see documentation explaining anything in detail on this.

I fear I don't have ready answer but looking at Kazoo code I think this can come in following conditions,
Socket read timeout,
Socket write timeout,
deserialize failure because of timeout issues,
Client creation with high initial bytes value of node
Tried to gather this from Kazoo unittest code test_connection test_client ,

Related

How do I keep a FTP connection alive?

I used ftputil to download a batch of files from a FTP server. It raised the error ftputil.error.FTPIOError: [Errno 60] Operation timed out.
As described in Documentation – ftputil,
keep_alive() attempts to keep the connection to the remote server active in order to prevent timeouts from happening. This method is primarily intended to keep the underlying FTP connection of an FTPHost object alive while a file is uploaded or downloaded. This will require either an extra thread while the upload or download is in progress or calling keep_alive from a callback function.
I called keep_alive from a callback function with,
ftp_host.download(source, target, callback=ftp_host.keep_alive)
but it raised ERROR __main__ keep_alive() takes 1 positional argument but 2 were given.
How do I keep a FTP connection alive?

This isn't directly an answer to your question, but it may help finding an answer for your particular problem yourself. Also, a ticket on the ftputil website is better for help with debugging a problem. That said, I think it's fine to ask on StackOverflow first since you don't know in advance if the problem is a simple one or not. :-)
Since FTP is a stateful protocol, client and server can't send arbitrary commands at a given time. The allowed commands and possibly replies are determined by the state the connection is in. See also the state diagrams in RFC 959.
To work around this limitation, ftputil creates a new FTP connection behind the scenes for each remote file object [1]. With this approach, you can still send commands like chdir or start a download while another is still in progress. However, this means that from the perspective of the server, all these FTP connections that come from a single FTPHost object are independent connections, so each of these connections can have their timeout at different times, depending on the usage pattern of the respective connection.
For example, there was ftputil ticket 141, where presumably the main connection initiated by the FTPHost object timed out while a connection used for downloading was still usable.
In your case, it might be helpful to find out which of the underlying connections is timing out (the initial connection or a connection for a remote file). You can use ftputil.session.session_factory to create factories that have FTP debugging enabled (see the documentation).
Unfortunately, a timeout of 60 seconds is quite short, so there are relatively many chances for timeouts.
Especially given the possibility of timeouts in FTP connections, my advice is to write software for FTP transfers in a way so that you can restart the operation (ideally with a new FTPHost object for robustness) where it was interrupted by the timeout. So far I haven't been able to come up with a way to universally work around timeouts. In simple cases you may actually be better off using ftplib directly, although ftputil has robustness and latency improvements that ftplib doesn't have. Using ftplib doesn't save you from timeouts, but at least you don't have any "hidden" connections that may make debugging more difficult.
[1] That said, if you close a remote file in ftputil, the underlying FTP connection can be reused unless it's not timed out. The library checks for a timeout before it reuses the connection.
The picture regarding timeouts is even more complicated by ftputil caching a lot of information from the server to reduce latency. For example, if you call FTPHost.getcwd(), the current directory is retrieved from a cached attribute, not by sending a PWD command to the server and thereby resetting the timeout. Stat information from directory listings is also usually cached.

After couple hours looking for solutions I got it running without '421 Timeout' errors calling keepalive from separate thread. However your I/O Timeout error probably was caused by connection problems.
import ftputil
from threading import Thread
from time import sleep
fhandle = ftputil.FTPHost('host', 'user', 'pwd')
quitThread = 0
def _thread_keep_alive():
while quitThread == 0:
print("KEEPALIVE!")
fhandle.keep_alive()
sleep(25)
thread = Thread(target = _thread_keep_alive)
thread.start()
# some downloading...
quitThread = 1
fhandle.close()

Consumers do not reconnect after Rabbit MQ restart

Sometimes our rabbit messaging server requires a restart. After which however some consumers which are listening via basic consume blocking call do not consume any messages until they are restarted themselves and neither do they raise any exception.
What is the reason for this and how might I fix?

In the connectionFactory, please ensure the following property is set to true:
factory.setAutomaticRecoveryEnabled(true);
For more details, please refer the document here

As I mentioned in my comment, every AMQP client library has a different way to recover connections, and some depend on the developer to do that. There is NO canonical method.
Pika has this example as a starting point for connection recovery. Note that the code is for the unreleased version of Pika (1.0.0). If you're on 0.12.0 you will have to adjust the parameters to the method calls.
The best way to test and implement connection recovery is to simulate failure conditions and then code for them. Run your application, then kill the beam.smp process (RabbitMQ) to see what happens. If you have a RabbitMQ cluster, use firewall rules to simulate a network partition. Can your application handle that? What happens when you run rabbitmqctl stop_app; sleep 10; rabbitmqctl start_app? Can your app handle that?
Run your application through a TCP proxy like toxiproxy and introduce latency and other non-optimal conditions. Shut down the proxy to simulate a sudden TCP connection close. In each case, code for that failure condition and log the event so that someone can later diagnose what has happened.
I have seen too many developers code for the "happy path" only to have their applications fail spectacularly in production with zero ability to determine the source of the failure.

Python RabbitMQ and kombu: heartbeats

I have a Twisted application (Python 2.7) that uses the kombu module to communicate with a RabbitMQ message server.
We're having problems with closing connections (probably firewall related) and I'm trying to implement the heartbeat_check() method to handle this. I've got a heartbeat value of 10 set on the connection and I've got a Twisted LoopingCall that calls the heartbeat_check(rate=2) method every 5 seconds.
However, once things get rolling I'm getting an exception thrown every other call to heartbeat_check() (based on the logging information I've got in the function that LoopingCall calls, which includes the call to heartbeat_check). I've tried all kinds of variations of heartbeat and rate values and nothing seems to help. When I debug into the heartbeat_call() it looks like some minor message traffic is being sent back and forth, am I supposed to respond to that in my message consumer?
Any hints or pointers would be very helpful, thanks in advance!!
Doug

Message queue proxy in Python + Twisted

I want to implement a lightweight Message Queue proxy. It's job is to receive messages from a web application (PHP) and send them to the Message Queue server asynchronously. The reason for this proxy is that the MQ isn't always avaliable and is sometimes lagging, or even down, but I want to make sure the messages are delivered, and the web application returns immediately.
So, PHP would send the message to the MQ proxy running on the same host. That proxy would save the messages to SQLite for persistence, in case of crashes. At the same time it would send the messages from SQLite to the MQ in batches when the connection is available, and delete them from SQLite.
Now, the way I understand, there are these components in this service:
message listener (listens to the messages from PHP and writes them to a Incoming Queue)
DB flusher (reads messages from the Incoming Queue and saves them to a database; due to SQLite single-threadedness)
MQ connection handler (keeps the connection to the MQ server online by reconnecting)
message sender (collects messages from SQlite db and sends them to the MQ server, then removes them from db)
I was thinking of using Twisted for #1 (TCPServer), but I'm having problem with integrating it with other points, which aren't event-driven. Intuition tells me that each of these points should be running in a separate thread, because all are IO-bound and independent of each other, but I could easily put them in a single thread. Even though, I couldn't find any good and clear (to me) examples on how to implement this worker thread aside of Twisted's main loop.
The example I've started with is the chatserver.py, which uses service.Application and internet.TCPServer objects. If I start my own thread prior to creating TCPServer service, it runs a few times, but the it stops and never runs again. I'm not sure, why this is happening, but it's probably because I don't use threads with Twisted correctly.
Any suggestions on how to implement a separate worker thread and keep Twisted? Do you have any alternative architectures in mind?

You're basically considering writing an ad-hoc extension to your messaging server, the job of which it is to provide whatever reliability guarantees you've asked of it.
Instead, perhaps you should take the hardware where you were planning to run this new proxy and run another MQ node on it. The new node should take care of persisting and relaying messages that you deliver to it while the other nodes are overloaded or offline.

Maybe it's not the best bang for your buck to use a separate thread in Twisted to get around a blocking call, but sometimes the least evil solution is the best. Here's a link that shows you how to integrate threading into Twisted:
http://twistedmatrix.com/documents/10.1.0/core/howto/threading.html
Sometimes in a pinch easy-to-implement is faster than hours/days of research which may all turn out to be for nought.

A neat solution to this problem would be to use the Key Value store Redis. Its a high speed persistent data store, with plenty of clients - it has a php and a python client (if you want to use a timed/batch process to process messages - it saves you creating a database, and also deals with your persistence stories. It runs fine on Cywin/Windows + posix environments.
PHP Redis client is here.
Python client is here.
Both have a very clean and simple API. Redis also offers a publish/subscribe mechanism, should you need it, although it sounds like it would be of limited value if you're publishing to an inconsistent queue.

MySQLdb execute timeout

Sometimes in our production environment occurs situation when connection between service (which is python program that uses MySQLdb) and mysql server is flacky, some packages are lost, some black magic happens and .execute() of MySQLdb.Cursor object never ends (or take great amount of time to end).
This is very bad because it is waste of service worker threads. Sometimes it leads to exhausting of workers pool and service stops responding at all.
So the question is: Is there a way to interrupt MySQLdb.Connection.execute operation after given amount of time?

if the communication is such a problem, consider writing a 'proxy' that receives your SQL commands over the flaky connection and relays them to the MySQL server on a reliable channel (maybe running on the same box as the MySQL server). This way you have total control over failure detection and retrying.

You need to analyse exactly what the problem is. MySQL connections should eventually timeout if the server is gone; TCP keepalives are generally enabled. You may be able to tune the OS-level TCP timeouts.
If the database is "flaky", then you definitely need to investigate how. It seems unlikely that the database really is the problem, more likely that networking in between is.
If you are using (some) stateful firewalls of any kind, it's possible that they're losing some of the state, thus causing otherwise good long-lived connections to go dead.
You might want to consider changing the idle timeout parameter in MySQL; otherwise, a long-lived, unused connection may go "stale", where the server and client both think it's still alive, but some stateful network element in between has "forgotten" about the TCP connection. An application trying to use such a "stale" connection will have a long wait before receiving an error (but it should eventually).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.