Kafka timing out because of Docker latency

Kafka timing out because of Docker latency - python

I am totally new to Kafka and Docker, and have been handed a problem to fix. Our Continuous Integration tests for Kafka (Apache) queues run just fine on local machines, but when on the Jenkins CI server, occasionally fail with this sort of error:
%3|1508247800.270|FAIL|art#producer-1| [thrd:localhost:9092/bootstrap]: localhost:9092/bootstrap: Connect to ipv4#127.0.0.1:9092 failed: Connection refused
%3|1508247800.270|ERROR|art#producer-1| [thrd:localhost:9092/bootstrap]: localhost:9092/bootstrap: Connect to ipv4#127.0.0.1:9092 failed: Connection refused
%3|1508247800.270|ERROR|art#producer-1| [thrd:localhost:9092/bootstrap]: 1/1 brokers are down
The working theory is that the Docker image takes time to get started, by which time the Kafka producer has given up. The offending code is
producer_properties = {
'bootstrap.servers': self._job_queue.bootstrap_server,
'client.id': self._job_queue.client_id
}
try:
self._producer = kafka.Producer(**producer_properties)
except:
print("Bang!")
with the error lines above appearing in the creation of the producer. However, no exception is raised, and the call returns an otherwise valid looking producer, so I can't programmatically test the presence of the broker endpoint. Is there an API to check the status of a broker?

It seems the client doesn't throw exception if connection to broker fails. It actually tries to connect to bootstrap servers when first time producer tries to send the message. If connection fails, it repeatedly tries to connect to any of the brokers passed in the bootstrap list. Eventually, if the brokers come up, send will happen (and we may check the status in the callback function).
The confluent kafka python library is using librdkafka library and this client doesn't seem to have proper documentation. Some of the Kafka producer option specified by Kafka protocol, seem not supported by librdkafka.
Here is the sample code with callback I used:
from confluent_kafka import Producer
def notifyme(err, msg):
print err, msg.key(), msg.value()
p = Producer({'bootstrap.servers': '127.0.0.1:9092', 'retry.backoff.ms' : 100,
'message.send.max.retries' : 20,
"reconnect.backoff.jitter.ms" : 2000})
try:
p.produce(topic='sometopic', value='this is data', on_delivery=notifyme)
except Exception as e:
print e
p.flush()
Also, checking for the presence of the broker, you may just telnet to the broker ip on its port (in this example it is 9092). And on the Zookeeper used by Kafka cluster, you may check the contents of the znodes under /brokers/ids

Here is the code that seems to work for me. If it looks a bit Frankenstein, then you are right, it is! If there is a clean solution, I look forward to seeing it:
import time
import uuid
from threading import Event
from typing import Dict
import confluent_kafka as kafka
# pylint: disable=no-name-in-module
from confluent_kafka.cimpl import KafkaError
# more imports...
LOG = # ...
# Default number of times to retry connection to Kafka Broker
_DEFAULT_RETRIES = 3
# Default time in seconds to wait between connection attempts
_DEFAULT_RETRY_DELAY_S = 5.0
# Number of times to scan for an error after initiating the connection. It appears that calling
# flush() once on a producer after construction isn't sufficient to catch the 'broker not available'
# # error. At least twice seems to work.
_NUM_ERROR_SCANS = 2
class JobProducer(object):
def __init__(self, connection_retries: int=_DEFAULT_RETRIES,
retry_delay_s: float=_DEFAULT_RETRY_DELAY_S) -> None:
"""
Constructs a producer.
:param connection_retries: how many times to retry the connection before raising a
RuntimeError. If 0, retry forever.
:param retry_delay_s: how long to wait between retries in seconds.
"""
self.__error_event = Event()
self._job_queue = JobQueue()
self._producer = self.__wait_for_broker(connection_retries, retry_delay_s)
self._topic = self._job_queue.topic
def produce_job(self, job_definition: Dict) -> None:
"""
Produce a job definition on the queue
:param job_definition: definition of the job to be executed
"""
value = ... # Conversion to JSON
key = str(uuid.uuid4())
LOG.info('Produced message: %s', value)
self.__error_event.clear()
self._producer.produce(self._topic,
value=value,
key=key,
on_delivery=self._on_delivery)
self._producer.flush(self._job_queue.flush_timeout)
#staticmethod
def _on_delivery(error, message):
if error:
LOG.error('Failed to produce job %s, with error: %s', message.key(), error)
def __create_producer(self) -> kafka.Producer:
producer_properties = {
'bootstrap.servers': self._job_queue.bootstrap_server,
'error_cb': self.__on_error,
'client.id': self._job_queue.client_id,
}
return kafka.Producer(**producer_properties)
def __wait_for_broker(self, retries: int, delay: float) -> kafka.Producer:
retry_count = 0
while True:
self.__error_event.clear()
producer = self.__create_producer()
# Need to call flush() several times with a delay between to ensure errors are caught.
if not self.__error_event.is_set():
for _ in range(_NUM_ERROR_SCANS):
producer.flush(0.1)
if self.__error_event.is_set():
break
time.sleep(0.1)
else:
# Success: no errors.
return producer
# If we get to here, the error callback was invoked.
retry_count += 1
if retries == 0:
msg = '({})'.format(retry_count)
else:
if retry_count <= retries:
msg = '({}/{})'.format(retry_count, retries)
else:
raise RuntimeError('JobProducer timed out')
LOG.warn('JobProducer: could not connect to broker, will retry %s', msg)
time.sleep(delay)
def __on_error(self, error: KafkaError) -> None:
LOG.error('KafkaError: %s', error.str())
self.__error_event.set()

Related

How to capture Application maximum poll interval exceeded error with confluent-kafka-python?

My microservice uses confluent-kafka-python. Once in a while it fails with this error
%4|1654121013.314|MAXPOLL|rdkafka#consumer-1| [thrd:main]: Application maximum poll interval (300000ms) exceeded by 67ms (adjust max.poll.interval.ms for long-running message processing): leaving group
Whenever it hits this error, it goes into idle instead of terminating.
Snippet of the consumer code:
def start(self, callback: Callable[[List[Message]], None]) -> None:
'''
start consuming
'''
try:
while self.__terminate_event is None or not self.__terminate_event.is_set():
if not self.__consumer.assignment():
self.__consumer.subscribe(
[self.__topic],
on_assign=_on_assign,
on_revoke=_on_revoke,
on_lost=_on_lost
)
log.info("subscribed to topic: %s", self.__topic)
message_list = self.__consumer.consume(
num_messages=self.__num_messages_per_poll, timeout=KAFKA_TIMEOUT
)
How can it capture this error, so it terminates and let another service respawn?
Thanks

Test of sending & receiving message for Azure Service Bus Queue

I would like to write an integration test checking connection of the Python script with Azure Service Bus queue. The test should:
send a message to a queue,
confirm that the message landed in the queue.
The test looks like this:
import pytest
from azure.servicebus import ServiceBusClient, ServiceBusMessage, ServiceBusSender
CONNECTION_STRING = <some connection string>
QUEUE = <queue name>
def send_message_to_service_bus(sender: ServiceBusSender, msg: str) -> None:
message = ServiceBusMessage(msg)
sender.send_message(message)
class TestConnectionWithQueue:
def test_message_is_sent_to_queue_and_received(self):
msg = "test message sent to queue"
expected_message = ServiceBusMessage(msg)
servicebus_client = ServiceBusClient.from_connection_string(conn_str=CONNECTION_STRING, logging_enable=True)
with servicebus_client:
sender = servicebus_client.get_queue_sender(queue_name=QUEUE)
with sender:
send_message_to_service_bus(sender, expected_message)
receiver = servicebus_client.get_queue_receiver(queue_name=QUEUE)
with receiver:
messages_in_queue = receiver.receive_messages(max_message_count=10, max_wait_time=20)
assert any(expected_message == str(actual_message) for actual_message in messages_in_queue)
The test occassionally works, more often than not it doesn't. There are no other messages sent to the queue at the same time. As I debugged the code, if the test does not work, the variable messages_in_queue is just an empty list.
Why doesn't the code work at all times and what should be done to fix it?

Are you sure you don't have another process that receive your messages ? Maybe you are sharing your queue connections strings with other colleagues, build machines...
To troubleshoot you need to keep an eye on the Queue monitoring on Azure Portal. Debug your test and look at incoming messages if it increment by 1. Then continue your debug and check if it decrement by 1.
Also, are you sure that this unit test is useful? It looks like you are testing your infra instead of testing your code

Celery - continue call back if group task fails(effective logging of task failures)

Using celery 4.4.2, I have a group of tasks that connects to remote devices and gathers data, when that is completed the results are collated then emailed with the callback task.
However if gathering the data of one of the remote devices fails, the call back also fails. I've read that using link_error should resolve this but I'm not sure if my implementation, I have executed the below but it still failed
for device in device_ids:
task_count +=1
tasks.append(config_auth.s(device.id, device.hostname, device.get_scripts_ip()))
callback = email_auth_results.s().set(link_error=error_handler.s())
tasks = group(tasks)
r = chord(tasks)(callback)
return '{} tasks sent to queue'.format(task_count)
#app.task
def error_handler(uuid):
result = AsyncResult(uuid)
exc = result.get(propagate=False)
print('Task {0} raised exception: {1!r}\n{2!r}'.format(
uuid, exc, result.traceback))
original error:
celery.exceptions.ChordError: Dependency 5ffc10c9-edc7-4b91-a660-08c372c60ab2 raised NetmikoTimeoutException('Connection to device timed-out')
I still want to log that a task failed so I can see the failures in flower, but I want to ignore the failures or append the results so they just say failure and I can see it in the email results
Thanks

TWS IB Gateway (version 972/974) Client keeps disconnecting

I am trying to connect with IB Api to download some historical data. I have noticed that my client connects to the API, but then disconnects automatically in a very small period (~a few seconds).
Here's the log in the server:
socket connection for client{10} has closed.
Connection terminated.
Here's my main code for starting the app:
class TestApp(TestWrapper, TestClient):
def __init__(self):
TestWrapper.__init__(self)
TestClient.__init__(self, wrapper=self)
self.connect(config.ib_hostname, config.ib_port, config.ib_session_id)
self.session_id = int(config.ib_session_id)
self.thread = Thread(target = self.run)
self.thread.start()
setattr(self, "_thread", self.thread)
self.init_error()
def reset_connection(self):
pass
def check_contract(self, name, exchange_name, security_type, currency):
self.reset_connection()
ibcontract = IBcontract()
ibcontract.secType = security_type
ibcontract.symbol = name
ibcontract.exchange = exchange_name
ibcontract.currency = currency
return self.resolve_ib_contract(ibcontract)
def resolve_contract(self, security):
self.reset_connection()
ibcontract = IBcontract()
ibcontract.secType = security.security_type()
ibcontract.symbol=security.name()
ibcontract.exchange=security.exchange()
ibcontract.currency = security.currency()
return self.resolve_ib_contract(ibcontract)
def get_historical_data(self, security, duration, bar_size, what_to_show):
self.reset_connection()
resolved_ibcontract=self.resolve_contract(security)
data = test_app.get_IB_historical_data(resolved_ibcontract.contract, duration, bar_size, what_to_show)
return data
def create_app():
test_app = TestApp()
return test_app
Any suggestions on what could be the problem? I can show more error messages from the debug if needed.

If you can connect without issue only by changing the client ID, typically that indicates that the previous connection was not properly closed and TWS thinks its still open. To disconnect an API client you should call the EClient.disconnect function explicity, overridden in your example as:
test_app.disconnect()
Though its not necessary to disconnect/reconnect after every task, and you can just leave the connection open for extended periods.
You may sometimes encounter problems if an API function, such as reqHistoricalData, is called immediately after connection. Its best to have a small pause after initiating a connection to wait for a callback such as nextValidID to ensure the connection is complete before proceeding.
http://interactivebrokers.github.io/tws-api/connection.html#connect
I'm not sure what the function init_error() is intended for in your example since it would always be called when a TestApp object is created (whether or not there is an error).

Installing the latest version of TWS API (v 9.76) solved the problem.
https://interactivebrokers.github.io/#

Unable connect to node with id 1: [Worker]: Error: ConnectionError('No connection to node with id')

I am trying to use robinhood / faust but without success!
I have already created a producer that inserts in the original topic, in my confluent-kafka localhost instance, successfully!
but the faust is unable to connect to localhost.
My app.py:
import faust
import base64
import random
from datetime import datetime
SOURCE_TOPIC="input_msgs"
TARGET_TOPIC="output_msgs"
app = faust.App("messages-stream",
broker="kafka://"+'localhost:9092',
topic_partitions=1,
store="memory://")
class OriginalMessage(faust.Record):
msg: str
class TransformedMessage(faust.Record):
msg_id: int
msg_data: str
msg_base64: str
created_at: float
source_topic: str
target_topic: str
deleted: bool
topic = app.topic(SOURCE_TOPIC, value_type=OriginalMessage)
out_topic = app.topic(TARGET_TOPIC, partitions=1)
table = app.Table(
"output_msgs",
default=TransformedMessage,
partitions=1,
changelog_topic=out_topic,
)
print('Initializing Thread Processor...')
#app.agent(topic)
async def transformedmessage(messageevents):
async for transfmessage in messageevents:
try:
table[transfmessage.msg_id] = random.randint(1, 999999)
table[transfmessage.msg_data] = transfmessage.msg
table[transfmessage.msg_base64] = base64.b64encode(transfmessage.msg)
table[transfmessage.created_at] = datetime.now().isoformat()
table[transfmessage.source_topic] = SOURCE_TOPIC
table[transfmessage.target_topic] = TARGET_TOPIC
table[transfmessage.deleted] = False
except Exception as e:
print(f"Error: {e}")
if __name__ == "__main__":
app.main()
Error
[2020-01-24 18:05:36,910] [55712] [ERROR] Unable connect to node with id 1: [Errno 8] nodename nor servname provided, or not known
[2020-01-24 18:05:36,910] [55712] [ERROR] [^Worker]: Error: ConnectionError('No connection to node with id 1')
"No connection to node with id {}".format(node_id))
kafka.errors.ConnectionError: ConnectionError: No connection to node with id 1
I'm running with: faust -A app worker -l debug

I encountered this error, and luckily I was somewhat anticipating it so it wasn't too hard to figure out the problem.
In Confluent, you have to configure the domain that should be used to reach all the Kafka brokers that are bootstrapped for you. I didn't really know how important the domain was going to be so I just put something random in until I got stuck.
Of course I got stuck here like you did, so I fired up Wireshark to see what was happening between Faust and the bootstrap server. It turns out a bootstrap conversation goes something like this:
..............faust-1.10.4..PLAIN <-- client name
................PLAIN <-- authentication protocol
....foobar.foobar.foobar2000. <-- credentials!
b0.svs.cluster.local..#.......b1.svs.cluster.local..# <-- individual Kafka brokers
These follow the pattern of the domain names I chose and that are described in Confluent documentation: https://docs.confluent.io/current/installation/operator/co-endpoints.html
If these names do not resolve, you get the vague error here because despite bootstrapping successfully, the Kafka client could not hence actually connect to the endpoints. So, pick a domain you actually control, or put the answers you need in your local /etc/hosts or equivalent file.
17 123.111.222.1 b0.svs.cluster.local
18 123.111.222.2 b1.svs.cluster.local
After restarting Faust, the bootstrap and Kafka connection succeeded.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Kafka timing out because of Docker latency - python

Related

How to capture Application maximum poll interval exceeded error with confluent-kafka-python?

Test of sending & receiving message for Azure Service Bus Queue

Celery - continue call back if group task fails(effective logging of task failures)

TWS IB Gateway (version 972/974) Client keeps disconnecting

Unable connect to node with id 1: [Worker]: Error: ConnectionError('No connection to node with id')

Categories

Resources