Python Kafka consumer message deserialisation using AVRO, without schema registry - problem

Python Kafka consumer message deserialisation using AVRO, without schema registry - problem - python

I have a problem with Kafka message deserializing. I use confluent kafka.
There is no schema registry - schemas are hardcoded.
I can connect consumer to any topic and receive messages, but I can't deserialise these messages.
Output after deserialisation looks something like this:
print(reader) line:
<avro.io.DatumReader object at 0x000002354235DBB0>
I think, that I've wrong code for deserializaing, but hove to solve this problem?
At the end I want to extract deserialized key and value
from confluent_kafka import Consumer, KafkaException, KafkaError
import sys
import time
import avro.schema
from avro.io import DatumReader, DatumWriter
def kafka_conf():
conf = {''' MY CONFIGURATION'''
}
return conf
if __name__ == '__main__':
conf = kafka_conf()
topic = """MY TOPIC"""
c = Consumer(conf)
c.subscribe([topic])
try:
while True:
msg = c.poll(timeout=200.0)
if msg is None:
continue
if msg.error():
# Error or event
if msg.error().code() == KafkaError._PARTITION_EOF:
# End of partition event
sys.stderr.write('%% %s [%d] reached end at offset %d\n' %
(msg.topic(), msg.partition(), msg.offset()))
else:
# Error
raise KafkaException(msg.error())
else:
print("key: ", msg.key())
print("value: ", msg.value())
print("offset: ", msg.offset())
print("topic: ", msg.topic())
print("timestamp: ", msg.timestamp())
print("headers: ", msg.headers())
print("partition: ", msg.partition())
print("latency: ", msg.latency())
schema = avro.schema.parse(open("MY_AVRO_SCHEMA.avsc", "rb").read())
print(schema)
reader = DatumReader(msg.value, reader_schema=schema)
print(reader)
time.sleep(5) # only on test
except KeyboardInterrupt:
print('\nAborted by user\n')
finally:
c.close()

You're printing a reader object, not deserializing data, which you do with reader.read()
You need a BinaryDecoder as well.
The DeserializingConsumer in the Confluent library source code does the exact same thing, after it fetches the schema from the registry, rather than local filesystem, so I suggest you follow what they do.

Related

Poll several messages from Kafka

I'm using confluent_kafka package for working with Kafka.
I create topic in this way:
from confluent_kafka import avro
from confluent_kafka.avro import AvroProducer
def my_producer():
bootstrap_servers=['my_adress.com:9092',
'my_adress.com:9092']
value_schema = avro.load('/home/ValueSchema.avsc')
avroProducer = AvroProducer({
'bootstrap.servers': bootstrap_servers[0]+','+bootstrap_servers[1],
'schema.registry.url':'http://my_adress.com:8081',
},
default_value_schema=value_schema
)
for i in range(0, 25000):
value = {"name":"Yuva","favorite_number":10,"favorite_color":"green","age":i*2}
avroProducer.produce(topic='my_topik14', value=value)
avroProducer.flush(0)
print('Finished!')
if __name__ == '__main__':
my_producer()
It works. (this get 24820 messages instead of 25000 by the way...)
We can check it:
kafka-run-class kafka.tools.GetOffsetShell --broker-list my_adress.com:9092 --topic my_topik14
my_topik14:0:24819
Now I want to consume:
from confluent_kafka import KafkaError
from confluent_kafka.avro import AvroConsumer
from confluent_kafka.avro.serializer import SerializerError
bootstrap_servers=['my_adress.com:9092',
'my_adress.com:9092']
c = AvroConsumer(
{'bootstrap.servers': bootstrap_servers[0]+','+bootstrap_servers[1],
'group.id': 'avroneversleeps',
'schema.registry.url': 'http://my_adress.com:8081',
'api.version.request': True,
'fetch.min.bytes': 100000,
'consume.callback.max.messages':1000,
'batch.num.messages':2
})
c.subscribe(['my_topik14'])
running = True
while running:
msg = None
try:
msg = c.poll(0.1)
if msg:
if not msg.error():
print(msg.value())
c.commit(msg)
elif msg.error().code() != KafkaError._PARTITION_EOF:
print(msg.error())
running = False
else:
print("No Message!! Happily trying again!!")
except SerializerError as e:
print("Message deserialization failed for %s: %s" % (msg, e))
running = False
c.commit()
c.close()
But there is a problem:
I read messages just one by one.
My question is How to read batch of messages?
I tried different parameters in Consumer config but they didn't cnahge anything!
Also I found this question on SO and tried the same parameters - it still doesn't work.
Also read this. But this is against the previous link...

You can do it using consume([num_messages=1][, timeout=-1]) method. API ref. here:
For Consumer:
https://docs.confluent.io/current/clients/confluent-kafka-python/index.html#confluent_kafka.Consumer.consume
For AvroConsumer:
https://docs.confluent.io/current/clients/confluent-kafka-python/index.html?highlight=avroconsumer#confluent_kafka.Consumer.consume
More about the issue here:
https://github.com/confluentinc/confluent-kafka-python/issues/252

AvroConsumer have no consume method. But it is easy to make my own implementation of this method as there is in Consume class (parent of AvroConsumer).
Here is the code:
def consume_batch(self, num_messages=1, timeout=None):
"""
This is an overriden method from confluent_kafka.Consumer class. This handles batch of message
deserialization using avro schema
:param int num_messages: number of messages to read in one batch (default=1)
:param float timeout: Poll timeout in seconds (default: indefinite)
:returns: list of messages objects with deserialized key and value as dict objects
:rtype: Message
"""
messages_out = []
if timeout is None:
timeout = -1
messages = super(AvroConsumer, self).consume(num_messages=num_messages, timeout=timeout)
if messages is None:
return None
else:
for m in messages:
if not m.value() and not m.key():
return messages
if not m.error():
if m.value() is not None:
decoded_value = self._serializer.decode_message(m.value())
m.set_value(decoded_value)
if m.key() is not None:
decoded_key = self._serializer.decode_message(m.key())
m.set_key(decoded_key)
messages_out.append(m)
#print(len(message))
return messages_out
But after that we run test and this method give no any performance increasing. So looks like it just for better usability. Or I need to make some additional work about serializing not single message, but whole batch.

Paho-Mqtt django, on_message() function runs twice

I am using paho-mqtt in django to recieve messages. Everything works fine. But the on_message() function is executed twice.
I tried Debugging, but it seems like the function is called once, but the database insertion is happening twice, the printing of message is happening twice, everything within the on_message() function is happening twice, and my data is inserted twice for each publish.
I doubted it is happening in a parallel thread, and installed a celery redis backend to queue the insertion and avoid duplicate insertions. but still the data is being inserted twice.
I also tried locking the variables, to avoid problems in parallel threading, but still the data is inserted twice.
I am using Postgres DB
How do I solve this issue? I want the on_message() function to execute only once for each publish
my init.py
from . import mqtt
mqtt.client.loop_start()
my mqtt.py
import ast
import json
import paho.mqtt.client as mqtt
# Broker CONNACK response
from datetime import datetime
from raven.utils import logger
from kctsmarttransport import settings
def on_connect(client, userdata, flags, rc):
# Subcribing to topic and recoonect for
client.subscribe("data/gpsdata/server/#")
print 'subscribed to data/gpsdata/server/#'
# Receive message
def on_message(client, userdata, msg):
# from kctsmarttransport.celery import bus_position_insert_task
# bus_position_insert_task.delay(msg.payload)
from Transport.models import BusPosition
from Transport.models import Student, SpeedWarningLog, Bus
from Transport.models import Location
from Transport.models import IdleTimeLog
from pytz import timezone
try:
dumpData = json.dumps(msg.payload)
rawGpsData = json.loads(dumpData)
jsonGps = ast.literal_eval(rawGpsData)
bus = Bus.objects.get(bus_no=jsonGps['Busno'])
student = None
stop = None
if jsonGps['card'] is not False:
try:
student = Student.objects.get(rfid_value=jsonGps['UID'])
except Student.DoesNotExist:
student = None
if 'stop_id' in jsonGps:
stop = Location.objects.get(pk=jsonGps['stop_id'])
dates = datetime.strptime(jsonGps['Date&Time'], '%Y-%m-%d %H:%M:%S')
tz = timezone('Asia/Kolkata')
dates = tz.localize(dates)
lat = float(jsonGps['Latitude'])
lng = float(jsonGps['Longitude'])
speed = float(jsonGps['speed'])
# print msg.topic + " " + str(msg.payload)
busPosition = BusPosition.objects.filter(bus=bus, created_at=dates,
lat=lat,
lng=lng,
speed=speed,
geofence=stop,
student=student)
if busPosition.count() == 0:
busPosition = BusPosition.objects.create(bus=bus, created_at=dates,
lat=lat,
lng=lng,
speed=speed,
geofence=stop,
student=student)
if speed > 60:
SpeedWarningLog.objects.create(bus=busPosition.bus, speed=busPosition.speed,
lat=lat, lng=lng, created_at=dates)
sendSMS(settings.TRANSPORT_OFFICER_NUMBER, jsonGps['Busno'], jsonGps['speed'])
if speed <= 2:
try:
old_entry_query = IdleTimeLog.objects.filter(bus=bus, done=False).order_by('idle_start_time')
if old_entry_query.count() > 0:
old_entry = old_entry_query.reverse()[0]
old_entry.idle_end_time = dates
old_entry.save()
else:
new_entry = IdleTimeLog.objects.create(bus=bus, idle_start_time=dates, lat=lat, lng=lng)
except IdleTimeLog.DoesNotExist:
new_entry = IdleTimeLog.objects.create(bus=bus, idle_start_time=dates, lat=lat, lng=lng)
else:
try:
old_entry_query = IdleTimeLog.objects.filter(bus=bus, done=False).order_by('idle_start_time')
if old_entry_query.count() > 0:
old_entry = old_entry_query.reverse()[0]
old_entry.idle_end_time = dates
old_entry.done = True
old_entry.save()
except IdleTimeLog.DoesNotExist:
pass
except Exception, e:
logger.error(e.message, exc_info=True)
client = mqtt.Client()
client.on_connect = on_connect
client.on_message = on_message
client.connect("10.1.75.106", 1883, 60)

As some one mentioned in the comments run your server using --noreload
eg: python manage.py runserver --noreload
(posted here for better visibility.)

I had the same problem!
Try using:
def on_disconnect(client, userdata, rc):
client.loop_stop(force=False)
if rc != 0:
print("Unexpected disconnection.")
else:
print("Disconnected")

zmq/zeromq recv_multipart hangs on large data

I'm trying to modify a zeromq example for processing background task and get it working. In particular, I have a xpub/xsub sockets setup, and a client would subscribe to the publisher to receive progress updates and results from the worker.
worker_server.py
proxy = zmq.devices.ThreadDevice(zmq.QUEUE, zmq.XSUB, zmq.XPUB)
proxy.bind_in('tcp://127.0.0.1:5002')
proxy.bind_out('tcp://127.0.0.1:5003')
proxy.start()
client.py
ctx = zmq.Context()
socket = server.create_socket(ctx, 'sub')
socket.setsockopt(zmq.SUBSCRIBE, '')
poller = zmq.Poller()
print 'polling'
poller.register(socket, zmq.POLLIN)
ready = dict(poller.poll())
print 'polling done'
if ready and ready.has_key(socket):
job_id, code, result = socket.recv_multipart()
return {'status': code, 'data': result}
So far, the code works for small messages, however when the worker tries to publish the task results which is large, 35393030 bytes, client does not receive the message and code hangs at ready = dict(poller.poll()) Now, I just started learning to use zmq, but isn't send_multipart supposed to chunk the messages? what is causing the client to not receive results
worker.py
def worker(logger_name, method, **task_kwargs):
job_id = os.getpid()
ctx = zmq.Context()
socket = create_socket(ctx, 'pub')
time.sleep(1)
logger = logging.getLogger(logger_name)
logger.setLevel(logging.DEBUG)
ch = logging.StreamHandler()
sh = WSLoggingHandler(socket, job_id)
fh = logging.FileHandler(filename=os.path.join(tmp_folder, 'classifier.log.txt'), encoding='utf-8')
logger.addHandler(ch)
logger.addHandler(sh)
logger.addHandler(fh)
modules_arr = method.split('.')
m = __import__(".".join(modules_arr[:-1]), globals(), locals(), -1)
fn = getattr(m, modules_arr[-1])
try:
results = fn(**task_kwargs)
print 'size of data file %s' %len(results)
data = [
str(job_id),
SUCCESS_CODE,
results
]
tracker = socket.send_multipart(data)
print 'sent!!!'
except Exception, e:
print traceback.format_exc()
socket.send_multipart((
str(job_id),
ERROR_CODE,
str(e)
))
finally:
socket.close()
EDIT:
Tried manually splitting up the results into smaller chunks but haven had success.
results = fn(**task_kwargs)
print 'size of data file %s' %len(results)
data = [
str(job_id),
SUCCESS_CODE,
] + [results[i: i + 20] for i in xrange(0, len(results), 20)]
print 'list size %s' %len(data)
tracker = socket.send_multipart(data)
print 'sent!!!'

From the pyzmq documentation:
https://zeromq.github.io/pyzmq/api/zmq.html#zmq.Socket.send_multipart
msg_parts : iterable
A sequence of objects to send as a multipart message. Each element can be any sendable object (Frame, bytes, buffer-providers)
The message doesn't get chunked automatically, each element in the iterable you pass in is the chunk. So the way you have it set up, all of your result data will be one chunk. You'll need to use an iterator that chunks your results into an appropriate size.

Python, send a stop notification to a blocking loop within a thread

I've read many answers, however I have not found a proper solution.
The problem, I'm reading mixed/replace HTTP streams that will not expire or end by default.
You can try it by yourself using curl:
curl http://agent.mtconnect.org/sample\?interval\=0
So, now I'm using Python threads and requests to read data from multiple streams.
import requests
import uuid
from threading import Thread
tasks = ['http://agent.mtconnect.org/sample?interval=5000',
'http://agent.mtconnect.org/sample?interval=10000']
thread_id = []
def http_handler(thread_id, url, flag):
print 'Starting task %s' % thread_id
try:
requests_stream = requests.get(url, stream=True, timeout=2)
for line in requests_stream.iter_lines():
if line:
print line
if flag and line.endswith('</MTConnectStreams>'):
# Wait until XML message end is reached to receive the full message
break
except requests.exceptions.RequestException as e:
print('error: ', e)
except BaseException as e:
print e
if __name__ == '__main__':
for task in tasks:
uid = str(uuid.uuid4())
thread_id.append(uid)
t = Thread(target=http_handler, args=(uid, task, False), name=uid)
t.start()
print thread_id
# Wait Time X or until user is doing something
# Send flag = to desired thread to indicate the loop should stop after reaching the end.
Any suggestions? What is the best solution? I don't want to kill the thread because I would like to read the ending to have a full XML message.

I found a solution by using threading module and threading.events. Maybe not the best solution, but it works fine currently.
import logging
import threading
import time
import uuid
import requests
logging.basicConfig(level=logging.DEBUG, format='(%(threadName)-10s) %(message)s', )
tasks = ['http://agent.mtconnect.org/sample?interval=5000',
'http://agent.mtconnect.org/sample?interval=10000']
d = dict()
def http_handler(e, url):
logging.debug('wait_for_event starting')
message_buffer = []
filter_namespace = True
try:
requests_stream = requests.get(url, stream=True, timeout=2)
for line in requests_stream.iter_lines():
if line:
message_buffer.append(line)
if e.isSet() and line.endswith('</MTConnectStreams>'):
logging.debug(len(message_buffer))
break
except requests.exceptions.RequestException as e:
print('error: ', e)
except BaseException as e:
print e
if __name__ == '__main__':
logging.debug('Waiting before calling Event.set()')
for task in tasks:
uid = str(uuid.uuid4())
e = threading.Event()
d[uid] = {"stop_event": e}
t = threading.Event(uid)
t = threading.Thread(name=uid,
target=http_handler,
args=(e, task))
t.start()
logging.debug('Waiting 3 seconds before calling Event.set()')
for key in d:
time.sleep(3)
logging.debug(threading.enumerate())
logging.debug(d[key])
d[key]['stop_event'].set()
logging.debug('bye')

Receiving extended data with ssh using twisted.conch as client

I'm currently in the process of learning ssh via the brute-force/ just keep hacking until I understand it approach. After some trial and error I've been able to successfully send a "pty-req" followed by a "shell" request, I can get the login preamble, send commands and receive stdout but I'm not exactly sure how to tell the SSH service I want to recieve stderr and status messages. Reading through other SSH implementations ( paramiko, Net::SSH ) hasn't been much of a guide at the moment.
That said, looking at one of the RFC's for SSH, I believe that perhaps one of the listed requests might be what I am looking for: https://www.rfc-editor.org/rfc/rfc4250#section-4.9.3
#!/usr/bin/env python
from twisted.conch.ssh import transport
from twisted.conch.ssh import userauth
from twisted.conch.ssh import connection
from twisted.conch.ssh import common
from twisted.conch.ssh.common import NS
from twisted.conch.ssh import keys
from twisted.conch.ssh import channel
from twisted.conch.ssh import session
from twisted.internet import defer
from twisted.internet import defer, protocol, reactor
from twisted.python import log
import struct, sys, getpass, os
log.startLogging(sys.stdout)
USER = 'dward'
HOST = '192.168.0.19' # pristine.local
PASSWD = "password"
PRIVATE_KEY = "~/id_rsa"
class SimpleTransport(transport.SSHClientTransport):
def verifyHostKey(self, hostKey, fingerprint):
print 'host key fingerprint: %s' % fingerprint
return defer.succeed(1)
def connectionSecure(self):
self.requestService(
SimpleUserAuth(USER,
SimpleConnection()))
class SimpleUserAuth(userauth.SSHUserAuthClient):
def getPassword(self):
return defer.succeed(PASSWD)
def getGenericAnswers(self, name, instruction, questions):
print name
print instruction
answers = []
for prompt, echo in questions:
if echo:
answer = raw_input(prompt)
else:
answer = getpass.getpass(prompt)
answers.append(answer)
return defer.succeed(answers)
def getPublicKey(self):
path = os.path.expanduser(PRIVATE_KEY)
# this works with rsa too
# just change the name here and in getPrivateKey
if not os.path.exists(path) or self.lastPublicKey:
# the file doesn't exist, or we've tried a public key
return
return keys.Key.fromFile(filename=path+'.pub').blob()
def getPrivateKey(self):
path = os.path.expanduser(PRIVATE_KEY)
return defer.succeed(keys.Key.fromFile(path).keyObject)
class SimpleConnection(connection.SSHConnection):
def serviceStarted(self):
self.openChannel(SmartChannel(2**16, 2**15, self))
class SmartChannel(channel.SSHChannel):
name = "session"
def getResponse(self, timeout = 10):
self.onData = defer.Deferred()
self.timeout = reactor.callLater( timeout, self.onData.errback, Exception("Timeout") )
return self.onData
def openFailed(self, reason):
print "Failed", reason
#defer.inlineCallbacks
def channelOpen(self, ignoredData):
self.data = ''
self.oldData = ''
self.onData = None
self.timeout = None
term = os.environ.get('TERM', 'xterm')
#winsz = fcntl.ioctl(fd, tty.TIOCGWINSZ, '12345678')
winSize = (25,80,0,0) #struct.unpack('4H', winsz)
ptyReqData = session.packRequest_pty_req(term, winSize, '')
try:
result = yield self.conn.sendRequest(self, 'pty-req', ptyReqData, wantReply = 1 )
except Exception as e:
print "Failed with ", e
try:
result = yield self.conn.sendRequest(self, "shell", '', wantReply = 1)
except Exception as e:
print "Failed shell with ", e
#fetch preample
data = yield self.getResponse()
"""
Welcome to Ubuntu 11.04 (GNU/Linux 2.6.38-8-server x86_64)
* Documentation: http://www.ubuntu.com/server/doc
System information as of Sat Oct 29 13:09:50 MDT 2011
System load: 0.0 Processes: 111
Usage of /: 48.0% of 6.62GB Users logged in: 1
Memory usage: 39% IP address for eth1: 192.168.0.19
Swap usage: 3%
Graph this data and manage this system at https://landscape.canonical.com/
New release 'oneiric' available.
Run 'do-release-upgrade' to upgrade to it.
Last login: Sat Oct 29 01:23:16 2011 from 192.168.0.17
"""
print data
while data != "" and data.strip().endswith("~$") == False:
try:
data = yield self.getResponse()
print repr(data)
"""
\x1B]0;dward#pristine: ~\x07dward#pristine:~$
"""
except Exception as e:
print e
break
self.write("false\n")
#fetch response
try:
data = yield self.getResponse()
except Exception as e:
print "Failed to catch response?", e
else:
print data
"""
false
\x1B]0;dward#pristine: ~\x07dward#pristine:~$
"""
self.write("true\n")
#fetch response
try:
data = yield self.getResponse()
except Exception as e:
print "Failed to catch response?", e
else:
print data
"""
true
\x1B]0;dward#pristine: ~\x07dward#pristine:~$
"""
self.write("echo Hello World\n\x00")
try:
data = yield self.getResponse()
except Exception as e:
print "Failed to catch response?", e
else:
print data
"""
echo Hello World
Hello World
\x1B]0;dward#pristine: ~\x07dward#pristine:~$
"""
#Close up shop
self.loseConnection()
dbgp = 1
def request_exit_status(self, data):
status = struct.unpack('>L', data)[0]
print 'status was: %s' % status
def dataReceived(self, data):
self.data += data
if self.onData is not None:
if self.timeout and self.timeout.active():
self.timeout.cancel()
if self.onData.called == False:
self.onData.callback(data)
def extReceived(self, dataType, data):
dbgp = 1
print "Extended Data recieved! dataType = %s , data = %s " % ( dataType, data, )
self.extendData = data
def closed(self):
print 'got data : %s' % self.data.replace("\\r\\n","\r\n")
self.loseConnection()
reactor.stop()
protocol.ClientCreator(reactor, SimpleTransport).connectTCP(HOST, 22)
reactor.run()
Additionally I tried adding in an explicit bad command to the remote shell:
self.write("ls -alF badPathHere\n\x00")
try:
data = yield self.getResponse()
except Exception as e:
print "Failed to catch response?", e
else:
print data
"""
ls -alF badPathHere
ls: cannot access badPathHere: No such file or directory
\x1B]0;dward#pristine: ~\x07dward#pristine:~$
"""
And it looks like stderr is being mixed into stderr

Digging through the source code for OpenSSH, channel session logic is handled in session.c at line
2227 function -> session_input_channel_req which if given a pty-req then a "shell" request leads to do_exec_pty which ultimately leads to the call to session_set_fds(s, ptyfd, fdout, -1, 1, 1). The forth argument would normally be a file descriptor responsible for handling stderr but since none is supplied then there won't be any extended data for stderr.
Ultimately, even if I modified openssh to provide a stderr FD, the problem resides with the shell. Complete guess work at this point but I believe that similar to logging into a ssh service via a terminal like xterm or putty, that stderr and stdout are sent together unless explicitly redirected via something like "2> someFile" which is beyond the scope of a SSH service provider.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Kafka consumer message deserialisation using AVRO, without schema registry - problem - python

Related

Poll several messages from Kafka

Paho-Mqtt django, on_message() function runs twice

zmq/zeromq recv_multipart hangs on large data

Python, send a stop notification to a blocking loop within a thread

Receiving extended data with ssh using twisted.conch as client

Categories

Resources