Differentiating between binary encoded Avro and JSON messages

Differentiating between binary encoded Avro and JSON messages - python

I'm using python to read messages coming from various topics. Some topics have got their messages encoded in plain JSON, while others are using Avro binary serialization, with confluent schema registry.
When I receive a message I need to know if it has to be decoded. At the moment I'm only relying on the fact that the binary encoded messages are starting with a MAGIC_BYTE which value is zero:
from confluent_kafka.cimpl import Consumer
consumer = Consumer(config)
consumer.subsrcibe(...)
msg = consumer.poll()
# check the msg is not null or error etc
if msg.values()[0] == 0:
# It is binary encoded
else:
# It is json
I was wondering is there's a better way to do that?

You could get bytes 0-5 of your message, then
magic_byte = message_bytes[0]
schema_id = message_bytes[1:5]
Then, perform a lookup against your registry for GET /schemas/{schema_id}, and cache the ID + schema (if needed) when you get a 200 response code.
Otherwise, the message is either JSON, or the producer had sent its data to a different registry (if there is more than one in your environment). Note: this means the data could still be Avro

You can simply query the schema registry through REST first, and build a local cache of the topics that are registered there. Then, when you're trying to decode a message from a particular topic, simply compare the topic to the contents of that list. If it's there, you know it has be decoded.
Of course, this only works if all the topics that are Avro encoded are using Schema Registry. If you ever receive an Avro-encoded message that is not registered with Schema Registry, then it won't work.

Related

Send already serialized message inside message

I'm using Protobuf with the C++ API and I have a standart message I send between 2 different softwares and I want to add a raw nested message as data.
So I added a message like this:
Message main{
string id=1;
string data=2;
}
I tried to serialize some nested messages I made to a string and send it as "data" with "main" message but it doesn't work well on the parser side.
How can I send nested serialized message inside a message using c++ and python api.

Basically, use bytes:
message main {
string id=1;
bytes data=2;
}
In addition to not corrupting the data (string is strictly UTF-8), as long as the payload is a standard message, this is also compatible with changing it later (at either end, or both) to the known type:
``` proto
message main {
string id=1;
TheOtherMessageType data=2;
}
message TheOtherMessageType {...}
(or even using both versions at different times depending on which is most convenient)

How to hash a CAN message?

I want to Hash a CAN message received from a vehicle. The following code written in Python is used to receive the CAN message(dev.recv()) from the vehicle and print the received message (dev.send()).I want to hash the CAN message present in dev.recv()function before sending the message using dev.send().Is this possible? If so how can it be done?
from canard.hw import socketcan
dev = socketcan.SocketCanDev(’can0’)
dev.start()
while True:
f = dev.recv()
dev.send(f)
`

I am not sure what the data type is for "f", the data you receive from the function recv.
I am guessing that SocketCanDev is just a wrapper for the device, and recv acts very similar to the function, read().
So, "f" in your code might be interpreted as an array of bytes, or chars.
Hashing is done to an array of bytes, regardless of the format of the
string.
And, the result of the hashing does not depend on the input format or data type.
Therefore, in your case,
while True:
f = dev.recv()
result = (hashFunction)(f)
dev.send(result) // result should in the data type that the function send can accept as a parameter
(hashFunction) can be replaced with an actual function from a hashing library, such as "hashlib".

If you are interested in cryptographic hashing, you should take a look at hashlib
With it you should be able to hash the message and send the hash like this
H = hashlib.new('sha256', message)
dev.send(H.digest())
If you want to still send the original message besides the hash, you could make two calls to send.

Confluent-Kafka: Avro Serialization Confusions with Schema Handling in Python Consumers

I am try to understand Avro Serialization on Confluent Kafka along with Schema Registry usage. It was all going well till the end but the final expectations from AVRO made lots of Confusions to me. As per my reading and understanding, Avro Serialization gives us the flexibility that when we have a change in schema, we can simply manage that without impacting the older producer/consumer.
Following the same, I have developed a python producer which will Check for a Schema existence in Schema-Registry, if absent, create it and start Producing the json messages show below. When I need to change schema, I simply update it in my producer and this produces messages with new schema.
My Old Schema :
data = '{"schema":"{\\"type\\":\\"record\\",\\"name\\":\\"value\\",\\"namespace\\":\\"my.test\\",\\"fields\\":[{\\"name\\":\\"fname\\",\\"type\\":\\"string\\"},{\\"name\\":\\"lname\\",\\"type\\":\\"string\\"},{\\"name\\":\\"email\\",\\"type\\":\\"string\\"},{\\"name\\":\\"principal\\",\\"type\\":\\"string\\"},{\\"name\\":\\"ipaddress\\",\\"type\\":\\"string\\"},{\\"name\\":\\"mobile\\",\\"type\\":\\"long\\"},{\\"name\\":\\"passport_make_date\\",\\"type\\":[\\"string\\",\\"null\\"],\\"logicalType\\":\\"timestamp\\",\\"default\\":\\"None\\"},{\\"name\\":\\"passport_expiry_date\\",\\"type\\":\\"string\\",\\"logicalType\\":\\"date\\"}]}"}'
Sample Data from Producer-1 :
{u'mobile': 9819841242, u'lname': u'Rogers', u'passport_expiry_date': u'2026-05-21', u'passport_make_date': u'2016-05-21', u'fname': u'tom', u'ipaddress': u'208.103.236.60', u'email': u'tom_Rogers#TEST.co.nz', u'principal': u'tom#EXAMPLE.COM'}
My New Schema:
data = '{"schema":"{\\"type\\":\\"record\\",\\"name\\":\\"value\\",\\"namespace\\":\\"my.test\\",\\"fields\\":[{\\"name\\":\\"fname\\",\\"type\\":\\"string\\"},{\\"name\\":\\"lname\\",\\"type\\":\\"string\\"},{\\"name\\":\\"email\\",\\"type\\":\\"string\\"},{\\"name\\":\\"principal\\",\\"type\\":\\"string\\"},{\\"name\\":\\"ipaddress\\",\\"type\\":\\"string\\"},{\\"name\\":\\"mobile\\",\\"type\\":\\"long\\"},{\\"name\\":\\"new_passport_make_date\\",\\"type\\":[\\"string\\",\\"null\\"],\\"logicalType\\":\\"timestamp\\",\\"default\\":\\"None\\"},{\\"name\\":\\"new_passport_expiry_date\\",\\"type\\":\\"string\\",\\"logicalType\\":\\"date\\"}]}"}'
Sample Data from Producer-2 :
{u'mobile': 9800647004, u'new_passport_make_date': u'2011-05-22', u'lname': u'Reed', u'fname': u'Paul', u'new_passport_expiry_date': u'2021-05-22', u'ipaddress': u'134.124.7.28', u'email': u'Paul_Reed#nbc.com', u'principal': u'Paul#EXAMPLE.COM'}
Case 1: when I have 2 producers with above 2 schemas running together, I can successfully consume message with below code. All is well till here.
while True:
try:
msg = c.poll(10)
except SerializerError as e:
xxxxx
break
print msg.value()
Case 2: When I go little deeper in JSON fields, things mixes up and breaks.
At first, say I have one producer running with ‘My Old Schema’ above and one consumer consuming these messages successfully.
print msg.value()["fname"] , msg.value()["lname"] , msg.value()["passport_make_date"], msg.value()["passport_expiry_date"]
When I run 2nd producer with ‘My New Schema’ mentioned above, my Old Consumers breaks as there is No Field passport_expiry_date and passport_make_date which is True.
Question:
Sometime I think, this is expected as it’s me(Developer) who is using the field names which are Not in the Message. But how Avro can help here? Shouldn't the missing field be handled by Avro? I saw examples in JAVA where this situation was handled properly but did not find any example in Python. For example, below github has perfect example of handling this scenario. When the field is not present, Consumer simply prints 'None'.
https://github.com/LearningJournal/ApacheKafkaTutorials
Case 3: When I run the combinations like Old Producer with Old Consumer and then in another terminals New Producer with New Consumer, Producers/Consumers mixes up and things break saying no json field.
Old Consumer ==>
print msg.value()["fname"] , msg.value()["lname"] , msg.value()["passport_make_date"], msg.value()["passport_expiry_date"]
New Consumer ==>
print msg.value()["fname"] , msg.value()["lname"] , msg.value()["new_passport_make_date"], msg.value()["new_passport_expiry_date"]
Question:
Again I think, this is expected. But, then Avro makes me think the right Consumer should get the right message with right schema. If I use msg.value() and always parse the fields at consumer side using programming without any role of Avro, then Where is the benefit of using avro? What is the benefit of sending schema with the messages/storing in SR?
Lastly, is there any way to check the schema attached to a message? I understand, in Avro, schema ID is attached with the message which is used further with Schema Registry while Reading and Writing messages. But I never see it with the messages.
Thanks much in Advance.

It's not clear what compatibility setting you're using on the registry, but I will assume backwards, which means you would have needed to add a field with a default.
Sounds like you're getting a Python KeyError because those keys don't exist.
Instead of msg.value()["non-existing-key"], you can try
option 1: treat it like a dict()
msg.value().get("non-existing-key", "Default value")
option 2: check individually for all the keys that might not be there
some_var = None # What you want to parse
val = msg.value()
if "non-existing-key" not in val:
some_var = "Default Value"
Otherwise, you must "project" the newer schema over the older data, which is what the Java code is doing by using a SpecificRecord subclass. That way, the older data would be parsed with the newer schema, which has the newer fields with their defaults.
If you used GenericRecord in Java instead, you would have similar problems. I'm not sure in Python there is an equivalent to Java's SpecificRecord.
By the way, I don't think the string "None" can be applied for a logicalType=timestamp

data type of message in volttron pubsub

What is the data type for "message" in pubsub used by volttron? I have checked the documentation but there is nothing mentioned about this. When checking the source I found this function comment source :
param headers: header info for the message,
type headers: None or dict,
param message: actual message,
type message: None or any
Are the above info correct? Does that "any" type refer to this: typing.Any?

The message can be any Python object that can be serialized into JSON. Typically this will be something specifically defined by the Agent publishing the message that aligns with the purpose of the message. Usually this will be a dictionary or list, but occasionally messages will be numbers or strings. VOLTTRON does not place any restrictions on the structure of the data as long as it can be serialized.
It is up to agents define the datatype of the message and document it for use by other agents.
Nested data structures are allowed as they are in JSON.

Unable to extract the body of the email file in python

I am reading an email file stored in my machine,able to extract the headers of the email, but unable to extract the body.
# The following part is working , opening a file and reading the header .
import email
from email.parser import HeaderParser
with open(passedArgument1+filename,"r",encoding="ISO-8859-1") as f:
msg=email.message_from_file(f)
print('message',msg.as_string())
parser = HeaderParser()
h = parser.parsestr(msg.as_string())
print (h.keys())
# The following snippet gives error
msgBody=msg.get_body('text/plain')
Is there any proper way to extract only the body message.Stuck at this point.
For reference the email file can be downloaded from
https://drive.google.com/file/d/0B3XlF206d5UrOW5xZ3FmV3M3Rzg/view

The 3.6 email lib uses an API that is compatible with Python 3.2 by default and that is what is causing you this problem.
Note the default policy in the declaration below from the docs:
email.message_from_file(fp, _class=None, *, policy=policy.compat32)
If you want to use the "new" API that you see in the 3.6 docs, you have to create the message with a different policy.
import email
from email import policy
...
msg=email.message_from_file(f, policy=policy.default)
will give you the new API that you see in the docs which will include the very useful: get_body()

Update
If you are having the AttributeError: 'Message' object has no attribute 'get_body' error, you might want to read what follows.
I did some tests, and it seems the doc is indeed erroneous compared to the current library implementation (July 2017).
What you might be looking for is actually the function get_payload() it seems to do what you want to achieve:
The conceptual model provided by an EmailMessage object is that of an
ordered dictionary of headers coupled with a payload that represents
the RFC 5322 body of the message, which might be a list of
sub-EmailMessage objects
get_payload() is not in current July 2017 Documentation, but the help() says the following:
get_payload(i=None, decode=False) method of email.message.Message instance
Return a reference to the payload.
The payload will either be a list object or a string. If you mutate
the list object, you modify the message's payload in place. Optional
i returns that index into the payload.
Optional decode is a flag indicating whether the payload should be decoded or not, according to the Content-Transfer-Encoding
header (default is False).
When True and the message is not a multipart, the payload will be decoded if this header's value is 'quoted-printable' or 'base64'. If some other encoding is used, or the header is missing, or if the payload has bogus data (i.e. bogus base64 or uuencoded data), the payload is returned as-is.
If the message is a multipart and the decode flag is True, then None is returned.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Differentiating between binary encoded Avro and JSON messages - python

Related

Send already serialized message inside message

How to hash a CAN message?

Confluent-Kafka: Avro Serialization Confusions with Schema Handling in Python Consumers

data type of message in volttron pubsub

Unable to extract the body of the email file in python

Categories

Resources