Confluent-Kafka: Avro Serialization Confusions with Schema Handling in Python Consumers

Confluent-Kafka: Avro Serialization Confusions with Schema Handling in Python Consumers - python

I am try to understand Avro Serialization on Confluent Kafka along with Schema Registry usage. It was all going well till the end but the final expectations from AVRO made lots of Confusions to me. As per my reading and understanding, Avro Serialization gives us the flexibility that when we have a change in schema, we can simply manage that without impacting the older producer/consumer.
Following the same, I have developed a python producer which will Check for a Schema existence in Schema-Registry, if absent, create it and start Producing the json messages show below. When I need to change schema, I simply update it in my producer and this produces messages with new schema.
My Old Schema :
data = '{"schema":"{\\"type\\":\\"record\\",\\"name\\":\\"value\\",\\"namespace\\":\\"my.test\\",\\"fields\\":[{\\"name\\":\\"fname\\",\\"type\\":\\"string\\"},{\\"name\\":\\"lname\\",\\"type\\":\\"string\\"},{\\"name\\":\\"email\\",\\"type\\":\\"string\\"},{\\"name\\":\\"principal\\",\\"type\\":\\"string\\"},{\\"name\\":\\"ipaddress\\",\\"type\\":\\"string\\"},{\\"name\\":\\"mobile\\",\\"type\\":\\"long\\"},{\\"name\\":\\"passport_make_date\\",\\"type\\":[\\"string\\",\\"null\\"],\\"logicalType\\":\\"timestamp\\",\\"default\\":\\"None\\"},{\\"name\\":\\"passport_expiry_date\\",\\"type\\":\\"string\\",\\"logicalType\\":\\"date\\"}]}"}'
Sample Data from Producer-1 :
{u'mobile': 9819841242, u'lname': u'Rogers', u'passport_expiry_date': u'2026-05-21', u'passport_make_date': u'2016-05-21', u'fname': u'tom', u'ipaddress': u'208.103.236.60', u'email': u'tom_Rogers#TEST.co.nz', u'principal': u'tom#EXAMPLE.COM'}
My New Schema:
data = '{"schema":"{\\"type\\":\\"record\\",\\"name\\":\\"value\\",\\"namespace\\":\\"my.test\\",\\"fields\\":[{\\"name\\":\\"fname\\",\\"type\\":\\"string\\"},{\\"name\\":\\"lname\\",\\"type\\":\\"string\\"},{\\"name\\":\\"email\\",\\"type\\":\\"string\\"},{\\"name\\":\\"principal\\",\\"type\\":\\"string\\"},{\\"name\\":\\"ipaddress\\",\\"type\\":\\"string\\"},{\\"name\\":\\"mobile\\",\\"type\\":\\"long\\"},{\\"name\\":\\"new_passport_make_date\\",\\"type\\":[\\"string\\",\\"null\\"],\\"logicalType\\":\\"timestamp\\",\\"default\\":\\"None\\"},{\\"name\\":\\"new_passport_expiry_date\\",\\"type\\":\\"string\\",\\"logicalType\\":\\"date\\"}]}"}'
Sample Data from Producer-2 :
{u'mobile': 9800647004, u'new_passport_make_date': u'2011-05-22', u'lname': u'Reed', u'fname': u'Paul', u'new_passport_expiry_date': u'2021-05-22', u'ipaddress': u'134.124.7.28', u'email': u'Paul_Reed#nbc.com', u'principal': u'Paul#EXAMPLE.COM'}
Case 1: when I have 2 producers with above 2 schemas running together, I can successfully consume message with below code. All is well till here.
while True:
try:
msg = c.poll(10)
except SerializerError as e:
xxxxx
break
print msg.value()
Case 2: When I go little deeper in JSON fields, things mixes up and breaks.
At first, say I have one producer running with ‘My Old Schema’ above and one consumer consuming these messages successfully.
print msg.value()["fname"] , msg.value()["lname"] , msg.value()["passport_make_date"], msg.value()["passport_expiry_date"]
When I run 2nd producer with ‘My New Schema’ mentioned above, my Old Consumers breaks as there is No Field passport_expiry_date and passport_make_date which is True.
Question:
Sometime I think, this is expected as it’s me(Developer) who is using the field names which are Not in the Message. But how Avro can help here? Shouldn't the missing field be handled by Avro? I saw examples in JAVA where this situation was handled properly but did not find any example in Python. For example, below github has perfect example of handling this scenario. When the field is not present, Consumer simply prints 'None'.
https://github.com/LearningJournal/ApacheKafkaTutorials
Case 3: When I run the combinations like Old Producer with Old Consumer and then in another terminals New Producer with New Consumer, Producers/Consumers mixes up and things break saying no json field.
Old Consumer ==>
print msg.value()["fname"] , msg.value()["lname"] , msg.value()["passport_make_date"], msg.value()["passport_expiry_date"]
New Consumer ==>
print msg.value()["fname"] , msg.value()["lname"] , msg.value()["new_passport_make_date"], msg.value()["new_passport_expiry_date"]
Question:
Again I think, this is expected. But, then Avro makes me think the right Consumer should get the right message with right schema. If I use msg.value() and always parse the fields at consumer side using programming without any role of Avro, then Where is the benefit of using avro? What is the benefit of sending schema with the messages/storing in SR?
Lastly, is there any way to check the schema attached to a message? I understand, in Avro, schema ID is attached with the message which is used further with Schema Registry while Reading and Writing messages. But I never see it with the messages.
Thanks much in Advance.

It's not clear what compatibility setting you're using on the registry, but I will assume backwards, which means you would have needed to add a field with a default.
Sounds like you're getting a Python KeyError because those keys don't exist.
Instead of msg.value()["non-existing-key"], you can try
option 1: treat it like a dict()
msg.value().get("non-existing-key", "Default value")
option 2: check individually for all the keys that might not be there
some_var = None # What you want to parse
val = msg.value()
if "non-existing-key" not in val:
some_var = "Default Value"
Otherwise, you must "project" the newer schema over the older data, which is what the Java code is doing by using a SpecificRecord subclass. That way, the older data would be parsed with the newer schema, which has the newer fields with their defaults.
If you used GenericRecord in Java instead, you would have similar problems. I'm not sure in Python there is an equivalent to Java's SpecificRecord.
By the way, I don't think the string "None" can be applied for a logicalType=timestamp

Related

Python JIRA Non-Mandatory fields being forced on create issue

When using the jira python library and creating issues, non mandatory fields are being enforced on create_issue call.
Response on create issue attempt:
text: No issue link type with name 'Automated' found.
Response on create meta call to check mandatory fields:
'hasDefaultValue': False,
u'key': u'issuelinks',
u'name': u'Linked Issues',
u'operations': [u'add'],
u'required': False,

I had a similar issue and after a bit of digging around, this is what I did.
Open a jira and using developer tools (F12), find out the id of the mandatory custom fields. They should be named somewhat like "customfield_10304"
Once you have these field ids, just use them the way you set other fields while creating an issue. For eg.
new_issue = jira.create_issue(project={'key': project},
summary='{}'.format(summary),
description='{}'.format(description),
issuetype={'name': 'Bug'},
labels=labels,
versions=[{"name": affect_version[0]}],
customfield_10304=[{"value": env}],
customfield_10306=[{"value": customer}],
priority={'name': priority})

Jira behaves strange many times.
createmeta call returns you all the possible issuetypes, and their all fields, and which field is mandatory or not.
But even after this, there are certain fields which are mandatory but createmeta wont tell you this. You need to rely on the exception message that you got after filing create_issue().
In the exception message, exception_obj.response.text gives you the json having key/value of exact field required.
Then, you can search in response of createmeta about its schema type, and may be the allowedValues set.
And, then try again.
Basically, you need to do retry of above mechanism.

Differentiating between binary encoded Avro and JSON messages

I'm using python to read messages coming from various topics. Some topics have got their messages encoded in plain JSON, while others are using Avro binary serialization, with confluent schema registry.
When I receive a message I need to know if it has to be decoded. At the moment I'm only relying on the fact that the binary encoded messages are starting with a MAGIC_BYTE which value is zero:
from confluent_kafka.cimpl import Consumer
consumer = Consumer(config)
consumer.subsrcibe(...)
msg = consumer.poll()
# check the msg is not null or error etc
if msg.values()[0] == 0:
# It is binary encoded
else:
# It is json
I was wondering is there's a better way to do that?

You could get bytes 0-5 of your message, then
magic_byte = message_bytes[0]
schema_id = message_bytes[1:5]
Then, perform a lookup against your registry for GET /schemas/{schema_id}, and cache the ID + schema (if needed) when you get a 200 response code.
Otherwise, the message is either JSON, or the producer had sent its data to a different registry (if there is more than one in your environment). Note: this means the data could still be Avro

You can simply query the schema registry through REST first, and build a local cache of the topics that are registered there. Then, when you're trying to decode a message from a particular topic, simply compare the topic to the contents of that list. If it's there, you know it has be decoded.
Of course, this only works if all the topics that are Avro encoded are using Schema Registry. If you ever receive an Avro-encoded message that is not registered with Schema Registry, then it won't work.

Python Requests POST to form records incorrect payload (checkboxes)

I am having quite a bit of trouble with getting the correct form data saved to a server via POST with Requests (2.8.1) module.
I have previous code which does exactly what I want it to do: it encodes a bunch of key:value pairs into the correct header:value payload dict format, and successfully POSTS to the URI. I get a 200 response (what I'm looking for) and everything is great.
This is a section of the OLD payload encoding function, with a ton of key:value pairs omitted for brevity.
Note: the checkbox value set could be any sequence of numbers between 1 and 25, I just wrote it as
item in range(1,5)
to illustrate that the list is comprised of int numbers, i.e. [ "", 1, 2, 3, 4, 5,...] or [ "", 2, 7, 5, 1, 25,...] etc.
checkboxList = ["",]
for item in range(1,5):
checkboxList.append(item)
payload['checkbox[ids][]'] = checkboxList
...
response = request.post(data_url, data=payload)
>> 200 OK!
Here is a print of what the payload dict (checkboxes) looks like before it's sent to the server:
{... "checkbox[ids][]" : [ "", 2, 17, 20, 5], ...}
And when I look on the page with a browser, all the payload information has been correctly recorded (omitted above) AND the checkboxes (shown above) are correct!
Originally, the checkbox values came from an excel file, as did the rest of the information that was put into the payload before being POSTed to the server. However, now I'm retrieving the information from an SQLite db.
Below is the NEW code that records the checkboxes incorrectly. I should note: I do not have access to the server, so I cannot easily tell if it's a server issue, but let's assume it's not the servers fault. I've had this issue previously, but I got it to work with the above code. However, now that I've started to store the values I need in a db, I cannot get the correct checkboxes recorded by the server.
This is what the data from the db column looks like:
12-5-1-22-4
(... I know this isn't great practice for DB mgmt, but I assume this isn't why the POST is recording the wrong data, and I wanted this question to be as closely representational to my code as possible.)
checkList = checkboxesFromDB.split('-')
payload['checkbox[ids][]'] = checkList
...
response = request.post(data_url, data=payload)
>> 200 OK!
When I look at the site with the browser, it records the checkboxes incorrectly. Now, i should note that 3 checkboxes are selected no matter what I pass to payload[checkbox[ids][]]
It's ALWAYS the same 3, incorrect checkboxes, even if I completely omit checkbox[ids][] from the payload dict. Knowing that, we could assume its a server issue. However, the nearly EXACT code from above works (when I grab the info from an excel file).
I've tried the following (with only one value as a test) without getting the correct checkboxes recorded by the server:
payload['checkbox[ids][]'] = '1'
payload['checkbox[ids][]'] = 1
payload['checkbox[ids][]'] = [1]
payload['checkbox[ids][]'] = ["",1]
payload['checkbox[ids][]'] = [1,""]
When uploading images to the same server, I had an encoding issue when retrieving the image BLOB from the db and trying to pass the buffer object directly to Requests as a file, but I fixed this with cStringIO encoding. (It took me forever as I'm really new to programming, and still unsure of syntax, let alone ways to handle this sort of stuff....) I thought I might be having a similar encoding issue, but with the testing and research I've done, I cannot determine either way as I feel like I'm a bit over my head.
I apologize if this is completely NOOB, but I've done extensive research, trying so many different things that I could think of. I tried passing strings, lists, dicts, forcing encoding of lists as utf-8.
The main reason I'm so perplexed is my original code WORKS, and my new code is nearly identical but doesn't. The only real difference I can think of is now my information is coming from a SQLite db (this particular checkbox column is TEXT type)
Can anyone help me, or point me in a new direction I haven't thought of/know of?

I went through all payload pairs to find that it was an issue with HTML.
I was saving HTML in my SQLite db (via BeautifulSoup without prettifying it) as TEXT. Then I was retrieving it and sending it as a string. This was throwing off the server response.
I have since swapped that sql column value type to VARCHAR (as is best for my use) and prettify it like this foo = bar.prettify(formatter="html")before saving to the db. Now, when i retrieve the value and pass it to the payload, everything works as it should.

Representation of python dictionaries with unicode in database queries

I have a problem that I would like to know how to efficiently tackle.
I have data that is JSON-formatted (used with dumps / loads) and contains unicode.
This is part of a protocol implemented with JSON to send messages. So messages will be sent as strings and then loaded into python dictionaries. This means that the representation, as a python dictionary, afterwards will look something like:
{u"mykey": u"myVal"}
It is no problem in itself for the system to handle such structures, but the thing happens when I'm going to make a database query to store this structure.
I'm using pyOrient towards OrientDB. The command ends up something like:
"CREATE VERTEX TestVertex SET data = {u'mykey': u'myVal'}"
Which will end up in the data field getting the following values in OrientDB:
{'_NOT_PARSED_': '_NOT_PARSED_'}
I'm assuming this problem relates to other cases as well when you wish to make a query or somehow represent a data object containing unicode.
How could I efficiently get a representation of this data, of arbitrary depth, to be able to use it in a query?
To clarify even more, this is the string the db expects:
"CREATE VERTEX TestVertex SET data = {'mykey': 'myVal'}"
If I'm simply stating the wrong problem/question and should handle it some other way, I'm very much open to suggestions. But what I want to achieve is to have an efficient way to use python2.7 to build a db-query towards orientdb (using pyorient) that specifies an arbitrary data structure. The data property being set is of the OrientDB type EMBEDDEDMAP.
Any help greatly appreciated.
EDIT1:
More explicitly stating that the first code block shows the object as a dict AFTER being dumped / loaded with json to avoid confusion.

Dargolith:
ok based on your last response it seems you are simply looking for code that will dump python expression in a way that you can control how unicode and other data types print. Here is a very simply function that provides this control. There are ways to make this function more efficient (for example, by using a string buffer rather than doing all of the recursive string concatenation happening here). Still this is a very simple function, and as it stands its execution is probably still dominated by your DB lookup.
As you can see in each of the 'if' statements, you have full control of how each data type prints.
def expr_to_str(thing):
if hasattr(thing, 'keys'):
pairs = ['%s:%s' % (expr_to_str(k),expr_to_str(v)) for k,v in thing.iteritems()]
return '{%s}' % ', '.join(pairs)
if hasattr(thing, '__setslice__'):
parts = [expr_to_str(ele) for ele in thing]
return '[%s]' % (', '.join(parts),)
if isinstance(thing, basestring):
return "'%s'" % (str(thing),)
return str(thing)
print "dumped: %s" % expr_to_str({'one': 33, 'two': [u'unicode', 'just a str', 44.44, {'hash': 'here'}]})
outputs:
dumped: {'two':['unicode', 'just a str', 44.44, {'hash':'here'}], 'one':33}

I went on to use json.dumps() as sobolevn suggested in the comment. I didn't think of that one at first since I wasn't really using json in the driver. It turned out however that json.dumps() provided exactly the formats I needed on all the data types I use. Some examples:
>>> json.dumps('test')
'"test"'
>>> json.dumps(['test1', 'test2'])
'["test1", "test2"]'
>>> json.dumps([u'test1', u'test2'])
'["test1", "test2"]'
>>> json.dumps({u'key1': u'val1', u'key2': [u'val21', 'val22', 1]})
'{"key2": ["val21", "val22", 1], "key1": "val1"}'
If you need to take more control of the format, quotes or other things regarding this conversion, see the reply by Dan Oblinger.

grabbing HTTP GET parameter from url using Box API in python

I am dealing with the Box.com API using python and am having some trouble automating a step in the authentication process.
I am able to supply my API key and client secret key to Box. Once Box.com accepts my login credentials, they supply me with an HTTP GET parameter like
'http://www.myapp.com/finish_box?code=my_code&'
I want to be able to read and store my_code using python. Any ideas? I am new to python and dealing with APIs.

This is actually a more robust question than it seems, as it exposes some useful functions with web dev in general. You're basically asking how to separate my_code in the string 'http://www.myapp.com/finish_box?code=my_code&'.
Well let's take it in bits and pieces. First of all, you know that you only really need the stuff after the question mark, right? I mean, you don't need to know what website you got it from (though that would be good to save, let's keep that in case we need it later), you just need to know what arguments are being passed back. Let's start with String.split():
>>> return_string = 'http://www.myapp.com/finish_box?code=my_code&'
>>> step1 = return_string.split('?')
["http://www.myapp.com/finish_box","code=my_code&"]
This will return a list to step1 containing two elements, "http://www.myapp.com/finish_box" and "code=my_code&". Well hell, we're there! Let's split the second one again on the equals sign!
>>> step2 = step1[1].split("=")
["code","my_code&"]
Well lookie there, we're almost done! However, this doesn't really allow any more robust uses of it. What if instead we're given:
>>> return_string = r'http://www.myapp.com/finish_box?code=my_code&junk_data=ohyestheresverymuch&my_birthday=nottoday&stackoverflow=usefulplaceforinfo'
Suddenly our plan doesn't work. Let's instead break that second set on the & sign, since that's what's separating the key:value pairs.
step2 = step1[1].split("&")
["code=my_code",
"junk_data=ohyestheresverymuch",
"my_birthday=nottoday",
"stackoverflow=usefulplaceforinfo"]
Now we're getting somewhere. Let's save those as a dict, shall we?
>>> list_those_args = []
>>> for each_item in step2:
>>> list_those_args[each_item.split("=")[0]] = each_item.split("=")[1]
Now we've got a dictionary in list_those_args that contains key and value for every argument the GET passed back to you! Science!
So how do you access it now?
>>> list_those_args['code']
my_code

You need a webserver and a cgi-script to do this. I have setup a single python script solution to this to run this. You can see my code at:
https://github.com/jkitchin/box-course/blob/master/box_course/cgi-bin/box-course-authenticate
When you access the script, it redirects you to box for authentication. After authentication, if "code" is in the incoming request, the code is grabbed and redirected to the site where tokens are granted.
You have to setup a .htaccess file to store your secret key and id.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.