accessing Kafka metadata with Python KafkaConsumer

accessing Kafka metadata with Python KafkaConsumer - python

I have a simple Kafka reader class. I really don't remember where I got this code. Could have found it, or my previous self may have created it from various examples. Either way, it allows me to quickly read a kafka topic.
class KafkaStreamReader():
def __init__(self, schema_name, topic, server_list):
self.schema = get_schema(schema_name)
self.topic = topic
self.server_list = server_list
self.consumer = KafkaConsumer(topic, bootstrap_servers=server_list,
auto_offset_reset = 'latest',
security_protocol="PLAINTEXT")
def decode(self, msg, schema):
parsed_schema = avro.schema.parse(schema)
bytes_reader = io.BytesIO(msg)
decoder = avro.io.BinaryDecoder(bytes_reader)
reader = avro.io.DatumReader(parsed_schema)
record = reader.read(decoder)
return record
def fetch_msg(self):
event = next(self.consumer).value
record = self.decode(event, self.schema)
return record
To use it, I instantiate an object and loop forever reading data such as this:
consumer = KafkaStreamReader(schema, topic, server_list)
while True:
message = consumer.fetch_msg()
print message
I'm sure there are better solutions, but this works for me.
What I want to get out of this, is the meta data on the Kafka record. A coworker in another group used Java or Node and was able to see the following information on the record.
{
topic: 'clickstream-v2.origin.test',
value:
{
schema:payload_data/jsonschema/1-0-3',
data: [ [Object] ] },
offset: 16,
partition: 0,
highWaterOffset: 17,
key: null,
timestamp: 2018-07-25T17:01:36.959Z
}
}
I want to access the timestamp field using the Python KafkaConsumer.

I have a solution. If I change the fetch_msg method I can figure out how to access it.
def fetch_msg(self):
event = next(self.consumer)
timestamp = event.timestamp
record = self.decode(event.value, self.schema)
return record, timestamp
Not the most elegant solution as I personally don't like methods that return multiple values. However, it illustrates how to access the event data that I was after. I can work on more elegant solutions

Related

What is a better way to write multiple try clauses for casting a json string object to a dataclass?

I have a function that receives multiple different json string objects with different structure and/or field names, like so:
event = '{"userId": "TDQIQb2fQaORKvCyepDYoZgsoEE3", "profileIsCreated": true}'
or
event = '{"userId": "TDQIQb2fQaORKvCyepDYoZgsoEE3", "signUpFinished": true}'
And I have data classes like so:
from dataclasses import dataclass, field
#dataclass_json(letter_case=LetterCase.CAMEL)
#dataclass(frozen=True)
class UserId:
userId: str
#dataclass_json(letter_case=LetterCase.CAMEL)
#dataclass(frozen=True)
class SignUpFinished(UserId):
signUpFinished: bool
#dataclass_json(letter_case=LetterCase.CAMEL)
#dataclass(frozen=True)
class UserProfileCreated(UserId):
profileIsCreated: bool
Currently, the way I write my function is like this:
def cast_event(event):
user_details = None
try:
user_details = SignUpFinished.from_json(event)
except KeyError:
pass
try:
user_details = UserProfileCreated.from_json(event)
except KeyError:
pass
if user_details:
return "OK"
else:
return "UNHANDLED"
The problem is, as I have more and more events to handle, my function will become longer and longer, however, it is only doing the same thing.
Is there a better way to achieve what I want to achieve?
I have checked out some of the SO questions:
Multiple try codes in one block
Python: Multiple try except blocks in one?
but they don't seem to be the best way of trying to achieve what I want.

Since each case is syntactically the same, you can handle them in a single loop. Iterate through a sequence of cases and try to return; this automatically keeps on trying later cases until one succeeds.
def cast_event(event):
for case in (UserId , SignUpFinished, UserProfileCreated):
try:
return case.from_json(event)
except KeyError:
pass
raise ValueError(f'not a valid event: {event}')

While a loop approach works to solve your question as asked, it would be a lot better if you didn't need a "brute force" approach to deserialising your data in the first place. To do that, you'd need a field which unambiguously helped you determine what kind of data structure you're dealing with. E.g.:
event = {'event': 'profile',
'data': {'userId': 'TDQIQb2fQaORKvCyepDYoZgsoEE3', 'profileIsCreated': True}}
Here the event 'profile' will always be followed by an object with the keys 'userId' and 'profileIsCreated'. That is the guarantee your event messages should make, then it's trivial to parse them:
event_map = {
'profile': UserProfileCreated,
...
}
return event_map[event['event']](**event['data'])
Note that I'm skipping the JSON-parsing step here. You'll need to parse the JSON first to evaluate its event key, so using dataclass_json is probably superfluous/not useful then.

For the specified source data, you can do this:
import json
data = '{"userId": "TDQIQb2fQaORKvCyepDYoZgsoEE3", "profileIsCreated": true}'
data = json.loads(data)
user_id = data.pop('userId')
user_details_key = list(data.keys())[0] if data else None
user_details = list(data.values())[0] if data else None
assert user_id == 'TDQIQb2fQaORKvCyepDYoZgsoEE3'
assert user_details_key == 'profileIsCreated'
assert user_details == True

Python microsoft graph api

I am using microsoft graph api to pull my emails in python and return them as a json object. There is a limitation that it only returns 12 emails. The code is:
def get_calendar_events(token):
graph_client = OAuth2Session(token=token)
# Configure query parameters to
# modify the results
query_params = {
#'$select': 'subject,organizer,start,end,location',
#'$orderby': 'createdDateTime DESC'
'$select': 'sender, subject',
'$skip': 0,
'$count': 'true'
}
# Send GET to /me/events
events = graph_client.get('{0}/me/messages'.format(graph_url), params=query_params)
events = events.json()
# Return the JSON result
return events
The response I get are twelve emails with subject and sender, and total count of my email.
Now I want iterate over emails changing the skip in query_params to get the next 12. Any method of how to iterate it using loops or recursion.

I'm thinking something along the lines of this:
def get_calendar_events(token):
graph_client = OAuth2Session(token=token)
# Configure query parameters to
# modify the results
json_list = []
ct = 0
while True:
query_params = {
#'$select': 'subject,organizer,start,end,location',
#'$orderby': 'createdDateTime DESC'
'$select': 'sender, subject',
'$skip': ct,
'$count': 'true'
}
# Send GET to /me/events
events = graph_client.get('{0}/me/messages'.format(graph_url), params=query_params)
events = events.json()
json_list.append(events)
ct += 12
# Return the JSON result
return json_list
May require some tweaking but essentially you're adding 12 to the offset each time as long as it doesn't return an error. Then it appends the json to a list and returns that.
If you know how many emails you have, you could also batch it that way.

Unable to insert nested object in mongodb using pymongo

I am coming today following an issue that doesn't make sense to me using python and mongodb. I am a Go/C# developer so maybe I am missing something but I have the following case:
from datetime import datetime
from bson import ObjectId
class DailyActivity:
user_ids = []
date = None
def __init__(self, user_ids : [ObjectId] = [], date : datetime = None):
self.user_ids = user_ids
self.date = date
class ActivitiesThroughDays:
daily_activies = []
def add_daily_activity(self, daily_activity : DailyActivity = None):
daily_activies.append(daily_activity)
I then have these 2 classes but also another file containing some helper to use mongodb:
from pymongo import MongoClient
def get_client():
return MongoClient('localhost', 27017)
def get_database(database_name: str = None):
if database_name is None:
raise AttributeError("database name is None.")
return get_client().get_database(database_name)
def get_X_database():
return get_database("X")
And here we get to the issue.. I am now building a simple ActivitiesThroughDays object which has only one DailyActivity containing X user ids (as ObjectId array/list).
However, when I try to insert_one, I get the following:
TypeError: document must be an instance of dict, bson.son.SON, bson.raw_bson.RawBSONDocument, or a type that inherits from collections.MutableMapping
this is the piece of code that raise the exception:
def insert_activities_though_days(activities_through_days: ActivitiesThroughDays = None):
if activities_through_days is None:
raise AttributeError("activities_through_days is None.")
col = get_EM_column("activities_through_days")
col.insert_one(activities_through_days)
Based on the above issue, I then tried to convert my ActivitiesThroughDays into dic/json:
col.insert_one(activities_through_days.__dict__)
bson.errors.InvalidDocument: cannot encode object: models. DailyActivity. DailyActivity object at 0x10eea0320, of type: class 'models. DailyActivity. DailyActivity'
col.insert_one(json.dumps(activities_through_days))
TypeError: Object of type ActivitiesThroughDays is not JSON serializable
So based on this, I began to search for different solutions over google and found out solutions such as :
def to_dict(obj):
if not hasattr(obj,"__dict__"):
return obj
result = {}
for key, val in obj.__dict__.items():
if key.startswith("_"):
continue
element = []
if isinstance(val, list):
for item in val:
element.append(to_dict(item))
else:
element = to_dict(val)
result[key] = element
return result
But I got :
bson.errors.InvalidDocument: cannot encode object: property object at 0x10229aa98, of type: class 'property'
For each step I move forward, another issue comes up... To me, all of this doesn't make sense at all because.. there should be a generic serializer/deserializer somewhere that would, from 1 line, convert any nested objects/arrays to be inserted in mongodb..
Also, from one of the solution I tried, I found out that ObjectId were ignored while mapping to json/dict (I don't remember which one)
I am not at all a Python developer so please, feel free to give any tips :)
Thanks

pymongo's interface expects dict and .__dict__ is a very low level attribute.
I'm afraid you'll spend a lot of energy if you try to build an ORM/ODM for mongodb from scratch.
There are existing ORM/ODM libraries that exist for mongodb in python (mongoengine, pymodm which are quite similar) and that could help you to get something working quickly.
Here are a few lines that shows how the models would look with mongoengine and how to save them:
import datetime as dt
from mongoengine import *
connect(host='mongodb://localhost:27017/testdb')
class User(Document):
email = EmailField(required=True)
class DailyActivity(Document):
users = ListField(ReferenceField(User))
date = DateTimeField(default=dt.datetime.utcnow)
user = User(email='test#garbage.com').save()
user2 = User(email='test2#garbage.com').save()
activity = DailyActivity(users=[user, user2]).save()
I hope this helps

SQLAlchemy using like on relationship within a filter

EDIT: This is very similiar to SqlAlchemy - Filtering by Relationship Attribute in that we are both trying to filter on relationship attributes. However, they are filtering on matching an exact value, whereas I am filtering by using like/contains. Because of this, as pointed out in the comments, my solution requires an extra step that was not apparent in the other post.
Let me preface this with: I'm still quite new to SQLAlchemy, so it's entirely possible I'm going about this in the exact wrong way.
I have a Flask API that has defined "alerts" and "events". An alert can only belong to a single event, and an event can have multiple alerts. The schema for the alerts is as follows:
class Alert(PaginatedAPIMixin, db.Model):
__tablename__ = 'alert'
id = db.Column(db.Integer, primary_key=True, nullable=False)
type = db.relationship('AlertType')
type_id = db.Column(db.Integer, db.ForeignKey('alert_type.id'), nullable=False)
url = db.Column(db.String(512), unique=True, nullable=False)
event_id = db.Column(db.Integer, db.ForeignKey('event.id'), nullable=False)
event = db.relationship('Event')
def __str__(self):
return str(self.url)
def to_dict(self):
return {'id': self.id,
'event': self.event.name,
'type': self.type.value,
'url': self.url}
One of the API endpoints lets me get a list of alerts based on various filter criteria. For example, get all alerts of alert_type X, or get all alerts with X inside their alert_url. However, the one that is stumping me is that I want to be able to get all alerts with X inside their associated event name.
Here is the API endpoint function (the commented out events bit is my initial "naive" approach that does not work due to event being a relationship), but you can get the idea what I'm trying to do with the filtering.
def read_alerts():
""" Gets a list of all the alerts. """
filters = set()
# Event filter
#if 'event' in request.args:
# filters.add(Alert.event.name.like('%{}%'.format(request.args.get('event'))))
# URL filter
if 'url' in request.args:
filters.add(Alert.url.like('%{}%'.format(request.args.get('url'))))
# Type filter
if 'type' in request.args:
type_ = AlertType.query.filter_by(value=request.args.get('type')).first()
if type_:
type_id = type_.id
else:
type_id = -1
filters.add(Alert.type_id == type_id)
data = Alert.to_collection_dict(Alert.query.filter(*filters), 'api.read_alerts', **request.args)
return jsonify(data)
The filters set that is built gets fed to the to_collection_dict() function, which essentially returns a paginated list of the query with all of the filters.
def to_collection_dict(query, endpoint, **kwargs):
""" Returns a paginated dictionary of a query. """
# Create a copy of the request arguments so that we can modify them.
args = kwargs.copy()
# Read the page and per_page values or use the defaults.
page = int(args.get('page', 1))
per_page = min(int(args.get('per_page', 10)), 100)
# Now that we have the page and per_page values, remove them
# from the arguments so that the url_for function does not
# receive duplicates of them.
try:
del args['page']
except KeyError:
pass
try:
del args['per_page']
except KeyError:
pass
# Paginate the query.
resources = query.paginate(page, per_page, False)
# Generate the response dictionary.
data = {
'items': [item.to_dict() for item in resources.items],
'_meta': {
'page': page,
'per_page': per_page,
'total_pages': resources.pages,
'total_items': resources.total
},
'_links': {
'self': url_for(endpoint, page=page, per_page=per_page, **args),
'next': url_for(endpoint, page=page + 1, per_page=per_page, **args) if resources.has_next else None,
'prev': url_for(endpoint, page=page - 1, per_page=per_page, **args) if resources.has_prev else None
}
}
return data
I understand that I can get the filtered list of alerts by their associated event name doing something along these lines with options and contains_eager:
alerts = db.session.query(Alert).join(Alert.event).options(contains_eager(Alert.event)).filter(Event.name.like('%{}%'.format(request.args.get('event')))).all()
But I have not gotten something similar to that to work when added to the filters set.

Floats in JSON on GAE / gviz_api

I have a python application running on Google App Engines which outputs data in JSON format structured by the gviz_api for Google Charts visualisation. The code is as follows:
class StatsItem(ndb.Model):
added = ndb.DateTimeProperty(auto_now_add = True, verbose_name = "Upload date")
originated = ndb.DateTimeProperty(verbose_name = "Origination date")
host = ndb.StringProperty(verbose_name = "Originating host")
uptime = ndb.IntegerProperty(indexed = False, verbose_name = "Uptime")
load1 = ndb.FloatProperty(indexed = False, verbose_name = "1-min load")
load5 = ndb.FloatProperty(indexed = False, verbose_name = "5-min load")
load15 = ndb.FloatProperty(indexed = False, verbose_name = "15-min load")
class ChartDataPage(webapp2.RequestHandler):
def get(self):
span = int(self.request.get('span', 720))
stats = StatsItem.query().order(-StatsItem.originated).fetch(span)
header = { 'originated' : ("datetime", "date") }
vars = []
for v in self.request.get_all('v'):
if v in StatsItem._properties.keys():
vars.append(v)
header[v] = ("number", StatsItem._properties[v]._verbose_name)
data = []
for s in stats:
entry = { 'originated' : s.originated }
for v in vars:
entry[v] = getattr(s, v)
data.append(entry)
data_table = gviz_api.DataTable(header)
data_table.LoadData(data)
self.response.headers['Content-Type'] = 'application/json'
self.response.out.write(data_table.ToJSonResponse(columns_order=(("originated",) + tuple(vars)),
order_by="originated"))
It is working all right, but I get the famous issue with float-type properties, namely this is the output I am seeing (example):
google.visualization.Query.setResponse({"status":"ok","table":{"rows":[{"c":[{"v":"Date(2013,11,19,12,55,22,460)"},{"v":0.33000000000000002}]},{"c":[{"v":"Date(2013,11,19,12,56,22,641)"},{"v":0.33000000000000002}]},{"c":[{"v":"Date(2013,11,19,12,57,22,747)"},{"v":0.28999999999999998}]},{"c":[{"v":"Date(2013,11,19,12,58,22,914)"},{"v":0.25}]},{"c":[{"v":"Date(2013,11,19,12,59,23,19)"},{"v":0.28000000000000003}]},{"c":[{"v":"Date(2013,11,19,13,0,23,169)"},{"v":0.28000000000000003}]},{"c":[{"v":"Date(2013,11,19,13,1,23,268)"},{"v":0.41999999999999998}]},{"c":[{"v":"Date(2013,11,19,13,2,23,385)"},{"v":0.40999999999999998}]},{"c":[{"v":"Date(2013,11,19,13,3,23,518)"},{"v":0.40999999999999998}]},{"c":[{"v":"Date(2013,11,19,13,4,23,643)"},{"v":0.40999999999999998}]}],"cols":[{"type":"datetime","id":"originated","label":"date"},{"type":"number","id":"load5","label":"5-min load"}]},"reqId":"0","version":"0.6"});
So a float with a value of 0.33 (as seen in the DataStore viewer) is represented as 0.33000000000000002 in JSON. While it works, this is not only ugly, but also takes up bandwidth, so I would like to round it to 2 digits, i.e. 0.33. Strangely enough in some cases, this is happening (see 0.25 above).
I am loading the gviz_api module from my applications directory.
I have tried the following solutions, none of these worked:
round()-ing the figure before inputting into the datatable (round(getattr(s, v)) in the above code). It gets invoked, as I see integers turning into floats, but has no impact on the above issue with floats.
Monkey-patching JSON both in the GAE application module and also in the gviz_api module. No effect, the code is just simply not invoked, as if it was not there at all.
Overriding the default() method in gviz_api.DataTableJSONEncoder. This is not working I guess because it gets invoked only for unknown data types.
I have not tried yet to process the JSON string produced with regexps and I would like to avoid that if possible. Any ideas how to fix it?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

accessing Kafka metadata with Python KafkaConsumer - python

Related

What is a better way to write multiple try clauses for casting a json string object to a dataclass?

Python microsoft graph api

Unable to insert nested object in mongodb using pymongo

SQLAlchemy using like on relationship within a filter

Floats in JSON on GAE / gviz_api

Categories

Resources