How to use avro-python3 on Windows 10 to parse files? - python

I have downloaded an AVRO file (with JSON payload) from Microsoft Azure to my Windows 10 computer:
Then with python 3.8.5 and avro 1.10.0 installed via pip I have tried running the following script:
import os, avro
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter
reader = DataFileReader(open("48.avro", "rb"), DatumReader())
for d in reader:
print(d)
reader.close()
Unfortunately, nothing is printed by the script.
Then I have searched around and have tried to add a schema as in below:
schema_str = """
{
"type" : "record",
"name" : "EventData",
"namespace" : "Microsoft.ServiceBus.Messaging",
"fields" : [ {
"name" : "SequenceNumber",
"type" : "long"
}, {
"name" : "Offset",
"type" : "string"
}, {
"name" : "EnqueuedTimeUtc",
"type" : "string"
}, {
"name" : "SystemProperties",
"type" : {
"type" : "map",
"values" : [ "long", "double", "string", "bytes" ]
}
}, {
"name" : "Properties",
"type" : {
"type" : "map",
"values" : [ "long", "double", "string", "bytes", "null" ]
}
}, {
"name" : "Body",
"type" : [ "null", "bytes" ]
} ]
}
"""
schema = avro.schema.parse(schema_str)
reader = DataFileReader(open("48.avro", "rb"), DatumReader(schema, schema))
for d in reader:
print(d)
reader.close()
But this hasn't helped, still nothing is printed.
While I was expecting that a list of dictionary objects would be printed...
UPDATE:
I've got a reply at the mailing list that avro-python3 is deprecated.
Still my issue with original avro persists, nothing is printed.
UPDATE 2:
I have to apologize - the avro file I was using did not contain any useful data. The reason for my confusion is that a colleague was using a different file with the same name while testing for me.
Now I have tried both avro and fastavro modules on a different avro file and both worked. I will look at PySpark as well.

As OneCricketeer suggested use PySpark to read avro files generated by EventHub. Here, PySpark: Deserializing an Avro serialized message contained in an eventhub capture avro file is one such example.

Related

JSON Schema validation in Python

I wonder if it is possible to have a Swagger JSON with all schemas in one file to be validated across all tests.
Assume somewhere in a long .json file a schema is specified like so:
...
"Stop": {
"type": "object",
"properties": {
"id": {
"description": "id identifier",
"type": "string"
},
"lat": {
"$ref": "#/components/schemas/coordinate"
},
"lng": {
"$ref": "#/components/schemas/coordinate"
},
"name": {
"description": "Name of location",
"type": "string"
},
"required": [
"id",
"name",
"lat",
"lng"
]
}
...
So lat and lng schemas are defined in another schema (the same file at the top).
Now I want to get response from backend with array of those stops and would like to validate it against the schema.
How should I approach it?
I am trying to get a partial schema loaded and validate against it but then the $ref won't resolve. Also is there anyway to tell the validator to accept arrays? Schema only tells how a single object looks like. If I manually add
"type": "array",
"items": {
Stop...
with hardcoded coordinates
}
Then it seems to work.
Here are my functions to validate arbitrary response JSON against chosen "chunk" of the full Swagger schema:
def pick_schema(schema):
with open("json_schemas/full_schema.json", "r") as f:
jschema = json.load(f)
return jschema["components"]["schemas"][schema]
def validate_json_with_schema(json_data, json_schema_name):
validate(
json_data,
schema=pick_schema(json_schema_name),
format_checker=jsonschema.FormatChecker(),
)
Maybe another approach is preferred? Separate files with schemas for API responses (it is quite tedious to write, full_schema.json is generated from Swagger itself).

PyMongo: Is there a way to add data to a existing document in MongoDB using python?

I have a database 'Product'. Which contains a collection name 'ProductLog'. Inside this collection , there are 2 documents in the following format:
{
"environment": "DevA",
"data": [
{
"Name": "ABC",
"Stream": "Yes"
},
{
"Name": "ZYX",
"Stream": "Yes"
}
]
},
{
"environment": "DevB",
"data": [
{
"Name": "ABC",
"Stream": "Yes"
},
{
"Name": "ZYX",
"Stream": "Yes"
}
]
}
This gets added as 2 documents in collection. I want to append more data in the already existing document's 'data' field in MongoDB using python. Is there a way for that? I guess update would remove the existing fields in "data" field or may update a whole document.
For example: Adding one more array in EmployeeDetails field, while the earlier data in EmployeeDetail also remains.
I want to show how you can append more data in the already existing document 'data' field in MongoDB using python:
First install pymongo:
pip install mongoengine
Now let's get our hands dirty:
from pymongo import MongoClient
mongo_uri = "mongodb://user:pass#mongosrv:27017/"
client = MongoClient(mongo_uri)
database = client["Product"]
collection = "ProductLog"
database[collection].update_one({"environment": "DevB"}, {
"$push": {
"data": {"Name": "DEF", "Stream": "NO"}
}
})
There is a SQL library in Python language through which you can insert/add your data in your desired database. For more information, check out the tutorial

How to select a particular value/attribute in json data via python?

I have some json data , which I want to load and inspect in python. I know python has few different ways to handle json. if i want to see what is the author name in following json data, how can directly select the value of name inside author in following json, without having to iterate , if there are multiple topic/blog in the data?
{
"topic":
{
"language": "JSON",
},
"blog":[
{
"author" : {
"name" : "coder"
}
}]
}

Problems trying to decode an AVRO file generated an an Azure Key Vault Audit Event

I am trying to read the Audit Events generated by accesses to an Azure Key Vault. They are streamed to an Event Hub. The events appear in the Event Hub as AVRO files. An individual event appears as a file, 44.avro, in a folder whose path specifies the time stamp of the event. For example, an event generated today (noon, 6-Nov-20) could be found at 'kv-audit-eh/security-logs/0/2020/11/06/12/00/44.avro'. So far, so good.
The problem comes when trying to read the contents of this file to verfiy the type of Key Vault access that triggered the event. An on-line utility says the file is empty. (The file is 508 bytes in size, and you can see a JSON-formatted schema embedded in it, along with some binary information.) I have used a tool to extract the JSON schema, and here it is:
{"namespace": "44.avro",
"type" : "record",
"name" : "EventData",
"namespace" : "Microsoft.ServiceBus.Messaging",
"fields" : [
{
"name" : "SequenceNumber",
"type" : "long"
},
{
"name" : "Offset",
"type" : "string"
},
{
"name" : "EnqueuedTimeUtc",
"type" : "string"
},
{
"name" : "SystemProperties",
"type" : {
"type" : "map",
"values" : [
"long",
"double",
"string",
"bytes"
]
}
},
{
"name" : "Properties",
"type" : {
"type" : "map",
"values" : [
"long",
"double",
"string",
"bytes",
"null"
]
}
},
{
"name" : "Body",
"type" : [
"null",
"bytes"
]
}
]
}
I saved this schema into the file audit.avsc. When I use the following Python code to read the file, I don't get any errors, but I don't get any output either.
import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter
schema = avro.schema.parse(open("audit.avsc", "rb").read())
reader = DataFileReader(open("44.avro", "rb"), DatumReader())
for name in reader:
print (name)
reader.close()
If I open the file in the Azure dashboard, it displays the message "may not render correctly as it contains an unknown extension."
So my question is: What is required to read the contents of one of these files? Any advice welcome, as I'm stumped by this.
Thanks in advance.
It turns out the AVRO file is empty, most of the time. The event hub was capturing all sorts of idle activity on the Key Vault. I switched off the option where the capture does not record null events. (That is, only capture an event that actually affects the Key Vault.) Once I did that, the AVRO files were a few kilobytes in size and the Python code read out audit events. The folders were no longer cluttered with empty files.
The capture setting in question was this one:
Do not emit empty files when no events occur during the Capture time window.
Check it.

ElasticSearch: Index only the fields specified in the mapping

I have an ElasticSearch setup, receiving data to index via a CouchDB river. I have the problem that most of the fields in the CouchDB documents are actually not relevant for search: they are fields internally used by the application (IDs and so on), and I do not want to get false positives because of these fields. Besides, indexing not needed data seems to me a waste of resources.
To solve this problem, I have defined a mapping where I specify the fields which I want to be indexed. I am using pyes to access ElasticSearch. The process that I follow is:
Create the CouchDB river, associated to an index. This apparently creates also the index, and creates a "couchdb" mapping in that index which, as far as I can see, includes all fields, with dynamically assigned types.
Put a mapping, restring it to the fields which I really want to index.
This is the index definition as obtained by:
curl -XGET http://localhost:9200/notes_index/_mapping?pretty=true
{
"notes_index" : {
"default_mapping" : {
"properties" : {
"note_text" : {
"type" : "string"
}
}
},
"couchdb" : {
"properties" : {
"_rev" : {
"type" : "string"
},
"created_at_date" : {
"format" : "dateOptionalTime",
"type" : "date"
},
"note_text" : {
"type" : "string"
},
"organization_id" : {
"type" : "long"
},
"user_id" : {
"type" : "long"
},
"created_at_time" : {
"type" : "long"
}
}
}
}
}
The problem that I have is manyfold:
that the default "couchdb" mapping is indexing all fields. I do not want this. Is it possible to avoid the creation of that mapping? I am confused, because that mapping seems to be the one which is somehow "connecting" to the CouchDB river.
the mapping that I create seems not to have any effect: there are no documents indexed by that mapping
Do you have any advice on this?
EDIT
This is what I am actually doing, exactly as typed:
server="localhost"
# Create the index
curl -XPUT "$server:9200/index1"
# Create the mapping
curl -XPUT "$server:9200/index1/mapping1/_mapping" -d '
{
"type1" : {
"properties" : {
"note_text" : {"type" : "string", "store" : "no"}
}
}
}
'
# Configure the river
curl -XPUT "$server:9200/_river/river1/_meta" -d '{
"type" : "couchdb",
"couchdb" : {
"host" : "localhost",
"port" : 5984,
"user" : "admin",
"password" : "admin",
"db" : "notes"
},
"index" : {
"index" : "index1",
"type" : "type1"
}
}'
The documents in index1 still contain fields other than "note_text", which is the only one that I have specifically mentioned in the mapping definition. Why is that?
The default behavior of CouchDB river is to use a 'dynamic' mapping, i.e. index all the fields that are found in the incoming CouchDB documents. You're right that it can unnecessarily increase the size of the index (your problems with search can be solved by excluding some fields from the query).
To use your own mapping instead of the 'dynamic' one, you need to configure the River plugin to use the mapping you've created (see this article):
curl -XPUT 'elasticsearch-host:9200/_river/notes_index/_meta' -d '{
"type" : "couchdb",
... your CouchDB connection configuration ...
"index" : {
"index" : "notes_index",
"type" : "mapping1"
}
}'
The name of the type that you're specifying in URL while doing mapping PUT overrides the one that you're including in the definition, so the type that you're creating is in fact mapping1. Try executing this command to see for yourself:
> curl 'localhost:9200/index1/_mapping?pretty=true'
{
"index1" : {
"mapping1" : {
"properties" : {
"note_text" : {
"type" : "string"
}
}
}
}
}
I think that if you will get the name of type right, it will start working fine.

Categories

Resources