Query nested JSON document in MongoDB collection using Python - python

I have a MongoDB collection containing multiple documents. A document looks like this:
{
'name': 'sys',
'type': 'system',
'path': 'sys',
'children': [{
'name': 'folder1',
'type': 'folder',
'path': 'sys/folder1',
'children': [{
'name': 'folder2',
'type': 'folder',
'path': 'sys/folder1/folder2',
'children': [{
'name': 'textf1.txt',
'type': 'file',
'path': 'sys/folder1/folder2/textf1.txt',
'children': ['abc', 'def']
}, {
'name': 'textf2.txt',
'type': 'file',
'path': 'sys/folder1/folder2/textf2.txt',
'children': ['a', 'b', 'c']
}]
}, {
'name': 'text1.txt',
'type': 'file',
'path': 'sys/folder1/text1.txt',
'children': ['aaa', 'bbb', 'ccc']
}]
}],
'_id': ObjectId('5d1211ead866fc19ccdf0c77')
}
There are other documents containing similar structure. How can I query this collection to find part of one document among multiple documents where path matches sys/folder1/text1.txt?
My desired output would be:
{
'name': 'text1.txt',
'type': 'file',
'path': 'sys/folder1/text1.txt',
'children': ['aaa', 'bbb', 'ccc']
}
EDIT:
What I have come up with so far is this. My Flask endpoint:
class ExecuteQuery(Resource):
def get(self, collection_name):
result_list = [] # List to store query results
query_list = [] # List to store the incoming queries
for k, v in request.json.items():
query_list.append({k: v}) # Store query items in list
cursor = mongo.db[collection_name].find(*query_list) # Execute query
for document in cursor:
encoded_data = JSONEncoder().encode(document) # Encode the query results to String
result_list.append(json.loads(encoded_data)) # Update dict by iterating over Documents
return result_list # Return query result to client
My client side:
request = {"name": "sys"}
response = requests.get(url, json=request, headers=headers)
print(response.text)
This gives me the entire document but I cannot extract a specific part of the document by matching the path.

I don't think mongodb supports recursive or deep queries within a document (neither recursive $unwind). What it does provide however, are recursive queries across documents referencing another, i.e. aggregating elements from a graph ($graphLookup).
This answer explains pretty well, what you need to do to query a tree.
Although it does not directly address your problem, you may want to reevaluate your data structure. It certainly is intuitive, but updates can be painful -- as well as queries for nested elements, as you just noticed.
Since $graphLookup allows you to create a view equal to your current document, I cannot think of any advantages the explicitly nested structure has over one document per path. There will be a slight performance loss for reading and writing the entire tree, but with proper indexing it should be ok.

Related

Modeling a dictionary as a queryable data object in python

I have a simple book catalog dictionary as the following
{
'key':
{
'title': str,
'authors': [ {
'firstname': str,
'lastname': str
}
],
'tags': [ str ],
'blob': str
}
}
Each book is a string key in the dictionary. A book contains a single title, and possibly has many authors (often just one). An author is made of two strings, firstname and lastname. Also we can associate many tags to a book as novel, literature, art, 1900s, etc. Each book as a blob field that contains additional data. (often the book itself). I want to be able to search for a given entry (or a group of them) based on data, as by author, by tag.
My main workflow would be:
Given a query, return all blob fields associated to each entry.
My question is how to model this, which libraries or formats to use keeping the given constraints:
Minimize the number of data objects (preference for a single data object to simplify queries).
Small size of columns (create a new column for every possible tag is probably insane and lead to a very sparse dataset)
Do not duplicate blob field (since it can be large).
My first idea was to create multiple rows for each author, for example:
{ '123': { 'title': 'A sample book',
'authors': [ {'firstname': 'John', 'lastname': 'Smith'},
{'firstname': 'Foos', 'lastname': 'M. Bar'} ]
'tags': [ 'tag1', 'tag2', 'tag3' ],
'blob': '.....'
}
Would turn, initially into two entries as
idx
key
Title
authors_firstname
authors_lastname
tags
blob
0
123
Sample Book
John
Smith
['tag1', 'tag2', 'tag3']
...
1
123
Sample Book
Foos
M. Bar
['tag1', 'tag2', 'tag3']
...
But this still duplicates the blob, and still need to figure out what to do with the unknown number of tags (as the database grows).
You can use TinyDB to accomplish what you want.
First, convert your dict to a database:
from tinydb import TinyDB, Query
from tinydb.table import Document
data = [{'123': {'title': 'A sample book',
'authors': [{'firstname': 'John', 'lastname': 'Smith'},
{'firstname': 'Foos', 'lastname': 'M. Bar'}],
'tags': ['tag1', 'tag2', 'tag3'],
'blob': 'blob1'}},
{'456': {'title': 'Another book',
'authors': [{'firstname': 'Paul', 'lastname': 'Roben'}],
'tags': ['tag1', 'tag3', 'tag4'],
'blob': 'blob2'}}]
db = TinyDB('catalog.json')
for record in data:
db.insert(Document(list(record.values())[0], doc_id=list(record.keys())[0]))
Now you can make queries:
Book = Query()
Author = Query()
rows = db.search(Book.authors.any(Author.lastname == 'Smith'))
rows = db.search(Book.tags.all(['tag1', 'tag4']))
rows = db.all()
Given a query, return all blob fields associated to each entry.
blobs = {row.doc_id: row['blob'] for row in db.all()}
>>> blobs
{123: 'blob1', 456: 'blob2'}

create dynamic frame with schema from catalog table

I have created table in catalog table through create_table in API aws glue.
Through this code sample below code is creating table in catalog.
When I create dynamic frame from this table, it is empty with no schema.
I want to create empty dynamic frame with these four columns
response = client.create_table(
DatabaseName= 'xxxxxxxxxx',
TableInput={'Name':'xxxxxxxxxx',
'StorageDescriptor': {
'Columns': [
{'Name': 'column_1', 'Type': 'string', 'Comment': 'None'},
{'Name': 'column_2', 'Type': 'string', 'Comment': 'None'},
{'Name': 'column_2', 'Type': 'string', 'Comment': 'None'},
{'Name': 'column_2', 'Type': 'string', 'Comment': 'None'}
],
'Location':'s3://xxxxxxx/',
'InputFormat': 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat',
'OutputFormat': 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat',
'SerdeInfo': {
'Name': 'avro',
'SerializationLibrary': 'org.apache.hadoop.hive.serde2.avro.AvroSerDe',
'Parameters':'{"type":"record","name":"DynamicRecord","namespace":"root","fields":[{"name":"column_1","type":["string","null"]},{"name":"column_2","type":["string","null"]},{"name":"column_3","type":["string","null"]},{"name":"column_4","type":["string","null"]}]}'
}
}}
)
A DynamicFrame is similar to a DataFrame, except that each record is
self-describing, so no schema is required initially. Instead, AWS Glue
computes a schema on-the-fly when required, and explicitly encodes
schema inconsistencies using a choice (or union) type. You can resolve
these inconsistencies to make your datasets compatible with data
stores that require a fixed schema.
Link: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame.html#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-schema
You may use apply_mapping() to set schema explicitly or need some data in s3 location.

How to compare json file with expected result in Python 3?

I need to prepare test which will be comparing content of .json file with expected result (we want to check if values in .json are correctly generated by our dev tool).
For test I will use robot framework or unittests but I don't know yet how to parse correctly json file.
Json example:
{
"Customer": [{
"Information": [{
"Country": "",
"Form": ""
}
],
"Id": "110",
"Res": "",
"Role": "Test",
"Limit": ["100"]
}]
}
So after I execute this:
with open('test_json.json') as f:
hd = json.load(f)
I get dict 'hd' where key is:
dict_keys(['Customer'])
and values:
dict_values([[{'Information': [{'Form': '', 'Country': ''}], 'Role': 'Test', 'Id': '110', 'Res': '', 'Limit': ['100']}]])
My problem is that I don't know how to get to only one value from Dict(e.g: Role: Test), because I can get only extract whole value. I can prepare a long string to compare with but it is not best solution for tests.
Any ideas how I can get to only one row from .json file?
Your JSON has single key 'Customer' and it has a value of list type. So when you ppass dict_keys(['Customer']) you are getting list value.
>>> hd['Customer']
[{'Id': '110', 'Role': 'Test', 'Res': '', 'Information': [{'Form': '', 'Country': ''}], 'Limit': ['100']}]
First element in list:
>>> hd['Customer'][0]
{'Id': '110', 'Role': 'Test', 'Res': '', 'Information': [{'Form': '', 'Country': ''}], 'Limit': ['100']}
Now access inside dict structure using:
>>> hd['Customer'][0]['Role']
'Test'
You can compare the dict that you loaded (say hd) to the expected results dict (say expected_dict) by running
hd.items() == expected_dict.items()

Extract specific keys from list of dict in python. Sentinelhub

I seem to be stuck on very simple task. I'm still dipping my toes into Python.
I'm trying to download Sentinel 2 Images with SentinelHub API:SentinelHub
The result of data that my code returns is like this:
{'geometry': {'coordinates': [[[[35.895906644, 31.602691754],
[36.264307655, 31.593801516],
[36.230618703, 30.604681346],
[35.642363693, 30.617971909],
[35.678587829, 30.757888786],
[35.715700562, 30.905919341],
[35.754290061, 31.053632806],
[35.793289298, 31.206946419],
[35.895906644, 31.602691754]]]],
'type': 'MultiPolygon'},
'id': 'ee923fac-0097-58a8-b861-b07d89b99310',
'properties': {'**productType**': '**S2MSI1C**',
'centroid': {'coordinates': [18.1321538275, 31.10368655], 'type': 'Point'},
'cloudCover': 10.68,
'collection': 'Sentinel2',
'completionDate': '2017-06-07T08:15:54Z',
'description': None,
'instrument': 'MSI',
'keywords': [],
'license': {'description': {'shortName': 'No license'},
'grantedCountries': None,
'grantedFlags': None,
'grantedOrganizationCountries': None,
'hasToBeSigned': 'never',
'licenseId': 'unlicensed',
'signatureQuota': -1,
'viewService': 'public'},
'links': [{'href': 'http://opensearch.sentinel-hub.com/resto/collections/Sentinel2/ee923fac-0097-58a8-b861-b07d89b99310.json?&lang=en',
'rel': 'self',
'title': 'GeoJSON link for ee923fac-0097-58a8-b861-b07d89b99310',
'type': 'application/json'}],
'orbitNumber': 10228,
'organisationName': None,
'parentIdentifier': None,
'platform': 'Sentinel-2',
'processingLevel': '1C',
'productIdentifier': 'S2A_OPER_MSI_L1C_TL_SGS__20170607T120016_A010228_T36RYV_N02.05',
'published': '2017-07-26T13:09:17.405352Z',
'quicklook': None,
'resolution': 10,
's3Path': 'tiles/36/R/YV/2017/6/7/0',
's3URI': 's3://sentinel-s2-l1c/tiles/36/R/YV/2017/6/7/0/',
'sensorMode': None,
'services': {'download': {'mimeType': 'text/html',
'url': 'http://sentinel-s2-l1c.s3-website.eu-central-1.amazonaws.com#tiles/36/R/YV/2017/6/7/0/'}},
'sgsId': 2168915,
'snowCover': 0,
'spacecraft': 'S2A',
'startDate': '2017-06-07T08:15:54Z',
'thumbnail': None,
'title': 'S2A_OPER_MSI_L1C_TL_SGS__20170607T120016_A010228_T36RYV_N02.05',
'updated': '2017-07-26T13:09:17.405352Z'},
'type': 'Feature'}
Can you explain how can I iterate through this set of data and extract only 'productType'? For example, if there are several similar data sets it would return only different product types.
My code is :
import matplotlib.pyplot as plt
import numpy as np
from sentinelhub import AwsProductRequest, AwsTileRequest, AwsTile, BBox, CRS
betsiboka_coords_wgs84 = [31.245117,33.897777,34.936523,36.129002]
bbox = BBox(bbox=betsiboka_coords_wgs84, crs=CRS.WGS84)
date= '2017-06-05',('2017-06-08')
data=sentinelhub.opensearch.get_area_info(bbox, date_interval=date, maxcc=None)
for i in data:
print(i)
Based on what you have provided, replace your bottom for loop:
for i in data:
print(i)
with the following:
for i in data:
print(i['properties']['**productType**'])
If you want to access only the propertyType you can use i['properties']['productType'] in your for loop. If you want to access it any time you want without writing each time those keys, you can define a generator like this:
def property_types(data_array):
for data in data_array
yield data['properties']['propertyType']
So you can use it like this in a loop (your data_array is data, as returned by sentinelhub api):
for property_type in property_types(data):
# do stuff with property_type
keys = []
for key in d.keys():
if key == 'properties':
for k in d[key].keys():
if k == '**productType**' and k not in keys:
keys.append(d[key][k])
print(keys)
Getting only specific (nested) values: Since your request key is nested, and resides inside the parent "properties" object, you need to access it first, preferably using the get method. This can be done as follows (note the '{}' parameter in the first get, this returns an empty dictionary if the first key is not present)
data_dictionary = json.loads(data_string)
product_type = data_dictionary.get('properties', {}).get('**productType**')
You can then aggregate the different product_type objects in a set, which will automatically guarantee that no 2 objects are the same
product_type_set = set()
product_type.add(product_type)

Iterating through and deleting certain elements in a list of dictionaries in python

I have json file that looks like this:
[{'Events': [{'EventName': 'Log',
'EventType': 'Native',
'LogLevel': 'error',
'Message': 'missing event: seqNum=1'},
{'EventName': 'Log',
'EventType': 'Native',
'LogLevel': 'error',
'Message': 'missing event: seqNum=2'}],
'Id': 116005},
{'Events': [{'EventName': 'Log',
'EventType': 'Native',
'LogLevel': 'error',
'Message': 'missing event: seqNum=101'},
{'EventName': 'Log',
'EventType': 'Native',
'LogLevel': 'error',
'Message': 'missing event: seqNum=102'},
{'BrowserInfo': {'name': 'IE ', 'version': '11'},
'EventName': 'Log',
'EventType': 'Native',
'LogLevel': 'info',
'SeqNum': 3,
'SiteID': 1454445626890,
'Time': 1454445626891,
'URL': 'http://test.com'},
{'BrowserInfo': {'name': 'IE ', 'version': '11'},
'EventName': 'eventIndicator',
'EventType': 'responseTime',
'SeqNum': 8,
'SiteID': 1454445626890,
'Time': 1454445626923,
'URL': 'http://test.com'}],
'Id': 116005}]
And I am trying to remove each of the events where "EventName": "Log".
I would assume there is a way to pop them out, but I can't even iterate far enough into the list to do that. What is the cleanest way to do this?
I should end up with a list that looks like:
[{'Events': [{'BrowserInfo': {'name': 'IE ', 'version': '11'},
'EventName': 'eventIndicator',
'EventType': 'responseTime',
'SeqNum': 8,
'SiteID': 1454445626890,
'Time': 1454445626923,
'URL': 'http://test.com'}],
'Id': 116005}]
It's difficult to modify a list or other data structure as you're iterating over it. It's often easier to create a new data structure, excluding the unwanted values.
You appear to want to do two things:
Remove dictionaries from the "Events" lists that have an "EventName" of "Log".
Remove any top level dictionaries who's lists of events have become empty after the "Log" events were removed.
It's a bit tricky to do both at once, I but not too bad:
filtered_json_list = []
for event_group in json_list:
filtered_events = [event for event in event_group["Events"]
if event["EventName"] != "Log"]
if filtered_events: # skip empty event groups!
filtered_json_list.append({"Id": event_group["Id"], "Events": filtered_events})
This was a lot easier than I expected because the top-level dictionaries (which I call event_groups, for lack of a better name) only had two keys, "Id" and "Events". If instead there were many keys and values in those dictionaries (or which keys and values they had were unpredictable), you'd probably need to replace the last line of my code with something more complicated (e.g. creating a dictionary with just the filtered events, then using kind of loop to copy over all the non-"Events" keys and values), rather than creating the dictionary by hand with a literal.
This program might help.
import json
# Parse the JSON
with open('x.json') as fp:
events = json.load(fp)
# Kill all "Log" events
for event_set in events:
event_list = event_set['Events']
event_list[:] = [event for event in event_list if event['EventName'] != 'Log']
# Kill all empty event sets
events[:] = [event_set for event_set in events if event_set['Events']]
print json.dumps(events, indent=2)
You can use Python generators/list comprenhensions for this
[x for x in json where x['EventName'] != 'Log']

Categories

Resources