Modeling a dictionary as a queryable data object in python - python

I have a simple book catalog dictionary as the following
{
'key':
{
'title': str,
'authors': [ {
'firstname': str,
'lastname': str
}
],
'tags': [ str ],
'blob': str
}
}
Each book is a string key in the dictionary. A book contains a single title, and possibly has many authors (often just one). An author is made of two strings, firstname and lastname. Also we can associate many tags to a book as novel, literature, art, 1900s, etc. Each book as a blob field that contains additional data. (often the book itself). I want to be able to search for a given entry (or a group of them) based on data, as by author, by tag.
My main workflow would be:
Given a query, return all blob fields associated to each entry.
My question is how to model this, which libraries or formats to use keeping the given constraints:
Minimize the number of data objects (preference for a single data object to simplify queries).
Small size of columns (create a new column for every possible tag is probably insane and lead to a very sparse dataset)
Do not duplicate blob field (since it can be large).
My first idea was to create multiple rows for each author, for example:
{ '123': { 'title': 'A sample book',
'authors': [ {'firstname': 'John', 'lastname': 'Smith'},
{'firstname': 'Foos', 'lastname': 'M. Bar'} ]
'tags': [ 'tag1', 'tag2', 'tag3' ],
'blob': '.....'
}
Would turn, initially into two entries as
idx
key
Title
authors_firstname
authors_lastname
tags
blob
0
123
Sample Book
John
Smith
['tag1', 'tag2', 'tag3']
...
1
123
Sample Book
Foos
M. Bar
['tag1', 'tag2', 'tag3']
...
But this still duplicates the blob, and still need to figure out what to do with the unknown number of tags (as the database grows).

You can use TinyDB to accomplish what you want.
First, convert your dict to a database:
from tinydb import TinyDB, Query
from tinydb.table import Document
data = [{'123': {'title': 'A sample book',
'authors': [{'firstname': 'John', 'lastname': 'Smith'},
{'firstname': 'Foos', 'lastname': 'M. Bar'}],
'tags': ['tag1', 'tag2', 'tag3'],
'blob': 'blob1'}},
{'456': {'title': 'Another book',
'authors': [{'firstname': 'Paul', 'lastname': 'Roben'}],
'tags': ['tag1', 'tag3', 'tag4'],
'blob': 'blob2'}}]
db = TinyDB('catalog.json')
for record in data:
db.insert(Document(list(record.values())[0], doc_id=list(record.keys())[0]))
Now you can make queries:
Book = Query()
Author = Query()
rows = db.search(Book.authors.any(Author.lastname == 'Smith'))
rows = db.search(Book.tags.all(['tag1', 'tag4']))
rows = db.all()
Given a query, return all blob fields associated to each entry.
blobs = {row.doc_id: row['blob'] for row in db.all()}
>>> blobs
{123: 'blob1', 456: 'blob2'}

Related

Read attribute names and return attribute information from dictionary

I am trying to write a simply query that will return all the attributes requested. The idea is to read the attributes names and return attribute information. It should start with the string 'select' and then followed by a list of the attributes the user wants to see
So, there is a small database consisting of dictionaries:
dsql_table =
[{'name': 'Jan', 'type': 'man', 'profession': 'Analyst'},
{'name': 'Max', 'type': 'man', 'profession': 'Doctor'}]
And the idea is to only implement the functionality (disregarding error handling):
try:
query = input('dsql> ')
while query != 'exit':
# I need to implement code over here
print ('Thank you!')
How can I do this without using classes? So if one input e.g. 'select name type', then it should return 'michiel man
Jan man'.
First you need to get the attribute names from the query, then it's quite simple.
dsql_table = [
{'name': 'Jan', 'type': 'man', 'profession': 'Analyst'},
{'name': 'Max', 'type': 'man', 'profession': 'Doctor'},
]
query = 'select name type'
# extract selected attributes from query
selected_attributes = query.split()[1:]
result = []
for record in dsql_table:
# iterate over selected attributes, store value if attribute exists
for attribute in selected_attributes:
if attribute in record:
result.append(record[attribute])
# now result is a list ['Jan', 'man', 'Max', 'man']
print(' '.join(result))
Alternatively, result can be populated using a list comprehesion:
result = [
record[attribute]
for record in dsql_table
for attribute in selected_attributes
if attribute in record
]

Query nested JSON document in MongoDB collection using Python

I have a MongoDB collection containing multiple documents. A document looks like this:
{
'name': 'sys',
'type': 'system',
'path': 'sys',
'children': [{
'name': 'folder1',
'type': 'folder',
'path': 'sys/folder1',
'children': [{
'name': 'folder2',
'type': 'folder',
'path': 'sys/folder1/folder2',
'children': [{
'name': 'textf1.txt',
'type': 'file',
'path': 'sys/folder1/folder2/textf1.txt',
'children': ['abc', 'def']
}, {
'name': 'textf2.txt',
'type': 'file',
'path': 'sys/folder1/folder2/textf2.txt',
'children': ['a', 'b', 'c']
}]
}, {
'name': 'text1.txt',
'type': 'file',
'path': 'sys/folder1/text1.txt',
'children': ['aaa', 'bbb', 'ccc']
}]
}],
'_id': ObjectId('5d1211ead866fc19ccdf0c77')
}
There are other documents containing similar structure. How can I query this collection to find part of one document among multiple documents where path matches sys/folder1/text1.txt?
My desired output would be:
{
'name': 'text1.txt',
'type': 'file',
'path': 'sys/folder1/text1.txt',
'children': ['aaa', 'bbb', 'ccc']
}
EDIT:
What I have come up with so far is this. My Flask endpoint:
class ExecuteQuery(Resource):
def get(self, collection_name):
result_list = [] # List to store query results
query_list = [] # List to store the incoming queries
for k, v in request.json.items():
query_list.append({k: v}) # Store query items in list
cursor = mongo.db[collection_name].find(*query_list) # Execute query
for document in cursor:
encoded_data = JSONEncoder().encode(document) # Encode the query results to String
result_list.append(json.loads(encoded_data)) # Update dict by iterating over Documents
return result_list # Return query result to client
My client side:
request = {"name": "sys"}
response = requests.get(url, json=request, headers=headers)
print(response.text)
This gives me the entire document but I cannot extract a specific part of the document by matching the path.
I don't think mongodb supports recursive or deep queries within a document (neither recursive $unwind). What it does provide however, are recursive queries across documents referencing another, i.e. aggregating elements from a graph ($graphLookup).
This answer explains pretty well, what you need to do to query a tree.
Although it does not directly address your problem, you may want to reevaluate your data structure. It certainly is intuitive, but updates can be painful -- as well as queries for nested elements, as you just noticed.
Since $graphLookup allows you to create a view equal to your current document, I cannot think of any advantages the explicitly nested structure has over one document per path. There will be a slight performance loss for reading and writing the entire tree, but with proper indexing it should be ok.

Remove duplicates in the query set

I have a query set like the one below which depicts the albums and songs associated with those albums. The model name is UserSongs
<QuerySet [{'id': 1, 'album': 'Sri', 'song': 'in the end','release': 2017},
{'id': 2, 'album': 'Adi', 'song': 'cafe mocha','release': 2016},
{'id': 3, 'album': 'Shr', 'song': 'smooth criminal','release': 2016},
{'id': 4, 'album': 'Mouse', 'song': 'trooper','release': 2017},
{'id': 5, 'album': 'Mouse', 'song': 'my mobile','release': 2015},
{'id': 6, 'album': 'Sri', 'song': 'my office','release': 2018},
{'id': 7, 'album': 'Sri', 'song': 'null','release': null},
{'id': 8, 'album': 'Mouse', 'song': 'null','release': null}]>
In the backend, I'm converting the query set into a list. See code below:
albums_songs = UserSongs.objects.filter(album__in =
['Sri','Mouse']).values('albums','songs')
list_albums_songs = list(albums_songs)
I'm sending this list to the front-end for displaying it in table.Sri,Mouse have multiple entries since they have released multiple songs. In the front end, these songs are displayed in a table with album and songs as entries. Each item in the query set is displayed as one row. Like the one below.
Album Songs
Sri in the end
Adi cafe mocha
Adi null
Shr smooth criminal
Mouse trooper
Mouse my mobile
Sri my office
Sri null
Mouse null
But in the table, null entry for the song is also displayed. I don't want to display that null entry for only Sri,Mouse. I want to diaplay the song=null doe Adi.I can remove it after converting into list and iterating over the list. But that is costly. I believe that we can do it in django query itself. Something like, if album is Sri or Mouse, then check for song = null and don't get that entry.
Or after getting the query set, before converting into list, can we remove those items from query set?
You can use the isnull filter:
albums_songs = UserSongs.objects.filter(album__in=['Sri','Mouse'], songs__isnull=False).values('albums','songs')
EDIT: With the new requirements in your updated question, you should use the exclude method instead:
albums_songs = UserSongs.objects.exclude(album__in=['Sri','Mouse'], songs__isnull=True).values('albums','songs')

Merge a specific value from one array of dicts into another if a value matches

How do I merge a specific value from one array of dicts into another array of dicts if a single specific value matches between them?
I have an array of dicts that represent books
books = [{'writer_id': '123-456-789', 'index': None, 'title': 'Yellow Snow'}, {'writer_id': '888-888-777', 'index': None, 'title': 'Python for Dummies'}, {'writer_id': '999-121-223', 'index': 'Foo', 'title': 'Something Else'}]
and I have an array of dicts that represents authors
authors = [{'roles': ['author'], 'profile_picture': None, 'author_id': '123-456-789', 'name': 'Pat'}, {'roles': ['author'], 'profile_picture': None, 'author_id': '999-121-223', 'name': 'May'}]
I want to take the name from authors and add it to the dict in books where the books writer_id matches the authors author_id.
My end result would ideally change the book array of dicts to be (notice the first dict now has the value of 'name': 'Pat' and the second book has 'name': 'May'):
books = [{'writer_id': '123-456-789', 'index': None, 'title': 'Yellow Snow', 'name': 'Pat'}, {'writer_id': '888-888-777', 'index': None, 'title': 'Python for Dummies'}, {'writer_id': '999-121-223', 'index': 'Foo', 'title': 'Something Else', 'name': 'May'}]
My current solution is:
for book in books:
for author in authors:
if book['writer_id'] == author['author_id']:
book['author_name'] = author['name']
And this works. However, the nested statements bother me and feel unwieldy. I also have a number of other such structures so I end up with a function that has a bunch of code resembling this in it:
for book in books:
for author in authors:
if book['writer_id'] == author['author_id']:
book['author_name'] = author['name']
books_with_foo = []
for book in books:
for thing in things:
if something:
// do something
for blah in books_with_foo:
for book_foo in otherthing:
if blah['bar'] == stuff['baz']:
// etc, etc.
Alternatively, how would you aggregate data from multiple database tables into one thing... some of the data comes back as dicts, some as arrays of dicts?
Pandas is almost definitely going to help you here. Convert your dicts to DataFrames for easier manipulation, then merge them:
import pandas as pd
authors = [{'roles': ['author'], 'profile_picture': None, 'author_id': '123-456-789', 'name': 'Pat'}, {'roles': ['author'], 'profile_picture': None, 'author_id': '999-121-223', 'name': 'May'}]
books = [{'writer_id': '123-456-789', 'index': None, 'title': 'Yellow Snow'}, {'writer_id': '888-888-777', 'index': None, 'title': 'Python for Dummies'}, {'writer_id': '999-121-223', 'index': 'Foo', 'title': 'Something Else'}]
df1 = pd.DataFrame.from_dict(books)
df2 = pd.DataFrame.from_dict(authors)
df1['author_id'] = df1.writer_id
df1 = df1.set_index('author_id')
df2 = df2.set_index('author_id')
result = pd.concat([df1, df2], axis=1)
you may find this page helpful for different ways of combining (merging, concatenating, etc) separate DataFrames.

Split dictionary field

I've managed to figure out how to run a SQL query to display information. I need to keep the data in the same form as the db tables, so I think I should be using a dictionary. So far, my fields are ID and Name, my print looks like this:
[{'ID': '123', 'Name': 'ROBERTSON*ROBERT'}, {'ID': '456', 'Name': 'MICHAELS*MIKE'}, {'ID': '789', 'Name': 'KRISTENSEN*KRISTEN'}, ...]
First, am I appropriately using dictionary?
Next, I need to split the Name field based on the * delimiter. For example:
Before:
{'ID': '789', 'Name': 'KRISTENSEN*KRISTEN'}
After:
{'ID': '789', 'LastName': 'KRISTENSEN', 'FirstName': 'KRISTEN'}
I've tested out a few things of code I've found but keep hitting roadblocks. I've used this to create my dictionary, I'm wondering if I include a split in this line to reduce a step?
query = [dict(zip(['ID', 'Name'],row)) for row in cursor.fetchall()]
Like so maybe:
query = [dict(zip(['ID', 'FirstName', 'LastName'], row[:1] + row[1].split('*'))) for row in cursor.fetchall()]
db_dict = {'ID': '789', 'Name': 'KRISTENSEN*KRISTEN'}
name = db_dict['Name']
def split_name(name):
for index, char in enumerate(name):
if char == '*':
position = index
last_name = name[:position]
first_name = name[position + 1:]
return {'LastName':last_name, 'FirstName':first_name}
new_db_dict = {db_dict.keys()[0] : db_dict.values()[0]}
new_db_dict.update(split_name(name))
print new_db_dict
First, while your use of dictionaries is valid I recommend using namedtuples for representing fixed structures with named fields
from collections import namedtuple
# structure class factory
Person = namedtuple("Person", ("id", "name"))
people = [ Person('123', 'ROBERTSON*ROBERT'), Person('456','MICHAELS*MIKE'), Person('789', 'KRISTENSEN*KRISTEN')]
# different structure
PersonName = namedtuple("Person", ("id", "first", "last"))
# structure transformation
def person_to_personname(person):
"""Transform Person -> PersonName"""
names = person.name.split('*')
if len(names) < 2: # depends on your defaults
last = names[0]
first = ''
else: # assumes first field is last name
last, first = names[:2] # even if other names present, takes first two
return PersonName(person.id, first, last)
people_names = [person_to_personname(person) for person in people]
If all entries have a name split by an asterix
A solution in two steps. Once you've retrieved your current results :
a = [{'ID': '123', 'Name': 'ROBERTSON*ROBERT'}, {'ID': '456', 'Name': 'MICHAELS*MIKE'}, {'ID': '789', 'Name': 'KRISTENSEN*KRISTEN'}]
result = [{'ID' : entry['ID'], 'LastName' : entry['Name'].split('*')[0], 'FirstName' : entry['Name'].split('*')[1]} for entry in a]
now if you print result :
[{'FirstName': 'ROBERT', 'ID': '123', 'LastName': 'ROBERTSON'},
{'FirstName': 'MIKE', 'ID': '456', 'LastName': 'MICHAELS'},
{'FirstName': 'KRISTEN', 'ID': '789', 'LastName': 'KRISTENSEN'}]
Otherwise (assuming that the field 'Name' is at least populated)
results = []
for entry in a:
name = entry['Name'].split('*')
result = dict(ID = entry['ID'], LastName = name[0])
if len(name) > 1:
result['FirstName'] = name[1]
results.append(result)

Categories

Resources