Dynamodb: query using more than two attributes - python

In Dynamodb you need to specify in an index the attributes that can be used for making queries.
How can I make a query using more than two attributes?
Example using boto.
Table.create('users',
schema=[
HashKey('id') # defaults to STRING data_type
], throughput={
'read': 5,
'write': 15,
}, global_indexes=[
GlobalAllIndex('FirstnameTimeIndex', parts=[
HashKey('first_name'),
RangeKey('creation_date', data_type=NUMBER),
],
throughput={
'read': 1,
'write': 1,
}),
GlobalAllIndex('LastnameTimeIndex', parts=[
HashKey('last_name'),
RangeKey('creation_date', data_type=NUMBER),
],
throughput={
'read': 1,
'write': 1,
})
],
connection=conn)
How can I look for users with first name 'John', last name 'Doe', and created on '3-21-2015' using boto?

Your data modeling process has to take into consideration your data retrieval requirements, in DynamoDB you can only query by hash or hash + range key.
If querying by primary key is not enough for your requirements, you can certainly have alternate keys by creating secondary indexes (Local or Global).
However, the concatenation of multiple attributes can be used in certain scenarios as your primary key to avoid the cost of maintaining secondary indexes.
If you need to get users by First Name, Last Name and Creation Date, I would suggest you to include those attributes in the Hash and Range Key, so the creation of additional indexes are not needed.
The Hash Key should contain a value that could be computed by your application and at same time provides uniform data access. For example, say that you choose to define your keys as follow:
Hash Key (name): first_name#last_name
Range Key (created) : MM-DD-YYYY-HH-mm-SS-milliseconds
You can always append additional attributes in case the ones mentioned are not enough to make your key unique across the table.
users = Table.create('users', schema=[
HashKey('name'),
RangeKey('created'),
], throughput={
'read': 5,
'write': 15,
})
Adding the user to the table:
with users.batch_write() as batch:
batch.put_item(data={
'name': 'John#Doe',
'first_name': 'John',
'last_name': 'Doe',
'created': '03-21-2015-03-03-02-3243',
})
Your code to find the user John Doe, created on '03-21-2015' should be something like:
name_john_doe = users.query_2(
name__eq='John#Doe',
created__beginswith='03-21-2015'
)
for user in name_john_doe:
print user['first_name']
Important Considerations:
i. If your query starts to get too complicated and the Hash or Range Key too long by having too many concatenated fields then definitely use Secondary Indexes. That's a good sign that only a primary index is not enough for your requirements.
ii. I mentioned that the Hash Key should provide uniform data access:
"Dynamo uses consistent hashing to partition its key space across its
replicas and to ensure uniform load distribution. A uniform key
distribution can help us achieve uniform load distribution assuming
the access distribution of keys is not highly skewed." [DYN]
Not only the Hash Key allows to uniquely identify the record, but also is the mechanism to ensure load distribution. The Range Key (when used) helps to indicate the records that will be mostly retrieved together, therefore, the storage can also be optimized for such need.
The link below has a complete explanation about the topic:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.UniformWorkload

Related

Pymongo updating documents using list of dictionaries [duplicate]

First, some background.
I have a function in Python which consults an external API to retrieve some information associated with an ID. Such function takes as argument an ID and it returns a list of numbers (they correspond to some metadata associated with such ID).
For example, let us introduce in such function the IDs {0001, 0002, 0003}. Let's say that the function returns for each ID the following arrays:
0001 → [45,70,20]
0002 → [20,10,30,45]
0003 → [10,45]
My goal is to implement a collection which structures data as so:
{
"_id":45,
"list":[0001,0002,0003]
},
{
"_id":70,
"list":[0001]
},
{
"_id":20,
"list":[0001,0002]
},
{
"_id":10,
"list":[0002,0003]
},
{
"_id":30,
"list":[0002]
}
As it can be seen, I want my collection to index the information by the metadata itself. With this structure, the document with $_id "45" contains a list with all the IDs that have metadata 45 associated. This way I can retrieve with a single request to the collection all IDs mapped to a particular metadata value.
The class method in charge of inserting IDs and metadata in the collection is the following:
def add_entries(self,id,metadataVector):
start = time.time()
id=int(id)
for data in metadataVector:
self.SegmentDB.update_one(
filter = {"_id":data},
update = {"$addToSet":{"list":id}},
upsert = True
)
end = time.time()
duration = end-start
return duration
metadataVector is the list which contains all metadata (integers) associated to a given ID (i.e.:[45,70,20]).
id is the ID associated to the metadata in metadataVector. (i.e.:0001).
This method currently iterates through the list and performs an operation for every element (every metadata) on the list. This method implements the collection I desire: it updates the document whose "_id" is a given metadata and adds to its corresponding list the ID from which such metadata originated (if such document doesn't exist yet, it inserts it - that's what upsert = true is all for).
However, this implementation ends up being somewhat slow on the long run. metadataVector usually has around 1000-3000 items for each ID (metainformation integers which can range in 800 - 23000000), and I have around 40000 IDs to analyze. As a result, the collection grows quickly. At the moment, I have around 3.2m documents in the collection (one specifically dedicated to each individual metadata integer). I would like to implement a faster solution; if possible, I would like to insert all metadata in one only DB request instead of calling an update for each item in metadataVector individually.
I tried this approach but it doesn't seem to work as I intended:
def add_entries(self,id,metadataVector):
start = time.time()
id=int(id)
self.SegmentDB.update_many(
filter={"_id": {"$in":metadataVector}},
update={"$addToSet":{"list":id}},
upsert = True
)
end = time.time()
duration = end-start
return duration
I tried using update_many (as it seemed the natural approach to tackle the problem) specifying a filter which, to my understanding, states "any document whose _id is in metadataVector". In this way, all documents involved would add to the list the originating ID (or the document would be created if it didn't exist due to the Upsert condition) but instead the collection ends up being filled with documents containing a single element in the list and an ObjectId() _id.
Picture showing the final result.
Is there a way to implement what I want? Should I restructure the DB differently all together?
Thanks a lot in advance!
Here is an example, and it uses Bulk Write operations. Bulk operations submits multiple inserts, updates, deletes (can be a combination) as a single call to the database and returns a result. This is more efficient than multiple single calls to the database.
Scenario 1:
Input: 3 -> [10, 45]
def some_fn(id):
# id = 3; and after some process... returns a dictionary
return { 10: 3, 45: 3, }
Scenario 2:
Input (as a list):
3 -> [10, 45]
1 -> [45, 70, 20]
def some_fn(ids):
# ids are 1 and 3; and after some process... returns a dictionary
return { 10: [ 3 ], 45: [ 3, 1 ], 20: [ 1 ], 70: [ 1 ] }
Perform Bulk Write
Now, perform the bulk operation on the database using the returned value from some_fn.
data = some_fn(id) # or some_fn(ids)
requests = []
for k, v in data.items():
op = UpdateOne({ '_id': k }, { '$push': { 'list': { '$each': v }}}, upsert=True)
requests.append(op)
result = db.collection.bulk_write(requests, ordered=False)
Note the ordered=False - this option is used for, again, better performance as writes can happen in parallel.
References:
collection.bulk_write

Annotating a queryset using aggregations of values with more than one field

Django annotations are really awesome. However I can't figure out how to deal with annotations where several values() are required.
Question:
I would like to annotate an author_queryset with counts of items in a related m2m. I don't know if I need to use a Subquery or not, however:
annotated_queryset = author_queryset.annotate(genre_counts=Subquery(genre_counts))
Returns:
SyntaxError: subquery must return only one column
I've tried casting the values to a JSONField to get it back in one field, hoping that I could use JSONBagg on it since I'm using postgres and need to filter the result:
subquery = Author.objects.filter(id=OuterRef('pk')).values('id','main_books__genre').annotate(genre_counts=Count('main_books__genre'))
qss = qs.annotate(genre_counts=Subquery(Cast(subquery,JSONField()), output_field=JSONField()))
Yeilds:
AttributeError: 'Cast' object has no attribute 'clone'
I'm not sure what I need to get the dict cast to a JSONField(). There's some great info here about filtering on these. There's also something for postgres coming soon in the development version called ArraySubQuery() which looks to address this. However, I can't use this feature until it's in a stable release.
Desired result
I would like to annotate so I could filter based on the annotations, like this:
annotated_queryset.filter(genre_counts__scifi__gte=5)
Detail
I can use dunders to get at a related field, then count like so:
# get all the authors with Virginia in their name
author_queryset = Author.objects.filter(name__icontains='Virginia')
author_queryset.count()
# returns: 21
# aggregate the book counts by genre in the Book m2m model
genre_counts = author_queryset.values('id','main_books__genre').annotate(genre_counts=Count('main_books__genre'))
genre_counts.count()
# returns: 25
this is because there can be several genres counts returned for each Author object in the queryset. In this particular example, there is an Author which has books in 4 different genres:
To illustrate:
...
{'id': 'authorid:0054f04', 'main_books__genre': 'scifi', 'genre_counts': 1}
{'id': 'authorid:c245457', 'main_books__genre': 'fantasy', 'genre_counts': 4}
{'id': 'authorid:a129a73', 'main_books__genre': None, 'genre_counts': 0}
{'id': 'authorid:f41f14b', 'main_books__genre': 'mystery', 'genre_counts': 16}
{'id': 'authorid:f41f14b', 'main_books__genre': 'romance', 'genre_counts': 1}
{'id': 'authorid:f41f14b', 'main_books__genre': 'scifi', 'genre_counts': 9}
{'id': 'authorid:f41f14b', 'main_books__genre': 'fantasy', 'genre_counts': 3}
...
and there is another Author with 2, the rest have one genre each. Which is the 25 values.
Hoping this makes sense to someone! I'm sure there is a way to handle this properly without waiting for the feature described above.
You want to use .annotate( without a Subquery because as you found, that needs to return a single value. You should be able to span all the relationships in the first annotate's count expression.
Unfortunately Django doesn't support what you're looking for with genre_counts__scifi_gt=5 currently. You can structure it such that you do the Count with the filter passed to it.
selected_genre = 'scifi'
annotated_queryset = author_queryset.annotate(
genre_count=Count("main_books__genre", filter=Q(genre=selected_genre))
).filter(genre_count__gte=5)
To have the full breakdown, you'll be better off returning the breakdown and doing the final aggregation in the application as you showed in your question.

Python Object ID

I would like create a object ID in python, I explain:
I know that exist mysql, sqlite, mongoDB, etc... But I would like at least create a object ID for store data in json.
Before I was putting the json info inside of a list and the ID was the index of this json in the list, for example:
data = [{"name": userName}]
data[0]["id"] = len(data) - 1
Then I realize that was wrong and obviously dont look like objectID, then I thought in that the ID can be the Date and Time together, but I thought was wrong too, so, I would like know the best way for make like a objectID, that represent this json inside the list. this list will be more longer, is for users or clients (is just a personal project). And how can be a example of a method for create the ID
Thanks so much, hope I explained good.
If you just want to create a locally unique value, you could use a really simple autoincrement approach.
data = [
{"name": "Bill"},
{"name": "Javier"},
{"name": "Jane"},
{"name": "Xi"},
{"name": "Nosferatu"},
]
current_id = 1
for record in data:
record["id"] = current_id
current_id += 1
print(record)
# {'name': 'Bill', 'id': 1}
# {'name': 'Javier', 'id': 2}
# {'name': 'Jane', 'id': 3}
# {'name': 'Xi', 'id': 4}
# {'name': 'Nosferatu', 'id': 5}
To add a new value if you're not initializing like this, you can get the last one with max(d.get("id", 0) for d in data).
This may cause various problems depending on your use case. If you don't want to worry about that, you could also throw UUIDs at it; they're heavier but easy to generate with reasonable confidence of uniqueness.
from uuid import uuid4
data = [{"name": "Conan the Librarian"}]
data[0]["id"] = str(uuid4())
print(data)
# 'id' will be different each time; example:
# [{'name': 'Conan the Librarian', 'id': '85bb4db9-c450-46e3-a027-cb573a09f3e3'}]
Without knowing your actual use case, though, it's impossible to say whether one, either, or both of these approaches would be useful or appropriate.

MongoDB + K means clustering

I'm using MongoDB as my datastore and wish to store a "clustered" configuration of my documents in a separate collection.
So in one collection, I'd have my original set of objects, and in my second, it'd have
kMeansCollection: {
1: [mongoObjectCopy1], [mongoObjectCopy2]...
2: [mongoObjectCopy3], [mongoObjectCopy4]...
}
I'm following the implementation a K-means for text clustering here, http://tech.swamps.io/recipe-text-clustering-using-nltk-and-scikit-learn/, but I'm having a hard time thinking about how I'd tie the outputs back into MongoDB.
An example (taken from the link):
if __name__ == "__main__":
tags = collection.find({}, {'tag_data': 1, '_id': 0})
clusters = cluster_texts(tags, 5) #algo runs here with 5 clusters
pprint(dict(clusters))
The var "tags" is the required input for the algo to run.
It must be in the form of an array, but currently tags returns an array of objects (I must therefore extract the text values from the query)
However, after magically clustering my collection 5 ways, how can I reunite them with their respective object entry from mongo?
I am only feeding specific text content from one property of the object.
Thanks a lot!
You would need to have some identifier for the documents. It is probably a good idea to include the _id field in your query so that you do have a unique document identifier. Then you can create parallel lists of ids and tag_data.
docs = collection.find({}, {'tag_data': 1, '_id': 1})
ids = [doc['_id'] for doc in docs]
tags = [doc['tag_data'] for doc in docs]
Then call the cluster function on the tag data.
clusters = cluster_text(tags)
And zip the results back with the ids.
doc_clusters = zip(ids, clusters)
From here you have built tuples of (_id, cluster) so you can update the cluster labels on your mongo documents.
The efficient way to do this is to use the aggregation framework to create the list of "_id" and "tag-data" using server-side operation. This also reduces both the amount of data sent over the wire and the time and memory used to decode documents on the client-side.
You need to $group your documents and use the $push accumulator operator to return the list of _id and the list of tag-data. Of course the aggregate() method gives access to the aggregation pipeline.
cursor = collection.aggregate([{
'$group': {
'_id': None,
'ids': {'$push': '$_id'},
'tags': {'$push': '$tag-data'}
}
}])
You then retrieve you data using the .next() method on the CommandCursor because we group by None thus our cursor hold one element.
data = cursor.next()
After that, simply call your function and zip the result.
clusters = cluster_text(data['tags'])
doc_clusters = zip(data['ids'], clusters)

Implementing "select distinct ... from ..." over a list of Python dictionaries

Here is my problem: I have a list of Python dictionaries of identical form, that are meant to represent the rows of a table in a database, something like this:
[ {'ID': 1,
'NAME': 'Joe',
'CLASS': '8th',
... },
{'ID': 1,
'NAME': 'Joe',
'CLASS': '11th',
... },
...]
I have already written a function to get the unique values for a particular field in this list of dictionaries, which was trivial. That function implements something like:
select distinct NAME from ...
However, I want to be able to get the list of multiple unique fields, similar to:
select distinct NAME, CLASS from ...
Which I am finding to be non-trivial. Is there an algorithm or Python included function to help me with this quandry?
Before you suggest loading the CSV files into a SQLite table or something similar, that is not an option for the environment I'm in, and trust me, that was my first thought.
If you want it as a generator:
def select_distinct(dictionaries, keys):
seen = set()
for d in dictionaries:
v = tuple(d[k] for k in keys)
if v in seen: continue
yield v
seen.add(v)
if you want the result in some other form (e.g., a list instead of a generator) it's not hard to alter this (e.g., .append to the initially-empty result list instead of yielding, and return the result list at the end).
To be called, of course, as
for values_tuple in select_distinct(thedicts, ('NAME', 'CLASS')):
...
or the like.
distinct_list = list(set([(d['NAME'], d['CLASS']) for d in row_list]))
where row_list is the list of dicts you have
One can implement the task using hashing. Just hash the content of rows that appear in the distinct query and ignore the ones with same hash.

Categories

Resources