Inserting documents in MongoDB in specific order using pymongo - python

I have to insert documents in MongoDB in a left-shift manner i.e if the collection contains 60 documents, I am removing the 1st document and I want to insert the new document at the rear of the database. But when I am inserting the 61st element and so forth, the documents are being inserted in random positions.
Is there any way I can insert the documents in the order that I specified above?
Or do I have to do this processing when I am retrieving the values from the database? If yes then how?
The data format is :
data = {"time":"10:14:23", #timestamp
"stats":[<list of dictionaries>]
}
The code I am using is
from pymongo import MongoClient
db = MongoClient().test
db.timestamp.delete_one({"_id":db.timestamp.find()[0]["_id"]})
db.timestamp.insert_one(new_data)
the timestamp is the name of the collection.
Edit: Changed the code. Is there any better way?
from pymongo.operations import InsertOne,DeleteOne
def save(collection,data,cap=60):
if collection.count() == cap:
top_doc_time= min(doc['time'] for doc in collection.find())
collection.delete_one({'time':top_doc_time['_time']})
collection.insert_one(data)

A bulk write operation guarantees query ordering by default.
This means that the queries are executed sequentially.
from pymongo.operations import DeleteOne, InsertOne
def left_shift_insert(collection, doc, cap=60):
ops = []
variance = max((collection.count() + 1) - cap, 0)
delete_ops = [DeleteOne({})] * variance
ops.extend(delete_ops)
ops.append(InsertOne(doc))
return collection.bulk_write(ops)
left_shift_insert(db.timestamp, new_data)

Related

Fast way to convert SQLAlchemy objects to Python dicts

I have this query that returns a list of student objects:
query = db.session.query(Student).filter(Student.is_deleted == false())
query = query.options(joinedload('project'))
query = query.options(joinedload('image'))
query = query.options(joinedload('student_locator_map'))
query = query.options(subqueryload('attached_addresses'))
query = query.options(subqueryload('student_meta'))
query = query.order_by(Student.student_last_name, Student.student_first_name,
Student.student_middle_name, Student.student_grade, Student.student_id)
query = query.filter(filter_column == field_value)
students = query.all()
The query itself does not take much time. The problem is converting all these objects (can be 5000+) to Python dicts. It takes over a minute with this many objects.Currently, the code loops thru the objects and converts using to_dict(). I have also tried _dict__ which was much faster but this does not convert all relational objects it seems.
How can I convert all these Student objects and related objects quickly?
Maybe this will help you...
from collections import defaultdict
def query_to_dict(student_results):
result = defaultdict(list)
for obj in student_results:
instance = inspect(obj)
for key, x in instance.attrs.items():
result[key].append(x.value)
return result
output = query_to_dict(students)
query = query.options(joinedload('attached_addresses').joinedload('address'))
By chaining address joinedload to attached_addresses I was able to significantly speed up the query.
My understanding of why this is the case:
Address objects were not being loaded with the initial query. Every iteration thru the loop, the db would get hit to retrieve the Address object. With joined load, Address objects are now loaded upon initial query.
Thanks to Corley Brigman for the help.

Read optimisation cassandra using python

I have a table with the following model:
CREATE TABLE IF NOT EXISTS {} (
user_id bigint ,
pseudo text,
importance float,
is_friend_following bigint,
is_friend boolean,
is_following boolean,
PRIMARY KEY ((user_id), is_friend_following)
);
I also have a table containing my seeds. Those (20) users are the starting point of my graph. So I select their ID and search in the table above to get their Followers and friends, and from there I build my graph (networkX).
def build_seed_graph(cls, name):
obj = cls()
obj.name = name
query = "SELECT twitter_id FROM {0};"
seeds = obj.session.execute(query.format(obj.seed_data_table))
obj.graph.add_nodes_from(obj.seeds)
for seed in seeds:
query = "SELECT friend_follower_id, is_friend, is_follower FROM {0} WHERE user_id={1}"
statement = SimpleStatement(query.format(obj.network_table, seed), fetch_size=1000)
friend_ids = []
follower_ids = []
for row in obj.session.execute(statement):
if row.friend_follower_id in obj.seeds:
if row.is_friend:
friend_ids.append(row.friend_follower_id)
if row.is_follower:
follower_ids.append(row.friend_follower_id)
if friend_ids:
for friend_id in friend_ids:
obj.graph.add_edge(seed, friend_id)
if follower_ids:
for follower_id in follower_ids:
obj.graph.add_edge(follower_id, seed)
return obj
The problem is that the time it takes to build the graph is too long and I would like to optimize it.
I've got approximately 5 millions rows in my table 'network_table'.
I'm wondering if it would be faster for me if instead of doing a query with a where clauses to just do a single query on whole table? Will it fit in memory? Is that a good Idea? Are there better way?
I suspect the real issue may not be the queries but rather the processing time.
I'm wondering if it would be faster for me if instead of doing a query with a where clauses to just do a single query on whole table? Will it fit in memory? Is that a good Idea? Are there better way?
There should not be any problem with doing a single query on the whole table if you enable paging (https://datastax.github.io/python-driver/query_paging.html - using fetch_size). Cassandra will return up to the fetch_size and will fetch additional results as you read them from the result_set.
Please note that if you have many rows in the table that are non seed related then a full scan may be slower as you will receive rows that will not include a "seed"
Disclaimer - I am part of the team building ScyllaDB - a Cassandra compatible database.
ScyllaDB have published lately a blog on how to efficiently do a full scan in parallel http://www.scylladb.com/2017/02/13/efficient-full-table-scans-with-scylla-1-6/ which applies to Cassandra as well - if a full scan is relevant and you can build the graph in parallel than this may help you.
It seems like you can get rid of the last 2 if statements, since you're going through data that you already have looped through once:
def build_seed_graph(cls, name):
obj = cls()
obj.name = name
query = "SELECT twitter_id FROM {0};"
seeds = obj.session.execute(query.format(obj.seed_data_table))
obj.graph.add_nodes_from(obj.seeds)
for seed in seeds:
query = "SELECT friend_follower_id, is_friend, is_follower FROM {0} WHERE user_id={1}"
statement = SimpleStatement(query.format(obj.network_table, seed), fetch_size=1000)
for row in obj.session.execute(statement):
if row.friend_follower_id in obj.seeds:
if row.is_friend:
obj.graph.add_edge(seed, row.friend_follower_id)
elif row.is_follower:
obj.graph.add_edge(row.friend_follower_id, seed)
return obj
This also gets rid of many append operations on lists that you're not using, and should speed up this function.

Parsing key:value pairs in a list

I have inherited a Mongo structure with key:value pairs within an array. I need to extract the collected and spent values in the below tags, however I don't see an easy way to do this using the $regex commands in the Mongo Query documentation.
{
"_id" : "94204a81-9540-4ba8-bb93-fc5475c278dc"
"tags" : ["collected:172", "donuts_used:1", "spent:150"]
}
The ideal output of extracting these values is to dump them into a format below when querying them using pymongo. I really don't know how best to return only the values I need. Please advise.
94204a81-9540-4ba8-bb93-fc5475c278dc, 172, 150
print d['_id'], ' '.join([ x.replace('collected:', '').replace('spent:', '')\
for x in d['tags'] if 'collected' in x or 'spent' in x ] )
>>>
94204a81-9540-4ba8-bb93-fc5475c278dc 172 150
In case you are having a hard time writing mongo query(your elements inside the list are actually string instead of key value which requires parsing), here is a solution in plain Python that might be helpful.
>>> import pymongo
>>> from pymongo import MongoClient
>>> client = MongoClient('localhost', 27017)
>>> db = client['test']
>>> collection = db['stackoverflow']
>>> collection.find_one()
{u'_id': u'94204a81-9540-4ba8-bb93-fc5475c278dc', u'tags': [u'collected:172', u'donuts_used:1', u'spent:150']}
>>> record = collection.find_one()
>>> print record['_id'], record['tags'][0].split(':')[-1], record['tags'][2].split(':')[-1]
94204a81-9540-4ba8-bb93-fc5475c278dc 172 150
Instead of using find_one(), you can retrieve all the record by using appropriate function here and looop through every record. I am not sure how consistent your data might be, so I hard coded using the first and third element in the list... you can want to tweak that part and have a try except at record level.
Here is one way to do it, if all you had was that sample JSON object.
Please pay attention to the note about the ordering of tags etc. It is probably best to revise your "schema" so that you can more easily query, collect and aggregate your "tags" as you call them.
import re
# Returns csv string of _id, collected, used
def parse(obj):
_id = obj["_id"]
# This is terribly brittle since the insertion of any other type of tag
# between 'c' and 's' will cause these indices to be messed up.
# It is probably much better to directly query these, or store them as individual
# entities in your mongo "schema".
collected = re.sub(r"collected:(\d+)", r"\1", obj["tags"][0])
spent = re.sub(r"spent:(\d+)", r"\1", obj["tags"][2])
return ", ".join([_id, collected, spent])
# Some sample object
parse_me = {
"_id" : "94204a81-9540-4ba8-bb93-fc5475c278dc"
"tags" : ["collected:172", "donuts_used:1", "spent:150"]
}
print parse(parse_me)

Retrieve all items from DynamoDB using query?

I am trying to retrieve all items in a dynamodb table using a query. Below is my code:
import boto.dynamodb2
from boto.dynamodb2.table import Table
from time import sleep
c = boto.dynamodb2.connect_to_region(aws_access_key_id="XXX",aws_secret_access_key="XXX",region_name="us-west-2")
tab = Table("rip.irc",connection=c)
x = tab.query()
for i in x:
print i
sleep(1)
However, I recieve the following error:
ValidationException: ValidationException: 400 Bad Request
{'message': 'Conditions can be of length 1 or 2 only', '__type': 'com.amazon.coral.validate#ValidationException'}
The code I have is pretty straightforward and out of the boto dynamodb2 docs, so I am not sure why I am getting the above error. Any insights would be appreciated (new to this and a bit lost). Thanks
EDIT: I have both an hash key and a range key. I am able to query by specific hash keys. For example,
x = tab.query(hash__eq="2014-01-20 05:06:29")
How can I retrieve all items though?
Ahh ok, figured it out. If anyone needs:
You can't use the query method on a table without specifying a specific hash key. The method to use instead is scan. So if I replace:
x = tab.query()
with
x = tab.scan()
I get all the items in my table.
I'm on groovy but it's gonna drop you a hint. Error :
{'message': 'Conditions can be of length 1 or 2 only'}
is telling you that your key condition can be length 1 -> hashKey only, or length 2 -> hashKey + rangeKey. All what's in a query on a top of keys will provoke this error.
The reason of this error is: you are trying to run search query but using key condition query. You have to add separate filterCondition to perform your query.
My code
String keyQuery = " hashKey = :hashKey and rangeKey between :start and :end "
queryRequest.setKeyConditionExpression(keyQuery)// define key query
String filterExpression = " yourParam = :yourParam "
queryRequest.setFilterExpression(filterExpression)// define filter expression
queryRequest.setExpressionAttributeValues(expressionAttributeValues)
queryRequest.setSelect('ALL_ATTRIBUTES')
QueryResult queryResult = client.query(queryRequest)
.scan() does not automatically return all elements of a table due to pagination of the table. There is a 1Mb max response limit Dynamodb Max response limit
Here is a recursive implementation of the boto3 scan:
import boto3
dynamo = boto3.resource('dynamodb')
def scanRecursive(tableName, **kwargs):
"""
NOTE: Anytime you are filtering by a specific equivalency attribute such as id, name
or date equal to ... etc., you should consider using a query not scan
kwargs are any parameters you want to pass to the scan operation
"""
dbTable = dynamo.Table(tableName)
response = dbTable.scan(**kwargs)
if kwargs.get('Select')=="COUNT":
return response.get('Count')
data = response.get('Items')
while 'LastEvaluatedKey' in response:
response = kwargs.get('table').scan(ExclusiveStartKey=response['LastEvaluatedKey'], **kwargs)
data.extend(response['Items'])
return data
I ran into this error when I was misusing KeyConditionExpression instead of FilterExpression when querying a dynamodb table.
KeyConditionExpression should only be used with partition key or sort key values.
FilterExpression should be used when you want filter your results even more.
However do note, using FilterExpression uses the same reads as it would without, because it performs the query based on the keyConditionExpression. It then removes items from the results based on your FilterExpression.
Source
Working with Queries
This is how I do a query if someone still needs a solution:
def method_name(a, b)
results = self.query(
key_condition_expression: '#T = :t',
filter_expression: 'contains(#S, :s)',
expression_attribute_names: {
'#T' => 'your_table_field_name',
'#S' => 'your_table_field_name'
},
expression_attribute_values: {
':t' => a,
':s' => b
}
)
results
end

How to iterate and update documents with PyMongo?

I have a simple, single-client setup for MongoDB and PyMongo 2.6.3. The goal is to iterate over each document in the collection collection and update (save) each document in the process. The approach I'm using looks roughly like:
cursor = collection.find({})
index = 0
count = cursor.count()
while index != count:
doc = cursor[index]
print 'updating doc ' + doc['name']
# modify doc ..
collection.save(doc)
index += 1
cursor.close()
The problem is that save is apparently modifying the order of documents in the cursor. For example, if my collection is made of 3 documents (ids omitted for clarity):
{
"name": "one"
}
{
"name": "two"
}
{
"name": "three"
}
the above program outputs:
> updating doc one
> updating doc two
> updating doc two
If however, the line collection.save(doc) is removed, the output becomes:
> updating doc one
> updating doc two
> updating doc three
Why is this happening? What is the right way to safely iterate and update documents in a collection?
Found the answer in MongoDB documentation:
Because the cursor is not isolated during its lifetime, intervening write operations on a document may result in a cursor that returns a document more than once if that document has changed. To handle this situation, see the information on snapshot mode.
Snapshot mode is enabled on the cursor, and makes a nice guarantee:
snapshot() traverses the index on the _id field and guarantees that the query will return each document (with respect to the value of the _id field) no more than once.
To enable snapshot mode with PyMongo:
cursor = collection.find(spec={},snapshot=True)
as per PyMongo find() documentation. Confirmed that this fixed my problem.
Snapshot does the work.
But on pymongo 2.9 and onwards, the syntax is slightly different.
cursor = collection.find(modifiers={"$snapshot": True})
or for any version,
cursor = collection.find({"$snapshot": True})
as per the PyMongo documentations
I couldn't recreate your situation but maybe, off the top of my head, because fetching the results like you're doing it get's them one by one from the db, you're actually creating more as you go (saving and then fetching the next one).
You can try holding the result in a list (that way, your fetching all results at once - might be heavy, depending on your query):
cursor = collection.find({})
# index = 0
results = [res for res in cursor] #count = cursor.count()
cursor.close()
for res in results: # while index != count //This will iterate the list without you needed to keep a counter:
# doc = cursor[index] // No need for this since 'res' holds the current record in the loop cycle
print 'updating doc ' + res['name'] # print 'updating doc ' + doc['name']
# modify doc ..
collection.save(res)
# index += 1 // Again, no need for counter
Hope it helps

Categories

Resources