I have a simple, single-client setup for MongoDB and PyMongo 2.6.3. The goal is to iterate over each document in the collection collection and update (save) each document in the process. The approach I'm using looks roughly like:
cursor = collection.find({})
index = 0
count = cursor.count()
while index != count:
doc = cursor[index]
print 'updating doc ' + doc['name']
# modify doc ..
collection.save(doc)
index += 1
cursor.close()
The problem is that save is apparently modifying the order of documents in the cursor. For example, if my collection is made of 3 documents (ids omitted for clarity):
{
"name": "one"
}
{
"name": "two"
}
{
"name": "three"
}
the above program outputs:
> updating doc one
> updating doc two
> updating doc two
If however, the line collection.save(doc) is removed, the output becomes:
> updating doc one
> updating doc two
> updating doc three
Why is this happening? What is the right way to safely iterate and update documents in a collection?
Found the answer in MongoDB documentation:
Because the cursor is not isolated during its lifetime, intervening write operations on a document may result in a cursor that returns a document more than once if that document has changed. To handle this situation, see the information on snapshot mode.
Snapshot mode is enabled on the cursor, and makes a nice guarantee:
snapshot() traverses the index on the _id field and guarantees that the query will return each document (with respect to the value of the _id field) no more than once.
To enable snapshot mode with PyMongo:
cursor = collection.find(spec={},snapshot=True)
as per PyMongo find() documentation. Confirmed that this fixed my problem.
Snapshot does the work.
But on pymongo 2.9 and onwards, the syntax is slightly different.
cursor = collection.find(modifiers={"$snapshot": True})
or for any version,
cursor = collection.find({"$snapshot": True})
as per the PyMongo documentations
I couldn't recreate your situation but maybe, off the top of my head, because fetching the results like you're doing it get's them one by one from the db, you're actually creating more as you go (saving and then fetching the next one).
You can try holding the result in a list (that way, your fetching all results at once - might be heavy, depending on your query):
cursor = collection.find({})
# index = 0
results = [res for res in cursor] #count = cursor.count()
cursor.close()
for res in results: # while index != count //This will iterate the list without you needed to keep a counter:
# doc = cursor[index] // No need for this since 'res' holds the current record in the loop cycle
print 'updating doc ' + res['name'] # print 'updating doc ' + doc['name']
# modify doc ..
collection.save(res)
# index += 1 // Again, no need for counter
Hope it helps
Related
I have produced a set of matching IDs from a database collection that looks like this:
{ObjectId('5feafffbb4cf9e627842b1d9'), ObjectId('5feaffcfb4cf9e627842b1d8'), ObjectId('5feb247f1bb7a1297060342e')}
Each ObjectId represents an ID on a collection in the DB.
I got that list by doing this: (which incidentally I also think I am doing wrong, but I don't yet know another way)
# Find all question IDs
question_list = list(mongo.db.questions.find())
all_questions = []
for x in question_list:
all_questions.append(x["_id"])
# Find all con IDs that match the question IDs
con_id = list(mongo.db.cons.find())
con_id_match = []
for y in con_id:
con_id_match.append(y["question_id"])
matches = set(con_id_match).intersection(all_questions)
print("matches", matches)
print("all_questions", all_questions)
print("con_id_match", con_id_match)
And that brings up all the IDs that are associated with a match such as the three at the top of this post. I will show what each print prints at the bottom of this post.
Now I want to get each ObjectId separately as a variable so I can search for these in the collection.
mongo.db.cons.find_one({"con": matches})
Where matches (will probably need to be a new variable) will be one of each ObjectId's that match the DB reference.
So, how do I separate the ObjectId in the matches so I get one at a time being iterated. I tried a for loop but it threw an error and I guess I am writing it wrong for a set. Thanks for the help.
Print Statements:
**matches** {ObjectId('5feafffbb4cf9e627842b1d9'), ObjectId('5feaffcfb4cf9e627842b1d8'), ObjectId('5feb247f1bb7a1297060342e')}
**all_questions** [ObjectId('5feafb52ae1b389f59423a91'), ObjectId('5feafb64ae1b389f59423a92'), ObjectId('5feaffcfb4cf9e627842b1d8'), ObjectId('5feafffbb4cf9e627842b1d9'), ObjectId('5feb247f1bb7a1297060342e'), ObjectId('6009b6e42b74a187c02ba9d7'), ObjectId('6010822e08050e32c64f2975'), ObjectId('601d125b3c4d9705f3a9720d')]
**con_id_match** [ObjectId('5feb247f1bb7a1297060342e'), ObjectId('5feafffbb4cf9e627842b1d9'), ObjectId('5feaffcfb4cf9e627842b1d8')]
Usually you can just use find method that yields documents one-by-one. And you can filter documents during iterating with python like that:
# fetch only ids
question_ids = {question['_id'] for question in mongo.db.questions.find({}, {'_id': 1})}
matches = []
for con in mongo.db.cons.find():
con_id = con['question_id']
if con_id in question_ids:
matches.append(con_id)
# you can process matched and loaded con here
print(matches)
If you have huge amount of data you can take a look to aggregation framework
I have to insert documents in MongoDB in a left-shift manner i.e if the collection contains 60 documents, I am removing the 1st document and I want to insert the new document at the rear of the database. But when I am inserting the 61st element and so forth, the documents are being inserted in random positions.
Is there any way I can insert the documents in the order that I specified above?
Or do I have to do this processing when I am retrieving the values from the database? If yes then how?
The data format is :
data = {"time":"10:14:23", #timestamp
"stats":[<list of dictionaries>]
}
The code I am using is
from pymongo import MongoClient
db = MongoClient().test
db.timestamp.delete_one({"_id":db.timestamp.find()[0]["_id"]})
db.timestamp.insert_one(new_data)
the timestamp is the name of the collection.
Edit: Changed the code. Is there any better way?
from pymongo.operations import InsertOne,DeleteOne
def save(collection,data,cap=60):
if collection.count() == cap:
top_doc_time= min(doc['time'] for doc in collection.find())
collection.delete_one({'time':top_doc_time['_time']})
collection.insert_one(data)
A bulk write operation guarantees query ordering by default.
This means that the queries are executed sequentially.
from pymongo.operations import DeleteOne, InsertOne
def left_shift_insert(collection, doc, cap=60):
ops = []
variance = max((collection.count() + 1) - cap, 0)
delete_ops = [DeleteOne({})] * variance
ops.extend(delete_ops)
ops.append(InsertOne(doc))
return collection.bulk_write(ops)
left_shift_insert(db.timestamp, new_data)
I queried two databases to get two relations. I terate over those relations once to form maps, and then again to perform some calculations. However, when I attempt iterate over the same relations a second time, I find that no iteration is actually occurring. Here is the code:
dev_connect = dev_engine.connect()
prod_connect = prod_engine.connect() # from a different database
Relation1 = dev_engine.execute(sqlquery1)
Relation2 = prod_engine.execute(sqlquery)
before_map = {}
after_map = {}
for row in Relation1:
before_map[row['instrument_id']] = row
for row2 in Relation2:
after_map[row2['instrument_id']] = row2
update_count = insert_count = delete_count = 0
change_list = []
count =0
for prod_row in Relation2:
count += 1
result = list(prod_row)
...
change_list.append(result)
count2 = 0
for before_row in Relation1:
count2 += 1
result = before_row
...
print count, count2 # prints 0
before_map and after_map are not empty, so Relation1 and Relation2 definitely have tuples in them. Yet count and count2 are 0, so the prod_row and before_row 'for loops' aren't actually occurring. Why can't I iterate over Relation1 and Relation2 a second time?
When you call execute on a SQL Alchemy engine, you get back a ResultProxy, which is a facade to a DBAPI cursor to the rows your query returns.
Once you iterate over all the results of the ResultProxy, it automatically closes the underlying cursor so you can't use the results again by just iterating over it, as documented on the SQLAlchemy page:
The returned result is an instance of ResultProxy, which references a DBAPI cursor and provides a largely compatible interface with that of the DBAPI cursor. The DBAPI cursor will be closed by the ResultProxy when all of its result rows (if any) are exhausted.
You can solve your problem a couple ways:
Store the results in a list. Just do a list-comprehension against the rows returned:
Relation1 = dev_engine.execute(sqlquery1)
relation1_items = [r for r in Relation1]
# ...
# now you can iterate over relation1_items as much as you want
Do everything you need to in one pass through each row set returned. I don't know if this option is feasible for you since I don't know if the full extent of your calculations require cross-referencing between your before_map and after_map objects.
I have inherited a Mongo structure with key:value pairs within an array. I need to extract the collected and spent values in the below tags, however I don't see an easy way to do this using the $regex commands in the Mongo Query documentation.
{
"_id" : "94204a81-9540-4ba8-bb93-fc5475c278dc"
"tags" : ["collected:172", "donuts_used:1", "spent:150"]
}
The ideal output of extracting these values is to dump them into a format below when querying them using pymongo. I really don't know how best to return only the values I need. Please advise.
94204a81-9540-4ba8-bb93-fc5475c278dc, 172, 150
print d['_id'], ' '.join([ x.replace('collected:', '').replace('spent:', '')\
for x in d['tags'] if 'collected' in x or 'spent' in x ] )
>>>
94204a81-9540-4ba8-bb93-fc5475c278dc 172 150
In case you are having a hard time writing mongo query(your elements inside the list are actually string instead of key value which requires parsing), here is a solution in plain Python that might be helpful.
>>> import pymongo
>>> from pymongo import MongoClient
>>> client = MongoClient('localhost', 27017)
>>> db = client['test']
>>> collection = db['stackoverflow']
>>> collection.find_one()
{u'_id': u'94204a81-9540-4ba8-bb93-fc5475c278dc', u'tags': [u'collected:172', u'donuts_used:1', u'spent:150']}
>>> record = collection.find_one()
>>> print record['_id'], record['tags'][0].split(':')[-1], record['tags'][2].split(':')[-1]
94204a81-9540-4ba8-bb93-fc5475c278dc 172 150
Instead of using find_one(), you can retrieve all the record by using appropriate function here and looop through every record. I am not sure how consistent your data might be, so I hard coded using the first and third element in the list... you can want to tweak that part and have a try except at record level.
Here is one way to do it, if all you had was that sample JSON object.
Please pay attention to the note about the ordering of tags etc. It is probably best to revise your "schema" so that you can more easily query, collect and aggregate your "tags" as you call them.
import re
# Returns csv string of _id, collected, used
def parse(obj):
_id = obj["_id"]
# This is terribly brittle since the insertion of any other type of tag
# between 'c' and 's' will cause these indices to be messed up.
# It is probably much better to directly query these, or store them as individual
# entities in your mongo "schema".
collected = re.sub(r"collected:(\d+)", r"\1", obj["tags"][0])
spent = re.sub(r"spent:(\d+)", r"\1", obj["tags"][2])
return ", ".join([_id, collected, spent])
# Some sample object
parse_me = {
"_id" : "94204a81-9540-4ba8-bb93-fc5475c278dc"
"tags" : ["collected:172", "donuts_used:1", "spent:150"]
}
print parse(parse_me)
I am trying to retrieve all items in a dynamodb table using a query. Below is my code:
import boto.dynamodb2
from boto.dynamodb2.table import Table
from time import sleep
c = boto.dynamodb2.connect_to_region(aws_access_key_id="XXX",aws_secret_access_key="XXX",region_name="us-west-2")
tab = Table("rip.irc",connection=c)
x = tab.query()
for i in x:
print i
sleep(1)
However, I recieve the following error:
ValidationException: ValidationException: 400 Bad Request
{'message': 'Conditions can be of length 1 or 2 only', '__type': 'com.amazon.coral.validate#ValidationException'}
The code I have is pretty straightforward and out of the boto dynamodb2 docs, so I am not sure why I am getting the above error. Any insights would be appreciated (new to this and a bit lost). Thanks
EDIT: I have both an hash key and a range key. I am able to query by specific hash keys. For example,
x = tab.query(hash__eq="2014-01-20 05:06:29")
How can I retrieve all items though?
Ahh ok, figured it out. If anyone needs:
You can't use the query method on a table without specifying a specific hash key. The method to use instead is scan. So if I replace:
x = tab.query()
with
x = tab.scan()
I get all the items in my table.
I'm on groovy but it's gonna drop you a hint. Error :
{'message': 'Conditions can be of length 1 or 2 only'}
is telling you that your key condition can be length 1 -> hashKey only, or length 2 -> hashKey + rangeKey. All what's in a query on a top of keys will provoke this error.
The reason of this error is: you are trying to run search query but using key condition query. You have to add separate filterCondition to perform your query.
My code
String keyQuery = " hashKey = :hashKey and rangeKey between :start and :end "
queryRequest.setKeyConditionExpression(keyQuery)// define key query
String filterExpression = " yourParam = :yourParam "
queryRequest.setFilterExpression(filterExpression)// define filter expression
queryRequest.setExpressionAttributeValues(expressionAttributeValues)
queryRequest.setSelect('ALL_ATTRIBUTES')
QueryResult queryResult = client.query(queryRequest)
.scan() does not automatically return all elements of a table due to pagination of the table. There is a 1Mb max response limit Dynamodb Max response limit
Here is a recursive implementation of the boto3 scan:
import boto3
dynamo = boto3.resource('dynamodb')
def scanRecursive(tableName, **kwargs):
"""
NOTE: Anytime you are filtering by a specific equivalency attribute such as id, name
or date equal to ... etc., you should consider using a query not scan
kwargs are any parameters you want to pass to the scan operation
"""
dbTable = dynamo.Table(tableName)
response = dbTable.scan(**kwargs)
if kwargs.get('Select')=="COUNT":
return response.get('Count')
data = response.get('Items')
while 'LastEvaluatedKey' in response:
response = kwargs.get('table').scan(ExclusiveStartKey=response['LastEvaluatedKey'], **kwargs)
data.extend(response['Items'])
return data
I ran into this error when I was misusing KeyConditionExpression instead of FilterExpression when querying a dynamodb table.
KeyConditionExpression should only be used with partition key or sort key values.
FilterExpression should be used when you want filter your results even more.
However do note, using FilterExpression uses the same reads as it would without, because it performs the query based on the keyConditionExpression. It then removes items from the results based on your FilterExpression.
Source
Working with Queries
This is how I do a query if someone still needs a solution:
def method_name(a, b)
results = self.query(
key_condition_expression: '#T = :t',
filter_expression: 'contains(#S, :s)',
expression_attribute_names: {
'#T' => 'your_table_field_name',
'#S' => 'your_table_field_name'
},
expression_attribute_values: {
':t' => a,
':s' => b
}
)
results
end