So I have mongo db that I starts dorm docker and put several entries:
(
id - int,
mod - boolean,
data - text,
)
So i want to search for N entries form my data base with mod=false:
results = self.dbcol.find({'mod': False}).limit(3)
So my results return 3 elements And for each entry at this result i want to change mod field to true.
So this is what i have try:
for entry in results:
entry['mod'] = True
And this is not changed my field.
You can use update_one
update = { "$set": { 'mod': True } }
for entry in results:
self.dbcol.update_one({ _id: entry['_id'] }, update)
Iterating over a cursor is not a performance-efficient way if the N is a big number.
An efficient way would be to take advantage of find_one_and_update method of pymongo and call it N times, thereby keeping the disk usage low if N is big.
import pymongo
from pymongo.collection import ReturnDocument
for _ in range(0,N):
returned_document = self.dbcol.find_one_and_update(filter={ 'mod': False }, update={"$set": { 'mod': True }}, return_document=ReturnDocument.AFTER)
# return_document=ReturnDocument.AFTER --> returns modified document after updating it
# return_document=ReturnDocument.BEFORE --> returns original document before updating it
Related
First, some background.
I have a function in Python which consults an external API to retrieve some information associated with an ID. Such function takes as argument an ID and it returns a list of numbers (they correspond to some metadata associated with such ID).
For example, let us introduce in such function the IDs {0001, 0002, 0003}. Let's say that the function returns for each ID the following arrays:
0001 → [45,70,20]
0002 → [20,10,30,45]
0003 → [10,45]
My goal is to implement a collection which structures data as so:
{
"_id":45,
"list":[0001,0002,0003]
},
{
"_id":70,
"list":[0001]
},
{
"_id":20,
"list":[0001,0002]
},
{
"_id":10,
"list":[0002,0003]
},
{
"_id":30,
"list":[0002]
}
As it can be seen, I want my collection to index the information by the metadata itself. With this structure, the document with $_id "45" contains a list with all the IDs that have metadata 45 associated. This way I can retrieve with a single request to the collection all IDs mapped to a particular metadata value.
The class method in charge of inserting IDs and metadata in the collection is the following:
def add_entries(self,id,metadataVector):
start = time.time()
id=int(id)
for data in metadataVector:
self.SegmentDB.update_one(
filter = {"_id":data},
update = {"$addToSet":{"list":id}},
upsert = True
)
end = time.time()
duration = end-start
return duration
metadataVector is the list which contains all metadata (integers) associated to a given ID (i.e.:[45,70,20]).
id is the ID associated to the metadata in metadataVector. (i.e.:0001).
This method currently iterates through the list and performs an operation for every element (every metadata) on the list. This method implements the collection I desire: it updates the document whose "_id" is a given metadata and adds to its corresponding list the ID from which such metadata originated (if such document doesn't exist yet, it inserts it - that's what upsert = true is all for).
However, this implementation ends up being somewhat slow on the long run. metadataVector usually has around 1000-3000 items for each ID (metainformation integers which can range in 800 - 23000000), and I have around 40000 IDs to analyze. As a result, the collection grows quickly. At the moment, I have around 3.2m documents in the collection (one specifically dedicated to each individual metadata integer). I would like to implement a faster solution; if possible, I would like to insert all metadata in one only DB request instead of calling an update for each item in metadataVector individually.
I tried this approach but it doesn't seem to work as I intended:
def add_entries(self,id,metadataVector):
start = time.time()
id=int(id)
self.SegmentDB.update_many(
filter={"_id": {"$in":metadataVector}},
update={"$addToSet":{"list":id}},
upsert = True
)
end = time.time()
duration = end-start
return duration
I tried using update_many (as it seemed the natural approach to tackle the problem) specifying a filter which, to my understanding, states "any document whose _id is in metadataVector". In this way, all documents involved would add to the list the originating ID (or the document would be created if it didn't exist due to the Upsert condition) but instead the collection ends up being filled with documents containing a single element in the list and an ObjectId() _id.
Picture showing the final result.
Is there a way to implement what I want? Should I restructure the DB differently all together?
Thanks a lot in advance!
Here is an example, and it uses Bulk Write operations. Bulk operations submits multiple inserts, updates, deletes (can be a combination) as a single call to the database and returns a result. This is more efficient than multiple single calls to the database.
Scenario 1:
Input: 3 -> [10, 45]
def some_fn(id):
# id = 3; and after some process... returns a dictionary
return { 10: 3, 45: 3, }
Scenario 2:
Input (as a list):
3 -> [10, 45]
1 -> [45, 70, 20]
def some_fn(ids):
# ids are 1 and 3; and after some process... returns a dictionary
return { 10: [ 3 ], 45: [ 3, 1 ], 20: [ 1 ], 70: [ 1 ] }
Perform Bulk Write
Now, perform the bulk operation on the database using the returned value from some_fn.
data = some_fn(id) # or some_fn(ids)
requests = []
for k, v in data.items():
op = UpdateOne({ '_id': k }, { '$push': { 'list': { '$each': v }}}, upsert=True)
requests.append(op)
result = db.collection.bulk_write(requests, ordered=False)
Note the ordered=False - this option is used for, again, better performance as writes can happen in parallel.
References:
collection.bulk_write
I have documents in a collection, and each document is (say) like this:
doc: {
'dflag':0
'name':
'address':
}
While iterating over, that is:
query = {"dflag":0}
for doc in mydb["mycol"].find(query):
-do-something-
...
# want to change 'dflag' of this particular doc, from 0 to 1
newvalue = { "$set": { "dflag": 1 } }
doc.update(newvalue)
I want to update 'dflag' of each document one by one, but, doing it like above is not working.
How can I update a specific field of a document, one by one, while iterating over all the documents?
You can do it with update method
doc.update(query, newvalue)
Also, you can use update_one to update the first element that matched the query.
one.update_one(query, newvalue)
If I have an Algolia index containing documents that look like these:
{"object_id":1, "color":"red", "shape":"circle"}
{"object_id":2, "color":"blue", "shape":"triangle"}
{"object_id":3, "color":"green", "shape":"square"}
{"object_id":4, "color":null, "shape":"hexagon"}
{"object_id":5, "shape":"hexagon"}
...
Using the python API for Algolia, how can I search the index to get objects like 4 and 5 back since they are both missing the "color" attribute.I've been dragging through (https://www.algolia.com/doc/api-client/python/search#search-in-an-index) but I cannot find the answer.
I've tried this snippet but no luck:
from algoliasearch import algoliasearch
client = algoliasearch.Client("YourApplicationID", 'YourAPIKey')
index = client.init_index("colorful_shapes")
res = index.search("null")
res1 = index.search("color=null")
res2 = index.search("color:null")
res3 = index.search("!color")
print(res, res1, res2, res3)
Unfortunately, searching for all objects with a missing key is not possible in Algolia (and btw pretty complex for schema-less NoSQL engines).
A simple work-around is to push - at indexing time - a tag inside to specify if the attribute is set or not:
{
"objectID": 1,
"myattr": "I'm set",
"_tags": ["myattr_set"]
}
and
{
"objectID": 2,
"_tags": ["myattr_unset"]
}
At query time, you would filter the searches with the tag:
index.search('your query', { filters: 'myattr_unset', ... })
I'm sure there are more elegant solutions, but this seems to work for what you provided (also I assumed your null was an intended None):
a = [{"object_id":1, "color":"red", "shape":"circle"}, {"object_id":2, "color":"blue", "shape":"triangle"}, {"object_id":3, "color":"green", "shape":"square"}, {"object_id":4, "color":None, "shape":"hexagon"}, {"object_id":5, "shape":"hexagon"}]
list(a) #since dict has no set order
for i in a:
try:
if (i['color'] is None):
print(a.index(i)) #prints 3
except KeyError:
print(a.index(i)) #prints 4
I know you expect 4 and 5 printed, but indexes start counting at 0, this can easily be changed just by adding 1 to each print statement.
I've already create a collection in mongo db with several documents and I want to insert to that documents a list of integers. I've found the function update. My code for pymongo is the following:
for item in content:
id = int(item.replace('\n', ''))
ids = follower_list(id)
collection.update({'_id':id},{'list_followers':ids})
follower_list a function that returns a list of ids. Update it seems to replace the document with new one only containing two fields id and list_followers(initial document containing more fields). I dont want to replace docs I just want to add a new field to the old one. How can i do such a thing?
The mongoDB example here:
db.books.update(
{ item: "Divine Comedy" },
{
$set: { price: 18 },
$inc: { stock: 5 }
}
)
Ok I found solution, collection.update({ "_id" : id },{"$set": {"list_followers": ids}})
is what I want here.
I am attempting to create a search in pymongo using REGEX. After the match, I want the data to be appended to a list in the module. I thought that I had everything set, but no matter what I set for the REGEX it returns 0 results. The code is below:
REGEX = '.*\.com'
def myModule(self, data)
#after importing everything and setting up the collection function in the DB I call the following:
cursor = collection.find({'multiple.layers.of.data' : REGEX})
data = []
for x in cursor:
matches.append(x)
return matches
This is but one module of three I am using to filter through a huge amount of json files that have been stored in a mongodb. However, no matter how many times I change this formatting such as /.*.com/ to declare in the operation or using the $regex in mongo...it never finds my data and appends it in the list.
EDIT: Adding in the full code along with what I am trying to identify:
RegEx = '.*\.com' #Or RegEx = re.compile('.*\.com')
def filterData(self, data):
db = self.client[self.dbName]
collection = db[self.collectionName]
cursor = collection.find({'data.item11.sub.level3': {'$regex': RegEx}})
data = []
for x in cursor:
data.append(x)
return data
I am attempting to parse through JSON data in a mongodb. The data is structured like so:
"data": {
"0": {
"item1": "something",
"item2": 0,
"item3": 000,
"item4": 000000000,
"item5": 000000000,
"item6": "0000",
"item7": 00,
"item8": "0000",
"item9": 00,
"item10": "useful",
"item11": {
"0000": {
"sub": {
"level": "letter",
"level1": 0000,
"level2": 0000000000,
"level3": "domain.com"
},
"more_data": "words"
}
}
}
UPDATE: After further testing it appears as though I need to include all of the layers in the search. Thus, it should look like
collection.find({'data.0.item11.0000.sub.level3': {'$regex': RegEx}}).
However, the "0" can be 1 - 50 and the "0000" is randomly generated. Is there a way to set these to index's as variables so that it will step into it no matter what the value? It will always be a number value.
Well, you need to tell mongodb the string should be treated as a regular expression, using the $regex operator:
cursor = collection.find({'multiple.layers.of.data' : {'$regex': REGEX}})
I think simply replacing REGEX = '.*\.com' with import re; REGEX = re.compile('.*\.com') might also work, but I'm not sure (would rely on a specific handling in the pymongo driver).
EDIT:
Regarding the wildcard part of the question: The answer is no.
In a nutshell, values that unknown should
never be assigned as keys because it makes querying very inefficient.
There are no 'wild card' queries.
It is better to restructure the database such that values that are
unknown are not keys
See:
MongoDB wildcard in the key of a query
http://groups.google.com/group/mongodb-user/browse_thread/thread/32b00d38d50bd858
https://groups.google.com/forum/#!topic/mongodb-user/TnAQMe-5ZGs