How to access MongoDB in-depth - python

I've got the following problem. I don't seem to figure out on how to access the "UID" value within the warning field. I want to iterate over every value inside the document, and access every existing UID to check if a randomly generated UID already exists. I'm honestly about to have a mental breakdown due this, I just can't seem to figure it out
This is what my MongoDB structure looks like:
https://i.imgur.com/sfKGLnf.png

warnings will be a list after you've got the object in python code - so you can just iterate over the docs in the warnings list and access their UID keys
edited code for comments:
We can get all the documents in the collection by using find with an empty query dict, though I've linked the docs on find below. As your document structure seems to have a random number as a key, and then an embedded doc, we find the keys in the document that aren't _id. It may be that there's only one key, but I didn't want to assume. If there's then a warnings key in our embedded doc, we iterate over each of the warnings and add the "UID" to our warning_uids list.
# this is your collection object
mod_logs_collection = database["modlogs"]
all_mod_docs = mod_logs_collection.find({})
warning_uids = []
for doc in all_mod_docs:
# filter all the doc keys that aren't the default `_id`
doc_keys = [key for key in doc.keys() if key != "_id"]
# not sure if it's possible for you to have multiple keys in a doc
# with this structure but we'll iterate over the top-level doc keys
# just in case
for key in doc_keys:
sub_doc = doc[key]
if warnings := sub_doc.get("warnings")
for warning in warnings:
warning_uids.append(warning["UID"])
print(warning["UID"])
pymongo find
mongo docs on querying

Related

The most efficient way to compare two dictionaries, verifying dict_2 is a complete subset of dict_1, and return all values of dict_2 which are less?

I'm working on a data pipeline that will pull data from online and store it in MongoDB. To manage the process, I've developed two dictionaries; request_totals and mongo_totals. mongo_totals will contain a key for each container in the Mongo database, along with a value for the max(id) that container contains. request_totals has a key for each category data can be pulled from, along with a value for the max(id) of that category. If MongoDB is fully updated, these who dictionaries would be identical.
I've developed code that runs, but I can't shake the feeling that I'm not really being efficient here. I hope that someone can share some tips on how to better write this:
def compare(request_totals, mongo_totals):
outdated = dict()
# Verifies MongoDB contains no unique collections
if request_totals | mongo_totals != request_totals:
raise AttributeError('mongo_totals does not appear to be a subset of request_totals')
sharedKeys = set(request_totals.keys()).intersection(mongo_totals.keys())
unsharedKeys = set(request_totals) - set(mongo_totals)
# Updates outdated dict with outdated key-value pairs representing MongoDB collections
for key in sharedKeys:
if request_totals[key] > mongo_totals[key]:
outdated.update({key : mongo_totals[key]})
elif request_totals[key] < mongo_totals[key]:
raise AttributeError(
f'mongo_total for {key}: {mongo_totals[key]} exceeds request_totals for {key}: {request_totals[key]}')
return outdated|dict.fromkeys(unsharedKeys, 0)
compare(request_totals, mongo_totals)
The returned dictionary has key:value pairs that maybe used in the following way; Query the API using the key, and offset the records by the key's value. This way, it allows me to keep the database updated. Is there a more efficient way to handle this comparison?

solve E11000 duplicate key error collection: _id_ dup key in pymongo

I am trying to insert a great number of document(+1M) using a bulk_write instruction. In order to do that, I create a list of InsertOne function.
python version = 3.7.4
pymongo version = 3.8.0
Document creation:
document = {
'dictionary': ObjectId(dictionary_id),
'price': price,
'source': source,
'promo': promo,
'date': now_utc,
'updatedAt': now_utc,
'createdAt:': now_utc
}
# add line to debug
if '_id' in document.keys():
print(document)
return document
I create the full list of document by adding a new field from a list of elements and create the query by using InsertOne
bulk = []
for element in list_elements:
for document in documents:
document['new_field'] = element
# add line to debug
if '_id' in document.keys():
print(document)
insert = InsertOne(document)
bulk.append(insert)
return bulk
I do the insert by using bulk_write command
collection.bulk_write(bulk, ordered=False)
I attach the documentation https://api.mongodb.com/python/current/api/pymongo/collection.html#pymongo.collection.Collection.bulk_write
According to the documentation,the _id field is added automatically
Parameter - document: The document to insert. If the document is missing an _id field one will be added.
And somehow it seems that is doing it wrong because some of them have the same value.
Receiving this error(with differents _id of course) for 700k of the 1M documents
'E11000 duplicate key error collection: database.collection index: _id_ dup key: { _id: ObjectId(\'5f5fccb4b6f2a4ede9f6df62\') }'
Seems a bug to me from pymongo, because I used this approach in many situations but I didn't with such size of documents
The _id field has to be unique for sure, but, due to this is done automatically by pymongo, I don't know how to approach to this problem, perhaps using a UpdateOne with upsert True with an impossible filter and hope for the best.
I would appreciate any solution or work around for this problem
It seems that as I was adding the new field of the document and append it into the list, I created similar instances of the same element, so I had the same queries len(list_elements) times and that is why I had the duplicated key error.
to solve the problem, I append to the list a copy of the document
bulk.append(document.copy())
and then create the queries with that list
I would like to thank #Belly Buster for his help in the issue
If any of the documents from your code snippet already contain an _id, a new one won't be added, and you run the risk of getting a duplicate error as you have observed.

Python dictionary key length not same as rows returning for the query in mysql

So i am trying to fetch data from the mysql into a python dictionary
here is my code.
def getAllLeadsForThisYear():
charges={}
cur.execute("select lead_id,extract(month from transaction_date),pid,extract(Year from transaction_date) from transaction where lead_id is not NULL and transaction_type='CHARGE' and YEAR(transaction_date)='2015'")
for i in cur.fetchall():
lead_id=i[0]
month=i[1]
pid=i[2]
year=str(i[3])
new={lead_id:[month,pid,year]}
charges.update(new)
return charges
x=getAllLeadsForThisYear()
when i prints (len(x.keys()) it gave me some number say 450
When i run the same query in mysql it returns me 500 rows.Although i do have some same keys in dictionary but it should count them as i have not mentioned it if i not in charges.keys(). Please correct me if i am wrong.
Thanks
As I said, the problem is that you are overwriting your value at a key every time a duplicate key pops up. This can be fixed two ways:
You can do a check before adding a new value and if the key already exists, append to the already existing list.
For example:
#change these lines
new={lead_id:[month,pid,year]}
charges.update(new)
#to
if lead_id in charges:
charges[lead_id].extend([month,pid,year])
else
charges[lead_id] = [month,pid,year]
Which gives you a structure like this:
charges = {
'123':[month1,pid1,year1,month2,pid2,year2,..etc]
}
With this approach, you can reach each separate entry by chunking the value at each key by chunks of 3 (this may be useful)
However, I don't really like this approach because it requires you to do that chunking. Which brings me to approach 2.
Use defaultdict from collections which acts in the exact same way as a normal dict would except that it defaults a value when you try to call a key that hasn't already been made.
For example:
#change
charges={}
#to
charges=defaultdict(list)
#and change
new={lead_id:[month,pid,year]}
charges.update(new)
#to
charges[lead_id].append((month,pid,year))
which gives you a structure like this:
charges = {
'123':[(month1,pid1,year1),(month2,pid2,year2),(..etc]
}
With this approach, you can now iterate through each list at each key with:
for key in charges:
for entities in charges[key]:
print(entities) # would print `(month,pid,year)` for each separate entry
If you are using this approach, dont forget to from collections import defaultdict. If you don't want to import external, you can mimic this by:
if lead_id in charges:
charges[lead_id].append((month,pid,year))
else
charges[lead_id] = [(month,pid,year)]
Which is incredibly similar to the first approach but does the explicit "create a list if the key isnt there" that defaultdict would do implicitly.

How to query if entity exists in app engine NDB

I'm having some trouble wrapping my head around NDB. For some reason it's just not clicking. The thing i'm struggling with the most is the whole key/kind/ancestor structure.
I'm just trying to store a simple set of Json data. When i store data, i want to check beforehand to see if a duplicate entity exists (based on the key, not the data) so i don't store a duplicate entity.
class EarthquakeDB(ndb.Model):
data = ndb.JsonProperty()
datetime = ndb.DateTimeProperty(auto_now_add=True)
Then, to store data:
quake_entry = EarthquakeDB(parent=ndb.Key('Earthquakes', quake['id']), data=quake).put()
So my questions are:
How do i check to see if that particular key exists before i insert more data?
How would i go about pulling that data out to read based on the key?
After some trial and error, and with the assistance of voscausa, here is what i came up with to solve the problem. The data is being read in via a for loop.
for quake in data:
quake_entity = EarthquakeDB.get_by_id(quake['id'])
if quake_entity:
continue
else:
quate_entity = EarthquakeDB(id=quake['id'], data=quake).put()
Because you do not provide a full NDB key (only a parent) you will always insert a unique key.
But you use your own entity id for the parent? Why?
I think you mean:
quake_entry = EarthquakeDB(id=quake['id'], data=quake)
quake_entry.put()
To get it, you can use:
quate_entry = ndb.Key('Earthquakes', quake['id']).get()
Here you can find two excellent videos about the datastore, strong consistency and entity groups. Datastore Introduction and Datastore Query, Index and Transaction.

Python Lists and MongoDB insert

Need help in understanding what is happening here and a suggestion to avoid this!
Here is my snippet:
result = [list of dictionary objects(dictionary objects have 2 keys and 2 String values)]
copyResults = list(results);
## Here I try to insert each Dict into MongoDB (Using PyMongo)
for item in copyResults:
dbcollection.save(item) # This is all saving fine in MongoDB.
But when I loop thru that original result list again it shows dictionary objects with a new field added
automatically which is ObjectId from MongoDB!
Later in code I need to transform that original result list to json but this ObjectId is causing issues.No clue why this is getting added to original list.
I have already tried copy or creating new list etc. It still adds up ObjectId in the original list after saving.
Please suggest!
every document saved in mongodb requires '_id' field - which has to be unique among documents in the collection. if you don't provide one, mongodb will automatically create one with ObjectId (bson.objectid.ObjectId for pymongo)
If you need to export documents to json, you have to pop '_id' field before jsonifying it.
Or you could use:
rows['_id'] = str(rows['_id'])
Remember to set it back if you then need to update

Categories

Resources