Never used PyMongo so I'm new to this stuff. I want to be able to save one of my lists to MongoDB. For example, I have a list imageIds = ["zw8SeIUW", "f28BYZ"], which is appended to frequently. After each append, the list imageIds should be saved to the database.
import pymongo
from pymongo import MongoClient
db = client.databaseForImages
and then later
imageIds.append(data)
db.databaseForImages.save(imageIds)
Why doesn't this work? What is the solution?
First, if you don't know what a python dict is, I recommend brushing up on Python fundamentals . Check out Google's Python Class or Learn Python the Hard Way. Otherwise, you will be back here every 10 minutes with a new question...
Now, you have to connect to the mongoDB server/instance:
client = MongoClient('hostname', port_number)
Connect to a database:
db = client.imagedb
Then save the record to the collection "image_data".
record = {'image_ids': imageIds}
db.image_data.save(record)
Using save(), the record dict is updated with an '_id' field which now points to the record in this collection. To update it with a new appended imageIds:
record['image_ids'] = imageIds # Already contains the original _id
db.image_data.save(record)
Related
I am using a MongoDB on Atlas. I am querying the database from a python script using the pyMongo library.
Version: Python: 3.8.2, MongoDB: 4.4.8
I have a curious problem that for a query to the database that I sort by a given column, I get back duplicate entries in the returned list, even though through the web frontend to access the MongoDB (http://https://cloud.mongodb.com/) I can query the database and do not find duplicate records in the database itself.
An extract of my python code is as follows:
client = MongoClient("mongodb+srv://user:passwd#cluster0.ubw9u.mongodb.net/annabot?retryWrites=true&w=majority")
annadb=client.annabot
products_collection=annadb.products
...
query={"$and":[{"prio_brand":True},{'sizes_available_scan_time':{"$exists": False }}]}
scan_item_list=products_collection.find(query).sort("value_score",pymongo.DESCENDING)
for a in range(10):
item=scan_item_list[a]
print(item['url'])
The output of this code snippet is as follows:
2050947061/articles/ZZO19EP23-Q00
1221383231/articles/ZZO17RQ08-K00
2116916381/articles/HU121I0D5-K11
2116916381/articles/HU121I0D5-K11
1487759261/articles/GU121B05H-K11
999470601/articles/ZZO17RQ35-A00
2050947061/articles/ZZO19EP23-M00
999470601/articles/HU121D11P-Q11
999470601/articles/HU121D11P-Q11
1915794391/articles/GU121E0DQ-J11
Note that the 3rd and 4th items are identical as are the 8th and 9th items.
But a query of the MongoDB for either of the specific URLs returns a single record in each case.
Any ideas why this might be the case? Why would a query from pyMongo return a list that has duplicate entries when they are actually not duplicated in the database?
I'm new to both MongoDB and pyMongo,
and am having some performance issues
regarding cursors.
TL,DNR: Anything operation I try to perform
using a cursor takes about a second.
Long version
I have a small database, which I bulkloaded. Each entry has 3 fields:
dom: domain name (unique)
date: date, YYYYMMDD
flag: string
I've loaded about 1.9 million entries, without incident, and quite quickly.
I created a hash index on the dom field.
Now, I want to grab certain records by the domain field, and update them, using a Python program.
That's where the problem lies.
I'm using the latest MongoDB, and the latest pyMongo.
stripped down program...
import pymongo
from pymongo import MongoClient
db = client.myindexname
posts = db.posts
print list(db.profiles.index_information()) # shows hash index is present
for k in newdomainlist.keys(): #iterate list of domains to check
ret = posts.find({"dom": k}) #this runs fine, and quickly
#'ret' is a cursor
print ret #this runs quickly
#Here's the problem
print ret.count() #this takes about a second. why?
If I just 'print ret', the speed is fine. However, if I try to
reference anything in the cursor, the speed drops to the floor - I
can do about 1 operation per second.
In this case, I'm just trying to see if ret.count() returns '0' (we don't
have this domain), or '1' (we have it already).
I've tried adding a batch_size(10000) to the find, without it helping.
I DO have the Python C extensions loaded.
What the heck am I doing wrong?
thanks
It turned out that I'd created my hashed index on the wrong field, 'collection', rather than 'posts'. Chalk it up to mongodb inexperience. We can close this one now, or delete it entirely.
Please apologize me if my question is naive, I am new to python and I am trying my hand on using the collections using pymongo. I have tried to extract the names using
collects = db.collection_names(); #This returns a list with names of collections
But when I tried to get the cursor using
cursor = db.collects[1].find(); #This returns a cursor which has no reference to a collection.
I understand that the above code uses a string instead of an object. So, I was wondering how I could accomplish this task of retaining a cursor for each collection in the DB, which I can use later to perform operations of search and update etc.
If you are using the pymongo driver you must use the get_collection method or a dict-style lookups instead. Also you may want to set the include_system_collections to False in collection_names so you don't include system collections (e.g system.indexes)
import pymongo
client = pymongo.MongoClient()
db = client.db
collects = db.collection_names(include_system_collections=False)
cursor = db.get_collection(collects[1]).find()
or
cursor = db[collects[1]].find()
Sry, i can't create a comment yet, but have you tried?:
cursor = db.getCollection(collects[1]).find();
I have a Large Mongo DB with raw web scraping data. I have a process that reads the mongo docs and creates records in my MySQL reporting DB. I need to track the documents that I have processed in the MongoDB. I am trying to use the ObjectID but can't seem to convert it to a string. I am using Pymongo as my client.
for i in Coll.find({"ISBN": {"$exists" : True}})[20:50]:
print('starting collection loop')
#Check if doc has been processed
if not ProcessingLog.objects.filter(mongoID = i['_id']).exists():
mongoID = ProcessingLog(mongoID = i['_id'],source = 'amazon',createDate= datetime.datetime.now())
....
I get the following error
ValueError: too many values to unpack
pymongo includes methods to work with ObjectId() as other type.
You can see what they are in the docs here. You can probably make do with just str(o).
Need help in understanding what is happening here and a suggestion to avoid this!
Here is my snippet:
result = [list of dictionary objects(dictionary objects have 2 keys and 2 String values)]
copyResults = list(results);
## Here I try to insert each Dict into MongoDB (Using PyMongo)
for item in copyResults:
dbcollection.save(item) # This is all saving fine in MongoDB.
But when I loop thru that original result list again it shows dictionary objects with a new field added
automatically which is ObjectId from MongoDB!
Later in code I need to transform that original result list to json but this ObjectId is causing issues.No clue why this is getting added to original list.
I have already tried copy or creating new list etc. It still adds up ObjectId in the original list after saving.
Please suggest!
every document saved in mongodb requires '_id' field - which has to be unique among documents in the collection. if you don't provide one, mongodb will automatically create one with ObjectId (bson.objectid.ObjectId for pymongo)
If you need to export documents to json, you have to pop '_id' field before jsonifying it.
Or you could use:
rows['_id'] = str(rows['_id'])
Remember to set it back if you then need to update