How to do bulk insert with ordered false in mongoengine - python

I'm trying to insert documents in bulk, I have created a unique index in my collection and want to skip documents which are duplicate while doing bulk insertion. This can be accomplished with native mongodb function:
db.collection.insert(
<document or array of documents>,
{
ordered: <boolean>
}
)
I want to accomplish this with mongoengine, If anybody knows how to achieve this, please answer the question, thanks.

If you have a class like this:
class Foo(db.Document):
bar= db.StringField()
meta = {'indexes': [{'fields': ['bar'], 'unique': True}]}
And having a list with Foo instances foos=[Foo('a'), Foo('a'), Foo('a')]
and trying Foo.objects.insert(foos) you will get mongoengine.errors.NotUniqueError
1st woraround would be delete index from mongodb, insert duplicates, and than ensure index with {unique : true, dropDups : true}
2nd workaround would be using underlying pymongo API for bulk ops: https://docs.mongodb.com/manual/reference/method/db.collection.initializeOrderedBulkOp/#db.collection.initializeOrderedBulkOp

For now I am using raw pymongo from mongoengine as a workaround for this. This is the 2nd workaround that #Alexey Smirnov mentioned. So for a mongoengine Document class DocClass you will access the underlying pymongo collection and execute query like below:
from pymongo.errors import BulkWriteError
try:
doc_list = [doc.to_mongo() for doc in me_doc_list] # Convert ME objects to what pymongo can understand
DocClass._get_collection().insert_many(doc_list, ordered=False)
except BulkWriteError as bwe:
print("Batch Inserted with some errors. May be some duplicates were found and are skipped.")
print(f"Count is {DocClass.objects.count()}.")
except Exception as e:
print( { 'error': str(e) })

Related

MongoDB and Pymongo, query FULLTEXT in all collections

I have a local MongoDB database with multiple collections.
I use pymongo in jupyter notebook, what I would like to do is run a query FULLTEXT looking for the data on all the collections present.
is it possible to do this? if so how could i proceed?
from pymongo import MongoClient
client = MongoClient('mongodb://localhost:27017/')
client.list_database_names()
out: ['admin', 'config', 'local']
In local I have a more collection:
this is what I do with just one collection
db = client["local"]
firstdb = db["firstdb"]
result = db.firstdb.find({"email": {"$regex":"test","$options": 'i'}})
for item in result:
print(item['email'],item['log'])
in essence I would like to perform an email query also on secondb, thirdb, fourthdb, etc. etc.
no one can help me? basically I have to do a FULLTEXT query on all collections.
the only solution I found is this:
result = db.firstdb.find({"email": {"$regex":"test","$options": 'i'}})
result1 = db.secondb.find({"email": {"$regex":"test","$options": 'i'}})
result2 = db.thirdb.find({"email": {"$regex":"test","$options": 'i'}})
for item in result:
print(item['email'],item['log'])
for item in result1:
print(item['email'],item['date'])
for item in result2:
print(item['email'],item['account'])
but I'm not sure I'm going on the right track!
I thank anyone who can help me!
PS: I would not like to change the structure of the collections, the problem could occur in other Database

PyMongo - Setting all values in an attribute to lowercase [duplicate]

This question already has answers here:
Update MongoDB field using value of another field
(12 answers)
Closed 5 years ago.
I am cleaning a dataset, and have a field gender. In this field, there are entries such as Male, male, and MALE. To resolve this, I am trying to update my MongoDB database using pymongo.
In the database, the Gender attribute is Gender (which a capital G at the front)
My code currently looks like this:
import pymongo
from pymongo import MongoClient
db_info = {
'db_name': 'MentalHealth',
'collection_name': 'MentalHealth',
}
if __name__ == "__main__":
mongo_client = MongoClient()
mongo_db = mongo_client[db_info['db_name']]
mongo_collection = mongo_db[db_info['collection_name']]
#normalize to lowercase
mongo_collection.aggregate([{ '$project': { 'Gender':{ '$toLower':"$Gender"}}}])
The code runs without issue, but the database is not updating, and I am unsure what is the error with the code. Any help would be greatly appreciated. Thank you!!!
Mongodb aggregation operations process data records and return computed results. It can't update any collection. you can update the same like this
db.mongo_collection.find({}).forEach(function(doc) {
db.mongo_collection.update(
{ "_id": doc._id },
{ "$set": { "Gender": doc.Gender.toUpperCase() } }
);
});
You are using aggregate query which will return you the result with all Gender fields cast to lower case. If you wish to update the value for a field you have to use update query.
Since you are using pymongo to query your documents your code should be like this
import pymongo
from pymongo import MongoClient
from bson.objectid import ObjectId
db_info = {
'db_name': 'MentalHealth',
'collection_name': 'MentalHealth'
}
if __name__ == "__main__":
mongo_client = MongoClient()
mongo_db = mongo_client[db_info['db_name']]
mongo_collection = mongo_db[db_info['collection_name']]
for doc in mongo_collection.find(no_cursor_timeout=True):
pk = ObjectId(str(doc.get("_id")))
g = doc.get('Gender')
if g:
g = g.lower()
mongo_collection.update({"_id": pk}, {"$set":{"Gender":g}})
The aggregation framework you’re using only performs queries. To actually perform writes, you need to use a $out stage to dump the results into the collection.
If you select an existing collection, that collection is replaced atomically as described in https://docs.mongodb.com/manual/reference/operator/aggregation/out/#pipe._S_out
Another option is to use an update operation to update just the documents with incorrect case.

AWS DynamoDB Python - boto3 Key() methods not recognized (Query)

I am using Lambda (Python) to query my DynamoDB database. I am using the boto3 library, and I was able to make an "equivalent" query:
This script works:
import boto3
from boto3.dynamodb.conditions import Key, Attr
import json
def create_list(event, context):
resource = boto3.resource('dynamodb')
table = resource.Table('Table_Name')
response = table.query(
TableName='Table_Name',
IndexName='Custom-Index-Name',
KeyConditionExpression=Key('Number_Attribute').eq(0)
)
return response
However, when I change the query expression to this:
KeyConditionExpression=Key('Number_Attribute').gt(0)
I get the error:
"errorType": "ClientError",
"errorMessage": "An error occurred (ValidationException) when calling the Query operation: Query key condition not supported"
According to this [1] resource, "gt" is a method of Key(). Does anyone know if this library has been updated, or what other methods are available other than "eq"?
[1] http://boto3.readthedocs.io/en/latest/reference/customizations/dynamodb.html#ref-dynamodb-conditions
---------EDIT----------
I also just tried the old method using:
response = client.query(
TableName = 'Table_Name',
IndexName='Custom_Index',
KeyConditions = {
'Custom_Number_Attribute':{
'ComparisonOperator':'EQ',
'AttributeValueList': [{'N': '0'}]
}
}
)
This worked, but when I try:
response = client.query(
TableName = 'Table_Name',
IndexName='Custom_Index',
KeyConditions = {
'Custom_Number_Attribute':{
'ComparisonOperator':'GT',
'AttributeValueList': [{'N': '0'}]
}
}
)
...it does not work.
Why would EQ be the only method working in these cases? I'm not sure what I'm missing in the documentation.
From what I think:
Your Partition Key is Number_Attribute, and so you cannot do a gt when doing a query (you can do an eq and that is it.)
You can do a gt or between for your Sort Key when doing a query. It is also called Range key, and because it "smartly" puts the items next to each other, it offers the possibility of doing gt and between efficiently in a query
Now, if you want to do a between to your partition Key, then you will have to use scan like the below:
Key('Number_Attribute').gt(0)
response = table.scan(
FilterExpression=fe
)
Keep in mind of the following concerning scan:
The scan method reads every item in the entire table, and returns all of the data in the table. You can provide an optional filter_expression, so that only the items matching your criteria are returned. However, note that the filter is only applied after the entire table has been scanned.
So in other words, it's a bit of a costly operation comparing to query. You can see an example in the documentation here.
Hope that helps!

How do I query AWS DynamoDB in python?

I'm fairly new to NoSQL and using AWS DynamoDB. I'm calling it from AWS Lambda using python 2.7
I'm trying to retrieve a value from an order_number field.
This is what my table looks like(only have one record.):
primary partition key: subscription_id
and my secondary global index: order_number
Is my setup correct?
If so given the order_number how do I retrieve the record using python?
I can't figure out the syntax to do it.
I've tried
response = table.get_item( Key = {'order_number': myordernumber} )
But i get:
An error occurred (ValidationException) when calling the GetItem operation: The provided key element does not match the schema: ClientError
DynamoDB does not automatically index all of the fields of your object. By default you can define a hash key (subscription_id in your case) and, optionally, a range key and those will be indexed. So, you could do this:
response = table.get_item(Key={'subscription_id': mysubid})
and it will work as expected. However, if you want to retrieve an item based on order_number you would have to use a scan operation which looks through all items in your table to find the one(s) with the correct value. This is a very expensive operation. Or you could create a Global Secondary Index in your table that uses order_number as the primary key. If you did that and called the new index order_number-index you could then query for objects that match a specific order number like this:
from boto3.dynamodb.conditions import Key, Attr
response = table.query(
IndexName='order_number-index',
KeyConditionExpression=Key('order_number').eq(myordernumber))
DynamoDB is an very fast, scalable, and efficient database but it does require a lot of thought about what fields you might want to search on and how to do that efficiently.
The good news is that now you can add GSI's to an existing table. Previously you would have had to delete your table and start all over again.
Make sure you've imported this:
from boto3.dynamodb.conditions import Key, Attr
If you don't have it, you'll get the error for sure. It's in the documentation examples.
Thanks #altoids for the comment above as this is the correct answer for me. I wanted to bring attention to it with a "formal" answer.
To query dynamodb using Index with filter:
import boto3
from boto3.dynamodb.conditions import Key, Attr
dynamodb = boto3.resource('dynamodb', region_name=region)
table = dynamodb.Table('<TableName>')
response = table.query(
IndexName='<Index>',
KeyConditionExpression=Key('<key1>').eq('<value>') & Key('<key2>').eq('<value>'),
FilterExpression=Attr('<attr>').eq('<value>')
)
print(response['Items'])
If filter is not rquired then don't use FilterExpression in query.
import boto3
from boto3.dynamodb.conditions import Key
dynamodb = boto3.resource('dynamodb', region_name=region_name)
table = dynamodb.Table(tableName)
def queryDynamo(pk, sk):
response = table.query(
ProjectionExpression="#pk, #sk, keyA, keyB",
ExpressionAttributeNames={"#pk": "pk", "#sk": "sk"},
KeyConditionExpression=
Key('pk').eq(pk) & Key('sk').eq(sk)
)
return response['Items']
If you use the boto3 dynamodb client, you can do the following (again you would need to use subscription_id as that is the primary key):
dynamodb = boto3.client('dynamodb')
response = dynamodb.query(
TableName='recurring_charges',
KeyConditionExpression="subscription_id = :subscription_id",
ExpressionAttributeValues={":subscription_id": {"S": "id"}}
)
So far, this is the cleanest way I've discovered; the query is in JSON format.
dynamodb_client = boto3.client('dynamodb')
def query_items():
arguments = {
"TableName": "your_dynamodb_table",
"IndexName": "order_number-index",
"KeyConditionExpression": "order_number = :V1",
"ExpressionAttributeValues": {":V1": {"S": "value"}},
}
return dynamodb_client.query(**arguments)

How to delete a MongoDB collection in PyMongo

How to check in PyMongo if collection exists and if exists empty (remove all from collection)?
I have tried like
collection.remove()
or
collection.remove({})
but it doesn't delete collection. How to do that ?
Sample code in Pymongo with comment as explanation:
from pymongo import MongoClient
connection = MongoClient('localhost', 27017) #Connect to mongodb
print(connection.database_names()) #Return a list of db, equal to: > show dbs
db = connection['testdb1'] #equal to: > use testdb1
print(db.list_collection_names()) #Return a list of collections in 'testdb1'
print("posts" in db.list_collection_names()) #Check if collection "posts"
# exists in db (testdb1)
collection = db['posts']
print(collection.count() == 0) #Check if collection named 'posts' is empty
collection.drop() #Delete(drop) collection named 'posts' from db and all documents contained.
You should use .drop() instead of .remove(), see documentation for detail: http://api.mongodb.org/python/current/api/pymongo/collection.html#pymongo.collection.Collection.drop
=====
Sorry for misunderstanding your question.
To check if a collection exists, use method collection_names on database:
>>> collection_name in database.list_collection_names()
To check if a collection is empty, use:
>>> collection.count() == 0
both will return True or False in result.
Have you tried this:
db.collection.remove();

Categories

Resources