PyMongo - Setting all values in an attribute to lowercase [duplicate]

PyMongo - Setting all values in an attribute to lowercase [duplicate] - python

This question already has answers here:
Update MongoDB field using value of another field
(12 answers)
Closed 5 years ago.
I am cleaning a dataset, and have a field gender. In this field, there are entries such as Male, male, and MALE. To resolve this, I am trying to update my MongoDB database using pymongo.
In the database, the Gender attribute is Gender (which a capital G at the front)
My code currently looks like this:
import pymongo
from pymongo import MongoClient
db_info = {
'db_name': 'MentalHealth',
'collection_name': 'MentalHealth',
}
if __name__ == "__main__":
mongo_client = MongoClient()
mongo_db = mongo_client[db_info['db_name']]
mongo_collection = mongo_db[db_info['collection_name']]
#normalize to lowercase
mongo_collection.aggregate([{ '$project': { 'Gender':{ '$toLower':"$Gender"}}}])
The code runs without issue, but the database is not updating, and I am unsure what is the error with the code. Any help would be greatly appreciated. Thank you!!!

Mongodb aggregation operations process data records and return computed results. It can't update any collection. you can update the same like this
db.mongo_collection.find({}).forEach(function(doc) {
db.mongo_collection.update(
{ "_id": doc._id },
{ "$set": { "Gender": doc.Gender.toUpperCase() } }
);
});

You are using aggregate query which will return you the result with all Gender fields cast to lower case. If you wish to update the value for a field you have to use update query.
Since you are using pymongo to query your documents your code should be like this
import pymongo
from pymongo import MongoClient
from bson.objectid import ObjectId
db_info = {
'db_name': 'MentalHealth',
'collection_name': 'MentalHealth'
}
if __name__ == "__main__":
mongo_client = MongoClient()
mongo_db = mongo_client[db_info['db_name']]
mongo_collection = mongo_db[db_info['collection_name']]
for doc in mongo_collection.find(no_cursor_timeout=True):
pk = ObjectId(str(doc.get("_id")))
g = doc.get('Gender')
if g:
g = g.lower()
mongo_collection.update({"_id": pk}, {"$set":{"Gender":g}})

The aggregation framework you’re using only performs queries. To actually perform writes, you need to use a $out stage to dump the results into the collection.
If you select an existing collection, that collection is replaced atomically as described in https://docs.mongodb.com/manual/reference/operator/aggregation/out/#pipe._S_out
Another option is to use an update operation to update just the documents with incorrect case.

Related

Problem querying AWS Athena from Lambda introducing a variable

I need help on a little problem that I have with my AWS Lambda function. This function queries my AWS Athena database.
The code looks like this :
import json
import boto3
import time
def lambda_handler(event, context):
client = boto3.client('athena')
QueryResponse = client.start_query_execution(
QueryString = "MY QUERY;",
QueryExecutionContext = {
'Database' : 'myDatabase'
},
ResultConfiguration = {
'OutputLocation' : 's3://mys3Bucket'
}
)
#Oberserve results :
queryId = QueryResponse['QueryExecutionId']
The code works great, but I am having some troubles with the "WHERE" part of my sql query (that is a long one)
Here is the part of my Query :
WHERE x.id_date > cast(date_format(date_trunc('day', current_timestamp -
interval '3' day), '%Y%m%d') as integer)
and x.id_date <= cast(date_format(current_timestamp, '%Y%m%d') as integer)
and c.label = 'NAME'
My query is written on a single line to fit the Python code replacing "MY QUERY".
Le problem is :
I need to replace the 'NAME' part by a variable (string) that will be given to my Lambda. I tried to use %s to replace by the given variable, but as there is '%Y%m%d' in my query, the code is waiting for string to replace these part too, but it is just made to format the date as I want to. I tried to replace NAME by a string and it works perfectly so I know my query is not the problem. I tried to put 'c.label = '%s' in first to see if it the % method would simply replace the first %s and let the other ones do their job but it didn't work.
So my question is : How can I replace 'NAME' by a str variable ?can I do this keeping my query on a single line ? (if yes, how ?) or at least how can I divide my query in different lines I could interact with ?
Thanks for your help.

As said in comment, the solution was to use :
MyString = 'my string to replace in query'
QueryString = f"SELECT * FROM {MyString};"

How to do bulk insert with ordered false in mongoengine

I'm trying to insert documents in bulk, I have created a unique index in my collection and want to skip documents which are duplicate while doing bulk insertion. This can be accomplished with native mongodb function:
db.collection.insert(
<document or array of documents>,
{
ordered: <boolean>
}
)
I want to accomplish this with mongoengine, If anybody knows how to achieve this, please answer the question, thanks.

If you have a class like this:
class Foo(db.Document):
bar= db.StringField()
meta = {'indexes': [{'fields': ['bar'], 'unique': True}]}
And having a list with Foo instances foos=[Foo('a'), Foo('a'), Foo('a')]
and trying Foo.objects.insert(foos) you will get mongoengine.errors.NotUniqueError
1st woraround would be delete index from mongodb, insert duplicates, and than ensure index with {unique : true, dropDups : true}
2nd workaround would be using underlying pymongo API for bulk ops: https://docs.mongodb.com/manual/reference/method/db.collection.initializeOrderedBulkOp/#db.collection.initializeOrderedBulkOp

For now I am using raw pymongo from mongoengine as a workaround for this. This is the 2nd workaround that #Alexey Smirnov mentioned. So for a mongoengine Document class DocClass you will access the underlying pymongo collection and execute query like below:
from pymongo.errors import BulkWriteError
try:
doc_list = [doc.to_mongo() for doc in me_doc_list] # Convert ME objects to what pymongo can understand
DocClass._get_collection().insert_many(doc_list, ordered=False)
except BulkWriteError as bwe:
print("Batch Inserted with some errors. May be some duplicates were found and are skipped.")
print(f"Count is {DocClass.objects.count()}.")
except Exception as e:
print( { 'error': str(e) })

Pymongo $in Query Not Working

Seeing some strange behavior in Pymongo $in query. Looking for records that meet the following query:
speciesCollection.find({"SPCOMNAME":{"$in":['paddlefish','lake sturgeon']}})
The query returns no records.
If I change it to find_one the it works returning the last value for Lake Sturgeon. The field is a text filed with one vaule. So I am looking for records that match paddlefish or Lake Sturgeon.
It works fine in Mongo Shell like this:
speciesCollection.find({SPCOMNAME:{$in: ['paddlefish','lake strugeon']}},{_id:0})
Here is the result from shell
{ "SPECIES_ID" : 1, "SPECIES_AB" : "LKS", "SPCOMNAME" : "lake sturgeon", "SP_SCINAME" : "Acipenser fulvescens
{ "SPECIES_ID" : 101, "SPECIES_AB" : "PAH", "SPCOMNAME" : "paddlefish", "SP_SCINAME" : "Polyodon spathula" }
Am I missing something here?

I think you have a typo or some other error in your program as I just did a test with your sample data and query and it works - see the GIF
Below is my test code which connects to the database called so and the collection speciesCollection, maybe you find the error in yours with it
import pymongo
client = pymongo.MongoClient('dockerhostlinux1', 30000)
db = client.so
coll = db.speciesCollection
result = coll.find({"SPCOMNAME":{"$in":['paddlefish','lake sturgeon']}})
for doc in result:
print(doc)

AWS DynamoDB Python - boto3 Key() methods not recognized (Query)

I am using Lambda (Python) to query my DynamoDB database. I am using the boto3 library, and I was able to make an "equivalent" query:
This script works:
import boto3
from boto3.dynamodb.conditions import Key, Attr
import json
def create_list(event, context):
resource = boto3.resource('dynamodb')
table = resource.Table('Table_Name')
response = table.query(
TableName='Table_Name',
IndexName='Custom-Index-Name',
KeyConditionExpression=Key('Number_Attribute').eq(0)
)
return response
However, when I change the query expression to this:
KeyConditionExpression=Key('Number_Attribute').gt(0)
I get the error:
"errorType": "ClientError",
"errorMessage": "An error occurred (ValidationException) when calling the Query operation: Query key condition not supported"
According to this [1] resource, "gt" is a method of Key(). Does anyone know if this library has been updated, or what other methods are available other than "eq"?
[1] http://boto3.readthedocs.io/en/latest/reference/customizations/dynamodb.html#ref-dynamodb-conditions
---------EDIT----------
I also just tried the old method using:
response = client.query(
TableName = 'Table_Name',
IndexName='Custom_Index',
KeyConditions = {
'Custom_Number_Attribute':{
'ComparisonOperator':'EQ',
'AttributeValueList': [{'N': '0'}]
}
}
)
This worked, but when I try:
response = client.query(
TableName = 'Table_Name',
IndexName='Custom_Index',
KeyConditions = {
'Custom_Number_Attribute':{
'ComparisonOperator':'GT',
'AttributeValueList': [{'N': '0'}]
}
}
)
...it does not work.
Why would EQ be the only method working in these cases? I'm not sure what I'm missing in the documentation.

From what I think:
Your Partition Key is Number_Attribute, and so you cannot do a gt when doing a query (you can do an eq and that is it.)
You can do a gt or between for your Sort Key when doing a query. It is also called Range key, and because it "smartly" puts the items next to each other, it offers the possibility of doing gt and between efficiently in a query
Now, if you want to do a between to your partition Key, then you will have to use scan like the below:
Key('Number_Attribute').gt(0)
response = table.scan(
FilterExpression=fe
)
Keep in mind of the following concerning scan:
The scan method reads every item in the entire table, and returns all of the data in the table. You can provide an optional filter_expression, so that only the items matching your criteria are returned. However, note that the filter is only applied after the entire table has been scanned.
So in other words, it's a bit of a costly operation comparing to query. You can see an example in the documentation here.
Hope that helps!

Can't update field in existing document in pymongo

I'm trying to add a field in to an existing document with pymongo.
here is my code:
from pymongo import MongoClient
client = MongoClient()
db = client['profiles']
collection = db['collection']
def createFields():
collection.update({'_id' : '547f21f450c19fca35de53cd'}, {'$set': {'new_field':1}})
createFields()
when I enter the following in to the mongoDB interpreter
>use profiles
>db.collection.find()
I can see that there have not been any fields added to the specified document.

An _id field is most commonly of type ObjectId() which is a twelve byte value, rather than a 24-byte string as you are providing here.
You must use the correct type in order to match the document.

You should add db before your script
Just try this, it works fine for me
In Mongo CLI
> use profiles
> db.collection.insertOne({'_id': '547f21f450c19fca35de53cd'})
> db.collection.find()
{ "_id" : "547f21f450c19fca35de53cd" }
Python Script -> test.py
from pymongo import MongoClient
client = MongoClient()
db = client['profiles']
collection = db['collection']
def createFields():
db.collection.update({'_id' : '547f21f450c19fca35de53cd'}, {'$set': {'new_field':-881}})
createFields()
Command Line
> python test.py
Mongo CLI
> use profiles
> db.collection.find()
> { "_id" : "547f21f450c19fca35de53cd", "new_field" : -881 }
For pymongo, use update_one or update_many instead of update

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

PyMongo - Setting all values in an attribute to lowercase [duplicate] - python

Related

Problem querying AWS Athena from Lambda introducing a variable

How to do bulk insert with ordered false in mongoengine

Pymongo $in Query Not Working

AWS DynamoDB Python - boto3 Key() methods not recognized (Query)

Can't update field in existing document in pymongo

Categories

Resources