PyMongo alternative to findAndModify - python

Background:
Python 3.7, Mongo 4.4, Ubuntu 20.04
I want to add a sequence number (unique integer) to new documents in my application. I'm holding a counter in the DB and using it as the unique number, increasing it after each new document.
Since my application is multi-processed, I sync this counter between processes using pyMongo findAndModify function, which is atomic (see Atomicity and Transactions).
Here is a simplified example of the code:
from pymongo import MongoClient
client = MongoClient('mongodb://localhost:27017/')
db = client['data']
collection = db.counters
query = {'_id' = 'sequence_data'}
update = {'$inc':{ 'counter':1}}
updated_doc = collection.find_and_modify(query, update)
return updated_doc['counter']
Problem:
The findAndModify function is deprecated (see PyMongo 4.0 changelog), and following the guidelines in PyMongo 4 Migration Guide, i tried to use the find_one_and_update function, but it is not atomic for the find part.
This can cause a race condition between 2 processes that execute the find_one_and_update and both read the same counter value, before incrementing it.
Question:
So my basic question is, how do I apply atomicity on the read section of my logic between multiple processes/threads?

Related

What is the return of an UPDATE query?

I'm using sqlalchemy in combination with sqlite and the databases library and I'm trying to wrap my head around what that combination returns when doing update queries. I'm running a testcase and I have sqlalchemy set up to roll back upon execution of each testcase via force_rollback=True.
db = databases.Database(DB_URL, force_rollback=True)
query = update(my_table).where(my_table.columns.id == some_id_to_update).values(**values)
res = await db.execute(query)
When working with psql, I'd expect res to be the number of rows that were affected by the UPDATE query, but from reading the documentation, sqlite seems to behave differently in that it doesn't return anything. I tested this manually by connecting to the database via sqlite3 and as expected, there is no return when doing UPDATE queries. sqlalchemy however does return something, which I assume is the number of total rows in the table, but I'm not sure. Can anybody shed some light into what is actually returned?
What's more, when I tried to get the number of rows affected by the UPDATE query via SELECT changes(), I'm also getting the number of total rows in the table and not the rows affected by the most recent query. Do I have a misunderstanding of what changes() does?
"The changes() function returns the number of database rows that were changed or inserted or deleted by the most recently completed INSERT, DELETE, or UPDATE statement, exclusive of statements in lower-level triggers."
When you use the Python sqlite3 module, you use .executeXXX interfaces to evaluate/prepare your query. If the query is supposed to modify the database, it does it at this stage. You have to use the same interface to prepare a SELECT statement. In either case, the .executeXXX interfaces never return anything. To get the result of a SELECT query, you have to use a .fetchXXX interface after running .executeXXX.
To get the number of changed rows after INSERT, DELETE, or UPDATE statement via sqlite3, you can also take the difference in con.total_changes before/after running .executeXXX.

Why is a Pandas Oracle DB query faster with literals?

When I use the bind variable approach found here: https://cx-oracle.readthedocs.io/en/latest/user_guide/bind.html#bind
and here: Python cx_Oracle bind variables
my query takes around 8 minutes, but when I use hardcoded values (literals), it takes around 20 seconds.
I'm struggling to comprehend what's happening "behind-the-scenes" (variables/memory access/data transfer/query parsing) to see if there's any way for me to adhere to the recommended approach of using bind variables and get the same ~20s performance.
This python script will be automated and the values will be dynamic, so I definitely can't use hardcoded values.
Technical background: Python 3.6; Oracle 11g; cx_Oracle 8
---- python portion of code -----
first version
param_dict = {“startDate:”01-Jul-21”, “endDate:”31-Jul-2021”}
conn = (typical database connection code….)
cur = conn.cursor()
###### this query has the bind variables and param_dict keys match bind variable aliases; runtime ~480s (8mins)
cur_df = pandas.DataFrame(cur.execute("inserted_query_here", param_dict))
second version
conn = (typical database connection code….)
cur = conn.cursor()
###### this query has the hardcoded values (literals); runtime ~20s
cur_df = pandas.DataFrame(cur.execute("inserted_query_here"))
#ChristopherJones and Alex thanks for the referenced articles.
I've been able to solve the issue by thoroughly examining the EXPLAIN PLAN. The query that performed faster wasn't using index (faster to do full table scan); the other was (bind variable version of query).
I applied NO_INDEX hint accordingly and now have ~20s result for bind variable version of query.

Trigger to reject cosmos DB update from python

I have some code which updates a record in a cosmos DB container (see simplified snippet below). However there are other independent processes that also update the same record from other systems. In the example below if the state is a particular value, I would like the upsert_item() to be a no-op if the same record in the container already got updated to a particular "final" state. One way to solve it is to read the value before each update but that is a bit too expensive. Is there a simple way to make the upsert_item() turn into a no-op based on some server side trigger? Any pointers would be appreciated
client = CosmosClient(<end_pt>, <key>)
database_name = "cosmosdb"
container_name = "solar_system"
db_client = client.get_database_client(database_name)
db_container = db_client.get_container_client(container_name)
uid, planet, state = get_planetary_config()
# How can I make this following update a no-op depending on current state in the database?
json_data = {"id": str(uid), "planet": planet, "state": state}
db_container.upsert_item(body=json_data)
As i know, the cosmos db server side trigger does not meet your need.It is invoked to execute pre-function or post-function,not to judge whether the document meets some conditions.
So,update with specific conditions is not supported by Cosmos Db natively. You need to read the value and judge the condition before your update operations.

Mongodb: Shard a collection within my python code

I have a distributed MongoDB 3.2 database. The system is mounted on two nodes with RedHat operating system.
Using python and the PyMongo driver (or some other), I want enable the sharding of a collection, specifying the compound shard key.
In the mongo shell this works:
> use mongotest
> sh.enableSharding("mongotest")
> db.signals.createIndex({ valueX: 1, valueY: 1 }, { unique: true })
> sh.shardCollection("mongotest.signals", { valueX: 1, valueY: 1 })
('mongotest' is the DB, and 'signals' is the collection)
The last two lines I want to make them within my code. Does anyone know if this is possible in python? If so, how is it done?
thank you very much,
sorry for my bad English
A direct translation of your shell commands to python is as shown below.
from pymongo import MongoClient
client = MongoClient()
db = client.admin # run commands against admin database.
db.command('enableSharding', 'mongotest')
db.command({'shardCollection': 'mongotest.signals', 'key': {'valueX': 1, 'valueY': 1}})
However, you may want to confirm that both enableSharding and shardCollection are exposed in your db by running
db.command('listCommands')

PyMongo: Update, $multi:false, get _id of updated document?

When updating a document in MongoDB using a search-style update, is it possible to get back the _id of the document(s) updated?
For example:
import pymongo
client = pymongo.MongoClient('localhost', 27017)
db = client.test_database
col = db.test_col
col.insert({'name':'kevin', 'status':'new'})
col.insert({'name':'brian', 'status':'new'})
col.insert({'name':'matt', 'status':'new'})
col.insert({'name':'stephen', 'status':'new'})
info = col.update({'status':'new'}, {'$set':{'status':'in_progress'}}, multi=False)
print info
# {u'updatedExisting': True, u'connectionId': 1380, u'ok': 1.0, u'err': None, u'n': 1}
# I want to know the _id of the document that was updated.
I have multiple threads accessing the database collection and want to be able to mark a document as being acted upon. Getting the document first and then updating by Id is not a good answer, because two threads may "get" the same document before it is updated. The application is a simple asynchronous task queue (yes, I know we'd be better off with something like Rabbit or ZeroMQ for this, but adding to our stack isn't possible right now).
You can use pymongo.collection.find_and_modify. It is a wrapper around MongoDB findAndModify command and can return original (by default) or modified document.
info = col.find_and_modify({'status':'new'}, {'$set':{'status':'in_progress'}})
if info:
print info.get('_id')

Categories

Resources