PyMongo: Update, $multi:false, get _id of updated document? - python

When updating a document in MongoDB using a search-style update, is it possible to get back the _id of the document(s) updated?
For example:
import pymongo
client = pymongo.MongoClient('localhost', 27017)
db = client.test_database
col = db.test_col
col.insert({'name':'kevin', 'status':'new'})
col.insert({'name':'brian', 'status':'new'})
col.insert({'name':'matt', 'status':'new'})
col.insert({'name':'stephen', 'status':'new'})
info = col.update({'status':'new'}, {'$set':{'status':'in_progress'}}, multi=False)
print info
# {u'updatedExisting': True, u'connectionId': 1380, u'ok': 1.0, u'err': None, u'n': 1}
# I want to know the _id of the document that was updated.
I have multiple threads accessing the database collection and want to be able to mark a document as being acted upon. Getting the document first and then updating by Id is not a good answer, because two threads may "get" the same document before it is updated. The application is a simple asynchronous task queue (yes, I know we'd be better off with something like Rabbit or ZeroMQ for this, but adding to our stack isn't possible right now).

You can use pymongo.collection.find_and_modify. It is a wrapper around MongoDB findAndModify command and can return original (by default) or modified document.
info = col.find_and_modify({'status':'new'}, {'$set':{'status':'in_progress'}})
if info:
print info.get('_id')

Related

PyMongo alternative to findAndModify

Background:
Python 3.7, Mongo 4.4, Ubuntu 20.04
I want to add a sequence number (unique integer) to new documents in my application. I'm holding a counter in the DB and using it as the unique number, increasing it after each new document.
Since my application is multi-processed, I sync this counter between processes using pyMongo findAndModify function, which is atomic (see Atomicity and Transactions).
Here is a simplified example of the code:
from pymongo import MongoClient
client = MongoClient('mongodb://localhost:27017/')
db = client['data']
collection = db.counters
query = {'_id' = 'sequence_data'}
update = {'$inc':{ 'counter':1}}
updated_doc = collection.find_and_modify(query, update)
return updated_doc['counter']
Problem:
The findAndModify function is deprecated (see PyMongo 4.0 changelog), and following the guidelines in PyMongo 4 Migration Guide, i tried to use the find_one_and_update function, but it is not atomic for the find part.
This can cause a race condition between 2 processes that execute the find_one_and_update and both read the same counter value, before incrementing it.
Question:
So my basic question is, how do I apply atomicity on the read section of my logic between multiple processes/threads?

SQLAlchemy/Postgres: Intermittent Error Serializing Object After Commit

I have a Flask application that uses SQLAlchemy (with some Marshmallow for serialization and deserialization).
I'm currently encountering some intermittent issues when trying to dump an object post-commit.
To give an example, let's say I have implemented a (multi-tenant) system for tracking system faults of some sort. This information is contained in a fault table:
class Fault(Base):
__tablename__ = "fault"
fault_id = Column(BIGINT, primary_key=True)
workspace_id = Column(Integer, ForeignKey('workspace.workspace_id'))
local_fault_id = Column(Integer)
name = Column(String)
description = Column(String)
I've removed a number of columns in the interest of simplicity, but this is the core of the model. The columns should be largely self explanatory, with workspace_id effectively representing tenant, and local_fault_id representing a tenant-specific fault sequence number, which is handled via a separate fault_sequence table.
This fault_sequence table holds a counter against workspace, and is updated by means of a simple on_fault_created() function that is executed by a trigger:
CREATE TRIGGER fault_created
AFTER INSERT
ON "fault"
FOR EACH ROW
EXECUTE PROCEDURE on_fault_created();
So - the problem:
I have a Flask endpoint for fault creation, where we create an instance of a Fault entity, add this via a scoped session (session.add(fault)), then call session.commit().
It seems that this is always successful in creating the desired entities in the database, executing the sequence update trigger etc. However, when I then try to interrogate the fault object for updated fields (after commit()), around 10% of the time I find that each key/field just points to an Exception:
psycopg2.errors.InFailedSqlTransaction: current transaction is aborted, commands ignored until end of transaction block
Which seems to boil down to the following:
(psycopg2.errors.InvalidTextRepresentation) invalid input syntax for integer: ""
[SQL: SELECT fault.fault_id AS fault_fault_id, fault.workspace_id AS fault_workspace_id, fault.local_fault_id AS fault_local_fault_id, fault.name as fault_name, fault.description as fault_description
FROM fault
WHERE fault.fault_id = %(param_1)s]
[parameters: {'param_1': 166}]
(Background on this error at: http://sqlalche.me/e/13/2j8
My question, then, is what do we think could be causing this?
I think it smells like a race condition, with the update trigger not being complete before SQLAlchemy has tried to get the updated data; perhaps local_fault_id is null, and this is resulting in the invalid input syntax error.
That said, I have very low confidence on this. Any guidance here would be amazing, as I could really do with retrieving that sequence number that's incremented/handled by the update trigger.
Thanks
Edit 1:
Some more info:
I have tried removing the update trigger, in the hope of eliminating that as a suspect. This behaviour is still intermittently evident, so I don't think it's related to that.
I have tried adopting usage of flush and refresh before the commit, and this allows me to get the values that I need - though commit still appears to 'break' the fault object.
Edit 2:
So it really seems to be more postgres than anything else. When I interrogate my database logs, this is the weirdest thing. I can copy and paste the command it says is failing, and I struggle to see how this integer value in the WHERE clause is possibly evaluating to an empty string.
This same error is reproducible with SELECT ... FROM fault WHERE fault.fault_id = '', which in no way seems to be the query making to the DB.
I am stumped.
Your sentence "This same error is reproducible with SELECT ... FROM fault WHERE fault.fault_id = '', which in no way seems to be the query making to the DB." seems to indicate that you are trying to access an object that does not have the database primary key "fault_id".
I guess, given that you did not provide the code, that you are adding the object to your session (session.add), committing (session.commit) and then using the object. As fault_id is autogenerated by the database, the fault object in the session (in memory) does not have fault_id.
I believe you can correct this with:
session.add(fault)
session.commit()
session.refresh(fault)
The refresh needs to be AFTER commit to refresh the fault object and retrieve fault_id.
If you are using async, you need
session.add(fault)
await session.commit()
await session.refresh(fault)

How to use Azure DevOps / VSTS to fetch query results in python

Below is my current code. It connects successfully to the organization. How can I fetch the results of a query in Azure like they have here? I know this was solved but there isn't an explanation and there's quite a big gap on what they're doing.
from azure.devops.connection import Connection
from msrest.authentication import BasicAuthentication
from azure.devops.v5_1.work_item_tracking.models import Wiql
personal_access_token = 'xxx'
organization_url = 'zzz'
# Create a connection to the org
credentials = BasicAuthentication('', personal_access_token)
connection = Connection(base_url=organization_url, creds=credentials)
wit_client = connection.clients.get_work_item_tracking_client()
results = wit_client.query_by_id("my query ID here")
P.S. Please don't link me to the github or documentation. I've looked at both extensively for days and it hasn't helped.
Edit: I've added the results line that successfully gets the query. However, it returns a WorkItemQueryResult class which is not exactly what is needed. I need a way to view the column and results of the query for that column.
So I've figured this out in probably the most inefficient way possible, but hope it helps someone else and they find a way to improve it.
The issue with the WorkItemQueryResult class stored in variable "result" is that it doesn't allow the contents of the work item to be shown.
So the goal is to be able to use the get_work_item method that requires the id field, which you can get (in a rather roundabout way) through item.target.id from results' work_item_relations. The code below is added on.
for item in results.work_item_relations:
id = item.target.id
work_item = wit_client.get_work_item(id)
fields = work_item.fields
This gets the id from every work item in your result class and then grants access to the fields of that work item, which you can access by fields.get("System.Title"), etc.

Trigger to reject cosmos DB update from python

I have some code which updates a record in a cosmos DB container (see simplified snippet below). However there are other independent processes that also update the same record from other systems. In the example below if the state is a particular value, I would like the upsert_item() to be a no-op if the same record in the container already got updated to a particular "final" state. One way to solve it is to read the value before each update but that is a bit too expensive. Is there a simple way to make the upsert_item() turn into a no-op based on some server side trigger? Any pointers would be appreciated
client = CosmosClient(<end_pt>, <key>)
database_name = "cosmosdb"
container_name = "solar_system"
db_client = client.get_database_client(database_name)
db_container = db_client.get_container_client(container_name)
uid, planet, state = get_planetary_config()
# How can I make this following update a no-op depending on current state in the database?
json_data = {"id": str(uid), "planet": planet, "state": state}
db_container.upsert_item(body=json_data)
As i know, the cosmos db server side trigger does not meet your need.It is invoked to execute pre-function or post-function,not to judge whether the document meets some conditions.
So,update with specific conditions is not supported by Cosmos Db natively. You need to read the value and judge the condition before your update operations.

Multi-tenancy with SQLAlchemy

I've got a web-application which is built with Pyramid/SQLAlchemy/Postgresql and allows users to manage some data, and that data is almost completely independent for different users. Say, Alice visits alice.domain.com and is able to upload pictures and documents, and Bob visits bob.domain.com and is also able to upload pictures and documents. Alice never sees anything created by Bob and vice versa (this is a simplified example, there may be a lot of data in multiple tables really, but the idea is the same).
Now, the most straightforward option to organize the data in the DB backend is to use a single database, where each table (pictures and documents) has user_id field, so, basically, to get all Alice's pictures, I can do something like
user_id = _figure_out_user_id_from_domain_name(request)
pictures = session.query(Picture).filter(Picture.user_id==user_id).all()
This is all easy and simple, however there are some disadvantages
I need to remember to always use additional filter condition when making queries, otherwise Alice may see Bob's pictures;
If there are many users the tables may grow huge
It may be tricky to split the web application between multiple machines
So I'm thinking it would be really nice to somehow split the data per-user. I can think of two approaches:
Have separate tables for Alice's and Bob's pictures and documents within the same database (Postgres' Schemas seems to be a correct approach to use in this case):
documents_alice
documents_bob
pictures_alice
pictures_bob
and then, using some dark magic, "route" all queries to one or to the other table according to the current request's domain:
_use_dark_magic_to_configure_sqlalchemy('alice.domain.com')
pictures = session.query(Picture).all() # selects all Alice's pictures from "pictures_alice" table
...
_use_dark_magic_to_configure_sqlalchemy('bob.domain.com')
pictures = session.query(Picture).all() # selects all Bob's pictures from "pictures_bob" table
Use a separate database for each user:
- database_alice
- pictures
- documents
- database_bob
- pictures
- documents
which seems like the cleanest solution, but I'm not sure if multiple database connections would require much more RAM and other resources, limiting the number of possible "tenants".
So, the question is, does it all make sense? If yes, how do I configure SQLAlchemy to either modify the table names dynamically on each HTTP request (for option 1) or to maintain a pool of connections to different databases and use the correct connection for each request (for option 2)?
After pondering on jd's answer I was able to achieve the same result for postgresql 9.2, sqlalchemy 0.8, and flask 0.9 framework:
from sqlalchemy import event
from sqlalchemy.pool import Pool
#event.listens_for(Pool, 'checkout')
def on_pool_checkout(dbapi_conn, connection_rec, connection_proxy):
tenant_id = session.get('tenant_id')
cursor = dbapi_conn.cursor()
if tenant_id is None:
cursor.execute("SET search_path TO public, shared;")
else:
cursor.execute("SET search_path TO t" + str(tenant_id) + ", shared;")
dbapi_conn.commit()
cursor.close()
Ok, I've ended up with modifying search_path in the beginning of every request, using Pyramid's NewRequest event:
from pyramid import events
def on_new_request(event):
schema_name = _figire_out_schema_name_from_request(event.request)
DBSession.execute("SET search_path TO %s" % schema_name)
def app(global_config, **settings):
""" This function returns a WSGI application.
It is usually called by the PasteDeploy framework during
``paster serve``.
"""
....
config.add_subscriber(on_new_request, events.NewRequest)
return config.make_wsgi_app()
Works really well, as long as you leave transaction management to Pyramid (i.e. do not commit/roll-back transactions manually, letting Pyramid to do that at the end of request) - which is ok as committing transactions manually is not a good approach anyway.
What works very well for me it to set the search path at the connection pool level, rather than in the session. This example uses Flask and its thread local proxies to pass the schema name so you'll have to change schema = current_schema._get_current_object() and the try block around it.
from sqlalchemy.interfaces import PoolListener
class SearchPathSetter(PoolListener):
'''
Dynamically sets the search path on connections checked out from a pool.
'''
def __init__(self, search_path_tail='shared, public'):
self.search_path_tail = search_path_tail
#staticmethod
def quote_schema(dialect, schema):
return dialect.identifier_preparer.quote_schema(schema, False)
def checkout(self, dbapi_con, con_record, con_proxy):
try:
schema = current_schema._get_current_object()
except RuntimeError:
search_path = self.search_path_tail
else:
if schema:
search_path = self.quote_schema(con_proxy._pool._dialect, schema) + ', ' + self.search_path_tail
else:
search_path = self.search_path_tail
cursor = dbapi_con.cursor()
cursor.execute("SET search_path TO %s;" % search_path)
dbapi_con.commit()
cursor.close()
At engine creation time:
engine = create_engine(dsn, listeners=[SearchPathSetter()])

Categories

Resources