Multi-tenancy with SQLAlchemy - python

I've got a web-application which is built with Pyramid/SQLAlchemy/Postgresql and allows users to manage some data, and that data is almost completely independent for different users. Say, Alice visits alice.domain.com and is able to upload pictures and documents, and Bob visits bob.domain.com and is also able to upload pictures and documents. Alice never sees anything created by Bob and vice versa (this is a simplified example, there may be a lot of data in multiple tables really, but the idea is the same).
Now, the most straightforward option to organize the data in the DB backend is to use a single database, where each table (pictures and documents) has user_id field, so, basically, to get all Alice's pictures, I can do something like
user_id = _figure_out_user_id_from_domain_name(request)
pictures = session.query(Picture).filter(Picture.user_id==user_id).all()
This is all easy and simple, however there are some disadvantages
I need to remember to always use additional filter condition when making queries, otherwise Alice may see Bob's pictures;
If there are many users the tables may grow huge
It may be tricky to split the web application between multiple machines
So I'm thinking it would be really nice to somehow split the data per-user. I can think of two approaches:
Have separate tables for Alice's and Bob's pictures and documents within the same database (Postgres' Schemas seems to be a correct approach to use in this case):
documents_alice
documents_bob
pictures_alice
pictures_bob
and then, using some dark magic, "route" all queries to one or to the other table according to the current request's domain:
_use_dark_magic_to_configure_sqlalchemy('alice.domain.com')
pictures = session.query(Picture).all() # selects all Alice's pictures from "pictures_alice" table
...
_use_dark_magic_to_configure_sqlalchemy('bob.domain.com')
pictures = session.query(Picture).all() # selects all Bob's pictures from "pictures_bob" table
Use a separate database for each user:
- database_alice
- pictures
- documents
- database_bob
- pictures
- documents
which seems like the cleanest solution, but I'm not sure if multiple database connections would require much more RAM and other resources, limiting the number of possible "tenants".
So, the question is, does it all make sense? If yes, how do I configure SQLAlchemy to either modify the table names dynamically on each HTTP request (for option 1) or to maintain a pool of connections to different databases and use the correct connection for each request (for option 2)?

After pondering on jd's answer I was able to achieve the same result for postgresql 9.2, sqlalchemy 0.8, and flask 0.9 framework:
from sqlalchemy import event
from sqlalchemy.pool import Pool
#event.listens_for(Pool, 'checkout')
def on_pool_checkout(dbapi_conn, connection_rec, connection_proxy):
tenant_id = session.get('tenant_id')
cursor = dbapi_conn.cursor()
if tenant_id is None:
cursor.execute("SET search_path TO public, shared;")
else:
cursor.execute("SET search_path TO t" + str(tenant_id) + ", shared;")
dbapi_conn.commit()
cursor.close()

Ok, I've ended up with modifying search_path in the beginning of every request, using Pyramid's NewRequest event:
from pyramid import events
def on_new_request(event):
schema_name = _figire_out_schema_name_from_request(event.request)
DBSession.execute("SET search_path TO %s" % schema_name)
def app(global_config, **settings):
""" This function returns a WSGI application.
It is usually called by the PasteDeploy framework during
``paster serve``.
"""
....
config.add_subscriber(on_new_request, events.NewRequest)
return config.make_wsgi_app()
Works really well, as long as you leave transaction management to Pyramid (i.e. do not commit/roll-back transactions manually, letting Pyramid to do that at the end of request) - which is ok as committing transactions manually is not a good approach anyway.

What works very well for me it to set the search path at the connection pool level, rather than in the session. This example uses Flask and its thread local proxies to pass the schema name so you'll have to change schema = current_schema._get_current_object() and the try block around it.
from sqlalchemy.interfaces import PoolListener
class SearchPathSetter(PoolListener):
'''
Dynamically sets the search path on connections checked out from a pool.
'''
def __init__(self, search_path_tail='shared, public'):
self.search_path_tail = search_path_tail
#staticmethod
def quote_schema(dialect, schema):
return dialect.identifier_preparer.quote_schema(schema, False)
def checkout(self, dbapi_con, con_record, con_proxy):
try:
schema = current_schema._get_current_object()
except RuntimeError:
search_path = self.search_path_tail
else:
if schema:
search_path = self.quote_schema(con_proxy._pool._dialect, schema) + ', ' + self.search_path_tail
else:
search_path = self.search_path_tail
cursor = dbapi_con.cursor()
cursor.execute("SET search_path TO %s;" % search_path)
dbapi_con.commit()
cursor.close()
At engine creation time:
engine = create_engine(dsn, listeners=[SearchPathSetter()])

Related

What is the fastest way to get all documents of a collection?

I have a problem. I want to get all documents of a collection with ~ 1 mio documents inside. I asked myself what is the fastest way to get all documents inside a collection. Is it with cursor or with .all? And are there any recommendation for the batch_size?
cursor
from arango import ArangoClient
# Initialize the ArangoDB client.
client = ArangoClient()
# Connect to database as user.
db = client.db(<db>, username=<username>, password=<password>)
cursor = db.aql.execute('FOR doc IN <Collection> RETURN doc', stream=True, ttl=3600, batch_size=<batchSize>)
collection = [doc for doc in cursor]
.all - with custom HTTP Client
from arango import ArangoClient
from arango.http import HTTPClient
class MyCustomHTTPClient(HTTPClient):
REQUEST_TIMEOUT = 1000
# Initialize the ArangoDB client.
client = ArangoClient(
http_client=MyCustomHTTPClient())
# Connect to database as user.
db = client.db(<db>, username=<username>, password=<password>)
collec = db.collection('<Collection>')
collection = collec.all()
If you want all documents in the memory then the .all will be the fastest because it uses the library's method for getting all the results which is optimized.
If you can process each document as they come in then the cursor is the best way to do it to avoid the memory overhead.
But the best way to decide this is to run tests measure the timing because many factors can effect the speed, such as the connection type and speed to the DB, amount of memory in your computer, etc. The examples you gave look simple enough to do such measurements pretty fast.

How to use Azure DevOps / VSTS to fetch query results in python

Below is my current code. It connects successfully to the organization. How can I fetch the results of a query in Azure like they have here? I know this was solved but there isn't an explanation and there's quite a big gap on what they're doing.
from azure.devops.connection import Connection
from msrest.authentication import BasicAuthentication
from azure.devops.v5_1.work_item_tracking.models import Wiql
personal_access_token = 'xxx'
organization_url = 'zzz'
# Create a connection to the org
credentials = BasicAuthentication('', personal_access_token)
connection = Connection(base_url=organization_url, creds=credentials)
wit_client = connection.clients.get_work_item_tracking_client()
results = wit_client.query_by_id("my query ID here")
P.S. Please don't link me to the github or documentation. I've looked at both extensively for days and it hasn't helped.
Edit: I've added the results line that successfully gets the query. However, it returns a WorkItemQueryResult class which is not exactly what is needed. I need a way to view the column and results of the query for that column.
So I've figured this out in probably the most inefficient way possible, but hope it helps someone else and they find a way to improve it.
The issue with the WorkItemQueryResult class stored in variable "result" is that it doesn't allow the contents of the work item to be shown.
So the goal is to be able to use the get_work_item method that requires the id field, which you can get (in a rather roundabout way) through item.target.id from results' work_item_relations. The code below is added on.
for item in results.work_item_relations:
id = item.target.id
work_item = wit_client.get_work_item(id)
fields = work_item.fields
This gets the id from every work item in your result class and then grants access to the fields of that work item, which you can access by fields.get("System.Title"), etc.

Trigger to reject cosmos DB update from python

I have some code which updates a record in a cosmos DB container (see simplified snippet below). However there are other independent processes that also update the same record from other systems. In the example below if the state is a particular value, I would like the upsert_item() to be a no-op if the same record in the container already got updated to a particular "final" state. One way to solve it is to read the value before each update but that is a bit too expensive. Is there a simple way to make the upsert_item() turn into a no-op based on some server side trigger? Any pointers would be appreciated
client = CosmosClient(<end_pt>, <key>)
database_name = "cosmosdb"
container_name = "solar_system"
db_client = client.get_database_client(database_name)
db_container = db_client.get_container_client(container_name)
uid, planet, state = get_planetary_config()
# How can I make this following update a no-op depending on current state in the database?
json_data = {"id": str(uid), "planet": planet, "state": state}
db_container.upsert_item(body=json_data)
As i know, the cosmos db server side trigger does not meet your need.It is invoked to execute pre-function or post-function,not to judge whether the document meets some conditions.
So,update with specific conditions is not supported by Cosmos Db natively. You need to read the value and judge the condition before your update operations.

SQLAlchemy get every row the matches query and loop through them

I'm new to Python and SQLAlchemy. I've been playing about with retrieving things from the database, and it's worked every time, but im a little unsure what to do when the select statement will return multiple rows. I tried using some older code that worked before I started SQLAlchemy, but db is a SQLAlchemy object and doesn't have the execute() method.
application = Applications.query.filter_by(brochureID=brochure.id)
cur = db.execute(application)
entries = cur.fetchall()
and then in my HTML file
{% for entry in entries %}
var getEmail = {{entry.2|tojson|safe}}
emailArray.push(getEmail);
I looked in the SQLAlchemy documentation and I couldn't find a .first() equivalent to getting all the rows. Can anyone point me in the right direction? No doubt it's something very small.
Your query is correct, you just need to change the way you interact with the result. The method you are looking for is all().
application = Applications.query.filter_by(brochureID=brochure.id)
entries = application.all()
the Usual way to work with orm queries is through the Session class, somewhere you should have a
engine = sqlalchemy.create_engine("sqlite:///...")
Session = sqlalchemy.orm.sessionmaker(bind=engine)
I'm not familiar with flask, but it likely does some of this work for you.
With a Session factory, your application is instead
session = Session()
entries = session.query(Application) \
.filter_by(...) \
.all()

web2py webserver - Best way to keep connection to external SQL server?

I have a simple web2py server that we use to visualize data from our PostgreSQL Server. The following functions are all part of the global models in web2py.
The current solution to fetch data is very simple. Every time I connect, and after I get the data I close the connection:
# Old way:
# (imports excluded)
def get_data(query):
postgres_connection = psycopg2.connect("credentials")
df = psql.frame_query(query, con=postgres_connection) # Pandas function to put data from query into DataFrame
postgres.close()
return df
For small queries, opening and closing the connection takes about 9/10 of the time run the function.
Is this a good way to do it instead? If not, what is a better way?
# Better way?
def connect():
"""
Create a connection to server.
"""
return psycopg2.connect("credentials")
db_connection = connect()
def create_pandas_frame(query):
"""
Get query if connection is open.
"""
return psql.frame_query(query, con=db_connection)
def get_data(query):
"""
Try to get data, open a new conneciton if connection is closed.
"""
try:
data = create_pandas_frame(query)
except:
global db_connection
db_connection = connect()
data = create_pandas_frame(query)
return data
If you run that code in a web2py model file, you'll end up creating a new connection on each HTTP request anyway. Instead, you might consider connection pooling.
An easier option might be to use the web2py DAL to fetch the data. Something like:
from pandas.core.api import DataFrame
db = DAL([db connection string], pool_size=10, migrate_enabled=False)
rows = db.executesql(query)
data = DataFrame.from_records(rows, columns=[list, of, column, names])
If you specify the pool_size argument to DAL(), it will automatically maintain a connection pool to be used across requests.
Note, I haven't tried this, so it may need some tweaking, but something along these lines should work.
If you'd like, you can even use the DAL to generate the SQL by defining database table models:
db.define_table('mytable',
Field('field1', 'integer'),
Field('field2', 'double'),
Field('field3', 'boolean'))
rows = db.executesql(db(db.mytable.id > 0)._select())
data = DataFrame.from_records(rows, columns=db.mytable.fields)
The ._select() method just generates the SQL without actually doing the select. The SQL is then passed to .executesql() to fetch the data.
An alternative is to create a special Pandas processor and pass it as the processor argument to .select().
def pandas_processor(rows, fields, columns, cacheable):
return DataFrame.from_records(rows, columns=columns)
data = db(db.mytable.id > 0).select(processor=pandas_processor)
I used Anthony's answer and now have functions that look like this:
# In one of the models files.
from pandas.core.api import DataFrame
external_db = DAL('postgres://connection_stuff',pool_size=10,migrate_enabled=False)
def create_simple_html_table(query):
dict_from_db = external_db.executesql(query, as_dict=True)
return DataFrame(dict_from_db).to_html()
Then later in a view or controller a html table is created using:
# In Controller:
my_table = create_simple_html_table('select * from random_table limit 50')
# In View:
{{=XML(create_simple_html_table('select * from random_table limit 50'))}}
I still need to do more testing, but my understanding so far is that this solution will let me query things from the external db and let web2py keep the connection, and let web2py use the same connection for all users.
Note that this solution is only good if all you want to do is to read and write to you Postgres server with raw SQL.
If you want to use DAL to read and write, you need to either try to find the DAL alternative called MyDAL or play around with the search_path option in Postgres.

Categories

Resources