What is the fastest way to get all documents of a collection? - python

I have a problem. I want to get all documents of a collection with ~ 1 mio documents inside. I asked myself what is the fastest way to get all documents inside a collection. Is it with cursor or with .all? And are there any recommendation for the batch_size?
cursor
from arango import ArangoClient
# Initialize the ArangoDB client.
client = ArangoClient()
# Connect to database as user.
db = client.db(<db>, username=<username>, password=<password>)
cursor = db.aql.execute('FOR doc IN <Collection> RETURN doc', stream=True, ttl=3600, batch_size=<batchSize>)
collection = [doc for doc in cursor]
.all - with custom HTTP Client
from arango import ArangoClient
from arango.http import HTTPClient
class MyCustomHTTPClient(HTTPClient):
REQUEST_TIMEOUT = 1000
# Initialize the ArangoDB client.
client = ArangoClient(
http_client=MyCustomHTTPClient())
# Connect to database as user.
db = client.db(<db>, username=<username>, password=<password>)
collec = db.collection('<Collection>')
collection = collec.all()

If you want all documents in the memory then the .all will be the fastest because it uses the library's method for getting all the results which is optimized.
If you can process each document as they come in then the cursor is the best way to do it to avoid the memory overhead.
But the best way to decide this is to run tests measure the timing because many factors can effect the speed, such as the connection type and speed to the DB, amount of memory in your computer, etc. The examples you gave look simple enough to do such measurements pretty fast.

Related

Trigger to reject cosmos DB update from python

I have some code which updates a record in a cosmos DB container (see simplified snippet below). However there are other independent processes that also update the same record from other systems. In the example below if the state is a particular value, I would like the upsert_item() to be a no-op if the same record in the container already got updated to a particular "final" state. One way to solve it is to read the value before each update but that is a bit too expensive. Is there a simple way to make the upsert_item() turn into a no-op based on some server side trigger? Any pointers would be appreciated
client = CosmosClient(<end_pt>, <key>)
database_name = "cosmosdb"
container_name = "solar_system"
db_client = client.get_database_client(database_name)
db_container = db_client.get_container_client(container_name)
uid, planet, state = get_planetary_config()
# How can I make this following update a no-op depending on current state in the database?
json_data = {"id": str(uid), "planet": planet, "state": state}
db_container.upsert_item(body=json_data)
As i know, the cosmos db server side trigger does not meet your need.It is invoked to execute pre-function or post-function,not to judge whether the document meets some conditions.
So,update with specific conditions is not supported by Cosmos Db natively. You need to read the value and judge the condition before your update operations.

PyMongo cursor batch_size

With PyMongo 3.7.2 I'm trying to read a collection in chunks by using batch_size on the MongoDB cursor, as described here. The basic idea is to use the find() method on the collection object, with batch_size as parameter. But whatever I try, the cursor always returns all documents in my collection.
A basic snippet of my code looks like this (the collection has over 10K documents):
import pymongo as pm
client = pm.MongoClient()
coll = client.get_database('db').get_collection('coll')
cur = coll.find({}, batch_size=500)
However, the cursor always returns the full collection size immediately. I'm using it as described in the docs.
Does anyone have an idea how I would properly iterate over the collection in batches? There are ways to loop over the output of the find() method, but that would still get the full collection first, and would only loop over the already pulled documents in memory. The batch_size parameter is supposed to get a batch and make a round-trip every time to the server, to save memory space.
Pymongo has some quality-of-life helpers for the Cursor class, so it will automatically do the batching for you, and return result to you in terms of documents.
The batch_size setting is set, but the idea is you only need to set it in the find() method, and not have to do manual low level calls or iterating through the batches.
For example, if I have 100 documents in my collection:
> db.test.count()
100
I then set the profiling level to log all queries:
> db.setProfilingLevel(0,-1)
{
"was": 0,
"slowms": 100,
"sampleRate": 1,
"ok": 1,
...
I then use pymongo to specify batch_size of 10:
import pymongo
import bson
conn = pymongo.MongoClient()
cur = conn.test.test.find({}, {'txt':0}, batch_size=10)
print(list(cur))
Running that query, I see in the MongoDB log:
2019-02-22T15:03:54.522+1100 I COMMAND [conn702] command test.test command: find { find: "test", filter: {} ....
2019-02-22T15:03:54.523+1100 I COMMAND [conn702] command test.test command: getMore { getMore: 266777378048, collection: "test", batchSize: 10, ....
(getMore repeated 9 more times)
So the query was fetched from the server in the specified batches. It's just hidden from you via the Cursor class.
Edit
If you really need to get the documents in batches, there is a function find_raw_batches() under Collection (doc link). This method works similarly to find() and accepts the same parameters. However be advised that it will return raw BSON which will need to be decoded by the application in a separate step. Notably, this method does not support sessions.
Having said that, if the aim is to lower the application's memory usage, it's worth considering modifying the query so that it uses ranges instead. For example:
find({'$gte': <some criteria>, '$lte': <some other criteria>})
Range queries are easier to optimize, can use indexes, and (in my opinion) easier to debug and easier to restart should the query gets interrupted. This is less flexible when using batches, where you have to restart the query from scratch and go over all the batches again if it gets interrupted.
This is how I do it, it helps getting the data chunked up but I thought there would be a more straight forward way to do this. I created a yield_rows function that gets you the generates and yields chunks, it ensures the used chunks are deleted.
import pymongo as pm
CHUNK_SIZE = 500
client = pm.MongoClient()
coll = client.get_database('db').get_collection('coll')
cursor = coll.find({}, batch_size=CHUNK_SIZE)
def yield_rows(cursor, chunk_size):
"""
Generator to yield chunks from cursor
:param cursor:
:param chunk_size:
:return:
"""
chunk = []
for i, row in enumerate(cursor):
if i % chunk_size == 0 and i > 0:
yield chunk
del chunk[:]
chunk.append(row)
yield chunk
chunks = yield_rows(cursor, CHUNK_SIZE)
for chunk in chunks:
# do processing here
pass
If I find a cleaner, more efficient way to do this I'll update my answer.

PyMongo: Update, $multi:false, get _id of updated document?

When updating a document in MongoDB using a search-style update, is it possible to get back the _id of the document(s) updated?
For example:
import pymongo
client = pymongo.MongoClient('localhost', 27017)
db = client.test_database
col = db.test_col
col.insert({'name':'kevin', 'status':'new'})
col.insert({'name':'brian', 'status':'new'})
col.insert({'name':'matt', 'status':'new'})
col.insert({'name':'stephen', 'status':'new'})
info = col.update({'status':'new'}, {'$set':{'status':'in_progress'}}, multi=False)
print info
# {u'updatedExisting': True, u'connectionId': 1380, u'ok': 1.0, u'err': None, u'n': 1}
# I want to know the _id of the document that was updated.
I have multiple threads accessing the database collection and want to be able to mark a document as being acted upon. Getting the document first and then updating by Id is not a good answer, because two threads may "get" the same document before it is updated. The application is a simple asynchronous task queue (yes, I know we'd be better off with something like Rabbit or ZeroMQ for this, but adding to our stack isn't possible right now).
You can use pymongo.collection.find_and_modify. It is a wrapper around MongoDB findAndModify command and can return original (by default) or modified document.
info = col.find_and_modify({'status':'new'}, {'$set':{'status':'in_progress'}})
if info:
print info.get('_id')

web2py webserver - Best way to keep connection to external SQL server?

I have a simple web2py server that we use to visualize data from our PostgreSQL Server. The following functions are all part of the global models in web2py.
The current solution to fetch data is very simple. Every time I connect, and after I get the data I close the connection:
# Old way:
# (imports excluded)
def get_data(query):
postgres_connection = psycopg2.connect("credentials")
df = psql.frame_query(query, con=postgres_connection) # Pandas function to put data from query into DataFrame
postgres.close()
return df
For small queries, opening and closing the connection takes about 9/10 of the time run the function.
Is this a good way to do it instead? If not, what is a better way?
# Better way?
def connect():
"""
Create a connection to server.
"""
return psycopg2.connect("credentials")
db_connection = connect()
def create_pandas_frame(query):
"""
Get query if connection is open.
"""
return psql.frame_query(query, con=db_connection)
def get_data(query):
"""
Try to get data, open a new conneciton if connection is closed.
"""
try:
data = create_pandas_frame(query)
except:
global db_connection
db_connection = connect()
data = create_pandas_frame(query)
return data
If you run that code in a web2py model file, you'll end up creating a new connection on each HTTP request anyway. Instead, you might consider connection pooling.
An easier option might be to use the web2py DAL to fetch the data. Something like:
from pandas.core.api import DataFrame
db = DAL([db connection string], pool_size=10, migrate_enabled=False)
rows = db.executesql(query)
data = DataFrame.from_records(rows, columns=[list, of, column, names])
If you specify the pool_size argument to DAL(), it will automatically maintain a connection pool to be used across requests.
Note, I haven't tried this, so it may need some tweaking, but something along these lines should work.
If you'd like, you can even use the DAL to generate the SQL by defining database table models:
db.define_table('mytable',
Field('field1', 'integer'),
Field('field2', 'double'),
Field('field3', 'boolean'))
rows = db.executesql(db(db.mytable.id > 0)._select())
data = DataFrame.from_records(rows, columns=db.mytable.fields)
The ._select() method just generates the SQL without actually doing the select. The SQL is then passed to .executesql() to fetch the data.
An alternative is to create a special Pandas processor and pass it as the processor argument to .select().
def pandas_processor(rows, fields, columns, cacheable):
return DataFrame.from_records(rows, columns=columns)
data = db(db.mytable.id > 0).select(processor=pandas_processor)
I used Anthony's answer and now have functions that look like this:
# In one of the models files.
from pandas.core.api import DataFrame
external_db = DAL('postgres://connection_stuff',pool_size=10,migrate_enabled=False)
def create_simple_html_table(query):
dict_from_db = external_db.executesql(query, as_dict=True)
return DataFrame(dict_from_db).to_html()
Then later in a view or controller a html table is created using:
# In Controller:
my_table = create_simple_html_table('select * from random_table limit 50')
# In View:
{{=XML(create_simple_html_table('select * from random_table limit 50'))}}
I still need to do more testing, but my understanding so far is that this solution will let me query things from the external db and let web2py keep the connection, and let web2py use the same connection for all users.
Note that this solution is only good if all you want to do is to read and write to you Postgres server with raw SQL.
If you want to use DAL to read and write, you need to either try to find the DAL alternative called MyDAL or play around with the search_path option in Postgres.

Multi-tenancy with SQLAlchemy

I've got a web-application which is built with Pyramid/SQLAlchemy/Postgresql and allows users to manage some data, and that data is almost completely independent for different users. Say, Alice visits alice.domain.com and is able to upload pictures and documents, and Bob visits bob.domain.com and is also able to upload pictures and documents. Alice never sees anything created by Bob and vice versa (this is a simplified example, there may be a lot of data in multiple tables really, but the idea is the same).
Now, the most straightforward option to organize the data in the DB backend is to use a single database, where each table (pictures and documents) has user_id field, so, basically, to get all Alice's pictures, I can do something like
user_id = _figure_out_user_id_from_domain_name(request)
pictures = session.query(Picture).filter(Picture.user_id==user_id).all()
This is all easy and simple, however there are some disadvantages
I need to remember to always use additional filter condition when making queries, otherwise Alice may see Bob's pictures;
If there are many users the tables may grow huge
It may be tricky to split the web application between multiple machines
So I'm thinking it would be really nice to somehow split the data per-user. I can think of two approaches:
Have separate tables for Alice's and Bob's pictures and documents within the same database (Postgres' Schemas seems to be a correct approach to use in this case):
documents_alice
documents_bob
pictures_alice
pictures_bob
and then, using some dark magic, "route" all queries to one or to the other table according to the current request's domain:
_use_dark_magic_to_configure_sqlalchemy('alice.domain.com')
pictures = session.query(Picture).all() # selects all Alice's pictures from "pictures_alice" table
...
_use_dark_magic_to_configure_sqlalchemy('bob.domain.com')
pictures = session.query(Picture).all() # selects all Bob's pictures from "pictures_bob" table
Use a separate database for each user:
- database_alice
- pictures
- documents
- database_bob
- pictures
- documents
which seems like the cleanest solution, but I'm not sure if multiple database connections would require much more RAM and other resources, limiting the number of possible "tenants".
So, the question is, does it all make sense? If yes, how do I configure SQLAlchemy to either modify the table names dynamically on each HTTP request (for option 1) or to maintain a pool of connections to different databases and use the correct connection for each request (for option 2)?
After pondering on jd's answer I was able to achieve the same result for postgresql 9.2, sqlalchemy 0.8, and flask 0.9 framework:
from sqlalchemy import event
from sqlalchemy.pool import Pool
#event.listens_for(Pool, 'checkout')
def on_pool_checkout(dbapi_conn, connection_rec, connection_proxy):
tenant_id = session.get('tenant_id')
cursor = dbapi_conn.cursor()
if tenant_id is None:
cursor.execute("SET search_path TO public, shared;")
else:
cursor.execute("SET search_path TO t" + str(tenant_id) + ", shared;")
dbapi_conn.commit()
cursor.close()
Ok, I've ended up with modifying search_path in the beginning of every request, using Pyramid's NewRequest event:
from pyramid import events
def on_new_request(event):
schema_name = _figire_out_schema_name_from_request(event.request)
DBSession.execute("SET search_path TO %s" % schema_name)
def app(global_config, **settings):
""" This function returns a WSGI application.
It is usually called by the PasteDeploy framework during
``paster serve``.
"""
....
config.add_subscriber(on_new_request, events.NewRequest)
return config.make_wsgi_app()
Works really well, as long as you leave transaction management to Pyramid (i.e. do not commit/roll-back transactions manually, letting Pyramid to do that at the end of request) - which is ok as committing transactions manually is not a good approach anyway.
What works very well for me it to set the search path at the connection pool level, rather than in the session. This example uses Flask and its thread local proxies to pass the schema name so you'll have to change schema = current_schema._get_current_object() and the try block around it.
from sqlalchemy.interfaces import PoolListener
class SearchPathSetter(PoolListener):
'''
Dynamically sets the search path on connections checked out from a pool.
'''
def __init__(self, search_path_tail='shared, public'):
self.search_path_tail = search_path_tail
#staticmethod
def quote_schema(dialect, schema):
return dialect.identifier_preparer.quote_schema(schema, False)
def checkout(self, dbapi_con, con_record, con_proxy):
try:
schema = current_schema._get_current_object()
except RuntimeError:
search_path = self.search_path_tail
else:
if schema:
search_path = self.quote_schema(con_proxy._pool._dialect, schema) + ', ' + self.search_path_tail
else:
search_path = self.search_path_tail
cursor = dbapi_con.cursor()
cursor.execute("SET search_path TO %s;" % search_path)
dbapi_con.commit()
cursor.close()
At engine creation time:
engine = create_engine(dsn, listeners=[SearchPathSetter()])

Categories

Resources