I am using pymongo to insert and retrieve data from MongoDB. These two operations may be performed simultaneously. The question is when I do rows = db.<collection>.find() in pymongo, every time rows.count() returns a different response (since insertion of data is also being done at the same time). Is there some way I can limit MongoDB to return only the rows that were present when I executed the find() statement? I tried adding snapshot=True to find() but the problem persists.
db.<collection>.find().count() will make an additional (runCommand count) call to mongodb anyway.
How about simply taking a length of your cursor, like this:
rows = db.<collection>.find()
print len(list(rows))
Note that you can't just use len(rows).
Hope that helps.
Related
When use the following query:
Foo.query.filter_by(bar="bar")[:20]
Do all the objects that matches the filter get loaded and I only choose the first 20 or are the first 20 objects only get loaded. And if the second, is there a way to load only a specific number of objects ?
Thanks in advance guys.
It executes the query and filter, then uses ordinary Python slicing of the result.
If you want it to load only the specific number of records use the slice() method.
Foo.query.filter_by(bar="bar").slice(0,20)
This adds a LIMIT clause to the query, so the database never returns any records outside the range.
The [] is pure python and not related to the query. It means all records (based on the filter are returned and you take only 20). Use limit if you want to return only 20 records from db.
See SQLAlchemy query to return only n results?
What is the most efficient way to loop through the cursor object in Pymongo?
Currently, this is what I'm doing:
list(my_db.my_collection.find())
Which converts the cursor to list object so that I can iterate over each element. This works fine if the find() query returns a small amount of data. However, when I scale the DB to return 10 million documents, the cursor conversion to the list is taking forever. Instead of converting the DB result(cursor) to list, I tried converting the cursor to dataframe as below:
pd.Dataframe(my_db.my_collection.find())
which didn't give me any performance improvement.
What is the most efficient way to loop through a cursor object in python?
I haven't used the pymongo till date.
But one thing I can definitely say, if you're fetching a huge amount of data by doing
list(my_db.my_collection.find())
then you must use the generator.
Because, using list here would increase memory usage significantly and may bring in MemoryError if it gets beyond the permitted value.
def get_data():
yield(my_db.my_collection.find())
Try using such methods which will not use much memory.
The cursor object pymongo gives you is already lazily loading objects, no need to do anything else.
for doc in my_db.my_collection.find():
#process doc
The method find() returns a Cursor which you can iterate
for match in my_db.my_collection.find():
# do something
pass
I am using Django 1.6.8 and MongoEngine 0.8.2.
I have 2 classes, ServiceDocument and OptionDocument. ServiceDocument keeps a list of OptionDocuments. There are millions of ServiceDocuments (2.5 million +).
I want to select every ServiceDocument which has more than two OptionDocuments.
I "want" this to work, but get 0 as result:
ServiceDocument.objects.filter(options__size__gt=2).count()
This is what I get to work:
>>> ServiceDocument.objects.filter(options__size=1).count()
6582
>>> ServiceDocument.objects.filter(options__size=2).count()
2734321
>>> ServiceDocument.objects.filter(options__size=3).count()
25165
>>> ServiceDocument.objects.all().count()
2769768
Lastly, if I had fewer ServiceDocuments and/or I could get an iterator working I could just loop through them myself, but I get segfaults after the memory fills up after a few seconds (I'm guessing any operation on .all() will try to collect them all in memory).
For the iterator, I tried the following without success:
iter(ServiceDocument.objects.all())
Well I think that you need to find a work around for this as mongoengine doesn't support your query. What you can do is add another field say 'options_length' and store the length of options field in this. Then you can query using '__gt > 2'. The additional cost is that you need to override your model save function to update length on every save. Also you need to update the existing records for this.
You can also read this
question
I have generated a giant SQLite database and need to get some data out of it. I wrote some script to do so, and profiling let to the unfortunate conclusion that the write process would take approx. 3 days with the current setup. I wrote the script as simplistic as possible to make it as fast as possible.
I am wondering if you have some trick to speed up the whole process. The database has an unique index, but the columns I am querying don't (because of duplicate rows for those).
Would it make sense to use any multi-processing Python library here?
The script would be like this:
import sqlite3
def write_from_query(db_name, table_name, condition, content_column, out_file):
'''
Writes contents from a SQLite database column to an output file
Keyword arguments:
db_name (str): Path of the .sqlite database file.
table_name (str): Name of the target table in the SQLite file.
condition (str): Condition for querying the SQLite database table.
content_colum (str): Name of the column that contains the content for the output file.
out_file (str): Path of the output file that will be written.
'''
# Connecting to the database file
conn = sqlite3.connect('zinc12_drugnow_nrb(copy).sqlite')
c = conn.cursor()
# Querying the database and writing the output file
c.execute('SELECT ({}) FROM {} WHERE {}'.format(content_column, table_name, condition))
with open(out_file, 'w') as outf:
for row in c:
outf.write(row[0])
# Closing the connection to the database
conn.close()
if __name__ == '__main__':
write_from_query(
db_name='my_db.sqlite',
table_name='my_table',
condition='variable1=1 AND variable2<=5 AND variable3="Zinc_Plus"',
content_column='variable4',
out_file='sqlite_out.txt'
)
Link to this script on GitHub
Thanks for your help, I am looking forward to your suggestions!
EDIT:
more information about the database:
I assume that you are running the write_from_query functions for a huge amount of queries.
If so the problem is the missing indices on your filter criteria
This results in the following: for each query you execute, sqlite will loop through the whole 50GB of data and checks whether your conditions hold true. That is VERY inefficient.
The easiest way would be to slap indices on your columns
An alternative would be to formulate less queries that include multiple of your cases and then loop over that data again to split it it in different files. How well this can done however depends on how your data is structured.
I'm not sure about multiprocessing/threading, sqlite is not really made for concurrency, but I guess it could work out since you only read data...
Either you dump the content and filter in your own program - or you add indices to all columns you use in your conditions.
Adding indices to all the columns will take a long long time.
But for many different queries there is no alternative.
No multiprocessing will probably not help. An SSD might, or 64GiB Ram. But they are not needed with indices, queries will be fast on normal disks too.
In conclusion you created a Database without creating indices for the columns you want to query. With 8Mio rows this wont work.
Whilst the process of actually writing this data to a file will take a while I would expect it to be more like minutes than days e.g. at a 50MB/s sequential write speed 15GB works out at around 5 mins.
I suspect that the issue is with the queries / lack of indexes. I would suggest trying to build composite indexes based on the combinations of columns that you need to filter on. As you will see from the documentation here, you can actually add as many columns as you want to an index.
Just to make you aware adding indexes will slow down inserts / updates to your database as every time it now needs to find the appropriate place in the relevant indexes to add data as well as appending data to the end of the tables, but this is probably your only option to speed up the queries.
I will look at the unique indices! But meanwhile another thing I just stumbled upon... Sorry for writing an own answer for my question here, but I thought it is better for the organization...
I was thinking that the .fetchall() command could also speed up the whole process, but I find the sqlite3 documentation on this a little bit brief ... Would something like
with open(out_file, 'w') as outf:
c.excecute ('SELECT * ...')
results = c.fetchmany(10000)
while results:
for row in results:
outf.write(row[0])
results = c.fetchmany(10000)
make sense?
I'm writing small program that is querying for results from database (single table). I'm using python 3.3, sqlalchemy and postgres database.
result = db_session.query(Data).all()
progress = 0
for row in result:
update_progress_bar(progress, len(result))
do_something_with_data(row)
progress += 1
Variable 'result' will contain few thousands rows, and processing of data is taking some time. This is why I introduced simple progress bar to give idea how mutch time it will take.
The problem is, that 30% of the total time is queering the database (first line). So when I start program I get big delay before my progress bar start moving. In addition I don't need to keep all results in memory. I can process them separately.
Is there any way to modify above program to get rows one by one until all rows are received, without loading everything into memory? In addition I want to monitor progress of querying and processing the data.
You need to just loop over the query without calling .all(), and call .yield_per() to set a batch size:
for row in db_session.query(Data).yield_per(10):
do_something_with_data(row)
.all() indeed turns the whole result set into a list first, causing a delay if the resultset is large. Iterating over the query directly after setting .yield_per() instead fetches results as needed, provided the database API supports it.
If you wanted to know up-front how many rows will be returned, call .count() first:
result = db_session.query(Data)
count = result.count()
for row in result.yield_per(10):
update_progress_bar(progress, count)
do_something_with_data(row)
progress += 1
.count() asks the database to gives us an item count for us first.
Your database could still be pre-caching the result rows, leading to a start-up delay, even when using .yield_per(). In that case you'll need to use a windowed query to break up your query into blocks based on the range of values in one of the columns. Wether or not this will work depends on your exact table layout.