Currently I fetch documents by iterating through cursor in pymongo, for example:
for d in db.docs.find():
mylist.append(d)
For reference, performing a fetchall on the same set of data (7m records) takes around 20 seconds while the method above takes a few minutes.
Is there a faster way read bulk data in mongo? Sorry I'm new to mongo, please let me know if more information is needed.
using the $natural sort will bypass the index and return the documents in the order in which they are stored on disk, meaning that mongo doesn't have to thrash around with random reads on your disk.
https://docs.mongodb.com/manual/reference/method/cursor.sort/#return-natural-order
The performance becomes severely degraded if you want to use a query. You should never rely on FIFO ordering. Mongo allows itself to move documents around within it's storage layer. If you don't care about the order, so be it.
This ordering is an internal implementation feature, and you should
not rely on any particular structure within i
for d in db.docs.find().sort( { $natural: 1 } ):
mylist.append(d)
in python, you also want to use an EXHAUST cursor type that tells the mongo server to stream back the results without waiting for the pymongo driver to acknowledge each batch
https://api.mongodb.com/python/current/api/pymongo/cursor.html#pymongo.cursor.CursorType.EXHAUST
Mind you, it'll never be as fast as the shell. The slowest aspect of moving data between mongo/bson->pymongo->you is UTF8 string decoding within python.
You only need to make a cast with list() function
pymongo_cursor = db.collection.find()
all_data = list(pymongo_cursor)
Related
I need to get data for a certain period of time by es api and use python to do some customized analysis of these data and display the result on dashboard.
There are about two hundred thousand records every 15 minutes,indexed by date.
Now I use scroll-scan to get data,But it takes nearly a minute to get 200000 records,It seems to be too slow.
Is there any way to process these data more quickly?and can I use something like redis to save the results and avoid repetitive work?
Is it possible to do the analysis on the Elasticsearch side using aggregations?
Assuming you're not doing it already, you should use _source to only download the absolute minimum data required. You could also try increasing the size parameter to scan() from the default of 1000. I would expect only modest speed improvements from that, however.
If the historical data doesn't change, then a cache like Redis (or even just a local file) could be a good solution. If the historical data can change, then you'd have to manage cache invalidation.
I've been working on a project to evaluate mongodb speed compared to another data store. To this end I'm trying to perform a full scan over a collection I've made. I found out about the profiler, so I have that enabled and set to log every query. I have a collection of a million objects, and i'm trying to time how long it takes to scan the collection. Unfortunately when I run
db.sampledata.find()
it returns immediately with a cursor to 1000 or so objects. So I wrote a python script to iterate through the cursor to handle all results. Here it is:
from pymongo import MongoClient
client = MongoClient()
db = client.argocompdb
data = db.sampledata
count = 0
my_info = data.find()
for row in my_info:
count += 1
print count
This seems to be taking the requisite time. However, when I check the profiler, theres no overall amount for the full query time, it's just a whole whack of "getmore" ops that take 3-6 millis each. Is there any way to do what I'm trying to do using the profiler instead of timing it in python? I essentially just want to:
Be able to execute a query and have it return all results, instead
of just the few in the cursor.
Get time for the "full query" in the profiler. The time it took to get all results.
Is what I want to do feasible?
I'm very new to MongoDB so I'm very sorry if this has been asked before but I couldn't find anything on it.
The profiler is measuring the correct thing. The Mongo driver is not returning all the records in the collection at once; it is first giving you a cursor, and then feeding the documents one by one as you iterate through the cursor. So the profiler is measuring exactly what is being done.
And I argue that this is a more correct metric than the one you are seeking, which I believe is the time that it takes to actually read all the documents into your client. You actually don't want the Mongo driver to read all the documents into memory before returning. No application would perform well if written that way, except for the smallest of collections. It's much faster for a client to read documents on demand, so that the smallest total memory footprint is necessary.
Also, what are you comparing this against? If you are comparing to a relational database, then it matters a great deal what your schema is in the relational DB, and what your collections and documents look like in Mongo. And of course, how each is indexed. Different choices can produce very different performance results, at no fault of the database engine.
The simplest, and therefore fastest, operations in Mongo will probably be lookups of tiny documents retrieved by their id which is always indexed: db.collection.find({id: ...}). But if you really want to measure a linear scan, then the smaller the documents are, the faster the scan will be. But really, this isn't very useful, as it basically only measures how quickly the server can read data from disk.
I have a table that look like this (snapshot of SequelPro):
It contains ~56M rows. They have been indexed with uniprot_id and gene_symbol as keys.
What I want to do is to perform the following simple SQL query:
SELECT uniprot_id, gene_symbol
FROM id_mapping_uniprotkb
And later store them into Python's dictionary.
The problem is that the above SQL query takes very long time to finish.
What's the way to speed it up?
Here is my Python code:
import MySQLdb as mdb
import MySQLdb.cursors
condb = mdb.connect(host = "db01.foo.com.jp",
user = "coolguy",
passwd = "xxxx",
db = "CISBP_DB",
cursorclass = mdb.cursors.SSCursor
)
crsr = condb.cursor()
sql = """
SELECT uniprot_id, gene_symbol
FROM id_mapping_uniprotkb
"""
crsr.execute(sql)
rows = crsr.fetchall()
#gene_symbol2uniprot = dict(crsr.fetchall())
gene_symbol2uniprot = {}
for uniprotid,gene_symbol in rows:
gene_symbol2uniprot[gene_symbol] = uniprotid
# do something with gene_symbol2uniprot
# with other process.
Transferring 65 Million records across the wire is never going to be quick, especially if the records are more than a few bytes each. The database shouldn't add much overhead but it's not giving you any value.
Normally I'd say you never need to do what you're trying, but I'm guessing from the table names that this is something to do with genomes and proteins, so it's possible you're doing something that really does require the data. A little clarification in the question would help us answer more usefully.
Anyway...
The database is designed to filter and sort massive data sets until they're at a manageable size. Since you're always getting every record, you'd find it a lot faster to store the data in a compressed format on disk. The Cpu overhead of decompression will be more than offset by the reduced time to read from disk in all but the smallest data sets.
If you're stuck using MySql for the time being, you should enable protocol compression. This will reduce the size of the data going over the wire and should speed things up at the expense of Cpu. The same may apply to compressing the table on disk, but that will be down to how beefy your Sql server is, how much data it can fit into cache, how recently the table was accessed and a host of other details.
A better solution would be to read the records from the database in (say) 1 million record chunks, pickle and/or zip them and write them to disk locally (preferably on an SSD). After you've done this process once, you can deserialize the local copy which should be considerably faster than using a remote database.
Edit:
I thought I should add that if you don't need all of the records in memory at the same time then there's no reason you can't page (SELECT a, b, c FROM x LIMIT 200, 50 would get you records #200-249)
People often get used to loading entire databases in core; this makes for very easy coding until you reach the limit of what you can comfortably fit in real (not virtual) memory.
As soon as keeping-it-all-in-core exhausts some resource, performance plummets. Do you really need fetchall() or could you use for row in cursor: instead? Put another way, if you can't select, reduce, or somehow summarize the contents of the database, you're not doing useful computation with the data.
(This is just a blunt summary of various comments above.)
I have a 400 million lines of unique key-value info that I would like to be available for quick look ups in a script. I am wondering what would be a slick way of doing this. I did consider the following but not sure if there is a way to disk map the dictionary and without using a lot of memory except during dictionary creation.
pickled dictionary object : not sure if this is an optimum solution for my problem
NoSQL type dbases : ideally want something which has minimum dependency on third party stuff plus the key-value are simply numbers. If you feel this is still the best option, I would like to hear that too. May be it will convince me.
Please let me know if anything is not clear.
Thanks!
-Abhi
If you want to persist a large dictionary, you are basically looking at a database.
Python comes with built in support for sqlite3, which gives you an easy database solution backed by a file on disk.
No one has mentioned dbm. It is opened like a file, behaves like a dictionary and is in the standard distribution.
From the docs https://docs.python.org/3/library/dbm.html
import dbm
# Open database, creating it if necessary.
with dbm.open('cache', 'c') as db:
# Record some values
db[b'hello'] = b'there'
db['www.python.org'] = 'Python Website'
db['www.cnn.com'] = 'Cable News Network'
# Note that the keys are considered bytes now.
assert db[b'www.python.org'] == b'Python Website'
# Notice how the value is now in bytes.
assert db['www.cnn.com'] == b'Cable News Network'
# Often-used methods of the dict interface work too.
print(db.get('python.org', b'not present'))
# Storing a non-string key or value will raise an exception (most
# likely a TypeError).
db['www.yahoo.com'] = 4
# db is automatically closed when leaving the with statement.
I would try this before any of the more exotic forms, and using shelve/pickle will pull everything into memory on loading.
Cheers
Tim
In principle the shelve module does exactly what you want. It provides a persistent dictionary backed by a database file. Keys must be strings, but shelve will take care of pickling/unpickling values. The type of db file can vary, but it can be a Berkeley DB hash, which is an excellent light weight key-value database.
Your data size sounds huge so you must do some testing, but shelve/BDB is probably up to it.
Note: The bsddb module has been deprecated. Possibly shelve will not support BDB hashes in future.
Without a doubt (in my opinion), if you want this to persist, then Redis is a great option.
Install redis-server
Start redis server
Install redis python pacakge (pip install redis)
Profit.
import redis
ds = redis.Redis(host="localhost", port=6379)
with open("your_text_file.txt") as fh:
for line in fh:
line = line.strip()
k, _, v = line.partition("=")
ds.set(k, v)
Above assumes a files of values like:
key1=value1
key2=value2
etc=etc
Modify insertion script to your needs.
import redis
ds = redis.Redis(host="localhost", port=6379)
# Do your code that needs to do look ups of keys:
for mykey in special_key_list:
val = ds.get(mykey)
Why I like Redis.
Configurable persistance options
Blazingly fast
Offers more than just key / value pairs (other data types)
#antrirez
I don't think you should try the pickled dict. I'm pretty sure that Python will slurp the whole thing in every time, which means your program will wait for I/O longer than perhaps necessary.
This is the sort of problem for which databases were invented. You are thinking "NoSQL" but an SQL database would work also. You should be able to use SQLite for this; I've never made an SQLite database that large, but according to this discussion of SQLite limits, 400 million entries should be okay.
What are the performance characteristics of sqlite with very large database files?
I personally use LMDB and its python binding for a few million records DB.
It is extremely fast even for a database larger than the RAM.
It's embedded in the process so no server is needed.
Dependency are managed using pip.
The only downside is you have to specify the maximum size of the DB. LMDB is going to mmap a file of this size. If too small, inserting new data will raise a error. To large, you create sparse file.
Please bear with me as I explain the problem, how I tried to solve it,
and my question on how to improve it is at the end.
I have a 100,000 line csv file from an offline batch job and I needed to
insert it into the database as its proper models. Ordinarily, if this is a fairly straight-forward load, this can be trivially loaded by just munging the CSV file to fit a schema; but, I had to do some external processing that requires querying and it's just much more convenient to use SQLAlchemy to generate the data I want.
The data I want here is 3 models that represent 3 pre-exiting tables
in the database and each subsequent model depends on the previous model.
For example:
Model C --> Foreign Key --> Model B --> Foreign Key --> Model A
So, the models must be inserted in the order A, B, and C. I came up
with a producer/consumer approach:
- instantiate a multiprocessing.Process which contains a
threadpool of 50 persister threads that have a threadlocal
connection to a database
- read a line from the file using the csv DictReader
- enqueue the dictionary to the process, where each thread creates
the appropriate models by querying the right values and each
thread persists the models in the appropriate order
This was faster than a non-threaded read/persist but it is way slower than
bulk-loading a file into the database. The job finished persisting
after about 45 minutes. For fun, I decided to write it in SQL
statements, it took 5 minutes.
Writing the SQL statements took me a couple of hours, though. So my
question is, could I have used a faster method to insert rows using
SQLAlchemy? As I understand it, SQLAlchemy is not designed for bulk
insert operations, so this is less than ideal.
This follows to my question, is there a way to generate the SQL statements using SQLAlchemy, throw
them in a file, and then just use a bulk-load into the database? I
know about str(model_object) but it does not show the interpolated
values.
I would appreciate any guidance for how to do this faster.
Thanks!
Ordinarily, no, there's no way to get the query with the values included.
What database are you using though? Cause a lot of databases do have some bulk load feature for CSV available.
Postgres: http://www.postgresql.org/docs/8.4/static/sql-copy.html
MySQL: http://dev.mysql.com/doc/refman/5.1/en/load-data.html
Oracle: http://www.orafaq.com/wiki/SQL*Loader_FAQ
If you're willing to accept that certain values might not be escaped correctly than you can use this hack I wrote for debugging purposes:
'''Replace the parameter placeholders with values'''
params = compiler.params.items()
params.sort(key=lambda (k, v): len(str(k)), reverse=True)
for k, v in params:
'''Some types don't need escaping'''
if isinstance(v, (int, long, float, bool)):
v = unicode(v)
else:
v = "'%s'" % v
'''Replace the placeholders with values
Works both with :1 and %(foo)s type placeholders'''
query = query.replace(':%s' % k, v)
query = query.replace('%%(%s)s' % k, v)
First, unless you actually have a machine with 50 CPU cores, using 50 threads/processes won't help performance -- it will actually make things slower.
Second, I've a feeling that if you used SQLAlchemy's way of inserting multiple values at once, it would be much faster than creating ORM objects and persisting them one-by-one.
I would venture to say the time spent in the python script is in the per-record upload portion. To determine this you could write to CSV or discard the results instead of uploading new records. This will determine where the bottleneck is; at least from a lookup-vs-insert standpoint. If, as I suspect, that is indeed where it is you can take advantage of the bulk import feature most DBS have. There is no reason, and indeed some arguments against, inserting record-by-record in this kind of circumstance.
Bulk imports tend to do some interestng optimization such as doing it as one transaction w/o commits for each record (even just doing this could see an appreciable drop in run time); whenever feasible I recommend the bulk insert for large record counts. You could still use the producer/consumer approach, but have the consumer instead store the values in memory or in a file and then call the bulk import statement specific to the DB you are using. This might be the route to go if you need to do processing for each record in the CSV file. If so I would also consider how much of that can be cached and shared between records.
it is also possible that the bottleneck is using SQLAlchemy. Not that I know of any inherent issues, but given what you are doing it might be requiring a lot more processing than is necessary - as evidenced by the 8x difference in run times.
For fun, since you already know the SQL, try using a direct DBAPI module in Python to do it and compare run times.