I have a 400 million lines of unique key-value info that I would like to be available for quick look ups in a script. I am wondering what would be a slick way of doing this. I did consider the following but not sure if there is a way to disk map the dictionary and without using a lot of memory except during dictionary creation.
pickled dictionary object : not sure if this is an optimum solution for my problem
NoSQL type dbases : ideally want something which has minimum dependency on third party stuff plus the key-value are simply numbers. If you feel this is still the best option, I would like to hear that too. May be it will convince me.
Please let me know if anything is not clear.
Thanks!
-Abhi
If you want to persist a large dictionary, you are basically looking at a database.
Python comes with built in support for sqlite3, which gives you an easy database solution backed by a file on disk.
No one has mentioned dbm. It is opened like a file, behaves like a dictionary and is in the standard distribution.
From the docs https://docs.python.org/3/library/dbm.html
import dbm
# Open database, creating it if necessary.
with dbm.open('cache', 'c') as db:
# Record some values
db[b'hello'] = b'there'
db['www.python.org'] = 'Python Website'
db['www.cnn.com'] = 'Cable News Network'
# Note that the keys are considered bytes now.
assert db[b'www.python.org'] == b'Python Website'
# Notice how the value is now in bytes.
assert db['www.cnn.com'] == b'Cable News Network'
# Often-used methods of the dict interface work too.
print(db.get('python.org', b'not present'))
# Storing a non-string key or value will raise an exception (most
# likely a TypeError).
db['www.yahoo.com'] = 4
# db is automatically closed when leaving the with statement.
I would try this before any of the more exotic forms, and using shelve/pickle will pull everything into memory on loading.
Cheers
Tim
In principle the shelve module does exactly what you want. It provides a persistent dictionary backed by a database file. Keys must be strings, but shelve will take care of pickling/unpickling values. The type of db file can vary, but it can be a Berkeley DB hash, which is an excellent light weight key-value database.
Your data size sounds huge so you must do some testing, but shelve/BDB is probably up to it.
Note: The bsddb module has been deprecated. Possibly shelve will not support BDB hashes in future.
Without a doubt (in my opinion), if you want this to persist, then Redis is a great option.
Install redis-server
Start redis server
Install redis python pacakge (pip install redis)
Profit.
import redis
ds = redis.Redis(host="localhost", port=6379)
with open("your_text_file.txt") as fh:
for line in fh:
line = line.strip()
k, _, v = line.partition("=")
ds.set(k, v)
Above assumes a files of values like:
key1=value1
key2=value2
etc=etc
Modify insertion script to your needs.
import redis
ds = redis.Redis(host="localhost", port=6379)
# Do your code that needs to do look ups of keys:
for mykey in special_key_list:
val = ds.get(mykey)
Why I like Redis.
Configurable persistance options
Blazingly fast
Offers more than just key / value pairs (other data types)
#antrirez
I don't think you should try the pickled dict. I'm pretty sure that Python will slurp the whole thing in every time, which means your program will wait for I/O longer than perhaps necessary.
This is the sort of problem for which databases were invented. You are thinking "NoSQL" but an SQL database would work also. You should be able to use SQLite for this; I've never made an SQLite database that large, but according to this discussion of SQLite limits, 400 million entries should be okay.
What are the performance characteristics of sqlite with very large database files?
I personally use LMDB and its python binding for a few million records DB.
It is extremely fast even for a database larger than the RAM.
It's embedded in the process so no server is needed.
Dependency are managed using pip.
The only downside is you have to specify the maximum size of the DB. LMDB is going to mmap a file of this size. If too small, inserting new data will raise a error. To large, you create sparse file.
Related
I scraped a large amount of data from a database and saved it as "first_database.db" using Python's shelve module (I'm using Python 3.4). I've had problems with shelve before (see my old issues), which IIRC were probably due to something relating to my ancient OS (OSX 10.9.4) and gdbm/dbm.gnu.
Now I have now a more intractable problem: I made a new file that's ~170 MB, and now I can only access a single key/value, no matter what.
I know the superset of possible keys, and trying to access any of them gives me a KeyError (except for one). When I save the value of the single key that doesn't return a KeyError as a new shelve database, its size is only 16 KB, so I know the data is in the 170 MB file, but I can't access it.
Am I just screwed?
Furthermore, I have made a copy of the database and tried to add more keys to it (~95). That database will say that it has three keys, but when I try to access the value of the third one, I get the following error:
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/shelve.py", line 114, in __getitem__
value = Unpickler(f).load()
_pickle.UnpicklingError: invalid load key, ''.
I don't know the issue, but maybe this alternative might help you:
https://github.com/dagnelies/pysos
It's like shelve but does not rely on an underlying implementation and saves its data in plain text. That way, you could even open the DB file to inspect its content if something unexpected occurs.
Note also that shelve relies on an underlying dbm implementation. That means that if you saved your shelve on a Linux, you might not be able to read it on Mac for instance, if its dbm implementation differs (there are several of).
Im using python 2.7.8 and pymongo 2.7
and the mongoDB server is a ReplicaSet group one primary two secondary .
the mognodb server is built on AWS server EBS:500GB, IOPS3000 .
I want to know is there any way to speed up the insert.when W=2, j=True
Using pymongo to insert a million of files takes a lot of time
and i know that if i use the W=0 it will speed up ,but it isn't safe
So any suggestion ? Please help me thanks.
Setting W=0 is deprecated. This is the older model of MongoDB (pre-3.0), which they don't recommend using any more.
Using MongoDB as a file storage system also isn't a great idea; but, you can consider using GridFS if that's the case.
I assume you're trying some sort of mass-import, and you don't have many (or any) readers right now; in which case, you will be okay if any reader sees some, but not all, of the documents.
You have a couple of options:
set j=False. MongoDB will return more quickly (before the documents are committed to the journal), at the potential risk of documents being lost if the DB crashes.
set W=1. If replication is slow, this will only wait until one of the nodes (the primary) has the data before returning.
If you do need strong consistency requirements (readers seeing everything inserted so far), neither of these options will help.
You can use unordered or ordered bulk inserts
This speeds up things alot. May be also you take a look at my muBulkOps a wraper of pymongo bulk operations.
I'm trying to write data collected from a data acquisition system to locations in memory, and then asynchronously perform further processing on the data, or write it out to file for offline processing. I'm trying to do this architecture this way to isolate data acquisition from data analysis and transmittal, buying us some flexibility for future expansion and improvement, but it is definitely more complex then simply writing the data directly to a file.
Here is some exploratory code I wrote.
#io.BufferedRWPair test
from io import BufferedRWPair
# Samples of instrumentation data to be stored in RAM
test0 = {'Wed Aug 1 16:48:51 2012': ['20.0000', '0.0000', '13.5', '75.62', '8190',
'1640', '240', '-13', '79.40']}
test1 = {'Wed Aug 1 17:06:48 2012': ['20.0000', '0.0000', '13.5', '75.62', '8190',
'1640', '240', '-13', '79.40']}
# Attempt to create a RAM-resident object into which to read the data.
data = BufferedRWPair(' ', ' ', buffer_size=1024)
data.write(test0)
data.write(test1)
print data.getvalue()
data.close()
There are a couple of issues here (maybe more!):
-> 'data' is a variable name that picks up a construct (outside of Python) that I'm trying to assemble -- which is an array-like structure that should hold sequential records with each record containing several process data measurements, prefaced by a timestamp that can serve as a key for retrieval. I offered this as background to my design intent, in case the code was too vague to reflect my true questions.
-> This code does not work, because the 'data' object is not being created. I'm just trying to open an empty buffer, to be filled later, but Python is looking for two objects, one readable, one writeable, which are not present in my code. Because of this, I'm not sure I'm even using the right construct, which leads to these questions:
Is io.BufferedRWPair the best way to deal with this data? I've tried StringIO, since I'm on Python 2.7.2, but no luck. I like the idea of a record with a timestamp key, hence my choice of the dict structure, but I'd sure look at alternatives. Are there other io classes I should look at instead?
One alternative I've looked at is the DataFrame construct which is defined in the NumPy/ SciPy/ Pandas world. It looks interesting, but there seems like a lot of additional modules required, so I've shied away from that. I have no experience with any of those modules -- Should I be looking at these more complex modules to get what I need?
I'd welcome any suggestions or feedback, folks... Thanks for checking out this question!
If I understand what you are asking, using an in-memory sqlite database might be the way to go. Sqlite allows you to create a fully functioning SQL database entirly in memory. Instead of reads and writes you would do selects and inserts.
Writing a mechanism to hold data in memory while it fits and only write it to a file if necessary is redundant – the operating system does this for you anyway. If you use a normal file and access it from the different parts of your application, the operating system will keep the file contents in the disk cache as long as enough memory is available.
If you want to have access to the file by memory addresses, you can memory-map it using the mmap module. However, my impression is that all you need is a standard database, or one of the simpler alternatives offered by the Python standard library, such as the shelve any anydbm modules.
Based on your comments, also check out key-value stores like Redis and memcached.
Please bear with me as I explain the problem, how I tried to solve it,
and my question on how to improve it is at the end.
I have a 100,000 line csv file from an offline batch job and I needed to
insert it into the database as its proper models. Ordinarily, if this is a fairly straight-forward load, this can be trivially loaded by just munging the CSV file to fit a schema; but, I had to do some external processing that requires querying and it's just much more convenient to use SQLAlchemy to generate the data I want.
The data I want here is 3 models that represent 3 pre-exiting tables
in the database and each subsequent model depends on the previous model.
For example:
Model C --> Foreign Key --> Model B --> Foreign Key --> Model A
So, the models must be inserted in the order A, B, and C. I came up
with a producer/consumer approach:
- instantiate a multiprocessing.Process which contains a
threadpool of 50 persister threads that have a threadlocal
connection to a database
- read a line from the file using the csv DictReader
- enqueue the dictionary to the process, where each thread creates
the appropriate models by querying the right values and each
thread persists the models in the appropriate order
This was faster than a non-threaded read/persist but it is way slower than
bulk-loading a file into the database. The job finished persisting
after about 45 minutes. For fun, I decided to write it in SQL
statements, it took 5 minutes.
Writing the SQL statements took me a couple of hours, though. So my
question is, could I have used a faster method to insert rows using
SQLAlchemy? As I understand it, SQLAlchemy is not designed for bulk
insert operations, so this is less than ideal.
This follows to my question, is there a way to generate the SQL statements using SQLAlchemy, throw
them in a file, and then just use a bulk-load into the database? I
know about str(model_object) but it does not show the interpolated
values.
I would appreciate any guidance for how to do this faster.
Thanks!
Ordinarily, no, there's no way to get the query with the values included.
What database are you using though? Cause a lot of databases do have some bulk load feature for CSV available.
Postgres: http://www.postgresql.org/docs/8.4/static/sql-copy.html
MySQL: http://dev.mysql.com/doc/refman/5.1/en/load-data.html
Oracle: http://www.orafaq.com/wiki/SQL*Loader_FAQ
If you're willing to accept that certain values might not be escaped correctly than you can use this hack I wrote for debugging purposes:
'''Replace the parameter placeholders with values'''
params = compiler.params.items()
params.sort(key=lambda (k, v): len(str(k)), reverse=True)
for k, v in params:
'''Some types don't need escaping'''
if isinstance(v, (int, long, float, bool)):
v = unicode(v)
else:
v = "'%s'" % v
'''Replace the placeholders with values
Works both with :1 and %(foo)s type placeholders'''
query = query.replace(':%s' % k, v)
query = query.replace('%%(%s)s' % k, v)
First, unless you actually have a machine with 50 CPU cores, using 50 threads/processes won't help performance -- it will actually make things slower.
Second, I've a feeling that if you used SQLAlchemy's way of inserting multiple values at once, it would be much faster than creating ORM objects and persisting them one-by-one.
I would venture to say the time spent in the python script is in the per-record upload portion. To determine this you could write to CSV or discard the results instead of uploading new records. This will determine where the bottleneck is; at least from a lookup-vs-insert standpoint. If, as I suspect, that is indeed where it is you can take advantage of the bulk import feature most DBS have. There is no reason, and indeed some arguments against, inserting record-by-record in this kind of circumstance.
Bulk imports tend to do some interestng optimization such as doing it as one transaction w/o commits for each record (even just doing this could see an appreciable drop in run time); whenever feasible I recommend the bulk insert for large record counts. You could still use the producer/consumer approach, but have the consumer instead store the values in memory or in a file and then call the bulk import statement specific to the DB you are using. This might be the route to go if you need to do processing for each record in the CSV file. If so I would also consider how much of that can be cached and shared between records.
it is also possible that the bottleneck is using SQLAlchemy. Not that I know of any inherent issues, but given what you are doing it might be requiring a lot more processing than is necessary - as evidenced by the 8x difference in run times.
For fun, since you already know the SQL, try using a direct DBAPI module in Python to do it and compare run times.
I have a database which I regularly need to import large amounts of data into via some python scripts. Compacted, the data for a single months imports takes about 280mb, but during the import file size swells to over a gb.
Given the 2gb size limit on mdb files, this is a bit of a concern. Apart from breaking the inserts into chunks and compacting inbetween each, are there any techniques for avoiding the increase in file size?
Note that no temporary tables are being created/deleted during the process: just inserts into existing tables.
And to forstall the inevitable comments: yes, I am required to store this data in Access 2003. No, I can't upgrade to Access 2007.
If it could help, I could preprocess in sqlite.
Edit:
Just to add some further information (some already listed in my comments):
The data is being generated in Python on a table by table basis, and then all of the records for that table batch inserted via odbc
All processing is happening in Python: all the mdb file is doing is storing the data
All of the fields being inserted are valid fields (none are being excluded due to unique key violations, etc.)
Given the above, I'll be looking into how to disable row level locking via odbc and considering presorting the data and/or removing then reinstating indexes. Thanks for the suggestions.
Any further suggestions still welcome.
Are you sure row locking is turned off? In my case, turning off row locking reduced bloat by over a 100 megs when working on a 5 meg file. (in other words the file barley grew after turning off row locking to about 6 megs). With row locking on, the same operation results in a file well over 100 megs in size.
Row locking is a HUGE source of bloat during recordset operations since it pads each record to a page size.
Do you have ms-access installed here, or are you just using JET (JET is the data engine that ms-access uses. You can use JET without access).
Open the database in ms-access and go:
Tools->options
On the advanced tab, un-check the box:
[ ] Open databases using record level locking.
This will not only make a HUGE difference in the file growth (bloat), it will also speed things up by a factor of 10 times.
There also a registry setting that you can use here.
And, Are you using odbc, or an oleDB connection?
You can try:
Set rs = New ADODB.Recordset
With rs
.ActiveConnection = RsCnn
.Properties("Jet OLEDB:Locking Granularity") = 1
Try the setting from accesss (change the setting), exit, re-enter and then compact and repair. Then run your test import…the bloat issue should go away.
There is likely no need to open the database using row locking. If you turn off that feature, then you should be able to reduce the bloat in file size down to a minimum.
For furher reading and an example see here:
Does ACEDAO support row level locking?
One thing to watch out for is records which are present in the append queries but aren't inserted into the data due to duplicate key values, null required fields, etc. Access will allocate the space taken by the records which aren't inserted.
About the only significant thing I'm aware of is to ensure you have exclusive access to the database file. Which might be impossible if doing this during the day. I noticed a change in behavior from Jet 3.51 (used in Access 97) to Jet 4.0 (used in Access 2000) when the Access MDBs started getting a lot larger when doing record appends. I think that if the MDB is being used by multiple folks then records are inserted once per 4k page rather than as many as can be stuffed into a page. Likely because this made index insert/update operations faster.
Now compacting does indeed put as many records in the same 4k page as possible but that isn't of help to you.
A common trick, if feasible with regard to the schema and semantics of the application, is to have several MDB files with Linked tables.
Also, the way the insertions take place matters with regards to the way the file size balloons... For example: batched, vs. one/few records at a time, sorted (relative to particular index(es)), number of indexes (as you mentioned readily dropping some during the insert phase)...
Tentatively a pre-processing approach with say storing of new rows to a separate linked table, heap fashion (no indexes), then sorting/indexing this data is a minimal fashion, and "bulk loading" it to its real destination. Similar pre-processing in SQLite (has hinted in question) would serve the serve purpose. Keeping it "ALL MDB" is maybe easier (fewer languages/processes to learn, fewer inter-op issues [hopefuly ;-)]...)
EDIT: on why inserting records in a sorted/bulk fashion may slow down the MDB file's growth (question from Tony Toews)
One of the reasons for MDB files' propensity to grow more quickly than the rate at which text/data added to them (and their counterpart ability to be easily compacted back down) is that as information is added, some of the nodes that constitute the indexes have to be re-arranged (for overflowing / rebalancing etc.). Such management of the nodes seems to be implemented in a fashion which favors speed over disk space and harmony, and this approach typically serves simple applications / small data rather well. I do not know the specific logic in use for such management but I suspect that in several cases, node operations cause a particular node (or much of it) to be copied anew, and the old location simply being marked as free/unused but not deleted/compacted/reused. I do have "clinical" (if only a bit outdated) evidence that by performing inserts in bulk we essentially limit the number of opportunities for such duplication to occur and hence we slow the growth.
EDIT again: After reading and discussing things from Tony Toews and Albert Kallal it appears that a possibly more significant source of bloat, in particular in Jet Engine 4.0, is the way locking is implemented. It is therefore important to set the database in single user mode to avoid this. (Read Tony's and Albert's response for more details.
Is your script executing a single INSERT statement per row of data? If so, pre-processing the data into a text file of many rows that could then be inserted with a single INSERT statement might improve the efficiency and cut down on the accumulating temporary crud that's causing it to bloat.
You might also make sure the INSERT is being executed without transactions. Whether or not that happens implicitly depends on the Jet version and the data interface library you're using to accomplish the task. By explicitly making sure it's off, you could improve the situation.
Another possibility is to drop the indexes before the insert, compact, run the insert, compact, re-instate the indexes, and run a final compact.
I find I am able to link from Access to Sqlite and to run a make table query to import the data. I used this ODBC Driver: http://www.ch-werner.de/sqliteodbc/ and created User DNS.
File --> Options --> Current Database -> Check below options
* Use the Cache format that is compatible with Microsoft Access 2010 and later
* Clear Cache on Close
Then, you file will be saved by compacting to the original size.