Avoid loading database each time the script is run - python

I wrote a Python script that loads an user/artist/playcount dataset and predicts which artists I might like. However, the database (a .tsv file I downloaded) is big so it takes time to read it and store the information I want in a dictionary. How can I optimize this? Is there a way to preserve the loaded database so each time I want to make predictions I don't have to load it again?
Thank you very much.

You could store and load your dictionary using the shelve module. This is likely to yield a benefit if the processing time to create the dictionary is large relative to the amount of time it takes to load it into memory - that is, if your algorithm is complicated or your dictionary is small.
If your dictionary is still going to be large, one trick you could use is to store file pointer offsets as the dictionary values. That is, when you want a dictionary value to be some information about a song (for example), instead of storing the information itself in the dictionary, store the byte offset in the TSV file where the corresponding line starts. Then, when you want to access that information, open the TSV file, seek to the offset, read a line, and parse it to construct the object representing that song. Seeks are fast, or at least much faster than reading through the whole file. Alternatively, you could use the mmap module to memory-map the file and effectively treat it as an array of bytes, which is especially useful if you know how many bytes you'll need (or at least have a reasonably low upper bound).
If you want to maintain compatibility with other systems written in other programming languages, or if you just want something human-readable, you could store your dictionary as JSON instead, using the json module. I would recommend this only if your dictionary is not too large.
Another solution you could try is just storing the information from your dictionary in a database in the first place. Databases are organized in a way that makes accessing them fast. Python's standard library includes the sqlite3 module that you can use to access an SQLite database. This should be fine. But if you already have a database server running, or you have special needs that make using a separate database server advantageous (like multiple processes accessing the database simultaneously), you can use SQLAlchemy to store and load data in any SQL database.
For completeness I would also mention the pickle module, which can be used to store pretty much any Python object, but I don't think you need to use it directly. There are more streamlined ways to store and load dictionary-type data.

Related

Python text file instead of dictionary

I'm working on a project where I crawl and re-organize huge number of data into a result text file. Previously I used dictionary to store temporary data, but as the data volume increased the process slowed down because of memory usage and dictionary got useless.
Since process speed is not so important in my case, I'm trying to replace dictionary to file but I'm not sure how can I easily move file pointer to appropriate position and read up required data. In dictionary I can easily refer to any data. I would like to achieve the same but in file.
I'm thinking to use mmap and write my own functions to move file pointer where I want. Does Python have a built-in or 3rd party module for such operations?
Any other theoretical approach is welcome.
I think you are now trying to reinvent a key-value database.
Maybe the easiest thing would be to check if sqlite3 module would offer you what you need. Using a readymade database is easier than rolling your own!
Of course, sqlite3 is not a key-value DB (on the surface), so if you need something even simpler, have a look at LMDB and its Python bindings: http://lmdb.readthedocs.org/en/release/
It is as lightweight and fast as it gets. It is probably close to the fastest way to achieve what you want.
It should be noted that there is no such thing as an optimal key-value database. There are several aspects to consider. At least:
Do you read a lot or write a lot?
What are the key and value sizes?
Do you need transactions/crash-proof?
Do you have duplicate keys (one key, several values)?
Do you want to have sorted keys?
Do you want to read the data out in the same order it is inserted?
What is your database size (MB, GB, TB, PB)?
Are you constrained on IO or CPU?
For example, LMDB I suggested above is very good in read-intensive tasks, not so much in write-intensive tasks. It offers transactions, keeps keys in sorted order and is crash-proof (limited by the underlying file system). However, if you need to write your database often, LMDB may not be the best choice.
On the other hand, SQLite is not the perfect choice to this task - theoretically speking. In practice, it is built in into the standard Python distribution and is thus easy to use. It may provide adequate performance, and it may thus be the best choice.
There are numerous high-quality databases out there. By not mentioning them I do not try to give the impression that the DBs mentioned in this answer are the only good alternatives. Most database managers have a very good reason for their existence. While there are some that are a bit outdated, most have their own sweet spots in the application area.
The field is constantly changing. There are both completely new databases available and old database systems are updated. This should be kept in mind when reading old benchmarks. Also, the type of HW used has its impact; a computer with a SSD disk, a cloud computing instance, and a traditional computer with a HDD behave quite differently performance-wise.

Using a dictionary of dictionaries to store constants

I'm currently in the middle of making a web app in Python. It will contain and use a lot of unchanging data about several things. This data will be used in most of the app to varying degrees.
My original plan was to put this data in a database with all the other changing data, but to me this seems excessive and a potential and unnecessary choke point (as the same data will be queried multiple times / in various combinations on most page loads / interactions).
I've rewritten it so that the data is now stored in several dictionaries of dictionaries (i.e. in memory), essentially being constants, from which the data is accessed through functions. The data looks a bit like this:
{
0: {
'some_key': 'some_value',
'another_key': 'another_value',
...
},
...
}
Is this memory efficient? Is there a more tried and true / pythonic / just plain better (in terms of speed, memory use, etc.) way of doing this kind of thing? Is using a database actually the best way of doing it?
There's nothing especially wrong with this approach, but I'll note some issues:
Why nested dictionaries? Why not a flat dict, or even a module filled with variables?
If these are "objecty" data, why not store them in actual objects? Again, these could live in variables in a module, or in a dict.
Your web framework may already have a solution for this specific problem.
This seems like a perfectly sensible way of storing your reference data so long as you have plenty of memory for the data that you need to hold. I agree that this should be quicker than reading from a database as the data will already be in memory and sorted for efficient access.
If you don't want to store this data actually in your source code you could store it in a json file (json.dump() to write out to a file and json.load() to read back in). But you would want to read this into memory at the point of the application starting up and then just keep it in memory rather than going back to the file for it every time.

How to make a python instanced object reusable?

a couple of my python programs aim to
format into a hash table (hence, I'm a dict() addict ;-) ) some informations in a "source" text file, and
to use that table to modify a "target" file. My concern is that the "source" files I usually process can be very large (several GB) so it makes more than 10sec to parse, and I need to run that program a bunch of times. To conclude, I feel like it's a waste to reload the same large file each time I need to modify a new "target".
My thought is, if it would be possible to write once the dict() made from the "source" file in a way that python would be able to read/process much faster (I think about a format close to the one used in RAM by python), it would be great.
Is there a possibility to achieve that?
Thank you.
Yea, you can marshal the dict, or you can use pickle. For the difference between the two, especially as regards to speed, see this question.
pickle is the usual solution to such things, but if you see any value in being able to edit the saved data, and if the dictionary uses only simple types such as strings and numbers (nested dictionaries or lists are also OK), you can simply write the repr() of the dictionary to a text file, then parse it back into a Python dictionary using eval() (or, better yet, ast.literal_eval()).

What is the best way to store big vector to database with python?

I want to classify some text. So I have to compare it with other texts. After representing texts as vectors how can I store them (very big lists of float values) to SQL database for using them later?
My idea is using pickle module:
vector=text_to_vector(text)
present=pickle.dumps(big_list)
some_db.save(text_id,present)
#later
present=some_db.get(text_id)
vector=pickle.loads(present)
Is it fast and effective if I have thousends of texts?
You may find that pickle and databases don't work too well together.
Python's pickle is for serializing Python objects to a format, that can then be read back in to Python objects by Python. Although it's very easy to serialize with pickle, you can't* query this serialized format, you can't* read it into a program in another language. Check out cPickle, another Python module, for faster pickle-ing.
Databases, on the other hand, are great for persisting data in such a way that it is queryable and non-language-specific. But the cost is that it's generally harder to get/put data into/from the database. That's why there's special tools like SQL Alchemy, and endless blog-based debates about the benefits/horrors of Object-Relation-Mapping software.
Pickle-ing objects, and then sending them to a database such as MySQL or SQL Server is probably not a good idea. However, check out shelve, another Python module for database-like persistence of Python objects.
So, to sum up:
use pickle or shelve if you just need to save the data for later use by a Python program
map objects to a database if you want to persist the data for general use, with the understanding that this requires more effort
performance-wise, cPickle will probably win over a database + object/relation mapping
*: at least, not without a lot of effort and/or special libraries.

how to transfer a python object between two requests?

i want to process a python dict object in batches between two requests. i was wondering what's the best way to do it.
i want to do that because my dict is big and i couldn't do the whole processing in 30s.
thanks
You can serialize your object (perhaps with pickle, though there may be more efficient and specific ways if your object's nature is well-constrained) and save the serialized byte string to the datastore and to memcache (I don't recommend using just memcache, because it just might occasionally happen that the cache is "flushed" between the two requests -- in that case, you definitely want to be able to fetch your serialized byte string from the datastore!).
memcache will to the pickling for you, if you pass the original object -- but, since you need the serialized string anyway to put it in the datastore, I think it's better to do your own explicit serialization. Once you memcache.add a string, the fact that the latter gets pickled (and later unpickled on retrieval) is not a big deal -- the overhead of time and space is really quite modest.
There are limits to this approach -- you can't memcache more than 1MB per key, for example, so if your object's truly huge you need to split up the serialized bytestring onto multiple keys (and for more than a few such megabyte-slices, things get very unwieldy).
Also, of course, the first and the second request must "agree" on a key to use for the serialized data's storage and retrieval -- i.e. there must be a simple way to get that key without confusion (for example, it might be based on the name of the current user).

Categories

Resources