a couple of my python programs aim to
format into a hash table (hence, I'm a dict() addict ;-) ) some informations in a "source" text file, and
to use that table to modify a "target" file. My concern is that the "source" files I usually process can be very large (several GB) so it makes more than 10sec to parse, and I need to run that program a bunch of times. To conclude, I feel like it's a waste to reload the same large file each time I need to modify a new "target".
My thought is, if it would be possible to write once the dict() made from the "source" file in a way that python would be able to read/process much faster (I think about a format close to the one used in RAM by python), it would be great.
Is there a possibility to achieve that?
Thank you.
Yea, you can marshal the dict, or you can use pickle. For the difference between the two, especially as regards to speed, see this question.
pickle is the usual solution to such things, but if you see any value in being able to edit the saved data, and if the dictionary uses only simple types such as strings and numbers (nested dictionaries or lists are also OK), you can simply write the repr() of the dictionary to a text file, then parse it back into a Python dictionary using eval() (or, better yet, ast.literal_eval()).
Related
I wrote a Python script that loads an user/artist/playcount dataset and predicts which artists I might like. However, the database (a .tsv file I downloaded) is big so it takes time to read it and store the information I want in a dictionary. How can I optimize this? Is there a way to preserve the loaded database so each time I want to make predictions I don't have to load it again?
Thank you very much.
You could store and load your dictionary using the shelve module. This is likely to yield a benefit if the processing time to create the dictionary is large relative to the amount of time it takes to load it into memory - that is, if your algorithm is complicated or your dictionary is small.
If your dictionary is still going to be large, one trick you could use is to store file pointer offsets as the dictionary values. That is, when you want a dictionary value to be some information about a song (for example), instead of storing the information itself in the dictionary, store the byte offset in the TSV file where the corresponding line starts. Then, when you want to access that information, open the TSV file, seek to the offset, read a line, and parse it to construct the object representing that song. Seeks are fast, or at least much faster than reading through the whole file. Alternatively, you could use the mmap module to memory-map the file and effectively treat it as an array of bytes, which is especially useful if you know how many bytes you'll need (or at least have a reasonably low upper bound).
If you want to maintain compatibility with other systems written in other programming languages, or if you just want something human-readable, you could store your dictionary as JSON instead, using the json module. I would recommend this only if your dictionary is not too large.
Another solution you could try is just storing the information from your dictionary in a database in the first place. Databases are organized in a way that makes accessing them fast. Python's standard library includes the sqlite3 module that you can use to access an SQLite database. This should be fine. But if you already have a database server running, or you have special needs that make using a separate database server advantageous (like multiple processes accessing the database simultaneously), you can use SQLAlchemy to store and load data in any SQL database.
For completeness I would also mention the pickle module, which can be used to store pretty much any Python object, but I don't think you need to use it directly. There are more streamlined ways to store and load dictionary-type data.
In perl there was this idea of the tie operator, where writing to or modifying a variable can run arbitrary code (such as updating some underlying Berkeley database file). I'm quite sure there is this concept of overloading in python too.
I'm interested to know what the most idiomatic way is to basically consider a local JSON file as the canonical source of needed hierarchical information throughout the running of a python script, so that changes in a local dictionary are automatically reflected in the JSON file. I'll leave it to the OS to optimise writes and cache (I don't mind if the file is basically updated dozens of times throughout the running of the script), but ultimately this is just about a kilobyte of metadata that I'd like to keep around. It's not necessary to address concurrent access to this. I'd just like to be able to access a hierarchical structure (like nested dictionary) within the python process and have reads (and writes to) that structure automatically result in reads from (and changes to) a local JSON file.
well, since python itself has no signals-slots, I guess you can instead make your own dictionary class by inherit it from python dictionary. Class exactly like python dict, only in every method of it that can change dict values you will dump your json.
also you can use smth like PyQt4 QAbstractItemModel which has signals. And when it data changed signal will emitted, do your dumping - it will be only in one place, which is nice.
I know these two are sort of stupid ways, probably yea. :) If anyone knows better, go ahead and tell!
This is a developpement from aspect_mkn8rd' answer taking into account Gerrat's comments, but it is too long for a true comment.
You will need 2 special container classes emulating a list and a dictionnary. In both, you add a pointer to a top-level object and override the following methods :
__setitem__(self, key, value)
__delitem__(self, key)
__reversed__(self)
All those methods are called in modification and should have the top-level object to be written to disk.
In addition, __setitem__(self, key, value) should look if value is a list and wrap it into a special list object or if it is a dictionary, wrap it into a special dictionnary object. In both case, the method should set the top-level object into the new container. If neither of them and the object defines __setitem__, it should raise an Exception saying the object is not supported. Of course you should then modify the method to take in account this new class.
Of course, there is a good deal of code to write and test, but it should work - left to the reader as an exercise :-)
If concurrency is not required, maybe consider writing 2 functions to read and write the data to a shelf file? Our is the idea to have the dictionary" aware" of changes to update the file without this kind of thing?
I'm currently in the middle of making a web app in Python. It will contain and use a lot of unchanging data about several things. This data will be used in most of the app to varying degrees.
My original plan was to put this data in a database with all the other changing data, but to me this seems excessive and a potential and unnecessary choke point (as the same data will be queried multiple times / in various combinations on most page loads / interactions).
I've rewritten it so that the data is now stored in several dictionaries of dictionaries (i.e. in memory), essentially being constants, from which the data is accessed through functions. The data looks a bit like this:
{
0: {
'some_key': 'some_value',
'another_key': 'another_value',
...
},
...
}
Is this memory efficient? Is there a more tried and true / pythonic / just plain better (in terms of speed, memory use, etc.) way of doing this kind of thing? Is using a database actually the best way of doing it?
There's nothing especially wrong with this approach, but I'll note some issues:
Why nested dictionaries? Why not a flat dict, or even a module filled with variables?
If these are "objecty" data, why not store them in actual objects? Again, these could live in variables in a module, or in a dict.
Your web framework may already have a solution for this specific problem.
This seems like a perfectly sensible way of storing your reference data so long as you have plenty of memory for the data that you need to hold. I agree that this should be quicker than reading from a database as the data will already be in memory and sorted for efficient access.
If you don't want to store this data actually in your source code you could store it in a json file (json.dump() to write out to a file and json.load() to read back in). But you would want to read this into memory at the point of the application starting up and then just keep it in memory rather than going back to the file for it every time.
I've got an XML file I want to parse with python. What is best way to do this? Taking into memory the entire document would be disastrous, I need to somehow read it a single node at a time.
Existing XML solutions I know of:
element tree
minixml
but I'm afraid they aren't quite going to work because of the problem I mentioned. Also I can't open it in a text editor - any good tips in generao for working with giant text files?
First, have you tried ElementTree (either the built-in pure-Python or C versions, or, better, the lxml version)? I'm pretty sure none of them actually read the whole file into memory.
The problem, of course, is that, whether or not it reads the whole file into memory, the resulting parsed tree ends up in memory.
ElementTree has a nifty solution that's pretty simple, and often sufficient: iterparse.
for event, elem in ET.iterparse(xmlfile, events=('end')):
...
The key here is that you can modify the tree as it's built up (by replacing the contents with a summary containing only what the parent node will need). By throwing out all the stuff you don't need to keep in memory as it comes in, you can stick to parsing things in the usual order without running out of memory.
The linked page gives more details, including some examples for modifying XML-RPC and plist as they're processed. (In those cases, it's to make the resulting object simpler to use, not to save memory, but they should be enough to get the idea across.)
This only helps if you can think of a way to summarize as you go. (In the most trivial case, where the parent doesn't need any info from its children, this is just elem.clear().) Otherwise, this won't work for you.
The standard solution is SAX, which is a callback-based API that lets you operate on the tree a node at a time. You don't need to worry about truncating nodes as you do with iterparse, because the nodes don't exist after you've parsed them.
Most of the best SAX examples out there are for Java or Javascript, but they're not too hard to figure out. For example, if you look at http://cs.au.dk/~amoeller/XML/programming/saxexample.html you should be able to figure out how to write it in Python (as long as you know where to find the documentation for xml.sax).
There are also some DOM-based libraries that work without reading everything into memory, but there aren't any that I know of that I'd trust to handle a 40GB file with reasonable efficiency.
The best solution will depend in part on what you are trying to do, and how free your system resources are. Converting it to a postgresql or similar database might not be a bad first goal; on the other hand, if you just need to pull data out once, it's probably not needed. When I have to parse large XML files, especially when the goal is to process the data for graphs or the like, I usually convert the xml to S-expressions, and then use an S-expression interpreter (implemented in python) to analyse the tags in order and build the tabulated data. Since it can read the file in a line at a time, the length of the file doesn't matter, so long as the resulting tabulated data all fits in memory.
I have a Python (2.7) script that acts as a server and it will therefore run for very long periods of time. This script has a bunch of values to keep track of which can change at any time based on client input.
What I'm ideally after is something that can keep a Python data structure (with values of types dict, list, unicode, int and float – JSON, basically) in memory, letting me update it however I want (except referencing any of the reference type instances more than once) while also keeping this data up-to-date in a human-readable file, so that even if the power plug was pulled, the server could just start up and continue with the same data.
I know I'm basically talking about a database, but the data I'm keeping will be very simple and probably less than 1 kB most of the time, so I'm looking for the simplest solution possible that can provide me with the described data integrity. Are there any good Python (2.7) libraries that let me do something like this?
Well, since you know we're basically talking about a database, albeit a very simple one, you probably won't be surprised that I suggest you have a look at the sqlite3 module.
I agree that you don't need a fully blown database, as it seems that all you want is atomic file writes. You need to solve this problem in two parts, serialisation/deserialisation, and the atomic writing.
For the first section, json, or pickle are probably suitable formats for you. JSON has the advantage of being human readable. It doesn't seem as though this the primary problem you are facing though.
Once you have serialised your object to a string, use the following procedure to write a file to disk atomically, assuming a single concurrent writer (at least on POSIX, see below):
import os, platform
backup_filename = "output.back.json"
filename = "output.json"
serialised_str = json.dumps(...)
with open(backup_filename, 'wb') as f:
f.write(serialised_str)
if platform.system() == 'Windows':
os.unlink(filename)
os.rename(backup_filename, filename)
While os.rename is will overwrite an existing file and is atomic on POSIX, this is sadly not the case on Windows. On Windows, there is the possibility that os.unlink will succeed but os.rename will fail, meaning that you have only backup_filename and no filename. If you are targeting Windows, you will need to consider this possibility when you are checking for the existence of filename.
If there is a possibility of more than one concurrent writer, you will have to consider a synchronisation construct.
Any reason for the human readable requirement?
I would suggest looking at sqlite for a simple database solution, or at pickle for a simple way to serialise objects and write them to disk. Neither is particularly human readable though.
Other options are JSON, or XML as you hinted at - use the built in json module to serialize the objects then write that to disk. When you start up, check for the presence of that file and load the data if required.
From the docs:
>>> import json
>>> print json.dumps({'4': 5, '6': 7}, sort_keys=True, indent=4)
{
"4": 5,
"6": 7
}
Since you mentioned your data is small, I'd go with a simple solution and use the pickle module, which lets you dump a python object into a line very easily.
Then you just set up a Thread that saves your object to a file in defined time intervals.
Not a "libraried" solution, but - if I understand your requirements - simple enough for you not to really need one.
EDIT: you mentioned you wanted to cover the case that a problem occurs during the write itself, effectively making it an atomic transaction. In this case, the traditional way to go is using "Log-based recovery". It is essentially writing a record to a log file saying that "write transaction started" and then writing "write transaction comitted" when you're done. If a "started" has no corresponding "commit", then you rollback.
In this case, I agree that you might be better off with a simple database like SQLite. It might be a slight overkill, but on the other hand, implementing atomicity yourself might be reinventing the wheel a little (and I didn't find any obvious libraries that do it for you).
If you do decide to go the crafty way, this topic is covered on the Process Synchronization chapter of Silberschatz's Operating Systems book, under the section "atomic transactions".
A very simple (though maybe not "transactionally perfect") alternative would be just to record to a new file every time, so that if one corrupts you have a history. You can even add a checksum to each file to automatically determine if it's broken.
You are asking how to implement a database which provides ACID guarantees, but you haven't provided a good reason why you can't use one off-the-shelf. SQLite is perfect for this sort of thing and gives you those guarantees.
However, there is KirbyBase. I've never used it and I don't think it makes ACID guarantees, but it does have some of the characteristics you're looking for.