I am in the very early stages of learning Python. This question has more to do with basic understanding than coding- hopefully I tag it correctly. I am reading my coursework and it says
"Run the program below that displays the ... The indentation and
spacing of the... key-value pairs simply provides more readability.
Note that order is not maintained in the dict when printed."
I know I can specify so that the order is the same each time. I can do that. I want to know when you write a program and run it why do the results get returned in a different order when not specified? Is it because of the way it gets handled in the processor?
Thanks.
The answer has nothing to do with Python, and everything to do with data structures - this behavior is universal and expected across all languages that implement a similar data structure. In Python it's called a dictionary, in other languages it's called a Map or a Hash Map or a Hash Table. There are a few other similar names for the same underlying data structure.
The Python dictionary is an Associative collection, as opposed to a Python List (which is just an Array), where its elements are contiguous in memory.
The big advantage that dictionaries (associative collections) offer is fast and constant look up times (O(1)) - arrays also offer fast look up since calculating an index is trivial - however a dictionary consists of key-value pairs where the key can be anything as long as it is hashable.
Essentially, to determine the "index" where an associated value should go in an associative container, you take the key, hash it, devise some way of mapping the hash to a number and treat that number like an index. As unlikely as it is for two different objects to yield the same hash, it could theoretically happen - what's more likely to happen is that your hash-to-number procedure maps two unique hashes to the same number - in any case, collisions like this can happen, and there are strategies for handling these collisions.
The point is, the hash of a key determines the order in which the associated value appears in the collection - therefore, there is no inherent order.
Data: I have two very long lists (up to 500M elements each) which are extremely intensive used (read-only, no modification) by my parallel part of program. First is list of string, second is list of list of integer. Length of element (string or list) in both lists may vary from few (most of the cases) up to few hundreds or thousands (rare situation) of letter/integer.
In parallel code (sub-processes) the lists are used as a read-only data. When parallel execution is finished new elements may be add (append to end) to lists by main-process.
Question: The question is what is the best way to share such data between processes? Below are solutions which are consider (which I aware of). If I miss something please comment.
Possible solutions:
Array from multiprocessing module. It was suggested that it can be combined with numpy array. This solutions seems to be the most reasonable. But in requires fixed-length elements - this a real problem for me. I dont have enough RAM to set array element size to length of the longest items which may ever appear. Within this approach I think the most effective way is to set element length to some reasonable value (which will be capable to store most of entries) and remaining longest entries store is second array with big enough element length.
os.fork of main-process (with lists described in Data section). It is unix-only but this is not a problem for me. It should benefit from copy-on-write. But both strings (elements of 1st list) and list (elements of 2nd list) are object and probably reference counting will cause that each time data will be copy
redis (o other db which can store data in ram). I really worry about performans
store as mmap object. This gives me huge string-like (and file-like) object. Searching (iterating) over it probably will be the most time-consuming part of my program.
store in original form in one process and use queue/pipe to communicate with other processes. Previously, when I used multiprocess.pool.imap with small chunksize communication between processes kills performance. Therefore I believe such solution cannot be apply for really high traffic.
So, based on above it seems to me the most reasonable solution is to use Array and numpy (option 1) and try to split each list into few arrays with different element length. Any comments, suggestions? Did I miss something?
I am writing some Python code to implement some of the concepts I have recently been learning, related to inverted indices / postings lists. I'm quite new to Python and am having some trouble understanding its efficiencies in some cases.
Theoretically, creating an inverted index of a set of documents D, each with a unique ID doc_id should involve:
Parsing / performing lexical analysis of each document in D
Removing stopwords, performing stemming etc.
Creating a list of all (word,doc_id) pairs
Sorting the list
Condensing duplicates into {word:[set_of_all_doc_ids]} (inverted index)
Step 5 is often carried out by having a dictionary containing the word with meta-data (term frequency, byte offsets) and a pointer to the postings list (list of documents it occurs in). The postings list is often implemented as a data structure which allows efficient random insert, i.e. a linked list.
My problem is that Python is a higher-level language, and direct use of things like memory pointers (and therefore linked lists) seems to be out of scope. I am optimising before profiling because for very large data sets it is already known that efficiency must be maximised to retain any kind of ability to calculate the index in a reasonable time.
Several other posts exist here on SO about Python inverted indices and, like MY current implementation, they use dictionaries mapping keys to lists (or sets). Is one to expect that this method have similar performance to a language which allows direct coding of pointers to linked lists?
There are a number of things to say:
If random access is required for a particular list implementation, a linked list is not optimal (regardless of the programming language used). To access the ith element of the list, a linked list requires you to iterate all the way from the 0th to the ith element. Instead, the list should be stored as one continuous block (or several large blocks if it is very long). Python lists [...] are stored in this way, so for a start, a Python list should be good enough.
In Python, any assignment a = b of an object b that is not a basic data type (such as int or float), is performed internally by passing a pointer and incrementing the reference count to b. So if b is a list or a dictionary (or a user-defined class, for that matter), this is in principle not much different from passing a pointer in C or C++.
However, there is obviously some overhead caused by a) reference counting and b) garbage collection. If the implementation is for study purposes, i.e. to understand the concepts of inverted indexing better, I would not worry about that. But for a serious, highly-optimized implementation, using pure Python (rather than, e.g. C/C++ embedded into Python) is not advisable.
As you optimise the implementation of your postings list further, you will probably see the need to a) make random inserts, b) keep it sorted and c) keep it compressed - all at the same time. At that point, the standard Python list won't be good enough any more, and you might want to look into implementing a more optimised list representation in C/C++ and embed it into Python. However, even then, sticking to pure Python would probably be possible. E.g. you could use a large string to implement the list and use itertools and buffer to access specific parts in a way that is, to some extent, similar to pointer arithmetic.
One thing that you should always keep in mind when dealing with strings in Python is that, despite what I said above about assignment operations, the substring operation text[i:j] involves creating an actual (deep) copy of the substring, rather than merely incrementing a reference count. This can be avoided by using the buffer data type mentioned above.
You can see the code and documentation for inverted index in Python at : http://www.ssiddique.info/creation-of-inverted-index-and-use-of-ranking-algorithm-python-code.html
Soon I will be coding it in C++..
Forgive me for asking in in such a general way as I'm sure their performance is depending on how one uses them, but in my case collections.deque was way slower than collections.defaultdict when I wanted to verify the existence of a value.
I used the spelling correction from Peter Norvig in order to verify a user's input against a small set of words. As I had no use for a dictionary with word frequencies I used a simple list instead of defaultdict at first, but replaced it with deque as soon as I noticed that a single word lookup took about 25 seconds.
Surprisingly, that wasn't faster than using a list so I returned to using defaultdict which returned results almost instantaneously.
Can someone explain this difference in performance to me?
Thanks in advance
PS: If one of you wants to reproduce what I was talking about, change the following lines in Norvig's script.
-NWORDS = train(words(file('big.txt').read()))
+NWORDS = collections.deque(words(file('big.txt').read()))
-return max(candidates, key=NWORDS.get)
+return candidates
These three data structures aren't interchangeable, they serve very different purposes and have very different characteristics:
Lists are dynamic arrays, you use them to store items sequentially for fast random access, use as stack (adding and removing at the end) or just storing something and later iterating over it in the same order.
Deques are sequences too, only for adding and removing elements at both ends instead of random access or stack-like growth.
Dictionaries (providing a default value just a relatively simple and convenient but - for this question - irrelevant extension) are hash tables, they associate fully-featured keys (instead of an index) with values and provide very fast access to a value by a key and (necessarily) very fast checks for key existence. They don't maintain order and require the keys to be hashable, but well, you can't make an omelette without breaking eggs.
All of these properties are important, keep them in mind whenever you choose one over the other. What breaks your neck in this particular case is a combination of the last property of dictionaries and the number of possible corrections that have to be checked. Some simple combinatorics should arrive at a concrete formula for the number of edits this code generates for a given word, but everyone who mispredicted such things often enough will know it's going to be surprisingly large number even for average words.
For each of these edits, there is a check edit in NWORDS to weeds out edits that result in unknown words. Not a bit problem in Norvig's program, since in checks (key existence checks) are, as metioned before, very fast. But you swaped the dictionary with a sequence (a deque)! For sequences, in has to iterate over the whole sequence and compare each item with the value searched for (it can stop when it finds a match, but since the least edits are know words sitting at the beginning of the deque, it usually still searches all or most of the deque). Since there are quite a few words and the test is done for each edit generated, you end up spending 99% of your time doing a linear search in a sequence where you could just hash a string and compare it once (or at most - in case of collisions - a few times).
If you don't need weights, you can conceptually use bogus values you never look at and still get the performance boost of an O(1) in check. Practically, you should just use a set which uses pretty much the same algorithms as the dictionaries and just cuts away the part where it stores the value (it was actually first implemented like that, I don't know how far the two diverged since sets were re-implemented in a dedicated, seperate C module).
I need a memory-efficient data structure for for storing about a million key--value pairs, where keys are strings of about 80 bytes, and values are strings of about 200 bytes, the total key and value size being about 280MB. I also need efficient lookup of value by key, preferably a hash-map. The memory overhead should be as little as possible, e.g. for 280MB of useful data, the data structure shouldn't use more than 300MB of virtual memory (including malloc() overhead and everything else). The usage pattern is the following: we start with an empty data structure, and we populate it gradually, never changing keys, and never changing the length of values. As a plus, the data structure may support changing the length of values, at the expense of a 100% value overhead (meaning that for x value bytes, x bytes might be wasted in temporarily in unused buffer space).
I need a pure Python module, or a built-in Python module, or a C implementation preferably with (C)Python bindings. I'd prefer if it was possible to serialize the whole data structure to disk, and to read it back very quickly.
Just to prove that such a small overhead is possible, I've created a simple design with open addressing, the hash table of 1.25 million elements containing 4-byte pointers to 1MB data blocks, the data blocks containing the key and value lengths as base-128 varints. This design has an important limitation: it doesn't allow removing or changing pairs without wasting their memory area. According to my calculations with 1 million key--value pairs of 280 bytes each, the overhead is less than 3.6% (10 080 000 bytes). The limits above are more generous, they allow 20 000 000 bytes of overhead.
I've just found http://www.pytables.org/ , which provides fast access and memory-efficient packing of data. I have to examine it more closely to check if it suits my needs.
Ok, the dirt-simple approach.
Use a python dictionary for the data structure. I filled a python dictionary with 1 million random key-value pairs where the key was 80 characters and the value 200 characters. It took 360,844 Kb on my computer, which is outside of your specification of no more than 300 MB, but I offer it up as a solution anyway because it's still pretty memory efficient.
This also fails your requirement of having a C API. I'm not sure why you need C, but as the question is tagged Python and lacks a C tag, I'll offer the pure Python to see if it just might fit the bill.
Regarding persistence. Use the cPickle module. It's very fast and, again, dirt-simple. To save your dictionary:
cPickle.dump(mydict, "myfile.pkl")
To reload your dictionary:
mydict = cPickle.load("myfile.pkl")
A second dirt-simple idea is to use the shelve module, which is basically disk-based python dictionary. Memory overhead is very low (it's all on disk). But it's also much slower.
Martijn mentioned this in a comment (not sure why people comment with answers), but I agree: use SQLite. You should give it a try and see if it will meet your needs.
If you don't plan to have a large amounts of deletes, then this isn't that hard. Deletes lead to fragmentation.
You also need to commit to a fixed length key. You mentioned 80 bytes. Are your keys allowed to duplicate? If not, it's even easier.
So, here is what you do.
You create an array of:
struct {
char value[80];
char *data;
} key;
And you keep this array sorted.
If you keys can duplicate, then you need:
struct link {
char *data;
link *next;
}
struct {
char value[80];
link *data;
} key;
(My C is rusty, but this is the gist of it) The latter has each key pointing to a linked list of values.
Then a lookup is a simple binary search. The "pain" is in maintaining this array and inserting/deleting keys. It's not as painful as it sounds, but it saves a LOT of memory, especially on 64bit systems.
What you want to reduce is the number of pointers. Pointers are expensive when you have lots of structures filled with pointers. On a 64bit system, a pointer is 8 bytes. So for a single pointer, there goes 8MB of your memory budget.
So, the expense is in building the array, copying and compacting memory (if you "know" you will have a million rows and can commit to that, then malloc(1000000 * sizeof(key)) right away, it'll save you some copying during expansion).
But don't be afraid of, once it's up and running, performance is quite good. Modern cpus are actually pretty good at copying 100M blocks of memory around.
Just as an aside, I just did something much like this in Java. On a 64bit JVM, a Map with 25M entries is 2G of RAM. My solution (using similar techniques to this) has at around 600M). Java uses more pointers than C, but the premise is the same.
Have you tried using a straightforward dict? Most of your data is in strings, so the overhead might fit within your requirements.
You can use the sha1 of the key instead of the key itself. If the keys are unique, then the sha1 hash of the keys is likely, too. It provides a memory savings to try to squeak in under your limit.
from random import choice
from string import letters
from hashlib import sha1
def keygen(length):
return "".join(choice(letters) for _ in xrange(length))
def gentestdata(n=1000*1000):
# return dict((sha1(keygen(80)).digest(), keygen(200)) for _ in xrange(n))
d = {}
for _ in xrange(n):
key = sha1(keygen(80)).digest()
assert key not in d
value = keygen(200)
d[key] = value
return d
if __name__ == '__main__':
d = gentestdata()
On my ubuntu box, this tops out at 304 MB of memory:
2010-10-26 14:26:02 hbrown#hbrown-ubuntu-wks:~$ ps aux | grep python
[...]
hbrown 12082 78.2 7.5 307420 303128 pts/1 S+ 14:20 4:47 python
Close enough? It's python, not C.
Later: also, if your data is somewhat redundant, you can gzip the values. It's a time versus space trade-off.
Using SQLite is a good idea. A quick implementation can tell if you are fast enough with little effort.
If you determine you have to roll your own, I'd recommend the following:
How well can you predict the number of pairs, or an upper limit for that?
How well can you predict the total data size, or an upper limit for that?
Arena allocator for strings and nodes. (Usually, you'd work on a list of arenas, so you don't have to predict the total size).
Alignment depends on your algorithms, in principle you could pack it byte-tight, and the only overhead is your overallocation, which only minimally affects your working set.
However, if you have to run any cmp/copy etc. operations on these strings, remember that with the following guarantees, you can squeeze a little or a lot from these string operations:
all elements are CPU word aligned
all pad bytes are (e.g.) 0
you can safely read "beyond" a string end as long as you don't cross a CPU border
Hash table for the index. A dictionary would work, too, but that makes sense only if potential degradation / rehashing would be a serious issue. I don't know any "stock" hashtable implementation for C, but there should be one, right? right? Just replace allocations with calls to the arena allocator.
Memory Locality
If you can guarantee that lookup will never request a string that is not in the map, you should store the keys in a separate arena, as they are needed only on hash collisions. That can improve memory locality significantly. (In that case, if you ever have a "final" table, you could even copy the colliding keys to a new arena, and throw away all the others. The benefits of that are probably marginal, though.)
Separation can help or hurt, depending on your access patterns. If you typically use the value once after each lookup, having them pair-wise in the same arena is great. If you e.g. look up a few keys, then use their values repeatedly, separate arenas make sense.
If you have to support "funny characters" / Unicode, normalize your strings before storing them.
You could use struct module to pack binary data and unpack it when needed.
You can implement a memory efficient storage using this approach. I guess access would be a pain.
http://docs.python.org/library/struct.html
Apache Portable Runtime (aka APR) has a c-based hash table. You can see documentation at http://apr.apache.org/docs/apr/0.9/group_apr_hash.html
With apr_hash_t all you store is void*. So it gives you full control over values. SO if you want you can store pointer to a 100 byte block instead of actual length of the string.
Judy should be memory-efficient: http://judy.sourceforge.net/
(Benchmarks: http://www.nothings.org/computer/judy/, see "Data Structure Size").
See also: http://www.dalkescientific.com/Python/PyJudy.html
Also,
For keys of a fixed size there is http://panthema.net/2007/stx-btree/ in C++ (I'm sure that with a custom C wrappers it can be used from CPython).
If the dataset allows it, you can store the variable-length keys in the value and use a hash or a prefix of the variable-length key as the fixed-length key.
The same logic applies to http://google-opensource.blogspot.ru/2013/01/c-containers-that-save-memory-and-time.html and http://code.google.com/p/sparsehash/ - istead of using a heavy std::string as a key, use an 32-bit or 64-bit integer key, making it somehow from the real variable-length key.
Since I couldn't find any existing solutions which will pack the memory tightly, I've decided to implement it in C for myself. See my design with open addressing in the question.