Numba-compatible replacement for dict - python

I am trying to solve a series of Bellman value functions without using dictionaries, since I believe that dictionaries are not supported in Numba (e.g. see Optimizing dict of set of tuple of ints with Numba?). In particular, the dictionary structure in my naive representation (below) is useful because I need to perform operations on both the keys as well as the values. What data structures should I use, or how should I set up my problem so that I can still use Numba?
Explicitly: I am filling in a set of recursively defined value functions, where they are meant to model 'job experience' over various 'sectors' at a given time. For example, if there are 2 sectors and it is t=3, the experience allocations, i.e. the possible states at t=3, are (3,0), (2,1), (1,2), (0,3). For each of these experience allocations there is some value associated with them. The recursion comes from the fact that at t=2, the value can be recursively defined from the values of 'reachable' states at t=3. For example, at t=2, the value of (2,0) is a function of two values at t=3, (3,0) and (2,1), since those are the 'reachable' experience levels by adding 1 to experience in either sector.
This is very straightforward to code if I define a dictionary with keys that are tuples of form (time, experience), since I can just 'read' the experience (tuple), find out what are the 'reachable' key values in the next period (by adding 1 to each value of the experience tuple), call the corresponding values, and evaluate the value at the current time using the recursive relationship.
However, I do not believe dictionaries work in Numba, so what sort of data structure should I use instead, that still lets me pull in a 'key' and a 'value'? In general the problem will be too large for me to make e.g. a 2-D array of size N^T for N sectors for time T. Also, for my desired 'output', I want the full 'dictionary' of values, since I will presumably calculate this filled in dictionary once and then re-using its values numerous times.

Related

When you don't specify an order- why does result order vary?

I am in the very early stages of learning Python. This question has more to do with basic understanding than coding- hopefully I tag it correctly. I am reading my coursework and it says
"Run the program below that displays the ... The indentation and
spacing of the... key-value pairs simply provides more readability.
Note that order is not maintained in the dict when printed."
I know I can specify so that the order is the same each time. I can do that. I want to know when you write a program and run it why do the results get returned in a different order when not specified? Is it because of the way it gets handled in the processor?
Thanks.
The answer has nothing to do with Python, and everything to do with data structures - this behavior is universal and expected across all languages that implement a similar data structure. In Python it's called a dictionary, in other languages it's called a Map or a Hash Map or a Hash Table. There are a few other similar names for the same underlying data structure.
The Python dictionary is an Associative collection, as opposed to a Python List (which is just an Array), where its elements are contiguous in memory.
The big advantage that dictionaries (associative collections) offer is fast and constant look up times (O(1)) - arrays also offer fast look up since calculating an index is trivial - however a dictionary consists of key-value pairs where the key can be anything as long as it is hashable.
Essentially, to determine the "index" where an associated value should go in an associative container, you take the key, hash it, devise some way of mapping the hash to a number and treat that number like an index. As unlikely as it is for two different objects to yield the same hash, it could theoretically happen - what's more likely to happen is that your hash-to-number procedure maps two unique hashes to the same number - in any case, collisions like this can happen, and there are strategies for handling these collisions.
The point is, the hash of a key determines the order in which the associated value appears in the collection - therefore, there is no inherent order.

Checking for duplicate arrays when I have a huge amount of arrays

I am counting various patterns in graphs, and I store the relevant information in a defaultdict of lists of numpy arrays of size N, where the index values are integers.
I want to efficiently know if I am creating a duplicate array. Not removing duplicates can exponentially grow the amount of duplicates to the point where what I am doing becomes infeasible. But there are potentially hundreds of thousands of arrays, stored in different lists, under different keys. As far as I know, I can't hash an array.
If I simply needed to check for duplicate nonzero indices, I would store the nonzero indices as a bit sequence of ones and then hash that value. But, I don't only need to check the indices - I need to also check their integer values. Is there any way to do this short of coming up with a completely knew design that uses different structures?
Thanks.
The basic idea is “How can I use my own hash (and perhaps ==) to store things differently in a set/dict?” (where “differently” includes “without raising TypeError for being non-hashable).
The first part of the answer is defining your hash function, for example following myrtlecat’s comment. However, beware the standard non-answer based on it: store the custom hash of each object in a set (or map it to, say, the original object with a dict). That you don’t have to provide an equality implementation is a hint that this is wrong: hash values aren’t always unique! (Exception: if you want to “hash by identity”, and know all your keys will outlive the map, id does provide unique “hashes”.)
The rest of the answer is to wrap your desired keys in objects that expose your hash/equality functions as __hash__ and __eq__. Note that overriding the non-hashability of mutable types comes with an obligation to not alter the (underlying) keys! (C programmers would often call doing so undefined behavior.)
For code, see an old answer by xperroni (which includes the option to increase safety by basing the comparisons on private copies that are less likely to be altered by some other code), though I’d add __slots__ to combat the memory overhead.

Injective two-way mappings [duplicate]

This question already has answers here:
Two way/reverse map [duplicate]
(15 answers)
Closed 5 years ago.
I often deal with mappings which are injective. In programming terminology, this can be expressed as a dictionary where all values are unique as well as, of course, all keys.
Is there a memory-efficient data structure for injective mappings with all the time-complexity properties you expect from dictionaries?
For example:
d = {1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e'}
d.get(2) = 'b' # this works with a normal dictionary
d.get('b', reverse=True) = 2 # but this is not possible
All the solutions in Two way/reverse map seem to require using or combining two sets of mappings, focusing on making it easier to perform operations on a two-way map. This is fine for small dictionaries that fit neatly in memory, but not good for large dictionaries.
The requirement is there should be no additional memory overhead storing the injective two-way map versus a regular dictionary storing only one-way mappings.
I understand dictionaries use a hash table, which use an associative array data type. By definition, associative arrays implement key -> value mappings with unique keys. Is it possible, theoretically or in practice, to produce a smart injective mapping which allows reverse lookup?
If it is not possible, I would appreciate an explanation why such a construct is difficult, or impossible, to implement with the same efficiency as dictionaries.
Update
Following the discussion with #rpy (see comments below), any information on how to set up a python dictionary-like object using a perfect reversible hash function would be useful. But, of course, a working implementation would be ideal (I have already tried perfection).
The net answer to your question is: NO (for any efficient implementation)
You put up two requirements that can not be fulfilled at the same time:
Do not use extra memory for the reverse mapping
Do not add extra time for doing (reverse) look-ups
Why are those two restrictions prohibiting a solution?
Mappings are pairs of values (tuples).
The most trivial implementation would be:
Sequential searching all tuples for a match.
This would have identical complexity for forward and backward mapping.
However, this clearly violates the expectation of time-complexity properties you expect from dictionaries:
If you would allow for O(n) complexity, then searching a tuple set sequentially would give you a proper solution.
Usually dictionary implementations try to get down to O(1) or at least O(n*log(n)) complexity. This is being achieved by introducing additional data for speeding up look-ups, like hashes or trees. Unfortunately, such aids only help for one direction as they either deal with keys (forward mapping case) or values (reverse mapping case).
So, as soon as you need to keep look-up complexity down (this also applies for modification complexity, but usually dictionaries are tailored towards fast look-up), you will need to add data for achieving speed.
The whole issue turns down to the classic memory vs. speed trade-off.
EDIT:
An approach for addressing the problem in a general implementation (for cases where keys and values allow for getting a numeric representant if those are not integral numbers in the first place) might be:
Calculate a hash value for key and one for value and register the tuple under both hash values. This way you can take key or value and identify the matching tuple and return the proper result. This would even work for non injective cases when you allow for returning sets of matching tuples.
This will require more space (double the hash entries) while keeping look-up complexity within values typical for hash based dicts. You might need to keep an eye on hash bucket size (length of collision chains) especially when value sets of keys and values are not disjoint)

Nested list and dictionary efficiency

I'm working on a project that requires a 2D map with a list for every possible x and y coordinate on that map. Seeing as though the map dimensions are constant, which is faster for creation, searching and changing values of?
Let's say that I have a 2x2 grid with a total of 4 positions. Each stores 2-bits (0, 1, 2 or 3) would having "[0b00, 0b00, 0b00, 0b01]" represent the list be better than "[[0b00, 0b00], [0b00, 0b01]]" in terms of efficiency and readability?
I assumed that the first method would be quicker at creation and iterating over all of the values but the second be faster for finding the value of a certain position (so listName[1][0] is easier to work out than listName[2]).
To clarify, I want to know what is both more memory efficient and CPU efficient for the 3 listed uses and (if it isn't too much trouble) why they are so. Further, the actual lists I'm using are 4096x4096 (using a total of 17Mb in raw data).
Note: I DO already plan on splitting the 4096x4096 grid into sectors that will be part of a nested list, I'm just asking if x and y should be on the same nesting level.
Thanks.

Python inverted index efficiency

I am writing some Python code to implement some of the concepts I have recently been learning, related to inverted indices / postings lists. I'm quite new to Python and am having some trouble understanding its efficiencies in some cases.
Theoretically, creating an inverted index of a set of documents D, each with a unique ID doc_id should involve:
Parsing / performing lexical analysis of each document in D
Removing stopwords, performing stemming etc.
Creating a list of all (word,doc_id) pairs
Sorting the list
Condensing duplicates into {word:[set_of_all_doc_ids]} (inverted index)
Step 5 is often carried out by having a dictionary containing the word with meta-data (term frequency, byte offsets) and a pointer to the postings list (list of documents it occurs in). The postings list is often implemented as a data structure which allows efficient random insert, i.e. a linked list.
My problem is that Python is a higher-level language, and direct use of things like memory pointers (and therefore linked lists) seems to be out of scope. I am optimising before profiling because for very large data sets it is already known that efficiency must be maximised to retain any kind of ability to calculate the index in a reasonable time.
Several other posts exist here on SO about Python inverted indices and, like MY current implementation, they use dictionaries mapping keys to lists (or sets). Is one to expect that this method have similar performance to a language which allows direct coding of pointers to linked lists?
There are a number of things to say:
If random access is required for a particular list implementation, a linked list is not optimal (regardless of the programming language used). To access the ith element of the list, a linked list requires you to iterate all the way from the 0th to the ith element. Instead, the list should be stored as one continuous block (or several large blocks if it is very long). Python lists [...] are stored in this way, so for a start, a Python list should be good enough.
In Python, any assignment a = b of an object b that is not a basic data type (such as int or float), is performed internally by passing a pointer and incrementing the reference count to b. So if b is a list or a dictionary (or a user-defined class, for that matter), this is in principle not much different from passing a pointer in C or C++.
However, there is obviously some overhead caused by a) reference counting and b) garbage collection. If the implementation is for study purposes, i.e. to understand the concepts of inverted indexing better, I would not worry about that. But for a serious, highly-optimized implementation, using pure Python (rather than, e.g. C/C++ embedded into Python) is not advisable.
As you optimise the implementation of your postings list further, you will probably see the need to a) make random inserts, b) keep it sorted and c) keep it compressed - all at the same time. At that point, the standard Python list won't be good enough any more, and you might want to look into implementing a more optimised list representation in C/C++ and embed it into Python. However, even then, sticking to pure Python would probably be possible. E.g. you could use a large string to implement the list and use itertools and buffer to access specific parts in a way that is, to some extent, similar to pointer arithmetic.
One thing that you should always keep in mind when dealing with strings in Python is that, despite what I said above about assignment operations, the substring operation text[i:j] involves creating an actual (deep) copy of the substring, rather than merely incrementing a reference count. This can be avoided by using the buffer data type mentioned above.
You can see the code and documentation for inverted index in Python at : http://www.ssiddique.info/creation-of-inverted-index-and-use-of-ranking-algorithm-python-code.html
Soon I will be coding it in C++..

Categories

Resources