python dictionary with constant value-type

python dictionary with constant value-type - python

I bumped into a case where I need a big (=huge) python dictionary, which turned to be quite memory-consuming.
However, since all of the values are of a single type (long) - as well as the keys, I figured I can use python (or numpy, doesn't really matter) array for the values ; and wrap the needed interface (in: x ; out: d[x]) with an object which actually uses these arrays for the keys and values storage.
I can use a index-conversion object (input --> index, of 1..n, where n is the different-values counter), and return array[index]. I can elaborate on some techniques of how to implement such an indexing-methods with reasonable memory requirement, it works and even pretty good.
However, I wonder if there is such a data-structure-object already exists (in python, or wrapped to python from C/++), in any package (I checked collections, and some Google searches).
Any comment will be welcome, thanks.

This kind of task is a typical database-type access (large volume of data in columns of a given type). You would create a simple table with indexed keys, for fast access. I don't have experience with it, but you might want to check out the standard sqlite3 module.
If your keys do not change over time, you could alternatively put all your data in two Python memory-optimized arrays (standard array module); one array contains the sorted keys, and the other one the corresponding values. You could then find key indexes through the optimized bisect.bisect function.

You might try using std::map. Boost.Python provides a Python wrapping for std::map out-of-the-box.

Related

Using Python's C API, what is an efficient way to pass large amounts of data?

I've been exploring the C API that Python offers.
https://docs.python.org/3.11/c-api/index.html
It's easy enough to move primitive data (double, long, int) from C into Python using things like:
PyObject *PyLong_FromLong(long v)
Building a Python list object and passing it primitives is also reasonable to do.
I was hoping to find a way to pass data in bulk. I know I can create 1,000,000 PyObject's out of doubles and add them each to a list, but it seems like there should be a better way. I think the Buffer Protocol is likely the way to go, but there isn't much in the way of examples.
https://docs.python.org/3.11/c-api/buffer.html
Whether the data is pushed into a Python list, array or numpy array doesn't really matter, it would just be nice to avoid millions of calls from C into Python.
Thanks!

When you don't specify an order- why does result order vary?

I am in the very early stages of learning Python. This question has more to do with basic understanding than coding- hopefully I tag it correctly. I am reading my coursework and it says
"Run the program below that displays the ... The indentation and
spacing of the... key-value pairs simply provides more readability.
Note that order is not maintained in the dict when printed."
I know I can specify so that the order is the same each time. I can do that. I want to know when you write a program and run it why do the results get returned in a different order when not specified? Is it because of the way it gets handled in the processor?
Thanks.

The answer has nothing to do with Python, and everything to do with data structures - this behavior is universal and expected across all languages that implement a similar data structure. In Python it's called a dictionary, in other languages it's called a Map or a Hash Map or a Hash Table. There are a few other similar names for the same underlying data structure.
The Python dictionary is an Associative collection, as opposed to a Python List (which is just an Array), where its elements are contiguous in memory.
The big advantage that dictionaries (associative collections) offer is fast and constant look up times (O(1)) - arrays also offer fast look up since calculating an index is trivial - however a dictionary consists of key-value pairs where the key can be anything as long as it is hashable.
Essentially, to determine the "index" where an associated value should go in an associative container, you take the key, hash it, devise some way of mapping the hash to a number and treat that number like an index. As unlikely as it is for two different objects to yield the same hash, it could theoretically happen - what's more likely to happen is that your hash-to-number procedure maps two unique hashes to the same number - in any case, collisions like this can happen, and there are strategies for handling these collisions.
The point is, the hash of a key determines the order in which the associated value appears in the collection - therefore, there is no inherent order.

Efficient array lookup in C

I'm trying to write a simple language interpreter for a custom language in C. I want to use C over C++ due to C's simplicity.
The things I'm not sure how to do in C is, storing variables and variable lookups.
I was planning to store variables in an array, but I think I'd need a variable sized array.
I also don't know an efficient way to lookup variables from an array besides just looping through it.
So I'd like to know, what is an efficient way of creating a variable sized array? How does Python or Ruby or Go store and retrieve variables efficiently?

How does Python or Ruby or Go store and retrieve variables efficiently?
Python and Ruby use hash-tables: the name of the variable is translated into an integer, and that integer is used as index into an array. It can always happen that several names collide (translate to the same integer), so that needs to be taken into account by allowing several bindings from name to value at the same slot, but there will only be a few to check for each name.
Go is compiled, so the variable is translated to an address (either static or an offset with respect to the stack—or frame—pointer) at compile-time.
what is an efficient way of creating a variable sized array?
If you decided to do that, you would use malloc and realloc.
In the case of resizing the array of buckets of a hash-table, realloc is unfortunately not useful because all the keys in the old array of buckets need to be re-hashed one by one to find where they go in the new array. If you know the maximum size of programs that will be interpreted by your interpreter, you can allocate the hash-table directly at the size that works for the largest programs, and avoid writing the hash-table resizing function.

I think you can get really carried away when trying to implement a variable-storage yourself. I would recommend you use an existing hashmap like uthash just to see how it works out for you conceptually and encapsulate it as good as possible. If it turns out to be a bottleneck, you can come back and optimize later.
I am somewhat confident to say, that at that time, you will not pick a dynamically expanding array. You have to consider that you need to implement a string-based search to find a variable by name, so you will have a hard time doing better than a hashmap with a dynamically expanding array. Search on it would be O(n) if unsorted and O(log n) if sorted, whereas the hashmap has O(1) search complexity.

Python inverted index efficiency

I am writing some Python code to implement some of the concepts I have recently been learning, related to inverted indices / postings lists. I'm quite new to Python and am having some trouble understanding its efficiencies in some cases.
Theoretically, creating an inverted index of a set of documents D, each with a unique ID doc_id should involve:
Parsing / performing lexical analysis of each document in D
Removing stopwords, performing stemming etc.
Creating a list of all (word,doc_id) pairs
Sorting the list
Condensing duplicates into {word:[set_of_all_doc_ids]} (inverted index)
Step 5 is often carried out by having a dictionary containing the word with meta-data (term frequency, byte offsets) and a pointer to the postings list (list of documents it occurs in). The postings list is often implemented as a data structure which allows efficient random insert, i.e. a linked list.
My problem is that Python is a higher-level language, and direct use of things like memory pointers (and therefore linked lists) seems to be out of scope. I am optimising before profiling because for very large data sets it is already known that efficiency must be maximised to retain any kind of ability to calculate the index in a reasonable time.
Several other posts exist here on SO about Python inverted indices and, like MY current implementation, they use dictionaries mapping keys to lists (or sets). Is one to expect that this method have similar performance to a language which allows direct coding of pointers to linked lists?

There are a number of things to say:
If random access is required for a particular list implementation, a linked list is not optimal (regardless of the programming language used). To access the ith element of the list, a linked list requires you to iterate all the way from the 0th to the ith element. Instead, the list should be stored as one continuous block (or several large blocks if it is very long). Python lists [...] are stored in this way, so for a start, a Python list should be good enough.
In Python, any assignment a = b of an object b that is not a basic data type (such as int or float), is performed internally by passing a pointer and incrementing the reference count to b. So if b is a list or a dictionary (or a user-defined class, for that matter), this is in principle not much different from passing a pointer in C or C++.
However, there is obviously some overhead caused by a) reference counting and b) garbage collection. If the implementation is for study purposes, i.e. to understand the concepts of inverted indexing better, I would not worry about that. But for a serious, highly-optimized implementation, using pure Python (rather than, e.g. C/C++ embedded into Python) is not advisable.
As you optimise the implementation of your postings list further, you will probably see the need to a) make random inserts, b) keep it sorted and c) keep it compressed - all at the same time. At that point, the standard Python list won't be good enough any more, and you might want to look into implementing a more optimised list representation in C/C++ and embed it into Python. However, even then, sticking to pure Python would probably be possible. E.g. you could use a large string to implement the list and use itertools and buffer to access specific parts in a way that is, to some extent, similar to pointer arithmetic.
One thing that you should always keep in mind when dealing with strings in Python is that, despite what I said above about assignment operations, the substring operation text[i:j] involves creating an actual (deep) copy of the substring, rather than merely incrementing a reference count. This can be avoided by using the buffer data type mentioned above.

You can see the code and documentation for inverted index in Python at : http://www.ssiddique.info/creation-of-inverted-index-and-use-of-ranking-algorithm-python-code.html
Soon I will be coding it in C++..

Why does Python treat tuples, lists, sets and dictionaries as fundamentally different things?

One of the reasons I love Python is the expressive power / reduced programming effort provided by tuples, lists, sets and dictionaries. Once you understand list comprehensions and a few of the basic patterns using in and for, life gets so much better! Python rocks.
However I do wonder why these constructs are treated as differently as they are, and how this is changing (getting stranger) over time. Back in Python 2.x, I could've made an argument they were all just variations of a basic collection type, and that it was kind of irritating that some non-exotic use cases require you to convert a dictionary to a list and back again. (Isn't a dictionary just a list of tuples with a particular uniqueness constraint? Isn't a list just a set with a different kind of uniqueness constraint?).
Now in the 3.x world, it's gotten more complicated. There are now named tuples -- starting to feel more like a special-case dictionary. There are now ordered dictionaries -- starting to feel more like a list. And I just saw a recipe for ordered sets. I can picture this going on and on ... what about unique lists, etc.
The Zen of Python says "There should be one-- and preferably only one --obvious way to do it". It seems to me this profusion of specialized collections types is in conflict with this Python precept.
What do the hardcore Pythonistas think?

These data types all serve different purposes, and in an ideal world you might be able to unify them more. However, in the real world we need to have efficient implementations of the basic collections, and e.g. ordering adds a runtime penalty.
The named tuples mainly serve to make the interface of stat() and the like more usable, and also can be nice when dealing with SQL row sets.
The big unification you're looking for is actually there, in the form of the different access protocols (getitem, getattr, iter, ...), which these types mix and match for their intended purposes.

tl;dr (duck-typing)
You're correct to see some similarities in all these data structures. Remember that python uses duck-typing (if it looks like a duck and quacks like a duck then it is a duck). If you can use two objects in the same situation then, for your current intents and purposes, they might as well be the same data type. But you always have to keep in mind that if you try to use them in other situations, they may no longer behave the same way.
With this in mind we should take a look at what's actually different and the same about the four data types you mentioned, to get a general idea of the situations where they are interchangeable.
Mutability (can you change it?)
You can make changes to dictionaries, lists, and sets. Tuples cannot be "changed" without making a copy.
Mutable: dict, list, set
Immutable: tuple
Python string is also an immutable type. Why do we want some immutable objects? I would paraphrase from this answer:
Immutable objects can be optimized a lot
In Python, only immutables are hashable (and only hashable objects can be members of sets, or keys in dictionaries).
Comparing across this property, lists and tuples seem like the "closest" two data types. At a high-level a tuple is an immutable "freeze-frame" version of a list. This makes lists useful for data sets that will be changing over time (since you don't have to copy a list to modify it) but tuples useful for things like dictionary keys (which must be immutable types).
Ordering (and a note on abstract data types)
A dictionary, like a set, has no inherent conceptual order to it. This is in contrast to lists and tuples, which do have an order. The order for the items in a dict or a set is abstracted away from the programmer, meaning that if element A comes before B in a for k in mydata loop, you shouldn't (and can't generally) rely on A being before B once you start making changes to mydata.
Order-preserving: list, tuple
Non-order-preserving: dict, set
Technically if you iterate over mydata twice in a row it'll be in the same order, but this is more a convenient feature of the mechanics of python, and not really a part of the set abstract data type (the mathematical definition of the data type). Lists and tuples do guarantee order though, especially tuples which are immutable.
What you see when you iterate (if it walks like a duck...)
One "item" per "element": set, list, tuple
Two "items" per "element": dict
I suppose here you could see a named tuple, which has both a name and a value for each element, as an immutable analogue of a dictionary. But this is a tenuous comparison- keep in mind that duck-typing will cause problems if you're trying to use a dictionary-only method on a named tuple, or vice-versa.
Direct responses to your questions
Isn't a dictionary just a list of tuples with a particular uniqueness
constraint?
No, there are several differences. Dictionaries have no inherent order, which is different from a list, which does.
Also, a dictionary has a key and a value for each "element". A tuple, on the other hand, can have an arbitrary number of elements, but each with only a value.
Because of the mechanics of a dictionary, where keys act like a set, you can look up values in constant time if you have the key. In a list of tuples (pairs here), you would need to iterate through the list until you found the key, meaning search would be linear in the number of elements in your list.
Most importantly, though, dictionary items can be changed, while tuples cannot.
Isn't a list just a set with a different kind of uniqueness
constraint?
Again, I'd stress that sets have no inherent ordering, while lists do. This makes lists much more useful for representing things like stacks and queues, where you want to be able to remember the order in which you appended items. Sets offer no such guarantee. However they do offer the advantage of being able to do membership lookups in constant time, while again lists take linear time.
There are now named tuples -- starting to feel more like a special-case dictionary. There are now ordered dictionaries -- starting to feel more like a list. And I just saw a recipe for ordered sets. I can picture this going on and on ... what about unique lists, etc.
To some degree I agree with you. However data structure libraries can be useful to support common use-cases for already well-established data structures. This keep the programmer from wasting time trying to come up with custom extensions to the standard structures. As long as it doesn't get out of hand, and we can still see the unique usefulness in each solution, it's good to have a wheel on the shelf so we don't need to reinvent it.
A great example is the Counter() class. This specialized dictionary has been of use to me more times than I can count (badoom-tshhhhh!) and it has saved me the effort of coding up a custom solution. I'd much rather have a solution that the community is helping me to develop and keep with proper python best-practices than something that sits around in my custom data structures folder and only gets used once or twice a year.

First of all, Ordered Dictionaries and Named Tuples were introduced in Python 2, but that's beside the point.
I won't point you at the docs since if you were really interested you would have read them already.
The first difference between collection types is mutability. tuple and frozenset are immutable types. This means they can be more efficient than list or set.
If you want something you can access randomly or in order, but will mainly change at the end, you want a list. If you want something you can also change at the beginning, you want a deque.
You simply can't have your cake and eat it too -- every feature you add causes you to lose some speed.
dict and set are fundamentally different from lists and tuples`. They store the hash of their keys, allowing you to see if an item is in them very quickly, but requires the key be hashable. You don't get the same membership testing speed with linked lists or arrays.
When you get to OrderedDict and NamedTuple, you're talking about subclasses of the builtin types implemented in Python, rather than in C. They are for special cases, just like any other code in the standard library you have to import. They don't clutter up the namespace but are nice to have when you need them.
One of these days, you'll be coding, and you'll say, "Man, now I know exactly what they meant by 'There should be one-- and preferably only one --obvious way to do it', a set is just what I needed for this, I'm so glad it's part of the Python language! If I had to use a list, it would take forever." That's when you'll understand why these different types exist.

A dictionary is indexed by key (in fact, it's a hash map); a generic list of tuples won't be. You might argue that both should be implemented as relations, with the ability to add indices at will, but in practice having optimized types for the common use cases is both more convenient and more efficient.
New specialized collections get added because they are common enough that lots of people would end up implementing them using more basic data types, and then you'd have the usual problems with wheel reinvention (wasted effort, lack of interoperability...). And if Python just offered an entirely generic construct, then we'd get lots of people asking "how do I implement a set using a relation", etc.
(btw, I'm using relation in the mathematical or DB sense)

All of these specialized collection types provide specific functionalities that are not adequately or efficiently provided by the "standard" data types of list, tuple, dict, and set.
For example, sometimes you need a collection of unique items, and you also need to retain the order in which you encountered them. You can do this using a set to keep track of membership and a list to keep track of order, but your solution will probably be slower and more memory-hungry than a specialized data structure designed for exactly this purpose, such as an ordered set.
These additional data types, which you see as combinations or variations on the basic ones, actually fill gaps in functionality left by the basic data types. From a practical perspective, if Python's core or standard library did not provide these data types, then anyone who needed them would invent their own inefficient versions. They are used less often than the basic types, but often enough to make it worth while to provide standard implementations.

One of the things I like in Python the most is agility. And a lot of functional, effective and usable collections types gives it to me.
And there is still one way to do this - each type does its own job.

The world of data structures (language agnostic) can generally be boiled down to a few small basic structures - lists, trees, hash-tables and graphs, etc. and variants and combinations thereof. Each has its own specific purpose in terms of use and implementation.
I don't think that you can do things like reduce a dictionary to a list of tuples with a particular uniqueness constraint without actually specifying a dictionary. A dictionary has a specific purpose - key/value look-ups - and the implementation of the data structure is generally tailored to those needs. Sets are like dictionaries in many ways, but certain operations on sets don't make sense on a dictionary (union, disjunction, etc).
I don't see this violating the 'Zen of Python' of doing things one way. While you can use a sorted dictionary to do what a dictionary does without using the sorted part, you're more violating Occam's razor and likely causing a performance penalty. I see this as different than being able to syntactically do thing different ways a la Perl.

The Zen of Python says "There should be one-- and preferably only one --obvious way to do it". It seems to me this profusion of specialized collections types is in conflict with this Python precept.
Not remotely. There are several different things being done here. We choose the right tool for the job. All of these containers are modeled on decades-old tried, tested and true CS concepts.
Dictionaries are not like tuples: they are optimized for key-value lookup. The tuple is also immutable, which distinguishes it from a list (you could think of it as sort of like a frozenlist). If you find yourself converting dictionaries to lists and back, you are almost certainly doing something wrong; an example would help.
Named tuples exist for convenience and are intended to replace simple classes rather than dictionaries, really. Ordered dictionaries are just a bit of wrapping to remember the order in which things were added to the dictionary. And neither is new in 3.x (although there may be better language support for them; I haven't looked).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.