Efficient array lookup in C - python

I'm trying to write a simple language interpreter for a custom language in C. I want to use C over C++ due to C's simplicity.
The things I'm not sure how to do in C is, storing variables and variable lookups.
I was planning to store variables in an array, but I think I'd need a variable sized array.
I also don't know an efficient way to lookup variables from an array besides just looping through it.
So I'd like to know, what is an efficient way of creating a variable sized array? How does Python or Ruby or Go store and retrieve variables efficiently?

How does Python or Ruby or Go store and retrieve variables efficiently?
Python and Ruby use hash-tables: the name of the variable is translated into an integer, and that integer is used as index into an array. It can always happen that several names collide (translate to the same integer), so that needs to be taken into account by allowing several bindings from name to value at the same slot, but there will only be a few to check for each name.
Go is compiled, so the variable is translated to an address (either static or an offset with respect to the stack—or frame—pointer) at compile-time.
what is an efficient way of creating a variable sized array?
If you decided to do that, you would use malloc and realloc.
In the case of resizing the array of buckets of a hash-table, realloc is unfortunately not useful because all the keys in the old array of buckets need to be re-hashed one by one to find where they go in the new array. If you know the maximum size of programs that will be interpreted by your interpreter, you can allocate the hash-table directly at the size that works for the largest programs, and avoid writing the hash-table resizing function.

I think you can get really carried away when trying to implement a variable-storage yourself. I would recommend you use an existing hashmap like uthash just to see how it works out for you conceptually and encapsulate it as good as possible. If it turns out to be a bottleneck, you can come back and optimize later.
I am somewhat confident to say, that at that time, you will not pick a dynamically expanding array. You have to consider that you need to implement a string-based search to find a variable by name, so you will have a hard time doing better than a hashmap with a dynamically expanding array. Search on it would be O(n) if unsorted and O(log n) if sorted, whereas the hashmap has O(1) search complexity.

Related

Efficient circular buffer with constant-time access

In a machine learning project written in python, I need an efficient circular buffer like collections.deque but with constant-time access to any element like numpy.array. The problem is that deque is apparently a linked list. Is there something efficient readily implemented in a python library that I am not aware of for this use-case please?
I could simply have a modified fixed-size numpy.array with a moving 0 index in my use-case, I guess, but that's for my python culture as it is not the first time I need something like that.
collections.deque is not exactly a linked-list. It's a doubly-linked list of arrays of size 64. I'd say it's a pretty decent choice when you want both the random-access and appending on both ends without constant reallocation.
However, if you've done proper performance profiling and that circular buffer is really your bottle-neck then you can implement the buffer in plain C for performance and add bindings to python.

Python inverted index efficiency

I am writing some Python code to implement some of the concepts I have recently been learning, related to inverted indices / postings lists. I'm quite new to Python and am having some trouble understanding its efficiencies in some cases.
Theoretically, creating an inverted index of a set of documents D, each with a unique ID doc_id should involve:
Parsing / performing lexical analysis of each document in D
Removing stopwords, performing stemming etc.
Creating a list of all (word,doc_id) pairs
Sorting the list
Condensing duplicates into {word:[set_of_all_doc_ids]} (inverted index)
Step 5 is often carried out by having a dictionary containing the word with meta-data (term frequency, byte offsets) and a pointer to the postings list (list of documents it occurs in). The postings list is often implemented as a data structure which allows efficient random insert, i.e. a linked list.
My problem is that Python is a higher-level language, and direct use of things like memory pointers (and therefore linked lists) seems to be out of scope. I am optimising before profiling because for very large data sets it is already known that efficiency must be maximised to retain any kind of ability to calculate the index in a reasonable time.
Several other posts exist here on SO about Python inverted indices and, like MY current implementation, they use dictionaries mapping keys to lists (or sets). Is one to expect that this method have similar performance to a language which allows direct coding of pointers to linked lists?
There are a number of things to say:
If random access is required for a particular list implementation, a linked list is not optimal (regardless of the programming language used). To access the ith element of the list, a linked list requires you to iterate all the way from the 0th to the ith element. Instead, the list should be stored as one continuous block (or several large blocks if it is very long). Python lists [...] are stored in this way, so for a start, a Python list should be good enough.
In Python, any assignment a = b of an object b that is not a basic data type (such as int or float), is performed internally by passing a pointer and incrementing the reference count to b. So if b is a list or a dictionary (or a user-defined class, for that matter), this is in principle not much different from passing a pointer in C or C++.
However, there is obviously some overhead caused by a) reference counting and b) garbage collection. If the implementation is for study purposes, i.e. to understand the concepts of inverted indexing better, I would not worry about that. But for a serious, highly-optimized implementation, using pure Python (rather than, e.g. C/C++ embedded into Python) is not advisable.
As you optimise the implementation of your postings list further, you will probably see the need to a) make random inserts, b) keep it sorted and c) keep it compressed - all at the same time. At that point, the standard Python list won't be good enough any more, and you might want to look into implementing a more optimised list representation in C/C++ and embed it into Python. However, even then, sticking to pure Python would probably be possible. E.g. you could use a large string to implement the list and use itertools and buffer to access specific parts in a way that is, to some extent, similar to pointer arithmetic.
One thing that you should always keep in mind when dealing with strings in Python is that, despite what I said above about assignment operations, the substring operation text[i:j] involves creating an actual (deep) copy of the substring, rather than merely incrementing a reference count. This can be avoided by using the buffer data type mentioned above.
You can see the code and documentation for inverted index in Python at : http://www.ssiddique.info/creation-of-inverted-index-and-use-of-ranking-algorithm-python-code.html
Soon I will be coding it in C++..

What makes sets faster than lists?

The python wiki says: "Membership testing with sets and dictionaries is much faster, O(1), than searching sequences, O(n). When testing "a in b", b should be a set or dictionary instead of a list or tuple."
I've been using sets in place of lists whenever speed is important in my code, but lately I've been wondering why sets are so much faster than lists. Could anyone explain, or point me to a source that would explain, what exactly is going on behind the scenes in python to make sets faster?
list: Imagine you are looking for your socks in your closet, but you don't know in which drawer your socks are, so you have to search drawer by drawer until you find them (or maybe you never do). That's what we call O(n), because in the worst scenario, you will look in all your drawers (where n is the number of drawers).
set: Now, imagine you're still looking for your socks in your closet, but now you know in which drawer your socks are, say in the 3rd drawer. So, you will just search in the 3rd drawer, instead of searching in all drawers. That's what we call O(1), because in the worst scenario you will look in just one drawer.
Sets are implemented using hash tables. Whenever you add an object to a set, the position within the memory of the set object is determined using the hash of the object to be added. When testing for membership, all that needs to be done is basically to look if the object is at the position determined by its hash, so the speed of this operation does not depend on the size of the set. For lists, in contrast, the whole list needs to be searched, which will become slower as the list grows.
This is also the reason that sets do not preserve the order of the objects you add.
Note that sets aren't faster than lists in general -- membership test is faster for sets, and so is removing an element. As long as you don't need these operations, lists are often faster.
I think you need to take a good look at a book on data structures. Basically, Python lists are implemented as dynamic arrays and sets are implemented as a hash tables.
The implementation of these data structures gives them radically different characteristics. For instance, a hash table has a very fast lookup time but cannot preserve the order of insertion.
While I have not measured anything performance related in python so far, I'd still like to point out that lists are often faster.
Yes, you have O(1) vs. O(n). But always remember that this gives information only about the asymptotic behavior of something. That means if your n is very high O(1) will always be faster - theoretically. In practice however n often needs to be much bigger than your usual data set will be.
So sets are not faster than lists per se, but only if you have to handle a lot of elements.
Python uses hashtables, which have O(1) lookup.
Basically, Depends on the operation you are doing …
*For adding an element - then a set doesn’t need to move any data, and all it needs to do is calculate a hash value and add it to a table. For a list insertion then potentially there is data to be moved.
*For deleting an element - all a set needs to do is remove the hash entry from the hash table, for a list it potentially needs to move data around (on average 1/2 of the data.
*For a search (i.e. an in operator) - a set just needs to calculate the hash value of the data item, find that hash value in the hash table, and if it is there - then bingo. For a list, the search has to look up each item in turn - on average 1/2 of all of the terms in the list. Even for many 1000s of items a set will be far quicker to search.
Actually sets are not faster than lists in every scenario. Generally the lists are faster than sets. But in the case of searching for an element in a collection, sets are faster because sets have been implemented using hash tables. So basically Python does not have to search the full set, which means that the time complexity in average is O(1). Lists use dynamic arrays and Python needs to check the full array to search. So it takes O(n).
So finally we can see that sets are better in some case and lists are better in some cases. Its up to us to select the appropriate data structure according to our task.
A list must be searched one by one, where a set or dictionary has an index for faster searching.

python dictionary with constant value-type

I bumped into a case where I need a big (=huge) python dictionary, which turned to be quite memory-consuming.
However, since all of the values are of a single type (long) - as well as the keys, I figured I can use python (or numpy, doesn't really matter) array for the values ; and wrap the needed interface (in: x ; out: d[x]) with an object which actually uses these arrays for the keys and values storage.
I can use a index-conversion object (input --> index, of 1..n, where n is the different-values counter), and return array[index]. I can elaborate on some techniques of how to implement such an indexing-methods with reasonable memory requirement, it works and even pretty good.
However, I wonder if there is such a data-structure-object already exists (in python, or wrapped to python from C/++), in any package (I checked collections, and some Google searches).
Any comment will be welcome, thanks.
This kind of task is a typical database-type access (large volume of data in columns of a given type). You would create a simple table with indexed keys, for fast access. I don't have experience with it, but you might want to check out the standard sqlite3 module.
If your keys do not change over time, you could alternatively put all your data in two Python memory-optimized arrays (standard array module); one array contains the sorted keys, and the other one the corresponding values. You could then find key indexes through the optimized bisect.bisect function.
You might try using std::map. Boost.Python provides a Python wrapping for std::map out-of-the-box.

how fast is python's slice

In order to save space and the complexity of having to maintain the consistency of data between different sources, I'm considering storing start/end indices for some substrings instead of storing the substrings themselves. The trick is that if I do so, it's possible I'll be creating slices ALL the time. Is this something to be avoided? Is the slice operator fast enough I don't need to worry? How about the new object creation/destruction overhead?
Okay, I learned my lesson. Don't optimize unless there's a real problem you're trying to fix. (Of course this doesn't mean to right needlessly bad code, but that's beside the point...) Also, test and profile before coming to stack overflow. =D Thanks everyone!
Fast enough as opposed to what? How do you do it right now? What exactly are you storing, what exactly are you retrieving? The answer probably highly depends on this. Which brings us to ...
Measure! Don't discuss and analyze theoretically; try and measure what is the more performant way. Then decide whether the possible performance gain justifies refactoring your database.
Edit: I just ran a test measuring string slicing versus lookup in a dict keyed on (start, end) tuples. It suggests that there's not much of a difference. It's a pretty naive test, though, so take it with a pinch of salt.
In a comment the OP mentions bloat "in the database" -- but no information regarding what database he's talking about; from the scant information in that comment it would seem that Python string slices aren't necessarily what's involved, rather, the "slicing" would be done by the DB engine upon retrieval.
If that's the actual situation then I would recommend on general principles against storing redundant information in the DB -- a "normal form" (maybe in a lax sense of the expression;-) whereby information is stored just once and derived information is recomputed (or cached charge of the DB engine, etc;-) should be the norm, and "denormalization" by deliberately storing derived information very much the exception and only when justified by specific, well measured retrieval-performance needs.
If the reference to "database" was a misdirection;-), or rather used in a lax sense as I did for "normal form" above;-), then another consideration may apply: since Python strings are immutable, it would seem to be natural to not have to do slices by copying, but rather have each slice reuse part of the memory space of the parent it's being sliced from (much as is done for numpy arrays' slices). However that's not currently part of the Python core. I did once try a patch to that purpose, but the problem of adding a reference to the big string and thus making it stay in memory just because a tiny substring thereof is still referenced loomed large for general-purpose adaptation. Still it would be possible to make a special purpose subclass of string (and one of unicode) for the case in which the big "parent" string needs to stay in memory anyway. Currently buffer does a tiny bit of that, but you can't call string methods on a buffer object (without explicitly copying it to a string object first), so it's only really useful for output and a few special cases... but there's no real conceptual block against adding string method (I doubt that would be adopted in the core, but it should be decently easy to maintain as a third party module anyway;-).
The worth of such an approach can hardly be solidly proven by measurement, one way or another -- speed would be very similar to the current implicitly-copying approach; the advantage would come entirely in terms of reducing memory footprint, which wouldn't so much make any given Python code faster, but rather allow a certain program to execute on a machine with a bit less RAM, or multi-task better when several instances are being used at the same time in separate processes. See rope for a similar but richer approach once experimented with in the context of C++ (but note it didn't make it into the standard;-).
I haven't done any measurements either, but since it sounds like you're already taking a C approach to a problem in Python, you might want to take a look at Python's built-in mmap library:
Memory-mapped file objects behave like both strings and like file objects. Unlike normal string objects, however, these are mutable. You can use mmap objects in most places where strings are expected; for example, you can use the re module to search through a memory-mapped file. Since they’re mutable, you can change a single character by doing obj[index] = 'a', or change a substring by assigning to a slice: obj[i1:i2] = '...'. You can also read and write data starting at the current file position, and seek() through the file to different positions.
I'm not sure from your question if that's exactly what you're looking for. And it bears repeating that you need to take some measurements. Python's timeit library is the easy one to use, but there's also cProfile or hotshot, although hotshot is at risk of being removed from the standard library as I understand it.
Would slices be ineffective because they create copies of the source string? This may or may not be an issue. If it turns out to be an issue, would it not be possible to simply implement a "String view"; an object that has a reference to the source string and has a start and end point.. Upon access/iteration, it just reads from the source string.
premature optimization is the rool of all evil.
Prove to yourself that you really have a need to optimize code, then act.

Categories

Resources