Which data structure is appropriate for this? - python

I have a line in my code that currently does this at each step x:
myList = [(lo,hi) for lo,hi in myList if lo <= x <= hi]
This is pretty slow. Is there a more efficient way to eliminate things from a list that don't contain a given x?

Perhaps you're looking for an interval tree. From Wikipedia:
In computer science, an interval tree is an ordered tree data structure to hold intervals. Specifically, it allows one to efficiently find all intervals that overlap with any given interval or point.
So, instead of storing the (lo, hi) pairs sequentially in a list, you would have them define the intervals in an interval tree. Then you could perform queries on the tree with x, and retain only the intervals that overlap x.

While you don't give much context, I'll assume the rest of the loop looks like:
for x in xlist:
myList = [(lo,hi) for lo,hi in myList if lo <= x <= hi]
In this case, if may be more efficient to construct an interval tree (http://en.wikipedia.org/wiki/Interval_tree) first. Then, for each x you walk the tree and find all intervals which intersect with x; add these intervals to a set as you find them.

Here I'm going to suggest what may seem like a really dumb solution favoring micro-optimizations over algorithmic ones. It'll depend on your specific needs.
The ultimate question is this: is a single linear pass over your array (list in Python), on average, expensive? In other words, is searching for lo/high pairs that contain x going to generally yield results that are very small (ex: 1% of the overall size of the list), or relatively quite large (ex: 25% or more of the original list)?
If the answer is the latter, you might actually get a more efficient solution keeping a basic, contiguous, cache-friendly representation that you're accessing sequentially. The hardware cache excels at plowing through contiguous data where multiple adjacent elements fit into a cache line sequentially.
What you want to avoid in such a case is the expensive linear-time removal from the middle of the array as well as possibly the construction of a new one. If you trigger a linear-time operation for every single individual element you remove from the array, then naturally that's going to get very expensive very quickly.
To exchange that linear-time operation for a much faster constant-time one, all we have to do when we want to remove an element at a certain index in the array is to overwrite the element at that index with the element at the back of the array (last element). Now simply remove the redundant duplicate from the back of the array (a removal from the back of an array is a constant-time operation, and often involves just basic arithmetical instructions).
If your needs fit the criteria, then this can actually give you better results than a smarter algorithm. It's one of the peculiar cases where the practice can trump the theory due to the skewed performance of the hardware cache over DRAM, but if you're performing these types of hi/lo queries repeatedly and wanting to get very narrow results, then something smarter like an interval tree or at least sorting the data to allow binary searches can be considerably better.

Related

searching an unsorted list of elements in python

Apart from Binary search - do we have any other algorithm having lesser number of comparisons.
Further Binary search would work on a Sorted list. what if the elements are unsorted ?
if Number of elements (= n) is a big number. Then the run time would be high, if I opt to sort it and then run a binary search on the same.
is there any other alternative.
Sorting has a cost of O(n*log(n)) in the mean case if you use Timsort, Python's default sorting algorithm, so it's only worth ordering if you are gonna perform many searches and the array is not gonna have new elements since you'll have to reorder which has cost O(n).
On the other hand, since you have to look every value individually, I don't think there are better ways unless you use parallel programing, that way several threads could look in different values at the same time.

Iterative Divide and Conquer algorithms

I am trying to create an algorithm using the divide-and-conquer approach but using an iterative algorithm (that is, no recursion).
I am confused as to how to approach the loops.
I need to break up my problems into smaller sub problems, until I hit a base case. I assume this is still true, but then I am not sure how I can (without recursion) use the smaller subproblems to solve the much bigger problem.
For example, I am trying to come up with an algorithm that will find the closest pair of points (in one-dimensional space - though I intend to generalize this on my own to higher dimensions). If I had a function closest_pair(L) where L is a list of integer co-ordinates in ℝ, how could I come up with a divide and conquer ITERATIVE algorithm that can solve this problem?
(Without loss of generality I am using Python)
The cheap way to turn any recursive algorithm into an iterative algorithm is to take the recursive function, put it in a loop, and use your own stack. This eliminates the function call overhead and from saving any unneeded data on the stack. However, this is not usually the "best" approach ("best" depends on the problem and context.)
They way you've worded your problem, it sounds like the idea is to break the list into sublists, find the closest pair in each, and then take the closest pair out of those two results. To do this iteratively, I think a better way to approach this than the generic way mentioned above is to start the other way around: look at lists of size 3 (there are three pairs to look at) and work your way up from there. Note that lists of size 2 are trivial.
Lastly, if your coordinates are integers, they are in Z (a much smaller subset of R).

Nested list and dictionary efficiency

I'm working on a project that requires a 2D map with a list for every possible x and y coordinate on that map. Seeing as though the map dimensions are constant, which is faster for creation, searching and changing values of?
Let's say that I have a 2x2 grid with a total of 4 positions. Each stores 2-bits (0, 1, 2 or 3) would having "[0b00, 0b00, 0b00, 0b01]" represent the list be better than "[[0b00, 0b00], [0b00, 0b01]]" in terms of efficiency and readability?
I assumed that the first method would be quicker at creation and iterating over all of the values but the second be faster for finding the value of a certain position (so listName[1][0] is easier to work out than listName[2]).
To clarify, I want to know what is both more memory efficient and CPU efficient for the 3 listed uses and (if it isn't too much trouble) why they are so. Further, the actual lists I'm using are 4096x4096 (using a total of 17Mb in raw data).
Note: I DO already plan on splitting the 4096x4096 grid into sectors that will be part of a nested list, I'm just asking if x and y should be on the same nesting level.
Thanks.

What makes sets faster than lists?

The python wiki says: "Membership testing with sets and dictionaries is much faster, O(1), than searching sequences, O(n). When testing "a in b", b should be a set or dictionary instead of a list or tuple."
I've been using sets in place of lists whenever speed is important in my code, but lately I've been wondering why sets are so much faster than lists. Could anyone explain, or point me to a source that would explain, what exactly is going on behind the scenes in python to make sets faster?
list: Imagine you are looking for your socks in your closet, but you don't know in which drawer your socks are, so you have to search drawer by drawer until you find them (or maybe you never do). That's what we call O(n), because in the worst scenario, you will look in all your drawers (where n is the number of drawers).
set: Now, imagine you're still looking for your socks in your closet, but now you know in which drawer your socks are, say in the 3rd drawer. So, you will just search in the 3rd drawer, instead of searching in all drawers. That's what we call O(1), because in the worst scenario you will look in just one drawer.
Sets are implemented using hash tables. Whenever you add an object to a set, the position within the memory of the set object is determined using the hash of the object to be added. When testing for membership, all that needs to be done is basically to look if the object is at the position determined by its hash, so the speed of this operation does not depend on the size of the set. For lists, in contrast, the whole list needs to be searched, which will become slower as the list grows.
This is also the reason that sets do not preserve the order of the objects you add.
Note that sets aren't faster than lists in general -- membership test is faster for sets, and so is removing an element. As long as you don't need these operations, lists are often faster.
I think you need to take a good look at a book on data structures. Basically, Python lists are implemented as dynamic arrays and sets are implemented as a hash tables.
The implementation of these data structures gives them radically different characteristics. For instance, a hash table has a very fast lookup time but cannot preserve the order of insertion.
While I have not measured anything performance related in python so far, I'd still like to point out that lists are often faster.
Yes, you have O(1) vs. O(n). But always remember that this gives information only about the asymptotic behavior of something. That means if your n is very high O(1) will always be faster - theoretically. In practice however n often needs to be much bigger than your usual data set will be.
So sets are not faster than lists per se, but only if you have to handle a lot of elements.
Python uses hashtables, which have O(1) lookup.
Basically, Depends on the operation you are doing …
*For adding an element - then a set doesn’t need to move any data, and all it needs to do is calculate a hash value and add it to a table. For a list insertion then potentially there is data to be moved.
*For deleting an element - all a set needs to do is remove the hash entry from the hash table, for a list it potentially needs to move data around (on average 1/2 of the data.
*For a search (i.e. an in operator) - a set just needs to calculate the hash value of the data item, find that hash value in the hash table, and if it is there - then bingo. For a list, the search has to look up each item in turn - on average 1/2 of all of the terms in the list. Even for many 1000s of items a set will be far quicker to search.
Actually sets are not faster than lists in every scenario. Generally the lists are faster than sets. But in the case of searching for an element in a collection, sets are faster because sets have been implemented using hash tables. So basically Python does not have to search the full set, which means that the time complexity in average is O(1). Lists use dynamic arrays and Python needs to check the full array to search. So it takes O(n).
So finally we can see that sets are better in some case and lists are better in some cases. Its up to us to select the appropriate data structure according to our task.
A list must be searched one by one, where a set or dictionary has an index for faster searching.

Best data-structure to use for two ended sorted list

I need a collection data-structure that can do the following:
Be sorted
Allow me to quickly pop values off the front and back of the list O(log n)
Remain sorted after I insert a new value
Allow a user-specified comparison function, as I will be storing tuples and want to sort on a particular value
Thread-safety is not required
Optionally allow efficient haskey() lookups (I'm happy to maintain a separate hash-table for this though)
My thoughts at this stage are that I need a priority queue and a hash table, although I don't know if I can quickly pop values off both ends of a priority queue. Another possibility is simply maintaining an OrderedDictionary and doing an insertion sort it every-time I add more data to it.
Because I'm interested in performance for a moderate number of items (I would estimate less than 200,000), I am unsure about what asymptotic performance I require for these operations. n will not grow infinitely, so a low constant performance k in k * O(n) may be as important O(n). That said, I would prefer that both the insert and pop operations take O(log n) time.
Furthermore, are there any particular implementations in Python? I would really like to avoid writing this code myself.
You might get good performance for these kinds of operations using blist or a database (such as the sqlite which is in the stdlib).
I suggest some sort of balanced binary tree such as a red-black tree.
A search on PyPi throws up a couple of implementations. Searching on google will give you more.
bintrees on PyPi looks very complete and has both Python and C/Cython implementations. I have not used it though, so caveat emptor.
A red-black tree is kept sorted and most operations (insert, delete, find) are O(log2(N)), so finding an element in a tree of 200,000 entries will take on average 17-18 comparisons.
Sounds like a skip list will fulfill all your requirements. It's basically a dynamically-sized sorted linked list, with O(log n) insertions and removals.
I don't really know Python, but this link seems to be relevant:
http://infohost.nmt.edu/tcc/help/lang/python/examples/pyskip/
I presume you need it sorted because you access element by rank in the sorted order?
You can use any implementation of any balanced binary tree, with the additional information at each node which tells you the numbers of descendants of that node (usually called the Order Statistic Binary Tree).
With this structure, given the rank of an element (even min/max), you can access/delete it in O(log n) time.
This makes all operations (access/insert/delete by rank, pop front/back, insert/delete/search by value) O(logn) time, while allowing custom sort methods.
Also, apparently python has an AVL tree (one of the first balanced tree structures) implementation which supports order statistics: http://www.python.org/ftp/python/contrib-09-Dec-1999/DataStructures/avl.README
So you won't need a custom implementation.
Except for the hashing, what you're looking for is a double-ended priority queue, aka a priority deque.
If your need for sorting doesn't extend beyond managing the min and max of your data, another structure for you to look at might be an interval heap, which has the advantage of O(1) lookup of both min and max if you need to peek at values (though deleteMin and deleteMax are still O(log(N)) ). Unfortunately, I'm not aware of any implementations in Python, so I think you'd have to roll your own.
Here's an addendum to an algorithms textbook that describes interval heaps if you're interested:
http://www.mhhe.com/engcs/compsci/sahni/enrich/c9/interval.pdf
If you can really allow O(log n) for pop, dequeue, and insert, then a simple balanced search tree like red-black tree is definitely sufficient.
You can optimize this of course by maintaining a direct pointer to the smallest and largest element in the tree, and then updating it when you (1) insert elements into the tree or (2) pop or dequeue, which of course invalidate the resp. pointer. But because the tree is balanced, there's some shuffling going out anyway, and you can update the corr. pointer at the same time.
There is also something called min-max heap (see the Wikipedia entry for Binary Heap), which implements exactly a "double-ended priority queue", i.e. a queue where you can pop both from front end and the rear end. However there you can't access the whole list of objects in order, whereas a search tree can be iterated efficiently through in O(n) time.
The benefit of a min-max heap however is that the current min and max objects can be read in O(1) time, a search tree requires O(log(n)) just to read the min or max object unless you have the cached pointers as I mentioned above.
If this were Java I'd use a TreeSet with the NavigableSet interface.
This is implemented as a Red-Black-Tree.

Categories

Resources