Python equivalent to java.util.SortedSet? - python

Does anybody know if Python has an equivalent to Java's SortedSet interface?
Heres what I'm looking for: lets say I have an object of type foo, and I know how to compare two objects of type foo to see whether foo1 is "greater than" or "less than" foo2. I want a way of storing many objects of type foo in a list L, so that whenever I traverse the list L, I get the objects in order, according to the comparison method I define.
Edit:
I guess I can use a dictionary or a list and sort() it every time I modify it, but is this the best way?

Take a look at BTrees. It look like you need one of them. As far as I understood you need structure that will support relatively cheap insertion of element into storage structure and cheap sorting operation (or even lack of it). BTrees offers that.
I've experience with ZODB.BTrees, and they scale to thousands and millions of elements.

You can use insort from the bisect module to insert new elements efficiently in an already sorted list:
from bisect import insort
items = [1,5,7,9]
insort(items, 3)
insort(items, 10)
print items # -> [1, 3, 5, 7, 9, 10]
Note that this does not directly correspond to SortedSet, because it uses a list. If you insert the same item more than once you will have duplicates in the list.

If you're looking for an implementation of an efficient container type for Python implemented using something like a balanced search tree (A Red-Black tree for example) then it's not part of the standard library.
I was able to find this, though:
http://www.brpreiss.com/books/opus7/
The source code is available here:
http://www.brpreiss.com/books/opus7/public/Opus7-1.0.tar.gz
I don't know how the source code is licensed, and I haven't used it myself, but it would be a good place to start looking if you're not interested in rolling your own container classes.
There's PyAVL which is a C module implementing an AVL tree.
Also, this thread might be useful to you. It contains a lot of suggestions on how to use the bisect module to enhance the existing Python dictionary to do what you're asking.
Of course, using insort() that way would be pretty expensive for insertion and deletion, so consider it carefully for your application. Implementing an appropriate data structure would probably be a better approach.
In any case, to understand whether you should keep the data structure sorted or sort it when you iterate over it you'll have to know whether you intend to insert a lot or iterate a lot. Keeping the data structure sorted makes sense if you modify its content relatively infrequently but iterate over it a lot. Conversely, if you insert and delete members all the time but iterate over the collection relatively infrequently, sorting the collection of keys before iterating will be faster. There is no one correct approach.

Similar to blist.sortedlist, the sortedcontainers module provides a sorted list, sorted set, and sorted dict data type. It uses a modified B-tree in the underlying implementation and is faster than blist in most cases.
The sortedcontainers module is pure-Python so installation is easy:
pip install sortedcontainers
Then for example:
from sortedcontainers import SortedList, SortedDict, SortedSet
help(SortedList)
The sortedcontainers module has 100% coverage testing and hours of stress. There's a pretty comprehensive performance comparison that lists most of the options you'd consider for this.

If you only need the keys, and no associated value, Python offers sets:
s = set(a_list)
for k in sorted(s):
print k
However, you'll be sorting the set each time you do this.
If that is too much overhead you may want to look at HeapQueues. They may not be as elegant and "Pythonic" but maybe they suit your needs.

Use blist.sortedlist from the blist package.
from blist import sortedlist
z = sortedlist([2, 3, 5, 7, 11])
z.add(6)
z.add(3)
z.add(10)
print z
This will output:
sortedlist([2, 3, 3, 5, 6, 7, 10, 11])
The resulting object can be used just like a python list.
>>> len(z)
8
>>> [2 * x for x in z]
[4, 6, 6, 10, 12, 14, 20, 22]

Do you have the possibility of using Jython? I just mention it because using TreeMap, TreeSet, etc. is trivial. Also if you're coming from a Java background and you want to head in a Pythonic direction Jython is wonderful for making the transition easier. Though I recognise that use of TreeSet in this case would not be part of such a "transition".
For Jython superusers I have a question myself: the blist package can't be imported because it uses a C file which must be imported. But would there be any advantage of using blist instead of TreeSet? Can we generally assume the JVM uses algorithms which are essentially as good as those of CPython stuff?

Related

Python custom ordered data structure

I am wondering if in Python exists a data structure for which is possible induce a custom internal ordering policy. I am aware of OrderedDict and whatnot, but they do not provide explicity what I am asking for. For example, OrderedDict just guarantees insertion order.
I really would like something that in C++ is provided with the use of comparison object: for example in std::set<Type,Compare,Allocator>, Compare is a parameter that define the internal ordering of the data structure. Usually, or probably always, it is a binary predicate that is evaluate for a pair of elements beloning to the data structure.
Is there something similar in Python? Do you know any workaround?
SortedSet & Co support a key:
>>> SortedSet([-3, 1, 4, 1], key=abs)
SortedSet([1, -3, 4], key=<built-in function abs>)

Python: sorted insert into list

This is a common task when building a list incrementally: having sorted the container, subsequent inserts should inject values efficiently at the correct location such that the sorted container stays sorted, and an iterator readout onto a standard list is O(n), being perfectly clear: I am looking for a call to compiled O(logn) inserts into what amounts to a list, as I would expect in the ordered set I'd get from std::set (where I'd have to explicitly specify std::unordered_set to get the default python behavior).
OrderedSet (the missing python type) would accomplish this task. Is there a way to get this
effect in python such that it is as efficient within the container as it would be expected to be in a general purpose compiled language?
import bisect
mylist = [1,2,5]
bisect.insort(mylist,4)
print(mylist)
# [1, 2, 4, 5]

Difference operator between a List and a Set

Is there an operator to remove elements from a List based on the content of a Set?
What I want to do is already possible by doing this:
words = ["hello", "you", "how", "are", "you", "today", "hello"]
my_set = {"you", "are"}
new_list = [w for w in words if w not in my_set]
# ["hello", "how", "today", "hello"]
What bothers me with this list comprehension is that for huge collections, it looks less effective to me than the - operator that can be used between two sets. Because in the list comprehension, the iteration happens in Python, whereas with the operator, the iteration happens in C and is more low-level, hence faster.
So is there some way of computing a difference between a List and a Set in a shorter/cleaner/more efficient way than using a list comprehension, like for example:
# I know this is not possible, but does something approaching exist?
new_list = words - my_set
TL;DR
I'm looking for a way to remove all element presents in a Set from a List, that is either:
cleaner (with a built-in perhaps)
and/or more efficient
than what I know can be done with list comprehensions.
Unfortunately, the only answer for this is: No, there is no built-in way, implemented in native code, for this kind of operation.
What bothers me with this list comprehension is that for huge collections, it looks less effective to me than the - operator that can be used between two sets.
I think what’s important here is the “looks” part. Yes, list comprehensions run more within Python than a set difference, but I assume that most of your application actually runs within Python (otherwise you should probably be programming in C instead). So you should consider whether it really matters much. Iterating over a list is fast in Python, and a membership test on a set is also super fast (constant time, and implemented in native code). And if you look at list comprehensions, they are also very fast. So it probably won’t matter much.
Because in the list comprehension, the iteration happens in Python, whereas with the operator, the iteration happens in C and is more low-level, hence faster.
It is true that native operations are faster, but they are also more specialized, limited and allow for less flexibility. For sets, a difference is pretty easy. The set difference is a mathematical concept and is very clearly defined.
But when talking about a “list difference” or a “list and set difference” (or more generalized “list and iterable difference”?) it becomes a lot more unclear. There are a lot open questions:
How are duplicates handled? If there are two X in the original list and only one X in the subtrahend, should both X disappear from the list? Should only one disappear? If so, which one, and why?
How is order handled? Should the order be kept as in the original list? Does the order of the elements in the subtrahend have any impact?
What if we want to subtract members based on some other condition than equality? For sets, it’s clear that they always work on the equality (and hash value) of the members. Lists don’t, so lists are by design a lot more flexible. With list comprehensions, we can easily have any kind of condition to remove elements from a list; with a “list difference” we would be restricted to equality, and that might actually be a rare situation if you think about it.
It’s maybe more likely to use a set if you need to calculate differences (or even some ordered set). And for filtering lists, it might also be a rare case that you want to end up with a filtered list, so it might be more common to use a generator expression (or the Python 3 filter() function) and work with that later without having to create that filtered list in memory.
What I’m trying to say is that the use case for a list difference is not as clear as a set difference. And if there was a use case, it might be a very rare use case. And in general, I don’t think it’s worth to add complexity to the Python implementation for this. Especially when the in-Python alternative, e.g. a list comprehension, is as fast as it already is.
First things first, are you prematurely worrying about an optimisation problem that isn't really an issue? I have to to have lists with at least 10,000,000 elements before I even get into the range of this operation taking 1/10ths of a second.
If you're working with large data sets then you may find it advantageous to move to using numpy.
import random
import timeit
r = range(10000000)
setup = """
import numpy as np
l = list({!r})
s = set(l)
to_remove = {!r}
n = np.array(l)
n_remove = np.array(list(to_remove))
""".format(r, set(random.sample(r, 3)))
list_filter = "[x for x in l if x not in to_remove]"
set_filter = "s - to_remove"
np_filter = "n[np.in1d(n, n_remove, invert=True)]"
n = 1
l_time = timeit.timeit(list_filter, setup, number=n)
print("lists:", l_time)
s_time = timeit.timeit(set_filter, setup, number=n)
print("sets:", s_time)
n_time = timeit.timeit(np_filter, setup, number=n)
print("numpy:", n_time)
returns the following results -- with numpy an order of magnitude faster than using sets.
lists: 0.8743789765043315
sets: 0.20703006886620656
numpy: 0.06197169088128707
I agree with poke. Here is my reasoning:
The easiest way to do it would be using a filter:
words = ["hello", "you", "how", "are", "you", "today", "hello"]
my_set = {"you", "are"}
new_list = filter(lambda w: w not in my_set, words)
And using Dunes solution, I get these times:
lists: 0.87401028
sets: 0.55103887
numpy: 0.16134396
filter: 0.00000886 WOW beats numpy by various orders of magnitude !!!
But wait, we are making a flawed comparison because we are comparing the time of making a list strictly (comprehension and set difference) vs. lazily (numpy and filter).
If I run Dunes solution but producing the actual lists, I get:
lists: 0.86804159
sets: 0.56945663
numpy: 1.19315723
filter: 1.68792561
Now numpy is slightly more efficient than using a simple filter, but both are not better than the list comprehension, which was the first and more intuitive solution.
I would definitely use a filter over the comprehension, except if I need to use the filtered list more than once (although I could tee it).

What is `set` definition? AKA what does `set` do?

I just finished LearnPythonTheHardWay as my intro to programming and set my mind on a sudoku related project. I've been reading through the code of a Sudoku Generator that was uploaded here
to learn some things, and I ran into the line available = set(range(1,10)). I read that as available = set([1, 2, 3, 4, 5, 6, 7, 8, 9]) but I'm not sure what set is.
I tried googling python set, looked through the code to see if set had been defined anywhere, and now I'm coming to you.
Thanks.
Set is built-in type. From the documentation:
A set object is an unordered collection of distinct hashable objects. Common uses include membership testing, removing duplicates from a sequence, and computing mathematical operations such as intersection, union, difference, and symmetric difference.
A set in Python is the collection used to mimic the mathematical notion of set. To put it very succinctly, a set is a list of unique objects, that is, it cannot contain duplicates, which a list can do.
A set is kind of like an unordered list, with unique elements. Documentation exists though, so I'm not sure why you couldn't find it:
https://docs.python.org/2/library/stdtypes.html#set
to make it easy to understand ,
lets take a list ,
a = [1,2,3,4,5,5,5,6,7,7,9]
print list(set(a))
the output will be ,
[1,2,3,4,5,6,7,9]
You can prevent repetitive number using set.
For more usage of set you have to refer to the docs.
Thanks to my friend here who reminded me about the lack of order ,
Incase if the list 'a' was like,
a =[7,7,5,5,5,1,2,3,4,6,9]
print list(set(a))
will still print the output as
[1,2,3,4,5,6,7,9]
You cant preserve order in set.

Fastest way to search a list in python

When you do something like "test" in a where a is a list does python do a sequential search on the list or does it create a hash table representation to optimize the lookup? In the application I need this for I'll be doing a lot of lookups on the list so would it be best to do something like b = set(a) and then "test" in b? Also note that the list of values I'll have won't have duplicate data and I don't actually care about the order it's in; I just need to be able to check for the existence of a value.
Also note that the list of values I'll have won't have duplicate data and I don't actually care about the order it's in; I just need to be able to check for the existence of a value.
Don't use a list, use a set() instead. It has exactly the properties you want, including a blazing fast in test.
I've seen speedups of 20x and higher in places (mostly heavy number crunching) where one list was changed for a set.
"test" in a with a list a will do a linear search. Setting up a hash table on the fly would be much more expensive than a linear search. "test" in b on the other hand will do an amoirtised O(1) hash look-up.
In the case you describe, there doesn't seem to be a reason to use a list over a set.
I think it would be better to go with the set implementation. I know for a fact that sets have O(1) lookup time. I think lists take O(n) lookup time. But even if lists are also O(1) lookup, you lose nothing with switching to sets.
Further, sets don't allow duplicate values. This will make your program slightly more memory efficient as well
List and tuples seems to have the same time, and using "in" is slow for large data:
>>> t = list(range(0, 1000000))
>>> a=time.time();x = [b in t for b in range(100234,101234)];print(time.time()-a)
1.66235494614
>>> t = tuple(range(0, 1000000))
>>> a=time.time();x = [b in t for b in range(100234,101234)];print(time.time()-a)
1.6594209671
Here is much better solution: Most efficient way for a lookup/search in a huge list (python)
It's super fast:
>>> from bisect import bisect_left
>>> t = list(range(0, 1000000))
>>> a=time.time();x = [t[bisect_left(t,b)]==b for b in range(100234,101234)];print(time.time()-a)
0.0054759979248

Categories

Resources