Efficient list manipulation in python - python

I have a large list and regularily need to find an item satisfying a rather complex condition (not equality), i.e. I am forced to check every item in the list until I find one. The conditions change, but some items match more often then others. So I would like to bring the matching item to the front of the list each time I find one, so frequently matching items are found more quickly.
Is there an efficient, pythonic way to do this?
Sequences ([]) are backed by an array, so removing an item somewhere in the middle and prepending it to the array means moving every previous item. That's in O(n) time, not good.
In C you could build a linked list and move the item on your own when found. In Python there is a deque, but afaik you cannot reference the node objects nor have access to .next pointers.
And a self-made linked list is very slow in Python. (In fact it's slower than ordinary linear search without moving any item.)
Sadly, a dict or set finds items based on value equality and thus doesn't fit my problem.
As an illustration, here's the condition:
u, v, w = n.value # list item
if v in g[u] and w in g[v] and u not in g[w]:
...

Consider instead a Pythonic approach. As Ed Post once put it, "The determined Real Programmer can write FORTRAN programs in any language" -- and this generalizes... you're trying to write C in Python and it isn't working well for you:-)
Rather, think of putting an auxiliary dict cache next to the list -- caching the indices where items are found (needs to be invalidated only on "deep" changes to the list's structure). Much simpler and faster...
Probably best done by having list and dict in a small class:
class Seeker(object):
def __init__(self, *a, **k):
self.l = list(*a, **k)
self.d = {}
def find(self, value):
where = self.d.get(value)
if where is None:
self.d[value] = where = self.l.find(value)
return where
def __setitem__(self, index, value):
if value in self.d: del self.d[value]
self.l[index] = value
# and so on for other mutators that invalidate self.d; then,
def __getattr__(self, name):
# delegate everything else to the list
return getattr(self.l, name)
You need only define the mutators you actually need to use -- e.g, if you won't do insert, sort, __delitem__, &c, no need to define those, you can just delegate them to the list.
Added: in Python 3.2 or better, functools.lru_cache can actually do most of the work for you -- use it to decorate find and you'll get a better implementation of caching, with the ability to limit cache size if you so desire. To clear the cache, you'll need to call self.find.cache_clear() at the appropriate spots (where I above use self.d = {}) -- unfortunately, that crucial functionality is not (yet!-) documented (the volunteers updating the docs are not the same ones updating the code...!-)... but, trust me, it's not going to disappear on you:-).
Added: the OP edited the Q to clarify that he's not after "value equality", but rather some more complex set of conditions, exemplified by a predicate such as:
def good_for_g(g, n):
# for some container `g` and item value `n`:
u, v, w = n.value
return v in g[u] and w in g[v] and u not in g[w]
Presumably, then, the desire to bring "good" items towards the front is in turn predicated on their "goodness" being "sticky", i.e, g staying pretty much the same for a while. In this case, one can use the predicate one as a feature extraction and checking function, which forms the key into the dictionary -- so for example:
class FancySeeker(object):
def __init__(self, *a, **k):
self.l = list(*a, **k)
self.d = {}
def _find_in_list(self, predicate):
for i, n in enumerate(self.l):
if predicate(n):
return i
return -1
def find(self, predicate):
where = self.d.get(predicate)
if where is None:
where = self._find_in_list(predicate)
self.d[predicate] = where
return where
and so forth.
So the remaining difficulty is to put predicate in a form suitable for effective indexing into a dict. If predicate is just a function, no problem. But if predicate is a function with parameters, as formed e.g by functools.partial or as a bound method of some instance, that requires a bit of further processing/wrapping to make the indexing work.
Two calls to functools.partial with the same bound argument(s) and function, for example, do not return equal objects -- one has, rather, to inspect the .args and .func of the returned objects to ensure, so to speak, a "singleton" is returned for any given (func, args) pair.
Moreover, if some of the bound arguments are mutable, one needs to use their id in lieu of their hash (or else the raw functools.partial object would not be hashable). It gets even hairier for bound methods, though they can similarly be wrapped into e.g a hashable, "equality adjusted" Predicate class.
Lastly, if these gyrations prove too cumbersome and you really want a fast implementation of a linked list instead, look at https://pypi.python.org/pypi/llist/0.4 -- it's a C-coded implementation of singly and doubly linked lists for Python (for each kind, it implements three types: the list itself, the list node, and the list's iterator).

You can do exactly what you want using deque.rotate.
from collections import deque
class Collection:
"Linked List collection that moves searched for items to the front of the collection"
def __init__(self, seq):
self._deque = deque(seq)
def __contains__(self, target):
for i, item in enumerate(self._deque):
if item == target:
self._deque.rotate(i)
self._deque.popleft()
self._deque.rotate(-i+1)
self._deque.appendleft(item)
return True
return False
def __str__(self):
return "Collection({})".format(str(self._deque))
c = Collection(range(10))
print(c)
print("5 in d:", 5 in c)
print(c)
Gives the following output:
Collection(deque([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]))
5 in c: True
Collection(deque([5, 0, 1, 2, 3, 4, 6, 7, 8, 9]))

Related

How to remove `duplicates' in list of instances

I have a list of instances of a certain class. This list contains `duplicates', in the sense that duplicates share the exact same attributes. I want to remove the duplicates from this list.
I can check whether two instances share the same attributes by using
class MyClass:
def __eq__(self, other) :
return self.__dict__ == other.__dict__
I could of course iterate through the whole list of instances and compare them element by element to remove duplicates, but I was wondering if there is a more pythonic way to do this, preferably using the in operator + list comprehension.
sets (no order)
A set cannot contain duplicate elements. list(set(content)) will deduplicate a list. This is not too inefficient and is probably one of the better ways to do it :P You will need to define a __hash__ function for your class though, which must be the same for equal elements and different for unequal elements for this to work. Note that the hash value must obey the aforementioned rule but otherwise it may change between runs without causing issues.
index function (stable order)
You could do lambda l: [l[index] for index in range(len(l)) if index == l.index(l[index])]. This only keeps elements that are the first in the list.
in operator (stable order)
def uniquify(content):
result = []
for element in content:
if element not in result:
result.append(element)
return result
This will keep appending elements to the output list unless they are already in the output list.
A little more on the set approach. You can safely implement a hash by delegating to a tuple's hash - just hash a tuple of all the attributes you want to look at. You will also need to define an __eq__ that behaves properly.
class MyClass:
def __init__(self, a, b, c):
self.a = a
self.b = b
self.c = c
def __eq__(self, other):
return (self.a, self.b, self.c) == (other.a, other.b, other.c)
def __hash__(self):
return hash((self.a, self.b, self.c))
def __repr__(self):
return "MyClass({!r}, {!r}, {!r})".format(self.a, self.b, self.c)
As you're doing so much tuple construction, you could just make your class iterable:
def __iter__(self):
return iter((self.a, self.b, self.c))
This enables you to call tuple on self instead of laboriously doing .a, .b, .c etc.
You can then do something like this:
def unordered_elim(l):
return list(set(l))
If you want to preserve ordering, you can use an OrderedDict instead:
from collections import OrderedDict
def ordered_elim(l):
return list(OrderedDict.fromkeys(l).keys())
This should be faster than using in or index, while still preserving ordering. You can test it something like this:
data = [MyClass("this", "is a", "duplicate"),
MyClass("first", "unique", "datum"),
MyClass("this", "is a", "duplicate"),
MyClass("second", "unique", "datum")]
print(unordered_elim(data))
print(ordered_elim(data))
With this output:
[MyClass('first', 'unique', 'datum'), MyClass('second', 'unique', 'datum'), MyClass('this', 'is a', 'duplicate')]
[MyClass('this', 'is a', 'duplicate'), MyClass('first', 'unique', 'datum'), MyClass('second', 'unique', 'datum')]
NB if any of your attributes aren't hashable, this won't work, and you'll either need to work around it (change a list to a tuple) or use a slow, n ^ 2 approach like in.

How to remove duplicates in set for objects?

I have set of objects:
class Test(object):
def __init__(self):
self.i = random.randint(1,10)
res = set()
for i in range(0,1000):
res.add(Test())
print len(res) = 1000
How to remove duplicates from set of objects ?
Thanks for answers, it's work:
class Test(object):
def __init__(self, i):
self.i = i
# self.i = random.randint(1,10)
# self.j = random.randint(1,20)
def __keys(self):
t = ()
for key in self.__dict__:
t = t + (self.__dict__[key],)
return t
def __eq__(self, other):
return isinstance(other, Test) and self.__keys() == other.__keys()
def __hash__(self):
return hash(self.__keys())
res = set()
res.add(Test(2))
...
res.add(Test(8))
result: [2,8,3,4,5,6,7]
but how to save order ? Sets not support order. Can i use list instead set for example ?
Your objects must be hashable (i.e. must have __eq__() and __hash__() defined) for sets to work properly with them:
class Test(object):
def __init__(self):
self.i = random.randint(1, 10)
def __eq__(self, other):
return self.i == other.i
def __hash__(self):
return self.i
An object is hashable if it has a hash value which never changes during its lifetime (it needs a __hash__() method), and can be compared to other objects (it needs an __eq__() or __cmp__() method). Hashable objects which compare equal must have the same hash value.
Hashability makes an object usable as a dictionary key and a set member, because these data structures use the hash value internally.
If you have several attributes, hash and compare a tuple of them (thanks, delnan):
class Test(object):
def __init__(self):
self.i = random.randint(1, 10)
self.k = random.randint(1, 10)
self.j = random.randint(1, 10)
def __eq__(self, other):
return (self.i, self.k, self.j) == (other.i, other.k, other.j)
def __hash__(self):
return hash((self.i, self.k, self.j))
Your first question is already answered by Pavel Anossov.
But you have another question:
but how to save order ? Sets not support order. Can i use list instead set for example ?
You can use a list, but there are a few downsides:
You get the wrong interface.
You don't get automatic handling of duplicates. You have to explicitly write if foo not in res: res.append(foo). Obviously, you can wrap this up in a function instead of writing it repeatedly, but it's still extra work.
It's going to be a lot less efficient if the collection can get large. Basically, adding a new element, checking whether an element already exists, etc. are all going to be O(N) instead of O(1).
What you want is something that works like an ordered set. Or, equivalently, like a list that doesn't allow duplicates.
If you do all your adds first, and then all your lookups, and you don't need lookups to be fast, you can get around this by first building a list, then using unique_everseen from the itertools recipes to remove duplicates.
Or you could just keep a set and a list or elements by order (or a list plus a set of elements seen so far). But that can get a bit complicated, so you might want to wrap it up.
Ideally, you want to wrap it up in a type that has exactly the same API as set. Something like an OrderedSet akin to collections.OrderedDict.
Fortunately, if you scroll to the bottom of that docs page, you'll see that exactly what you want already exists; there's a link to an OrderedSet recipe at ActiveState.
So, copy it, paste it into your code, then just change res = set() to res = OrderedSet(), and you're done.
I think you can easily do what you want with a list as you asked in your first post since you defined the eq operator :
l = []
if Test(0) not in l :
l.append(Test(0))
My 2 cts ...
Pavel Anossov's answer is great for allowing your class to be used in a set with the semantics you want. However, if you want to preserve the order of your items, you'll need a bit more. Here's a function that de-duplicates a list, as long as the list items are hashable:
def dedupe(lst):
seen = set()
results = []
for item in lst:
if item not in seen:
seen.add(item)
results.append(item)
return results
A slightly more idiomatic version would be a generator, rather than a function that returns a list. This gets rid of the results variable, using yield rather than appending the unique values to it. I've also renamed the lst parameter to iterable, since it will work just as well on any iterable object (such as another generator).
def dedupe(iterable):
seen = set()
for item in iterable:
if item not in seen:
seen.add(item)
yield item

Python set with the ability to pop a random element

I am in need of a Python (2.7) object that functions like a set (fast insertion, deletion, and membership checking) but has the ability to return a random value. Previous questions asked on stackoverflow have answers that are things like:
import random
random.sample(mySet, 1)
But this is quite slow for large sets (it runs in O(n) time).
Other solutions aren't random enough (they depend on the internal representation of python sets, which produces some results which are very non-random):
for e in mySet:
break
# e is now an element from mySet
I coded my own rudimentary class which has constant time lookup, deletion, and random values.
class randomSet:
def __init__(self):
self.dict = {}
self.list = []
def add(self, item):
if item not in self.dict:
self.dict[item] = len(self.list)
self.list.append(item)
def addIterable(self, item):
for a in item:
self.add(a)
def delete(self, item):
if item in self.dict:
index = self.dict[item]
if index == len(self.list)-1:
del self.dict[self.list[index]]
del self.list[index]
else:
self.list[index] = self.list.pop()
self.dict[self.list[index]] = index
del self.dict[item]
def getRandom(self):
if self.list:
return self.list[random.randomint(0,len(self.list)-1)]
def popRandom(self):
if self.list:
index = random.randint(0,len(self.list)-1)
if index == len(self.list)-1:
del self.dict[self.list[index]]
return self.list.pop()
returnValue = self.list[index]
self.list[index] = self.list.pop()
self.dict[self.list[index]] = index
del self.dict[returnValue]
return returnValue
Are there any better implementations for this, or any big improvements to be made to this code?
I think the best way to do this would be to use the MutableSet abstract base class in collections. Inherit from MutableSet, and then define add, discard, __len__, __iter__, and __contains__; also rewrite __init__ to optionally accept a sequence, just like the set constructor does. MutableSet provides built-in definitions of all other set methods based on those methods. That way you get the full set interface cheaply. (And if you do this, addIterable is defined for you, under the name extend.)
discard in the standard set interface appears to be what you have called delete here. So rename delete to discard. Also, instead of having a separate popRandom method, you could just define popRandom like so:
def popRandom(self):
item = self.getRandom()
self.discard(item)
return item
That way you don't have to maintain two separate item removal methods.
Finally, in your item removal method (delete now, discard according to the standard set interface), you don't need an if statement. Instead of testing whether index == len(self.list) - 1, simply swap the final item in the list with the item at the index of the list to be popped, and make the necessary change to the reverse-indexing dictionary. Then pop the last item from the list and remove it from the dictionary. This works whether index == len(self.list) - 1 or not:
def discard(self, item):
if item in self.dict:
index = self.dict[item]
self.list[index], self.list[-1] = self.list[-1], self.list[index]
self.dict[self.list[index]] = index
del self.list[-1] # or in one line:
del self.dict[item] # del self.dict[self.list.pop()]
One approach you could take is to derive a new class from set which salts itself with random objects of a type derived from int.
You can then use pop to select a random element, and if it is not of the salt type, reinsert and return it, but if it is of the salt type, insert a new, randomly-generated salt object (and pop to select a new object).
This will tend to alter the order in which objects are selected. On average, the number of attempts will depend on the proportion of salting elements, i.e. amortised O(k) performance.
Can't we implement a new class inheriting from set with some (hackish) modifications that enable us to retrieve a random element from the list with O(1) lookup time? Btw, on Python 2.x you should inherit from object, i.e. use class randomSet(object). Also PEP8 is something to consider for you :-)
Edit:
For getting some ideas of what hackish solutions might be capable of, this thread is worth reading:
http://python.6.n6.nabble.com/Get-item-from-set-td1530758.html
Here's a solution from scratch, which adds and pops in constant time. I also included some extra set functions for demonstrative purposes.
from random import randint
class RandomSet(object):
"""
Implements a set in which elements can be
added and drawn uniformly and randomly in
constant time.
"""
def __init__(self, seq=None):
self.dict = {}
self.list = []
if seq is not None:
for x in seq:
self.add(x)
def add(self, x):
if x not in self.dict:
self.dict[x] = len(self.list)
self.list.append(x)
def pop(self, x=None):
if x is None:
i = randint(0,len(self.list)-1)
x = self.list[i]
else:
i = self.dict[x]
self.list[i] = self.list[-1]
self.dict[self.list[-1]] = i
self.list.pop()
self.dict.pop(x)
return x
def __contains__(self, x):
return x in self.dict
def __iter__(self):
return iter(self.list)
def __repr__(self):
return "{" + ", ".join(str(x) for x in self.list) + "}"
def __len__(self):
return len(self.list)
Yes, I'd implement an "ordered set" in much the same way you did - and use a list as an internal data structure.
However, I'd inherit straight from "set" and just keep track of the added items in an
internal list (as you did) - and leave the methods I don't use alone.
Maybe add a "sync" method to update the internal list whenever the set is updated
by set-specific operations, like the *_update methods.
That if using an "ordered dict" does not cover your use cases. (I just found that trying to cast ordered_dict keys to a regular set is not optmized, so if you need set operations on your data that is not an option)
If you don't mind only supporting comparable elements, then you could use blist.sortedset.

Key-ordered dict in Python

I am looking for a solid implementation of an ordered associative array, that is, an ordered dictionary. I want the ordering in terms of keys, not of insertion order.
More precisely, I am looking for a space-efficent implementation of a int-to-float (or string-to-float for another use case) mapping structure for which:
Ordered iteration is O(n)
Random access is O(1)
The best I came up with was gluing a dict and a list of keys, keeping the last one ordered with bisect and insert.
Any better ideas?
"Random access O(1)" is an extremely exacting requirement which basically imposes an underlying hash table -- and I hope you do mean random READS only, because I think it can be mathematically proven than it's impossible in the general case to have O(1) writes as well as O(N) ordered iteration.
I don't think you will find a pre-packaged container suited to your needs because they are so extreme -- O(log N) access would of course make all the difference in the world. To get the big-O behavior you want for reads and iterations you'll need to glue two data structures, essentially a dict and a heap (or sorted list or tree), and keep them in sync. Although you don't specify, I think you'll only get amortized behavior of the kind you want - unless you're truly willing to pay any performance hits for inserts and deletes, which is the literal implication of the specs you express but does seem a pretty unlikely real-life requirement.
For O(1) read and amortized O(N) ordered iteration, just keep a list of all keys on the side of a dict. E.g.:
class Crazy(object):
def __init__(self):
self.d = {}
self.L = []
self.sorted = True
def __getitem__(self, k):
return self.d[k]
def __setitem__(self, k, v):
if k not in self.d:
self.L.append(k)
self.sorted = False
self.d[k] = v
def __delitem__(self, k):
del self.d[k]
self.L.remove(k)
def __iter__(self):
if not self.sorted:
self.L.sort()
self.sorted = True
return iter(self.L)
If you don't like the "amortized O(N) order" you can remove self.sorted and just repeat self.L.sort() in __setitem__ itself. That makes writes O(N log N), of course (while I still had writes at O(1)). Either approach is viable and it's hard to think of one as intrinsically superior to the other. If you tend to do a bunch of writes then a bunch of iterations then the approach in the code above is best; if it's typically one write, one iteration, another write, another iteration, then it's just about a wash.
BTW, this takes shameless advantage of the unusual (and wonderful;-) performance characteristics of Python's sort (aka "timsort"): among them, sorting a list that's mostly sorted but with a few extra items tacked on at the end is basically O(N) (if the tacked on items are few enough compared to the sorted prefix part). I hear Java's gaining this sort soon, as Josh Block was so impressed by a tech talk on Python's sort that he started coding it for the JVM on his laptop then and there. Most sytems (including I believe Jython as of today and IronPython too) basically have sorting as an O(N log N) operation, not taking advantage of "mostly ordered" inputs; "natural mergesort", which Tim Peters fashioned into Python's timsort of today, is a wonder in this respect.
The sortedcontainers module provides a SortedDict type that meets your requirements. It basically glues a SortedList and dict type together. The dict provides O(1) lookup and the SortedList provides O(N) iteration (it's extremely fast). The whole module is pure-Python and has benchmark graphs to backup the performance claims (fast-as-C implementations). SortedDict is also fully tested with 100% coverage and hours of stress. It's compatible with Python 2.6 through 3.4.
Here is my own implementation:
import bisect
class KeyOrderedDict(object):
__slots__ = ['d', 'l']
def __init__(self, *args, **kwargs):
self.l = sorted(kwargs)
self.d = kwargs
def __setitem__(self, k, v):
if not k in self.d:
idx = bisect.bisect(self.l, k)
self.l.insert(idx, k)
self.d[k] = v
def __getitem__(self, k):
return self.d[k]
def __delitem__(self, k):
idx = bisect.bisect_left(self.l, k)
del self.l[idx]
del self.d[k]
def __iter__(self):
return iter(self.l)
def __contains__(self, k):
return k in self.d
The use of bisect keeps self.l ordered, and insertion is O(n) (because of the insert, but not a killer in my case, because I append far more often than truly insert, so the usual case is amortized O(1)). Access is O(1), and iteration O(n). But maybe someone had invented (in C) something with a more clever structure ?
An ordered tree is usually better for this cases, but random access is going to be log(n). You should keep into account also insertion and removal costs...
You could build a dict that allows traversal by storing a pair (value, next_key) in each position.
Random access:
my_dict[k][0] # for a key k
Traversal:
k = start_key # stored somewhere
while k is not None: # next_key is None at the end of the list
v, k = my_dict[k]
yield v
Keep a pointer to start and end and you'll have efficient update for those cases where you just need to add onto the end of the list.
Inserting in the middle is obviously O(n). Possibly you could build a skip list on top of it if you need more speed.
I'm not sure which python version are you working in, but in case you like to experiment, Python 3.1 includes and official implementation of Ordered dictionaries:
http://www.python.org/dev/peps/pep-0372/
http://docs.python.org/3.1/whatsnew/3.1.html#pep-372-ordered-dictionaries
The ordereddict package ( http://anthon.home.xs4all.nl/Python/ordereddict/ ) that I implemented back in 2007 includes sorteddict. sorteddict is a KSO ( Key Sorted Order) dictionary. It is implemented in C and very space efficient and several times faster than a pure Python implementation. Downside is that only works with CPython.
>>> from _ordereddict import sorteddict
>>> x = sorteddict()
>>> x[1] = 1.0
>>> x[3] = 3.3
>>> x[2] = 2.2
>>> print x
sorteddict([(1, 1.0), (2, 2.2), (3, 3.3)])
>>> for i in x:
... print i, x[i]
...
1 1.0
2 2.2
3 3.3
>>>
Sorry for the late reply, maybe this answer can help others find that library.
here's a pastie: I Had a need for something similar. Note however that this specific implementation is immutable, there are no inserts, once the instance is created: The exact performance doesn't quite match what you're asking for, however. Lookup is O(log n) and full scan is O(n). This works using the bisect module upon a tuple of key/value (tuple) pairs. Even if you can't use this precisely, you might have some success adapting it to your needs.
import bisect
class dictuple(object):
"""
>>> h0 = dictuple()
>>> h1 = dictuple({"apples": 1, "bananas":2})
>>> h2 = dictuple({"bananas": 3, "mangoes": 5})
>>> h1+h2
('apples':1, 'bananas':3, 'mangoes':5)
>>> h1 > h2
False
>>> h1 > 6
False
>>> 'apples' in h1
True
>>> 'apples' in h2
False
>>> d1 = {}
>>> d1[h1] = "salad"
>>> d1[h1]
'salad'
>>> d1[h2]
Traceback (most recent call last):
...
KeyError: ('bananas':3, 'mangoes':5)
"""
def __new__(cls, *args, **kwargs):
initial = {}
args = [] if args is None else args
for arg in args:
initial.update(arg)
initial.update(kwargs)
instance = object.__new__(cls)
instance.__items = tuple(sorted(initial.items(),key=lambda i:i[0]))
return instance
def __init__(self,*args, **kwargs):
pass
def __find(self,key):
return bisect.bisect(self.__items, (key,))
def __getitem__(self, key):
ind = self.__find(key)
if self.__items[ind][0] == key:
return self.__items[ind][1]
raise KeyError(key)
def __repr__(self):
return "({0})".format(", ".join(
"{0}:{1}".format(repr(item[0]),repr(item[1]))
for item in self.__items))
def __contains__(self,key):
ind = self.__find(key)
return self.__items[ind][0] == key
def __cmp__(self,other):
return cmp(self.__class__.__name__, other.__class__.__name__
) or cmp(self.__items, other.__items)
def __eq__(self,other):
return self.__items == other.__items
def __format__(self,key):
pass
#def __ge__(self,key):
# pass
#def __getattribute__(self,key):
# pass
#def __gt__(self,key):
# pass
__seed = hash("dictuple")
def __hash__(self):
return dictuple.__seed^hash(self.__items)
def __iter__(self):
return self.iterkeys()
def __len__(self):
return len(self.__items)
#def __reduce__(self,key):
# pass
#def __reduce_ex__(self,key):
# pass
#def __sizeof__(self,key):
# pass
#classmethod
def fromkeys(cls,key,v=None):
cls(dict.fromkeys(key,v))
def get(self,key, default):
ind = self.__find(key)
return self.__items[ind][1] if self.__items[ind][0] == key else default
def has_key(self,key):
ind = self.__find(key)
return self.__items[ind][0] == key
def items(self):
return list(self.iteritems())
def iteritems(self):
return iter(self.__items)
def iterkeys(self):
return (i[0] for i in self.__items)
def itervalues(self):
return (i[1] for i in self.__items)
def keys(self):
return list(self.iterkeys())
def values(self):
return list(self.itervalues())
def __add__(self, other):
_sum = dict(self.__items)
_sum.update(other.__items)
return self.__class__(_sum)
if __name__ == "__main__":
import doctest
doctest.testmod()
For "string to float" problem you can use a Trie - it provides O(1) access time and O(n) sorted iteration. By "sorted" I mean "sorted alphabetically by key" - it seems that the question implies the same.
Some implementations (each with its own strong and weak points):
https://github.com/biopython/biopython has Bio.trie module with a full-featured Trie; other Trie packages are more memory-effcient;
https://github.com/kmike/datrie - random insertions could be slow, keys alphabet must be known in advance;
https://github.com/kmike/hat-trie - all operations are fast, but many dict methods are not implemented; underlying C library supports sorted iteration, but it is not implemented in a wrapper;
https://github.com/kmike/marisa-trie - very memory efficient, but doesn't support insertions; iteration is not sorted by default but can be made sorted (there is an example in docs);
https://github.com/kmike/DAWG - can be seen as a minimized Trie; very fast and memory efficient, but doesn't support insertions; has size limits (several GB of data)
Here's one option that has not been mentioned in other answers, I think:
Use a binary search tree (Treap/AVL/RB) to keep the mapping.
Also use a hashmap (aka dictionary) to keep the same mapping (again).
This will provide O(n) ordered traversal (via the tree), O(1) random access (via the hashmap) and O(log n) insertion/deletion (because you need to update both the tree and the hash).
The drawback is the need to keep all the data twice, however the alternatives which suggest keeping a list of keys alongside a hashmap are not much better in this sense.

Is there code out there to subclass set in Python for big xranges?

I'm trying to write some Python code that includes union/intersection of sets that potentially can be very large. Much of the time, these sets will be essentially set(xrange(1<<32)) or something of the kind, but often there will be ranges of values that do not belong in the set (say, 'bit 5 cannot be clear'), or extra values thrown in. For the most part, the set contents can be expressed algorithmically.
I can go in and do the dirty work to subclass set and create something, but I feel like this must be something that's been done before, and I don't want to spend days on wheel reinvention.
Oh, and just to make it harder, once I've created the set, I need to be able to iterate over it in random order. Quickly. Even if the set has a billion entries. (And that billion-entry set had better not actually take up gigabytes, because I'm going to have a lot of them.)
Is there code out there? Anyone have neat tricks? Am I asking for the moon?
You say:
For the most part, the set contents can be expressed algorithmically.
How about writing a class which presents the entire set API, but determines set inclusion algorithmically. Then with a number of classes which wrap around other sets to perform the union and intersection algorithmically.
For example, if you had a set a and set b which are instances of these pseudo sets:
>>> u = Union(a, b)
And then you use u with the full set API, which will turn around and query a and b using the correct logic. All the set methods could be designed to return these pseudo unions/intersections automatically so the whole process is transparent.
Edit: Quick example with a very limited API:
class Base(object):
def union(self, other):
return Union(self, other)
def intersection(self, other):
return Intersection(self, other)
class RangeSet(Base):
def __init__(self, low, high):
self.low = low
self.high = high
def __contains__(self, value):
return value >= self.low and value < self.high
class Union(Base):
def __init__(self, *sets):
self.sets = sets
def __contains__(self, value):
return any(value in x for x in self.sets)
class Intersection(Base):
def __init__(self, *sets):
self.sets = sets
def __contains__(self, value):
return all(value in x for x in self.sets)
a = RangeSet(0, 10)
b = RangeSet(5, 15)
u = a.union(b)
i = a.intersection(b)
print 3 in u
print 7 in u
print 12 in u
print 3 in i
print 7 in i
print 12 in i
Running gives you:
True
True
True
False
True
False
You are trying to make a set containing all the integer values in from 0 to 4,294,967,295. A byte is 8 bits, which gets you to 255. 99.9999940628% of your values are over one byte in size. A crude minimum size for your set, even if you are able to overcome the syntactic issues, is 4 billion bytes, or 4 GB.
You are never going to be able to hold an instance of that set in less than a GB of memory. Even with compression, it's likely to be a tough squeeze. You are going to have to get much more clever with your math. You may be able to take advantage of some properties of the set. After all, it's a very special set. What you are trying to do?
If you are using python 3.0, you can subclass collections.Set
This sounds like it might overlap with linear programming. In linear programming you are trying to find some optimal case where you add constraints to a set of values (typically integers) which initially van be very large. There are various libraries listed at http://wiki.python.org/moin/NumericAndScientific/Libraries that mention integer and linear programming, but nothing jumps out as being obviously what you want.
I would avoid subclassing set, since clearly you can usefully reuse no part of set's implementation. I would even avoid subclassing collections.Set, since the latter requires you to supply a __len__ -- a functionality which you appear not to need otherwise, and just can't be done effectively in the general case (it's going to be O(N), with, which the kind of size you're talking about, is far too slow). You're unlikely to find some existing implementation that matches your use case well enough to be worth reusing, because your requirements are very specific and even peculiar -- the concept of "random iterating and an occasional duplicate is OK", for example, is a really unusual one.
If your specs are complete (you only need union, intersection, and random iteration, plus occasional additions and removals of single items), implementing a special purpose class that fills those specs is not a crazy undertaking. If you have more specs that you have not explicitly mentioned, it will be trickier, but it's hard to guess without hearing all the specs. So for example, something like:
import random
class AbSet(object):
def __init__(self, predicate, maxitem=1<<32):
# set of all ints, >=0 and <maxitem, satisfying the predicate
self.maxitem = maxitem
self.predicate = predicate
self.added = set()
self.removed = set()
def copy(self):
x = type(self)(self.predicate, self.maxitem)
x.added = set(self.added)
x.removed = set(self.removed)
return x
def __contains__(self, item):
if item in self.removed: return False
if item in self.added: return True
return (0 <= item < self.maxitem) and self.predicate(item)
def __iter__(self):
# random endless iteration
while True:
x = random.randrange(self.maxitem)
if x in self: yield x
def add(self, item):
if item<0 or item>=self.maxitem: raise ValueError
if item not in self:
self.removed.discard(item)
self.added.add(item)
def discard(self, item):
if item<0 or item>=self.maxitem: raise ValueError
if item in self:
self.removed.add(item)
self.added.discard(item)
def union(self, o):
pred = lambda v: self.predicate(v) or o.predicate(v),
x = type(self)(pred, max(self.maxitem, o.maxitem))
toadd = [v for v in (self.added|o.added) if not pred(v)]
torem = [v for v in (self.removed|o.removed) if pred(v)]
x.added = set(toadd)
x.removed = set(torem)
def intersection(self, o):
pred = lambda v: self.predicate(v) and o.predicate(v),
x = type(self)(pred, min(self.maxitem, o.maxitem))
toadd = [v for v in (self.added&o.added) if not pred(v)]
torem = [v for v in (self.removed&o.removed) if pred(v)]
x.added = set(toadd)
x.removed = set(torem)
I'm not entirely certain about the logic determining added and removed upon union and intersection, but I hope this is a good base for you to work from.

Categories

Resources