Python set with the ability to pop a random element - python

I am in need of a Python (2.7) object that functions like a set (fast insertion, deletion, and membership checking) but has the ability to return a random value. Previous questions asked on stackoverflow have answers that are things like:
import random
random.sample(mySet, 1)
But this is quite slow for large sets (it runs in O(n) time).
Other solutions aren't random enough (they depend on the internal representation of python sets, which produces some results which are very non-random):
for e in mySet:
break
# e is now an element from mySet
I coded my own rudimentary class which has constant time lookup, deletion, and random values.
class randomSet:
def __init__(self):
self.dict = {}
self.list = []
def add(self, item):
if item not in self.dict:
self.dict[item] = len(self.list)
self.list.append(item)
def addIterable(self, item):
for a in item:
self.add(a)
def delete(self, item):
if item in self.dict:
index = self.dict[item]
if index == len(self.list)-1:
del self.dict[self.list[index]]
del self.list[index]
else:
self.list[index] = self.list.pop()
self.dict[self.list[index]] = index
del self.dict[item]
def getRandom(self):
if self.list:
return self.list[random.randomint(0,len(self.list)-1)]
def popRandom(self):
if self.list:
index = random.randint(0,len(self.list)-1)
if index == len(self.list)-1:
del self.dict[self.list[index]]
return self.list.pop()
returnValue = self.list[index]
self.list[index] = self.list.pop()
self.dict[self.list[index]] = index
del self.dict[returnValue]
return returnValue
Are there any better implementations for this, or any big improvements to be made to this code?

I think the best way to do this would be to use the MutableSet abstract base class in collections. Inherit from MutableSet, and then define add, discard, __len__, __iter__, and __contains__; also rewrite __init__ to optionally accept a sequence, just like the set constructor does. MutableSet provides built-in definitions of all other set methods based on those methods. That way you get the full set interface cheaply. (And if you do this, addIterable is defined for you, under the name extend.)
discard in the standard set interface appears to be what you have called delete here. So rename delete to discard. Also, instead of having a separate popRandom method, you could just define popRandom like so:
def popRandom(self):
item = self.getRandom()
self.discard(item)
return item
That way you don't have to maintain two separate item removal methods.
Finally, in your item removal method (delete now, discard according to the standard set interface), you don't need an if statement. Instead of testing whether index == len(self.list) - 1, simply swap the final item in the list with the item at the index of the list to be popped, and make the necessary change to the reverse-indexing dictionary. Then pop the last item from the list and remove it from the dictionary. This works whether index == len(self.list) - 1 or not:
def discard(self, item):
if item in self.dict:
index = self.dict[item]
self.list[index], self.list[-1] = self.list[-1], self.list[index]
self.dict[self.list[index]] = index
del self.list[-1] # or in one line:
del self.dict[item] # del self.dict[self.list.pop()]

One approach you could take is to derive a new class from set which salts itself with random objects of a type derived from int.
You can then use pop to select a random element, and if it is not of the salt type, reinsert and return it, but if it is of the salt type, insert a new, randomly-generated salt object (and pop to select a new object).
This will tend to alter the order in which objects are selected. On average, the number of attempts will depend on the proportion of salting elements, i.e. amortised O(k) performance.

Can't we implement a new class inheriting from set with some (hackish) modifications that enable us to retrieve a random element from the list with O(1) lookup time? Btw, on Python 2.x you should inherit from object, i.e. use class randomSet(object). Also PEP8 is something to consider for you :-)
Edit:
For getting some ideas of what hackish solutions might be capable of, this thread is worth reading:
http://python.6.n6.nabble.com/Get-item-from-set-td1530758.html

Here's a solution from scratch, which adds and pops in constant time. I also included some extra set functions for demonstrative purposes.
from random import randint
class RandomSet(object):
"""
Implements a set in which elements can be
added and drawn uniformly and randomly in
constant time.
"""
def __init__(self, seq=None):
self.dict = {}
self.list = []
if seq is not None:
for x in seq:
self.add(x)
def add(self, x):
if x not in self.dict:
self.dict[x] = len(self.list)
self.list.append(x)
def pop(self, x=None):
if x is None:
i = randint(0,len(self.list)-1)
x = self.list[i]
else:
i = self.dict[x]
self.list[i] = self.list[-1]
self.dict[self.list[-1]] = i
self.list.pop()
self.dict.pop(x)
return x
def __contains__(self, x):
return x in self.dict
def __iter__(self):
return iter(self.list)
def __repr__(self):
return "{" + ", ".join(str(x) for x in self.list) + "}"
def __len__(self):
return len(self.list)

Yes, I'd implement an "ordered set" in much the same way you did - and use a list as an internal data structure.
However, I'd inherit straight from "set" and just keep track of the added items in an
internal list (as you did) - and leave the methods I don't use alone.
Maybe add a "sync" method to update the internal list whenever the set is updated
by set-specific operations, like the *_update methods.
That if using an "ordered dict" does not cover your use cases. (I just found that trying to cast ordered_dict keys to a regular set is not optmized, so if you need set operations on your data that is not an option)

If you don't mind only supporting comparable elements, then you could use blist.sortedset.

Related

The zen of python applied to methods in classes

The Zen of python tells us:
There should be one and only one obvious way to do it.
This is difficult to put in practice when it comes to the following situation.
A class receives a list of documents.
The output is a dictionary per document with a variety of key/value pairs.
Every pair depends on a previous calculated one or even from other value/pairs of other dictionary of the list.
This is a very simplified example of such a class.
What is the “obvious” way to go? Every method adds a value/pair to every of the dictionaries.
class T():
def __init__(self,mylist):
#build list of dicts
self.J = [{str(i):mylist[i]} for i in range(len(mylist))]
# enhancement 1: upper
self.method1()
# enhancement 2: lower
self.J = self.static2(self.J)
def method1(self):
newdict = []
for i,mydict in enumerate(self.J):
mydict['up'] = mydict[str(i)].upper()
newdict.append(mydict)
self.J = newdict
#staticmethod
def static2(alist):
J = []
for i,mydict in enumerate(alist):
mydict['down'] = mydict[str(i)].lower()
J.append(mydict)
return J
#property
def propmethod(self):
J = []
for i,mydict in enumerate(self.J):
mydict['prop'] = mydict[str(i)].title()
J.append(mydict)
return J
# more methods extrating info out of every doc in the list
# ...
self.method1() is simple run and a new key/value pair is added to every dict.
The static method 2 can also be used.
and also the property.
Out of the three ways I discharge #property because I am not adding another attribute.
From the other two which one would you choose?
Remember the class will be composed by tens of this Methode that so not add attributes. Only Update (add keine pair values) dictionaries in a list.
I can not see the difference between method1 and static2.
thx.

I am trying make a function that takes a number as a parameter and removes all occurrences of it from a Queue

class Queue:
def __init__(self):
self.items = []
def is_empty(self):
return self.items == []
def add(self, item):
self.items.append(item)
def remove(self):
self.items.reverse()
return self.items.pop()
I need to create a function that takes in a number as a parameter and a queue then removes every occurrence of that number in the queue but with the exception of the omissions. I've put up a model of what my Queue looks like above and I'll put a model of what the queue should somewhat look like (It's very messy and in its early stages) below.
def remove_item(q, val):
q_temp = Queue
while not q.is_empty():
q_temp.add(q.remove)
remove_item()
I cannot directly modify it in any way and I can't put the elements of the Queue in a normal list. Anyone got any solutions?
Edit: Also it needs to be executable in IDLE like this
remove_item(queue,number)
What I would do is something like this:
number_to_remove = 123
for i in range(0, queue.length()):
number = queue.remove()
if number != number_to_remove:
queue.add(number)
That way you "loop" trough the queue, you look at every number and if it's not the number you should remove then just add it again. You need to create the .length() method though.
That code will fail because bad indentation:
def remove_item(q, val):
#> q_temp = Queue
And you're setting q_temp to the class, not to an instance, making that it can't be modified.
q_temp = Queue()
And, the .remove method doesn't work, there's an easier way to make it work:
def remove(self):
return self.items.pop(-1)
But I don't have a (fully valid) answer. I we could get the queue's items, I'd use conditional list comprehensions for deleting determinated items, and applying that into the queue.
def removeFromQueue(queue, value = ""):
items = queue.items
queue.items = [i for i in items if i != value]
return queue
You can do if you let us to use the Queue class use only the class methods (we can make variables, make another queues, .remove and .add numbers, but not to use .items) you can do this:
def removeFromQueue(queue, value = ""):
items = []
for i in len(queue):
items.append(queue.remove())
items = [i for i in items if i != value]:
for i in items:
queue.add(items.pop(0))
I don't think that's valid, though. And it would need defining Queue.__len__()*:
def __len__(self):
return len(self.items)
* or modififying the code and defining __iter__ or some sort of thing like that

How would I be able to return which instance belongs to the random number in my list. Without using a million if statements?

what I am trying to do, is returnthe instance, which range has the value from a random.randint() in a list.... Example...
class Testing:
def __init__(self, name, value):
self.name = name
self.value = value
randomtest = Testing('First', range(1, 50))
randomtest_2 = Testing('Second', range(50, 100))
selections = []
counter = 0
while counter < 2:
counter =+ 1
selector = random.randint(1, 100)
selections.append(selector)
But I don't want to use a million if statements to determine which index in the selections list it belongs to.. Like this:
if selections[0] in list(randomtest.value):
return True
elif selections[0] in list(randomtest_2.value):
return True
Your help is much appreciated, I am fairly new to programming and my head has just come to a stand still at the moment.
You can use a set for your selections object then check the intersection with set.intersection() method:
ex:
In [84]: a = {1, 2}
In [85]: a.intersection(range(4))
Out[85]: {1, 2}
and in your code:
if selections.intersection(randomtest.value):
return True
You can also define a hase_intersect method for your Testing class, in order to cehck if an iterable object has intersection with your obejct:
class Testing:
def __init__(self, name, value):
self.name = name
self.value = value
def hase_intersect(self, iterable):
iterable = set(iterable)
return any(i in iterable for i in self.value)
And check like this:
if randomtest.hase_intersect(selections):
return True
based on your comment, if you want to check the intersection of a spesific list against a set of objects you have to iterate over the
set of objects and check the intersection using aforementioned methods. But if you want to refuse iterating over the list of objects you should probably use a base claas
with an special method that returns your desire output but still you need to use an iteration to fild the name of all intended instances. Thus, if you certainly want to
create different objects you neend to at least use 1 iteration for this task.

How to remove duplicates in set for objects?

I have set of objects:
class Test(object):
def __init__(self):
self.i = random.randint(1,10)
res = set()
for i in range(0,1000):
res.add(Test())
print len(res) = 1000
How to remove duplicates from set of objects ?
Thanks for answers, it's work:
class Test(object):
def __init__(self, i):
self.i = i
# self.i = random.randint(1,10)
# self.j = random.randint(1,20)
def __keys(self):
t = ()
for key in self.__dict__:
t = t + (self.__dict__[key],)
return t
def __eq__(self, other):
return isinstance(other, Test) and self.__keys() == other.__keys()
def __hash__(self):
return hash(self.__keys())
res = set()
res.add(Test(2))
...
res.add(Test(8))
result: [2,8,3,4,5,6,7]
but how to save order ? Sets not support order. Can i use list instead set for example ?
Your objects must be hashable (i.e. must have __eq__() and __hash__() defined) for sets to work properly with them:
class Test(object):
def __init__(self):
self.i = random.randint(1, 10)
def __eq__(self, other):
return self.i == other.i
def __hash__(self):
return self.i
An object is hashable if it has a hash value which never changes during its lifetime (it needs a __hash__() method), and can be compared to other objects (it needs an __eq__() or __cmp__() method). Hashable objects which compare equal must have the same hash value.
Hashability makes an object usable as a dictionary key and a set member, because these data structures use the hash value internally.
If you have several attributes, hash and compare a tuple of them (thanks, delnan):
class Test(object):
def __init__(self):
self.i = random.randint(1, 10)
self.k = random.randint(1, 10)
self.j = random.randint(1, 10)
def __eq__(self, other):
return (self.i, self.k, self.j) == (other.i, other.k, other.j)
def __hash__(self):
return hash((self.i, self.k, self.j))
Your first question is already answered by Pavel Anossov.
But you have another question:
but how to save order ? Sets not support order. Can i use list instead set for example ?
You can use a list, but there are a few downsides:
You get the wrong interface.
You don't get automatic handling of duplicates. You have to explicitly write if foo not in res: res.append(foo). Obviously, you can wrap this up in a function instead of writing it repeatedly, but it's still extra work.
It's going to be a lot less efficient if the collection can get large. Basically, adding a new element, checking whether an element already exists, etc. are all going to be O(N) instead of O(1).
What you want is something that works like an ordered set. Or, equivalently, like a list that doesn't allow duplicates.
If you do all your adds first, and then all your lookups, and you don't need lookups to be fast, you can get around this by first building a list, then using unique_everseen from the itertools recipes to remove duplicates.
Or you could just keep a set and a list or elements by order (or a list plus a set of elements seen so far). But that can get a bit complicated, so you might want to wrap it up.
Ideally, you want to wrap it up in a type that has exactly the same API as set. Something like an OrderedSet akin to collections.OrderedDict.
Fortunately, if you scroll to the bottom of that docs page, you'll see that exactly what you want already exists; there's a link to an OrderedSet recipe at ActiveState.
So, copy it, paste it into your code, then just change res = set() to res = OrderedSet(), and you're done.
I think you can easily do what you want with a list as you asked in your first post since you defined the eq operator :
l = []
if Test(0) not in l :
l.append(Test(0))
My 2 cts ...
Pavel Anossov's answer is great for allowing your class to be used in a set with the semantics you want. However, if you want to preserve the order of your items, you'll need a bit more. Here's a function that de-duplicates a list, as long as the list items are hashable:
def dedupe(lst):
seen = set()
results = []
for item in lst:
if item not in seen:
seen.add(item)
results.append(item)
return results
A slightly more idiomatic version would be a generator, rather than a function that returns a list. This gets rid of the results variable, using yield rather than appending the unique values to it. I've also renamed the lst parameter to iterable, since it will work just as well on any iterable object (such as another generator).
def dedupe(iterable):
seen = set()
for item in iterable:
if item not in seen:
seen.add(item)
yield item

Is there code out there to subclass set in Python for big xranges?

I'm trying to write some Python code that includes union/intersection of sets that potentially can be very large. Much of the time, these sets will be essentially set(xrange(1<<32)) or something of the kind, but often there will be ranges of values that do not belong in the set (say, 'bit 5 cannot be clear'), or extra values thrown in. For the most part, the set contents can be expressed algorithmically.
I can go in and do the dirty work to subclass set and create something, but I feel like this must be something that's been done before, and I don't want to spend days on wheel reinvention.
Oh, and just to make it harder, once I've created the set, I need to be able to iterate over it in random order. Quickly. Even if the set has a billion entries. (And that billion-entry set had better not actually take up gigabytes, because I'm going to have a lot of them.)
Is there code out there? Anyone have neat tricks? Am I asking for the moon?
You say:
For the most part, the set contents can be expressed algorithmically.
How about writing a class which presents the entire set API, but determines set inclusion algorithmically. Then with a number of classes which wrap around other sets to perform the union and intersection algorithmically.
For example, if you had a set a and set b which are instances of these pseudo sets:
>>> u = Union(a, b)
And then you use u with the full set API, which will turn around and query a and b using the correct logic. All the set methods could be designed to return these pseudo unions/intersections automatically so the whole process is transparent.
Edit: Quick example with a very limited API:
class Base(object):
def union(self, other):
return Union(self, other)
def intersection(self, other):
return Intersection(self, other)
class RangeSet(Base):
def __init__(self, low, high):
self.low = low
self.high = high
def __contains__(self, value):
return value >= self.low and value < self.high
class Union(Base):
def __init__(self, *sets):
self.sets = sets
def __contains__(self, value):
return any(value in x for x in self.sets)
class Intersection(Base):
def __init__(self, *sets):
self.sets = sets
def __contains__(self, value):
return all(value in x for x in self.sets)
a = RangeSet(0, 10)
b = RangeSet(5, 15)
u = a.union(b)
i = a.intersection(b)
print 3 in u
print 7 in u
print 12 in u
print 3 in i
print 7 in i
print 12 in i
Running gives you:
True
True
True
False
True
False
You are trying to make a set containing all the integer values in from 0 to 4,294,967,295. A byte is 8 bits, which gets you to 255. 99.9999940628% of your values are over one byte in size. A crude minimum size for your set, even if you are able to overcome the syntactic issues, is 4 billion bytes, or 4 GB.
You are never going to be able to hold an instance of that set in less than a GB of memory. Even with compression, it's likely to be a tough squeeze. You are going to have to get much more clever with your math. You may be able to take advantage of some properties of the set. After all, it's a very special set. What you are trying to do?
If you are using python 3.0, you can subclass collections.Set
This sounds like it might overlap with linear programming. In linear programming you are trying to find some optimal case where you add constraints to a set of values (typically integers) which initially van be very large. There are various libraries listed at http://wiki.python.org/moin/NumericAndScientific/Libraries that mention integer and linear programming, but nothing jumps out as being obviously what you want.
I would avoid subclassing set, since clearly you can usefully reuse no part of set's implementation. I would even avoid subclassing collections.Set, since the latter requires you to supply a __len__ -- a functionality which you appear not to need otherwise, and just can't be done effectively in the general case (it's going to be O(N), with, which the kind of size you're talking about, is far too slow). You're unlikely to find some existing implementation that matches your use case well enough to be worth reusing, because your requirements are very specific and even peculiar -- the concept of "random iterating and an occasional duplicate is OK", for example, is a really unusual one.
If your specs are complete (you only need union, intersection, and random iteration, plus occasional additions and removals of single items), implementing a special purpose class that fills those specs is not a crazy undertaking. If you have more specs that you have not explicitly mentioned, it will be trickier, but it's hard to guess without hearing all the specs. So for example, something like:
import random
class AbSet(object):
def __init__(self, predicate, maxitem=1<<32):
# set of all ints, >=0 and <maxitem, satisfying the predicate
self.maxitem = maxitem
self.predicate = predicate
self.added = set()
self.removed = set()
def copy(self):
x = type(self)(self.predicate, self.maxitem)
x.added = set(self.added)
x.removed = set(self.removed)
return x
def __contains__(self, item):
if item in self.removed: return False
if item in self.added: return True
return (0 <= item < self.maxitem) and self.predicate(item)
def __iter__(self):
# random endless iteration
while True:
x = random.randrange(self.maxitem)
if x in self: yield x
def add(self, item):
if item<0 or item>=self.maxitem: raise ValueError
if item not in self:
self.removed.discard(item)
self.added.add(item)
def discard(self, item):
if item<0 or item>=self.maxitem: raise ValueError
if item in self:
self.removed.add(item)
self.added.discard(item)
def union(self, o):
pred = lambda v: self.predicate(v) or o.predicate(v),
x = type(self)(pred, max(self.maxitem, o.maxitem))
toadd = [v for v in (self.added|o.added) if not pred(v)]
torem = [v for v in (self.removed|o.removed) if pred(v)]
x.added = set(toadd)
x.removed = set(torem)
def intersection(self, o):
pred = lambda v: self.predicate(v) and o.predicate(v),
x = type(self)(pred, min(self.maxitem, o.maxitem))
toadd = [v for v in (self.added&o.added) if not pred(v)]
torem = [v for v in (self.removed&o.removed) if pred(v)]
x.added = set(toadd)
x.removed = set(torem)
I'm not entirely certain about the logic determining added and removed upon union and intersection, but I hope this is a good base for you to work from.

Categories

Resources