Quick way to extend a set if we know elements are unique

Quick way to extend a set if we know elements are unique - python

I am performing multiple iterations of the type:
masterSet=masterSet.union(setA)
As the set grows the length of time taken to perform these operations is growing (as one would expect, I guess).
I expect that the time is taken up checking whether each element of setA is already in masterSet?
My question is that if i KNOW that masterSet does not already contain any of elements in setA can I do this quicker?
[UPDATE]
Given that this question is still attracting views I thought I would clear up a few of the things from the comments and answers below:
When iterating though there were many iterations where I knew setA would be distinct from masterSet because of how it was constructed (without having to process any checks) but a few iterations I needed the uniqueness check.
I wondered if there was a way to 'tell' the masterSet.union() procedure not to bother with the uniquness check this time around as I know this one is distinct from masterSet just add these elements quickly trusting the programmer's assertion they were definately distict. Perhpas through calling some different ".unionWithDistinctSet()" procedure or something.
I think the responses have suggested that this isnt possible (and that really set operations should be quick enough anyway) but to use masterSet.update(setA) instead of union as its slightly quicker still.
I have accepted the clearest reponse along those lines, resolved the issue I was having at the time and got on with my life but would still love to hear if my hypothesised .unionWithDistinctSet() could ever exist?

You can use set.update to update your master set in place. This saves allocating a new set all the time so it should be a little faster than set.union...
>>> s = set(range(3))
>>> s.update(range(4))
>>> s
set([0, 1, 2, 3])
Of course, if you're doing this in a loop:
masterSet = set()
for setA in iterable:
masterSet = masterSet.union(setA)
You might get a performance boost by doing something like:
masterSet = set().union(*iterable)
Ultimately, membership testing of a set is O(1) (in the average case), so testing if the element is already contained in the set isn't really a big performance hit.

As mgilson points out, you can use update to update a set in-place from another set. That actually works out slightly quicker:
def union():
i = set(range(10000))
j = set(range(5000, 15000))
return i.union(j)
def update():
i = set(range(10000))
j = set(range(5000, 15000))
i.update(j)
return i
timeit.Timer(union).timeit(10000) # 10.351907968521118
timeit.Timer(update).timeit(10000) # 8.83384895324707

If you know your elements are unique, a set is not necessarily the best structure.
A simple list is way faster to extend.
masterList = list(masterSet)
masterList.extend(setA)

For sure, forgoing this check could be a big saving when the __eq__(..) method is very expensive. In the CPython implementation, __eq__(..) is called with every element already in the set that hashes to the same number. (Reference: source code for set.)
However, there will never be this functionality in a million years, because it opens up another way to violate the integrity of a set. The trouble associated with that far outweighs the (typically negligible) performance gain. While if this is determined as a performance bottleneck, it's not hard to write a C++ extension, and use its STL <set>, which should be faster by one or more orders of magnitude.

Related

Python - get object - Is dictionary or if statements faster?

I am making a POST to a python script, the POST has 2 parameters. Name and Location, and then it returns one string. My question is I am going to have 100's of these options, is it faster to do it in a dictionary like this:
myDictionary = {"Name":{"Location":"result", "LocationB":"resultB"},
"Name2":{"Location2":"result2A", "Location2B":"result2B"}}
And then I would use.get("Name").get("Location") to get the results
OR do something like this:
if Name = "Name":
if Location = "Location":
result = "result"
elif Location = "LocationB":
result = "resultB"
elif Name = "Name2":
if Location = "Location2B":
result = "result2A"
elif Location = "LocationB":
result = "result2B"
Now if there are hundreds or thousands of these what is faster? Or is there a better way all together?

First of all:
Generally, it's much more pythonic to match keys to values using dictionaries. You should do that from a point of style.
Secondly:
If you really care about performance, python might not always be the optimal tool. However, the dict approach should be much much faster, unless your selections happen about as often as the creation of these dicts. The creation of thousands and thousands of PyObjects to check your case is a really bad idea.
Thirdly:
If you care about your application so much, you might really want to benchmark both solutions -- as usual when it comes to performance questions, there's a million factors including your computing platform that only experiments will help to sort out
Fourth(ly?):
It looks like you're building something like a protocol parser. That's really not python's forte, performance-wise. Maybe you'd want to look into one of the dozens of tools that can write C code parsers for you and wrap that in a native module, it's pretty sure to be faster than either of your implementations, if done right.
Here's the python documentation on Extending Python with C or C++

I decided to test the two scenarios of 1000 Names and 2 locations
The Test Samples
Team Dictionary:
di = {}
for i in range(1000):
di["Name{}".format(i)] = {'Location': 'result{}'.format(i), 'LocationB':'result{}B'.format(i)}
def get_dictionary_value():
di.get("Name999").get("LocationB")
Team If Statement:
I used a python script to generate a 5000 line function if_statements(name, location): following this pattern
elif name == 'Name994':
if location == 'Location':
return 'result994'
elif location == 'LocationB':
return 'result994B'
# Some time later ...
def get_if_value():
if_statements("Name999", "LocationB")
Timing Results
You can time with the timeit function to test the time it takes a function to complete.
import timeit
print(timeit.timeit(get_dictionary_value))
# 0.06353...
print(timeit.timeit(get_if_value))
# 6.3684...
So there you have it, dictionary was 100 times faster on my machine than the hefty 165 KB if-statement function.

I will root for dict().
In most cases [key] selection is much faster than conditional checks. Rule of thumb conditionals are generally used for boolean statements.
The reason for this is; when you create a dictionary you essentially create a registry of that data which is stored in as hashes in a bucket. When you say for instance my dictonary_name['key'] if that value exist python knows the exact location of that value and returns it in almost in an instant.
However conditionals are different. Conditionals are sequential checks meaning worse case it has to check every condition provided to first establish the value's existence then it has return the respective data.
As you can see with 100's of statements this can be problematic. Though in this case dictionaries are faster. You also need to be cognizant of how often and how quickly these checks are. Because if they are faster than the the building of your dictionary you might get an error of value not found.

Python memory explosion with embedded functions

I have used Python for a while and from time to time I meet some memory explosion problem. I have searched for some sources to resolve my question such as
Memory profiling embedded python
and
https://mflerackers.wordpress.com/2012/04/12/fixing-and-avoiding-memory-leaks-in-python/
and
https://docs.python.org/2/reference/datamodel.html#object.del However, none of them works for me.
My current problem is the memory explosion when using embedded functions. The following codes works fine:
class A:
def fa:
some operations
get dictionary1
combine dictionary1 to get string1
dictionary1 = None
return *string1*
def fb:
for i in range(0, j):
call self.fa
get dictionary2 by processing *string1*
# dictionary1 and dictionary2 are basically the same.
update *dictionary3* by processing dictionary2
dictionary2 = None
return *dictionary3*
class B:
def ga:
for n in range(0, m):
call A.fb # as one argument is updated dynamically, I have to call it within the loop
processes *dictoinary3*
return something
The problem arouses when I notice that I don't need to combine dictionary1 to string1, I can directly pass dictionary1 to A.fb. I implemented it this way, then the program becomes extremely slow and the memory usage explodes for more than 10 times. I have verified that both method returned correct result.
May anybody suggest why such a little modification will result in so large difference?
Previously, I also noticed this when I was levelizing nodes in a multi-source tree (with 100,000+ nodes). If I start levelizing from the source node (which may have largest height) the memory usage is 100 times worse than that from the source node which may have smallest height. While the levelization time is about the same.
This has baffled me for a long time. Thank you so much in advance!
If anybody interested, I can email you the source code for a more clear explanation.

The fact that you're solving the same problem shouldn't imply anything about the efficiency of the solution. Same issue can be claimed with sorting arrays: you can use bubble-sort O(n^2), or merge-sort O(nlogn), or, if you can apply some restrictions you can use a non-comparison sorting algorithm like radix or bucket-sort which have linear runtime.
Starting to traverse from different nodes will generate different ways of traversing the graph - some of which might be non-efficient (repeating nodes more times).
As for "combine dictionary1 to string1" - it might be a very expensive operation and since this function is called recursively (many times) - the performance could be significantly poorer. But that's just an educated guess and cannot be answered without having more details about the complexity of the operations performed in these functions.

Loop until steady-state of a complex data structure in Python

I have a more-or-less complex data structure (list of dictionaries of sets) on which I perform a bunch of operations in a loop until the data structure reaches a steady-state, ie. doesn't change anymore. The number of iterations it takes to perform the calculation varies wildly depending on the input.
I'd like to know if there's an established way for forming a halting condition in this case. The best I could come up with is pickling the data structure, storing its md5 and checking if it has changed from the previous iteration. Since this is more expensive than my operations I only do this every 20 iterations but still, it feels wrong.
Is there a nicer or cheaper way to check for deep equality so that I know when to halt?
Thanks!

Take a look at python-deep. It should do what you want, and if it's not fast enough you can modify it yourself.
It also very much depends on how expensive the compare operation and how expensive one calculation iteration is. Say, one calculation iteration takes c time and one test takes t time and the chance of termination is p then the optimal testing frequency is:
(t * p) / c
That is assuming c < t, if that's not true then you should obviously check every loop.
So, since you can dynamically can track c and t and estimate p (with possible adaptions in the code if the code suspects the calculation is going to end) you can set your test frequency to an optimal value.

I think your only choices are:
Have every update mark a "dirty flag" when it alters a value from its starting state.
Doing a whole structure analysis (like the pickle/md5 combination you suggested).
Just run a fixed number of iterations known to reach a steady state (possibly running too many times but not having the overhead of checking the termination condition).
Option 1 is analogous to what Python itself does with ref-counting. Option 2 is analogous to what Python does with its garbage collector. Option 3 is common in numerical analysis (i.e. run divide-and-average 20 times to compute a square root).

Checking for equality to me doesn't seem the right way to go. Provided that you have full control over the operations you perform, I would introduce a "modified" flag (boolean variable) that is set to false at the beginning of each iteration. Whenever one of your operation modifies (part of) your data structure, it is set to true, and repetition is performed until modified remained "false" throughout a complete iteration.

I would trust the python equality operator to be reasonably efficient for comparing compositions of built-in objects.
I expect it would be faster than pickling+hashing, provided python tests for list equality something like this:
def __eq__(a,b):
if type(a) == list and type(b) == list:
if len(a) != len(b):
return False
for i in range(len(a)):
if a[i] != b[i]:
return False
return True
#testing for other types goes here
Since the function returns as soon as it finds two elements that don't match, in the average case it won't need to iterate through the whole thing. Compare to hashing, which does need to iterate through the whole data structure, even in the best case.
Here's how I would do it:
import copy
def perform_a_bunch_of_operations(data):
#take care to not modify the original data, as we will be using it later
my_shiny_new_data = copy.deepcopy(data)
#do lots of math here...
return my_shiny_new_data
data = get_initial_data()
while(True):
nextData = perform_a_bunch_of_operations(data)
if data == nextData: #steady state reached
break
data = nextData
This has the disadvantage of having to make a deep copy of your data each iteration, but it may still be faster than hashing - you can only know for sure by profiling your particular case.

How to optimize operations on large (75,000 items) sets of booleans in Python?

There's this script called svnmerge.py that I'm trying to tweak and optimize a bit. I'm completely new to Python though, so it's not easy.
The current problem seems to be related to a class called RevisionSet in the script. In essence what it does is create a large hashtable(?) of integer-keyed boolean values. In the worst case - one for each revision in our SVN repository, which is near 75,000 now.
After that it performs set operations on such huge arrays - addition, subtraction, intersection, and so forth. The implementation is the simplest O(n) implementation, which, naturally, gets pretty slow on such large sets. The whole data structure could be optimized because there are long spans of continuous values. For example, all keys from 1 to 74,000 might contain true. Also the script is written for Python 2.2, which is a pretty old version and we're using 2.6 anyway, so there could be something to gain there too.
I could try to cobble this together myself, but it would be difficult and take a lot of time - not to mention that it might be already implemented somewhere. Although I'd like the learning experience, the result is more important right now. What would you suggest I do?

You could try doing it with numpy instead of plain python. I found it to be very fast for operations like these.
For example:
# Create 1000000 numbers between 0 and 1000, takes 21ms
x = numpy.random.randint(0, 1000, 1000000)
# Get all items that are larger than 500, takes 2.58ms
y = x > 500
# Add 10 to those items, takes 26.1ms
x[y] += 10
Since that's with a lot more rows, I think that 75000 should not be a problem either :)

Here's a quick replacement for RevisionSet that makes it into a set. It should be much faster. I didn't fully test it, but it worked with all of the tests that I did. There are undoubtedly other ways to speed things up, but I think that this will really help because it actually harnesses the fast implementation of sets rather than doing loops in Python which the original code was doing in functions like __sub__ and __and__. The only problem with it is that the iterator isn't sorted. You might have to change a little bit of the code to account for this. I'm sure there are other ways to improve this, but hopefully it will give you a good start.
class RevisionSet(set):
"""
A set of revisions, held in dictionary form for easy manipulation. If we
were to rewrite this script for Python 2.3+, we would subclass this from
set (or UserSet). As this class does not include branch
information, it's assumed that one instance will be used per
branch.
"""
def __init__(self, parm):
"""Constructs a RevisionSet from a string in property form, or from
a dictionary whose keys are the revisions. Raises ValueError if the
input string is invalid."""
revision_range_split_re = re.compile('[-:]')
if isinstance(parm, set):
print "1"
self.update(parm.copy())
elif isinstance(parm, list):
self.update(set(parm))
else:
parm = parm.strip()
if parm:
for R in parm.split(","):
rev_or_revs = re.split(revision_range_split_re, R)
if len(rev_or_revs) == 1:
self.add(int(rev_or_revs[0]))
elif len(rev_or_revs) == 2:
self.update(set(range(int(rev_or_revs[0]),
int(rev_or_revs[1])+1)))
else:
raise ValueError, 'Ill formatted revision range: ' + R
def sorted(self):
return sorted(self)
def normalized(self):
"""Returns a normalized version of the revision set, which is an
ordered list of couples (start,end), with the minimum number of
intervals."""
revnums = sorted(self)
revnums.reverse()
ret = []
while revnums:
s = e = revnums.pop()
while revnums and revnums[-1] in (e, e+1):
e = revnums.pop()
ret.append((s, e))
return ret
def __str__(self):
"""Convert the revision set to a string, using its normalized form."""
L = []
for s,e in self.normalized():
if s == e:
L.append(str(s))
else:
L.append(str(s) + "-" + str(e))
return ",".join(L)
Addition:
By the way, I compared doing unions, intersections and subtractions of the original RevisionSet and my RevisionSet above, and the above code is from 3x to 7x faster for those operations when operating on two RevisionSets that have 75000 elements. I know that other people are saying that numpy is the way to go, but if you aren't very experienced with Python, as your comment indicates, then you might not want to go that route because it will involve a lot more changes. I'd recommend trying my code, seeing if it works and if it does, then see if it is fast enough for you. If it isn't, then I would try profiling to see what needs to be improved. Only then would I consider using numpy (which is a great package that I use quite frequently).

For example, all keys from 1 to 74,000 contain true
Why not work on a subset? Just 74001 to the end.
Pruning 74/75th of your data is far easier than trying to write an algorithm more clever than O(n).

You should rewrite RevisionSet to have a set of revisions. I think the internal representation for a revision should be an integer and revision ranges should be created as needed.
There is no compelling reason to use code that supports python 2.3 and earlier.

Just a thought. I used to do this kind of thing using run-coding in binary image manipulation. That is, store each set as a series of numbers: number of bits off, number of bits on, number of bits off, etc.
Then you can do all sorts of boolean operations on them as decorations on a simple merge algorithm.

Is python's "set" stable?

The question arose when answering to another SO question (there).
When I iterate several times over a python set (without changing it between calls), can I assume it will always return elements in the same order? And if not, what is the rationale of changing the order ? Is it deterministic, or random? Or implementation defined?
And when I call the same python program repeatedly (not random, not input dependent), will I get the same ordering for sets?
The underlying question is if python set iteration order only depends on the algorithm used to implement sets, or also on the execution context?

There's no formal guarantee about the stability of sets. However, in the CPython implementation, as long as nothing changes the set, the items will be produced in the same order. Sets are implemented as open-addressing hashtables (with a prime probe), so inserting or removing items can completely change the order (in particular, when that triggers a resize, which reorganizes how the items are laid out in memory.) You can also have two identical sets that nonetheless produce the items in different order, for example:
>>> s1 = {-1, -2}
>>> s2 = {-2, -1}
>>> s1 == s2
True
>>> list(s1), list(s2)
([-1, -2], [-2, -1])
Unless you're very certain you have the same set and nothing touched it inbetween the two iterations, it's best not to rely on it staying the same. Making seemingly irrelevant changes to, say, functions you call inbetween could produce very hard to find bugs.

A set or frozenset is inherently an unordered collection. Internally, sets are based on a hash table, and the order of keys depends both on the insertion order and on the hash algorithm. In CPython (aka standard Python) integers less than the machine word size (32 bit or 64 bit) hash to themself, but text strings, bytes strings, and datetime objects hash to integers that vary randomly; you can control that by setting the PYTHONHASHSEED environment variable.
From the __hash__ docs:
Note
By default, the __hash__() values of str, bytes and datetime
objects are “salted” with an unpredictable random value. Although they
remain constant within an individual Python process, they are not
predictable between repeated invocations of Python.
This is intended to provide protection against a denial-of-service
caused by carefully-chosen inputs that exploit the worst case
performance of a dict insertion, O(n^2) complexity. See
http://www.ocert.org/advisories/ocert-2011-003.html for details.
Changing hash values affects the iteration order of dicts, sets and
other mappings. Python has never made guarantees about this ordering
(and it typically varies between 32-bit and 64-bit builds).
See also PYTHONHASHSEED.
The results of hashing objects of other classes depend on the details of the class's __hash__ method.
The upshot of all this is that you can have two sets containing identical strings but when you convert them to lists they can compare unequal. Or they may not. ;) Here's some code that demonstrates this. On some runs, it will just loop, not printing anything, but on other runs it will quickly find a set that uses a different order to the original.
from random import seed, shuffle
seed(42)
data = list('abcdefgh')
a = frozenset(data)
la = list(a)
print(''.join(la), a)
while True:
shuffle(data)
lb = list(frozenset(data))
if lb != la:
print(''.join(data), ''.join(lb))
break
typical output
dachbgef frozenset({'d', 'a', 'c', 'h', 'b', 'g', 'e', 'f'})
deghcfab dahcbgef

And when I call the same python
program repeatedly (not random, not
input dependent), will I get the same
ordering for sets?
I can answer this part of the question now after a quick experiment. Using the following code:
class Foo(object) :
def __init__(self,val) :
self.val = val
def __repr__(self) :
return str(self.val)
x = set()
for y in range(500) :
x.add(Foo(y))
print list(x)[-10:]
I can trigger the behaviour that I was asking about in the other question. If I run this repeatedly then the output changes, but not on every run. It seems to be "weakly random" in that it changes slowly. This is certainly implementation dependent so I should say that I'm running the macports Python2.6 on snow-leopard. While the program will output the same answer for long runs of time, doing something that affects the system entropy pool (writing to the disk mostly works) will somethimes kick it into a different output.
The class Foo is just a simple int wrapper as experiments show that this doesn't happen with sets of ints. I think that the problem is caused by the lack of __eq__ and __hash__ members for the object, although I would dearly love to know the underlying explanation / ways to avoid it. Also useful would be some way to reproduce / repeat a "bad" run. Does anyone know what seed it uses, or how I could set that seed?

It’s definitely implementation defined. The specification of a set says only that
Being an unordered collection, sets do not record element position or order of insertion.
Why not use OrderedDict to create your own OrderedSet class?

The answer is simply a NO.
Python set operation is NOT stable.
I did a simple experiment to show this.
The code:
import random
random.seed(1)
x=[]
class aaa(object):
def __init__(self,a,b):
self.a=a
self.b=b
for i in range(5):
x.append(aaa(random.choice('asf'),random.randint(1,4000)))
for j in x:
print(j.a,j.b)
print('====')
for j in set(x):
print(j.a,j.b)
Run this for twice, you will get this:
First time result:
a 2332
a 1045
a 2030
s 1935
f 1555
====
a 2030
a 2332
f 1555
a 1045
s 1935
Process finished with exit code 0
Second time result:
a 2332
a 1045
a 2030
s 1935
f 1555
====
s 1935
a 2332
a 1045
f 1555
a 2030
Process finished with exit code 0
The reason is explained in comments in this answer.
However, there are some ways to make it stable:
set PYTHONHASHSEED to 0, see details here, here and here.
Use OrderedDict instead.

As pointed out, this is strictly an implementation detail.
But as long as you don’t change the structure between calls, there should be no reason for a read-only operation (= iteration) to change with time: no sane implementation does that. Even randomized (= non-deterministic) data structures that can be used to implement sets (e.g. skip lists) don’t change the reading order when no changes occur.
So, being rational, you can safely rely on this behaviour.
(I’m aware that certain GCs may reorder memory in a background thread but even this reordering will not be noticeable on the level of data structures, unless a bug occurs.)

The definition of a set is unordered, unique elements ("Unordered collections of unique elements"). You should care only about the interface, not the implementation. If you want an ordered enumeration, you should probably put it into a list and sort it.
There are many different implementations of Python. Don't rely on undocumented behaviour, as your code could break on different Python implementations.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Quick way to extend a set if we know elements are unique - python

If you know your elements are unique, a set is not necessarily the best structure. A simple list is way faster to extend. masterList = list(masterSet) masterList.extend(setA)

Related

Python - get object - Is dictionary or if statements faster?

Python memory explosion with embedded functions

Loop until steady-state of a complex data structure in Python

How to optimize operations on large (75,000 items) sets of booleans in Python?

Is python's "set" stable?

Categories

Resources