The question arose when answering to another SO question (there).
When I iterate several times over a python set (without changing it between calls), can I assume it will always return elements in the same order? And if not, what is the rationale of changing the order ? Is it deterministic, or random? Or implementation defined?
And when I call the same python program repeatedly (not random, not input dependent), will I get the same ordering for sets?
The underlying question is if python set iteration order only depends on the algorithm used to implement sets, or also on the execution context?
There's no formal guarantee about the stability of sets. However, in the CPython implementation, as long as nothing changes the set, the items will be produced in the same order. Sets are implemented as open-addressing hashtables (with a prime probe), so inserting or removing items can completely change the order (in particular, when that triggers a resize, which reorganizes how the items are laid out in memory.) You can also have two identical sets that nonetheless produce the items in different order, for example:
>>> s1 = {-1, -2}
>>> s2 = {-2, -1}
>>> s1 == s2
True
>>> list(s1), list(s2)
([-1, -2], [-2, -1])
Unless you're very certain you have the same set and nothing touched it inbetween the two iterations, it's best not to rely on it staying the same. Making seemingly irrelevant changes to, say, functions you call inbetween could produce very hard to find bugs.
A set or frozenset is inherently an unordered collection. Internally, sets are based on a hash table, and the order of keys depends both on the insertion order and on the hash algorithm. In CPython (aka standard Python) integers less than the machine word size (32 bit or 64 bit) hash to themself, but text strings, bytes strings, and datetime objects hash to integers that vary randomly; you can control that by setting the PYTHONHASHSEED environment variable.
From the __hash__ docs:
Note
By default, the __hash__() values of str, bytes and datetime
objects are “salted” with an unpredictable random value. Although they
remain constant within an individual Python process, they are not
predictable between repeated invocations of Python.
This is intended to provide protection against a denial-of-service
caused by carefully-chosen inputs that exploit the worst case
performance of a dict insertion, O(n^2) complexity. See
http://www.ocert.org/advisories/ocert-2011-003.html for details.
Changing hash values affects the iteration order of dicts, sets and
other mappings. Python has never made guarantees about this ordering
(and it typically varies between 32-bit and 64-bit builds).
See also PYTHONHASHSEED.
The results of hashing objects of other classes depend on the details of the class's __hash__ method.
The upshot of all this is that you can have two sets containing identical strings but when you convert them to lists they can compare unequal. Or they may not. ;) Here's some code that demonstrates this. On some runs, it will just loop, not printing anything, but on other runs it will quickly find a set that uses a different order to the original.
from random import seed, shuffle
seed(42)
data = list('abcdefgh')
a = frozenset(data)
la = list(a)
print(''.join(la), a)
while True:
shuffle(data)
lb = list(frozenset(data))
if lb != la:
print(''.join(data), ''.join(lb))
break
typical output
dachbgef frozenset({'d', 'a', 'c', 'h', 'b', 'g', 'e', 'f'})
deghcfab dahcbgef
And when I call the same python
program repeatedly (not random, not
input dependent), will I get the same
ordering for sets?
I can answer this part of the question now after a quick experiment. Using the following code:
class Foo(object) :
def __init__(self,val) :
self.val = val
def __repr__(self) :
return str(self.val)
x = set()
for y in range(500) :
x.add(Foo(y))
print list(x)[-10:]
I can trigger the behaviour that I was asking about in the other question. If I run this repeatedly then the output changes, but not on every run. It seems to be "weakly random" in that it changes slowly. This is certainly implementation dependent so I should say that I'm running the macports Python2.6 on snow-leopard. While the program will output the same answer for long runs of time, doing something that affects the system entropy pool (writing to the disk mostly works) will somethimes kick it into a different output.
The class Foo is just a simple int wrapper as experiments show that this doesn't happen with sets of ints. I think that the problem is caused by the lack of __eq__ and __hash__ members for the object, although I would dearly love to know the underlying explanation / ways to avoid it. Also useful would be some way to reproduce / repeat a "bad" run. Does anyone know what seed it uses, or how I could set that seed?
It’s definitely implementation defined. The specification of a set says only that
Being an unordered collection, sets do not record element position or order of insertion.
Why not use OrderedDict to create your own OrderedSet class?
The answer is simply a NO.
Python set operation is NOT stable.
I did a simple experiment to show this.
The code:
import random
random.seed(1)
x=[]
class aaa(object):
def __init__(self,a,b):
self.a=a
self.b=b
for i in range(5):
x.append(aaa(random.choice('asf'),random.randint(1,4000)))
for j in x:
print(j.a,j.b)
print('====')
for j in set(x):
print(j.a,j.b)
Run this for twice, you will get this:
First time result:
a 2332
a 1045
a 2030
s 1935
f 1555
====
a 2030
a 2332
f 1555
a 1045
s 1935
Process finished with exit code 0
Second time result:
a 2332
a 1045
a 2030
s 1935
f 1555
====
s 1935
a 2332
a 1045
f 1555
a 2030
Process finished with exit code 0
The reason is explained in comments in this answer.
However, there are some ways to make it stable:
set PYTHONHASHSEED to 0, see details here, here and here.
Use OrderedDict instead.
As pointed out, this is strictly an implementation detail.
But as long as you don’t change the structure between calls, there should be no reason for a read-only operation (= iteration) to change with time: no sane implementation does that. Even randomized (= non-deterministic) data structures that can be used to implement sets (e.g. skip lists) don’t change the reading order when no changes occur.
So, being rational, you can safely rely on this behaviour.
(I’m aware that certain GCs may reorder memory in a background thread but even this reordering will not be noticeable on the level of data structures, unless a bug occurs.)
The definition of a set is unordered, unique elements ("Unordered collections of unique elements"). You should care only about the interface, not the implementation. If you want an ordered enumeration, you should probably put it into a list and sort it.
There are many different implementations of Python. Don't rely on undocumented behaviour, as your code could break on different Python implementations.
Related
As noted by many, Python's hash is not consistent anymore (as of version 3.3), as a random PYTHONHASHSEED is now used by default (to address security concerns, as explained in this excellent answer).
However, I have noticed that the hash of some objects are still consistent (as of Python 3.7 anyway): that includes int, float, tuple(x), frozenset(x) (as long as x yields consistent hash). For example:
assert hash(10) == 10
assert hash((10, 11)) == 3713074054246420356
assert hash(frozenset([(0, 1, 2), 3, (4, 5, 6)])) == -8046488914261726427
Is that always true and guaranteed? If so, is that expected to stay that way? Is the PYTHONHASHSEED only applied to salt the hash of strings and byte arrays?
Why am I asking?
I have a system that relies on hashing to remember whether or not we have seen a given dict (in any order): {key: tuple(ints)}. In that system, the keys are a collection of filenames and the tuples a subset of os.stat_result, e.g. (size, mtime) associated with them. This system is used to make update/sync decisions based on detecting differences.
In my application, I have on the order of 100K such dicts, and each can represent several thousands of files and their state, so the compactness of the cache is important.
I can tolerate the small false positive rate (< 10^-19 for 64-bit hashes) coming from possible hash collisions (see also birthday paradox).
One compact representation is the following for each such dict "fsd":
def fsd_hash(fsd: dict):
return hash(frozenset(fsd.items()))
It is very fast and yields a single int to represent an entire dict (with order-invariance). If anything in the fsd dict changes, with high probability the hash will be different.
Unfortunately, hash is only consistent within a single Python instance, rendering it useless for hosts to compare their respective hashes. Persisting the full cache ({location_name: fsd_hash}) to disk to be reloaded on restart is also useless.
I cannot expect the larger system that uses that module to have been invoked with PYTHONHASHSEED=0, and, to my knowledge, there is no way to change this once the Python instance has started.
Things I have tried
I may use hashlib.sha1 or similar to calculate consistent hashes. This is slower and I can't directly use the frozenset trick: I have to iterate through the dict in a consistent order (e.g. by sorting on keys, slow) while updating the hasher. In my tests on real data, I see over 50x slow-down.
I could try applying an order-invariant hashing algorithm on consistent hashes obtained for each item (also slow, as starting a fresh hasher for each item is time-consuming).
I can try transforming everything into ints or tuples of ints and then frozensets of such tuples. At the moment, it seems that all int, tuple(int) and frozenset(tuple(int)) yield consistent hashes, but: is that guaranteed, and if so, how long can I expect this to be the case?
Additional question: more generally, what would be a good way to write a consistent hash replacement for hash(frozenset(some_dict.items())) when the dict contains various types and classes? I can implement a custom __hash__ (a consistent one) for the classes I own, but I cannot override str's hash for example. One thing I came up with is:
def const_hash(x):
if isinstance(x, (int, float, bool)):
pass
elif isinstance(x, frozenset):
x = frozenset([const_hash(v) for v in x])
elif isinstance(x, str):
x = tuple([ord(e) for e in x])
elif isinstance(x, bytes):
x = tuple(x)
elif isinstance(x, dict):
x = tuple([(const_hash(k), const_hash(v)) for k, v in x.items()])
elif isinstance(x, (list, tuple)):
x = tuple([const_hash(e) for e in x])
else:
try:
return x.const_hash()
except AttributeError:
raise TypeError(f'no known const_hash implementation for {type(x)}')
return hash(x)
Short answer to broad question: There are no explicit guarantees made about hashing stability aside from the overall guarantee that x == y requires that hash(x) == hash(y). There is an implication that x and y are both defined in the same run of the program (you can't perform x == y where one of them doesn't exist in that program obviously, so no guarantees are needed about the hash across runs).
Longer answers to specific questions:
Is [your belief that int, float, tuple(x), frozenset(x) (for x with consistent hash) have consistent hashes across separate runs] always true and guaranteed?
It's true of numeric types, with the mechanism being officially documented, but the mechanism is only guaranteed for a particular interpreter for a particular build. sys.hash_info provides the various constants, and they'll be consistent on that interpreter, but on a different interpreter (CPython vs. PyPy, 64 bit build vs. 32 bit build, even 3.n vs. 3.n+1) they can differ (documented to differ in the case of 64 vs. 32 bit CPython), so the hashes won't be portable across machines with different interpreters.
No guarantees on algorithm are made for tuple and frozenset; I can't think of any reason they'd change it between runs (if the underlying types are seeded, the tuple and frozenset benefit from it without needing any changes), but they can and do change the implementation between releases of CPython (e.g. in late 2018 they made a change to reduce the number of hash collisions in short tuples of ints and floats), so if you store off the hashes of tuples from say, 3.7, and then compute hashes of the same tuples in 3.8+, they won't match (even though they'd match between runs on 3.7 or between runs on 3.8).
If so, is that expected to stay that way?
Expected to, yes. Guaranteed, no. I could easily see seeded hashes for ints (and by extension, for all numeric types to preserve the numeric hash/equality guarantees) for the same reason they seeded hashes for str/bytes, etc. The main hurdles would be:
It would almost certainly be slower than the current, very simple algorithm.
By documenting the numeric hashing algorithm explicitly, they'd need a long period of deprecation before they could change it.
It's not strictly necessary (if web apps need seeded hashes for DoS protection, they can always just convert ints to str before using them as keys).
Is the PYTHONHASHSEED only applied to salt the hash of strings and byte arrays?
Beyond str and bytes, it applies to a number of random things that implement their own hashing in terms of the hash of str or bytes, often because they're already naturally convertable to raw bytes and are commonly used as keys in dicts populated by web-facing frontends. The ones I know of off-hand include the various classes of the datetime module (datetime, date, time, though this isn't actually documented in the module itself), and read-only memoryviews of with byte-sized formats (which hash equivalently to hashing the result of the view's .tobytes() method).
What would be a good way to write a consistent hash replacement for hash(frozenset(some_dict.items())) when the dict contains various types and classes?
The simplest/most composable solution would probably be to define your const_hash as a single dispatch function, using it the same way you do hash itself. This avoids having one single function defined in a single place that must handle all types; you can have the const_hash default implementation (which just relies on hash for those things with known consistent hashes) in a central location, and provide additional definitions for the built-in types you know aren't consistent (or which might contain inconsistent stuff) there, while still allowing people to extend the set of things it covers seamlessly by registering their own single-dispatch functions by importing your const_hash and decorating the implementation for their type with #const_hash.register. It's not significantly different in effect from your proposed const_hash, but it's a lot more manageable.
As follow is my understanding of types & parameters passing in java and python:
In java, there are primitive types and non-primitive types. Former are not object, latter are objects.
In python, they are all objects.
In java, arguments are passed by value because:
primitive types are copied and then passed, so they are passed by value for sure. non-primitive types are passed by reference but reference(pointer) is also value, so they are also passed by value.
In python, the only difference is that 'primitive types'(for example, numbers) are not copied, but simply taken as objects.
Based on official doc, arguments are passed by assignment. What does it mean by 'passed by assignment'? Is objects in java work the same way as python? What result in the difference (passed by value in java and passed by argument in python)?
And is there any wrong understanding above?
tl;dr: You're right that Python's semantics are essentially Java's semantics, without any primitive types.
"Passed by assignment" is actually making a different distinction than the one you're asking about.1 The idea is that argument passing to functions (and other callables) works exactly the same way assignment works.
Consider:
def f(x):
pass
a = 3
b = a
f(a)
b = a means that the target b, in this case a name in the global namespace, becomes a reference to whatever value a references.
f(a) means that the target x, in this case a name in the local namespace of the frame built to execute f, becomes a reference to whatever value a references.
The semantics are identical. Whenever a value gets assigned to a target (which isn't always a simple name—e.g., think lst[0] = a or spam.eggs = a), it follows the same set of assignment rules—whether it's an assignment statement, a function call, an as clause, or a loop iteration variable, there's just one set of rules.
But overall, your intuitive idea that Python is like Java but with only reference types is accurate: You always "pass a reference by value".
Arguing over whether that counts as "pass by reference" or "pass by value" is pointless. Trying to come up with a new unambiguous name for it that nobody will argue about is even more pointless. Liskov invented the term "call by object" three decades ago, and if that never caught on, anything someone comes up with today isn't likely to do any better.
You understand the actual semantics, and that's what matters.
And yes, this means there is no copying. In Java, only primitive values are copied, and Python doesn't have primitive values, so nothing is copied.
the only difference is that 'primitive types'(for example, numbers) are not copied, but simply taken as objects
It's much better to see this as "the only difference is that there are no 'primitive types' (not even simple numbers)", just as you said at the start.
It's also worth asking why Python has no primitive types—or why Java does.2
Making everything "boxed" can be very slow. Adding 2 + 3 in Python means dereferencing the 2 and 3 objects, getting the native values out of them, adding them together, and wrapping the result up in a new 5 object (or looking it up in a table because you already have an existing 5 object). That's a lot more work than just adding two ints.3
While a good JIT like Hotspot—or like PyPy for Python—can often automatically do those optimizations, sometimes "often" isn't good enough. That's why Java has native types: to let you manually optimize things in those cases.
Python, instead, relies on third-party libraries like Numpy, which let you pay the boxing costs just once for a whole array, instead of once per element. Which keeps the language simpler, but at the cost of needing Numpy.4
1. As far as I know, "passed by assignment" appears a couple times in the FAQs, but is not actually used in the reference docs or glossary. The reference docs already lean toward intuitive over rigorous, but the FAQ, like the tutorial, goes much further in that direction. So, asking what a term in the FAQ means, beyond the intuitive idea it's trying to get across, may not be a meaningful question in the first place.
2. I'm going to ignore the issue of Java's lack of operator overloading here. There's no reason they couldn't include special language rules for a handful of core classes, even if they didn't let you do the same thing with your own classes—e.g., Go does exactly that for things like range, and people rarely complain.
3. … or even than looping over two arrays of 30-bit digits, which is what Python actually does. The cost of working on unlimited-size "bigints" is tiny compared to the cost of boxing, so Python just always pays that extra, barely-noticeable cost. Python 2 did, like Java, have separate fixed and bigint types, but a couple decades of experience showed that it wasn't getting any performance benefits out of the extra complexity.
4. The implementation of Numpy is of course far from simple. But using it is pretty simple, and a lot more people need to use Numpy than need to write Numpy, so that turns out to be a pretty decent tradeoff.
Similar to passing reference types by value in C#.
Docs: https://learn.microsoft.com/en-us/dotnet/csharp/programming-guide/classes-and-structs/passing-reference-type-parameters#passing-reference-types-by-value
Code demo:
# mutable object
l = [9, 8, 7]
def createNewList(l1: list):
# l1+[0] will create a new list object, the reference address of the local variable l1 is changed without affecting the variable l
l1 = l1+[0]
def changeList(l1: list):
# Add an element to the end of the list, because l1 and l refer to the same object, so l will also change
l1.append(0)
print(l)
createNewList(l)
print(l)
changeList(l)
print(l)
# immutable object
num = 9
def changeValue(val: int):
# int is an immutable type, and changing the val makes the val point to the new object 8,
# it's not change the num value
value = 8
print(num)
changeValue(num)
print(num)
I am performing multiple iterations of the type:
masterSet=masterSet.union(setA)
As the set grows the length of time taken to perform these operations is growing (as one would expect, I guess).
I expect that the time is taken up checking whether each element of setA is already in masterSet?
My question is that if i KNOW that masterSet does not already contain any of elements in setA can I do this quicker?
[UPDATE]
Given that this question is still attracting views I thought I would clear up a few of the things from the comments and answers below:
When iterating though there were many iterations where I knew setA would be distinct from masterSet because of how it was constructed (without having to process any checks) but a few iterations I needed the uniqueness check.
I wondered if there was a way to 'tell' the masterSet.union() procedure not to bother with the uniquness check this time around as I know this one is distinct from masterSet just add these elements quickly trusting the programmer's assertion they were definately distict. Perhpas through calling some different ".unionWithDistinctSet()" procedure or something.
I think the responses have suggested that this isnt possible (and that really set operations should be quick enough anyway) but to use masterSet.update(setA) instead of union as its slightly quicker still.
I have accepted the clearest reponse along those lines, resolved the issue I was having at the time and got on with my life but would still love to hear if my hypothesised .unionWithDistinctSet() could ever exist?
You can use set.update to update your master set in place. This saves allocating a new set all the time so it should be a little faster than set.union...
>>> s = set(range(3))
>>> s.update(range(4))
>>> s
set([0, 1, 2, 3])
Of course, if you're doing this in a loop:
masterSet = set()
for setA in iterable:
masterSet = masterSet.union(setA)
You might get a performance boost by doing something like:
masterSet = set().union(*iterable)
Ultimately, membership testing of a set is O(1) (in the average case), so testing if the element is already contained in the set isn't really a big performance hit.
As mgilson points out, you can use update to update a set in-place from another set. That actually works out slightly quicker:
def union():
i = set(range(10000))
j = set(range(5000, 15000))
return i.union(j)
def update():
i = set(range(10000))
j = set(range(5000, 15000))
i.update(j)
return i
timeit.Timer(union).timeit(10000) # 10.351907968521118
timeit.Timer(update).timeit(10000) # 8.83384895324707
If you know your elements are unique, a set is not necessarily the best structure.
A simple list is way faster to extend.
masterList = list(masterSet)
masterList.extend(setA)
For sure, forgoing this check could be a big saving when the __eq__(..) method is very expensive. In the CPython implementation, __eq__(..) is called with every element already in the set that hashes to the same number. (Reference: source code for set.)
However, there will never be this functionality in a million years, because it opens up another way to violate the integrity of a set. The trouble associated with that far outweighs the (typically negligible) performance gain. While if this is determined as a performance bottleneck, it's not hard to write a C++ extension, and use its STL <set>, which should be faster by one or more orders of magnitude.
Here's what I have so far:
def is_ordered(collection):
if isinstance(collection, set):
return False
if isinstance(collection, list):
return True
if isinstance(collection, dict):
return False
raise Exception("unknown collection")
Is there a much better way to do this?
NB: I do mean ordered and not sorted.
Motivation:
I want to iterate over an ordered collection. e.g.
def most_important(priorities):
for p in priorities:
print p
In this case the fact that priorities is ordered is important. What kind of collection it is is not. I'm trying to live duck-typing here. I have frequently been dissuaded by from type checking by Pythonistas.
If the collection is truly arbitrary (meaning it can be of any class whatsoever), then the answer has to be no.
Basically, there are two possible approaches:
know about every possible class that can be presented to your method, and whether it's ordered;
test the collection yourself by inserting into it every possible combination of keys, and seeing whether the ordering is preserved.
The latter is clearly infeasible. The former is along the lines of what you already have, except that you have to know about every derived class such as collections.OrderedDict; checking for dict is not enough.
Frankly, I think the whole is_ordered check is a can of worms. Why do you want to do this anyway?
Update: In essence, you are trying to unittest the argument passed to you. Stop doing that, and unittest your own code. Test your consumer (make sure it works with ordered collections), and unittest the code that calls it, to ensure it is getting the right results.
In a statically-typed language you would simply restrict yourself to specific types. If you really want to replicate that, simply specify the only types you accept, and test for those. Raise an exception if anything else is passed. It's not pythonic, but it reliably achieves what you want to do
Well, you have two possible approaches:
Anything with an append method is almost certainly ordered; and
If it only has an add method, you can try adding a nonce-value, then iterating over the collection to see if the nonce appears at the end (or, perhaps at one end); you could try adding a second nonce and doing it again just to be more confident.
Of course, this won't work where e.g. the collection is empty, or there is an ordering function that doesn't result in addition at the ends.
Probably a better solution is simply to specify that your code requires ordered collections, and only pass it ordered collections.
I think that enumerating the 90% case is about as good as you're going to get (if using Python 3, replace basestring with str). Probably also want to consider how you would handle generator expressions and similar ilk, too (again, if using Py3, skip the xrangor):
generator = type((i for i in xrange(0)))
enumerator = type(enumerate(range(0)))
xrangor = type(xrange(0))
is_ordered = lambda seq : isinstance(seq,(tuple, list, collections.OrderedDict,
basestring, generator, enumerator, xrangor))
If your callers start using itertools, then you'll also need to add itertools types as returned by islice, imap, groupby. But the sheer number of these special cases really starts to point to a code smell.
What if the list is not ordered, e.g. [1,3,2]?
There's this script called svnmerge.py that I'm trying to tweak and optimize a bit. I'm completely new to Python though, so it's not easy.
The current problem seems to be related to a class called RevisionSet in the script. In essence what it does is create a large hashtable(?) of integer-keyed boolean values. In the worst case - one for each revision in our SVN repository, which is near 75,000 now.
After that it performs set operations on such huge arrays - addition, subtraction, intersection, and so forth. The implementation is the simplest O(n) implementation, which, naturally, gets pretty slow on such large sets. The whole data structure could be optimized because there are long spans of continuous values. For example, all keys from 1 to 74,000 might contain true. Also the script is written for Python 2.2, which is a pretty old version and we're using 2.6 anyway, so there could be something to gain there too.
I could try to cobble this together myself, but it would be difficult and take a lot of time - not to mention that it might be already implemented somewhere. Although I'd like the learning experience, the result is more important right now. What would you suggest I do?
You could try doing it with numpy instead of plain python. I found it to be very fast for operations like these.
For example:
# Create 1000000 numbers between 0 and 1000, takes 21ms
x = numpy.random.randint(0, 1000, 1000000)
# Get all items that are larger than 500, takes 2.58ms
y = x > 500
# Add 10 to those items, takes 26.1ms
x[y] += 10
Since that's with a lot more rows, I think that 75000 should not be a problem either :)
Here's a quick replacement for RevisionSet that makes it into a set. It should be much faster. I didn't fully test it, but it worked with all of the tests that I did. There are undoubtedly other ways to speed things up, but I think that this will really help because it actually harnesses the fast implementation of sets rather than doing loops in Python which the original code was doing in functions like __sub__ and __and__. The only problem with it is that the iterator isn't sorted. You might have to change a little bit of the code to account for this. I'm sure there are other ways to improve this, but hopefully it will give you a good start.
class RevisionSet(set):
"""
A set of revisions, held in dictionary form for easy manipulation. If we
were to rewrite this script for Python 2.3+, we would subclass this from
set (or UserSet). As this class does not include branch
information, it's assumed that one instance will be used per
branch.
"""
def __init__(self, parm):
"""Constructs a RevisionSet from a string in property form, or from
a dictionary whose keys are the revisions. Raises ValueError if the
input string is invalid."""
revision_range_split_re = re.compile('[-:]')
if isinstance(parm, set):
print "1"
self.update(parm.copy())
elif isinstance(parm, list):
self.update(set(parm))
else:
parm = parm.strip()
if parm:
for R in parm.split(","):
rev_or_revs = re.split(revision_range_split_re, R)
if len(rev_or_revs) == 1:
self.add(int(rev_or_revs[0]))
elif len(rev_or_revs) == 2:
self.update(set(range(int(rev_or_revs[0]),
int(rev_or_revs[1])+1)))
else:
raise ValueError, 'Ill formatted revision range: ' + R
def sorted(self):
return sorted(self)
def normalized(self):
"""Returns a normalized version of the revision set, which is an
ordered list of couples (start,end), with the minimum number of
intervals."""
revnums = sorted(self)
revnums.reverse()
ret = []
while revnums:
s = e = revnums.pop()
while revnums and revnums[-1] in (e, e+1):
e = revnums.pop()
ret.append((s, e))
return ret
def __str__(self):
"""Convert the revision set to a string, using its normalized form."""
L = []
for s,e in self.normalized():
if s == e:
L.append(str(s))
else:
L.append(str(s) + "-" + str(e))
return ",".join(L)
Addition:
By the way, I compared doing unions, intersections and subtractions of the original RevisionSet and my RevisionSet above, and the above code is from 3x to 7x faster for those operations when operating on two RevisionSets that have 75000 elements. I know that other people are saying that numpy is the way to go, but if you aren't very experienced with Python, as your comment indicates, then you might not want to go that route because it will involve a lot more changes. I'd recommend trying my code, seeing if it works and if it does, then see if it is fast enough for you. If it isn't, then I would try profiling to see what needs to be improved. Only then would I consider using numpy (which is a great package that I use quite frequently).
For example, all keys from 1 to 74,000 contain true
Why not work on a subset? Just 74001 to the end.
Pruning 74/75th of your data is far easier than trying to write an algorithm more clever than O(n).
You should rewrite RevisionSet to have a set of revisions. I think the internal representation for a revision should be an integer and revision ranges should be created as needed.
There is no compelling reason to use code that supports python 2.3 and earlier.
Just a thought. I used to do this kind of thing using run-coding in binary image manipulation. That is, store each set as a series of numbers: number of bits off, number of bits on, number of bits off, etc.
Then you can do all sorts of boolean operations on them as decorations on a simple merge algorithm.