Optimise comparison between two lists, giving indices that differ - python

I have three lists: old, new and ignore. old and new are lists of strings. ignore is a list of indices that should be ignored if they do not match. The objective is to create a list of indices which are different and not ignored.
old and new may contain a different number of elements. If there is a difference in size between old and new the difference should be marked as not matching (unless ignored).
My current function is as follows:
def CompareFields( old, new, ignore ):
if ( old == None ):
if ( new == None ):
return [];
else:
return xrange( len(new) )
elif ( new == None ):
return xrange( len(old) )
oldPadded = itertools.chain( old, itertools.repeat(None) )
newPadded = itertools.chain( new, itertools.repeat(None) )
comparisonIterator = itertools.izip( xrange( max( len(old ) , len( new ) ) ), oldPadded, newPadded )
changedItems = [ i for i,lhs,rhs in comparisonIterator if lhs != rhs and i not in ignore ]
return changedItems
The timings of the various options I have tried give the following timings for 100,000 runs:
[4, 9]
CompareFields: 6.083546
set([9, 4])
Set based: 12.594869
[4, 9]
Function using yield: 13.063725
[4, 9]
Use a (precomputed) ignore bitmap: 7.009405
[4, 9]
Use a precomputed ignore bitmap and give a limit to itertools.repeat(): 8.297951
[4, 9]
Use precomputed ignore bitmap, limit padding and itertools.starmap()/operator.ne(): 11.868687
[4, 9]
Naive implementation: 7.438201
The latest version of python I have is 2.6 (it is RHEL5.5). I am currently compiling Pypy to give that a try.
So does anyone have any ideas how to get this function to run faster? Is it worth using Cython?
If I can't get it to run faster I will look at rewriting the whole tool in C++ or Java.
Edit:
Ok I timed the various answers:
[4, 9]
CompareFields: 5.808944
[4, 9]
agf's itertools answer: 4.550836
set([9, 4])
agf's set based answer, but replaced list expression with a set to avoid duplicates: 9.149389
agf's set based answer, as described in answer: about 8 seconds
lucho's set based answer: 10.682579
So itertools seems to be the way to go for now. It is surprising that the set based solution performed so poorly. Although I am not surprised that using a lambda was slower.
Edit: Java benchmark
Naive implementation, with way too many if statements: 128ms

For both of these solutions, you should do:
ignore = set(ignore)
which will give you constant (average) time in tests.
I think this is the itertools / zip based method you were looking for:
[i for i, (o, n) in enumerate(izip_longest(old, new))
if o != n and i not in ignore]
No need for chain / repeat to pad -- that's what izip_longest is for. enumerate is also more appropriate than xrange.
And a more Pythonic (and possibly faster) version of the filter / set difference method in Lucho's answer:
[i for i, v in set(enumerate(new)).symmetric_difference(enumerate(old))
if i not in ignore]
List comprehensions are preferred over filter or map on a lambda, and there is no need to convert both lists to sets if you use the symmetric_difference method instead of the ^ / xor operator.

make ignore as set also.
filter(lambda x: x[0] not in ignore, set(enumerate(new)) ^ set(enumerate(old)))
I bet it will be faster than yours overcomplicated non-pythonic try outs (it would be cool if you can measure it - i am curious).

list constructors are definatly a pythonic thing to do, I would do something similar to this:
def findDiff(old, new, ignore):
ignore = set(ignore)
diff = []
(small, big) = (old, new) if len(old) < len(new) else (new, old)
diff.extend([i for i in xrange(0,len(small)) if i not in ignore and old[i] != new[i]])
diff.extend([i for i in xrange(len(small), len(big)) if i not in ignore])
return diff
for a fast function, this assumes that all indices above the length of the smallest list will be counted as different and are still checked by ignore.

Related

Why does Python allow out-of-range slice indexes for sequences?

So I just came across what seems to me like a strange Python feature and wanted some clarification about it.
The following array manipulation somewhat makes sense:
p = [1,2,3]
p[3:] = [4]
p = [1,2,3,4]
I imagine it is actually just appending this value to the end, correct?
Why can I do this, however?
p[20:22] = [5,6]
p = [1,2,3,4,5,6]
And even more so this:
p[20:100] = [7,8]
p = [1,2,3,4,5,6,7,8]
This just seems like wrong logic. It seems like this should throw an error!
Any explanation?
-Is it just a weird thing Python does?
-Is there a purpose to it?
-Or am I thinking about this the wrong way?
Part of question regarding out-of-range indices
Slice logic automatically clips the indices to the length of the sequence.
Allowing slice indices to extend past end points was done for convenience. It would be a pain to have to range check every expression and then adjust the limits manually, so Python does it for you.
Consider the use case of wanting to display no more than the first 50 characters of a text message.
The easy way (what Python does now):
preview = msg[:50]
Or the hard way (do the limit checks yourself):
n = len(msg)
preview = msg[:50] if n > 50 else msg
Manually implementing that logic for adjustment of end points would be easy to forget, would be easy to get wrong (updating the 50 in two places), would be wordy, and would be slow. Python moves that logic to its internals where it is succint, automatic, fast, and correct. This is one of the reasons I love Python :-)
Part of question regarding assignments length mismatch from input length
The OP also wanted to know the rationale for allowing assignments such as p[20:100] = [7,8] where the assignment target has a different length (80) than the replacement data length (2).
It's easiest to see the motivation by an analogy with strings. Consider, "five little monkeys".replace("little", "humongous"). Note that the target "little" has only six letters and "humongous" has nine. We can do the same with lists:
>>> s = list("five little monkeys")
>>> i = s.index('l')
>>> n = len('little')
>>> s[i : i+n ] = list("humongous")
>>> ''.join(s)
'five humongous monkeys'
This all comes down to convenience.
Prior to the introduction of the copy() and clear() methods, these used to be popular idioms:
s[:] = [] # clear a list
t = u[:] # copy a list
Even now, we use this to update lists when filtering:
s[:] = [x for x in s if not math.isnan(x)] # filter-out NaN values
Hope these practical examples give a good perspective on why slicing works as it does.
The documentation has your answer:
s[i:j]: slice of s from i to j (note (4))
(4) The slice of s from i to j is defined as the sequence of items
with index k such that i <= k < j. If i or j is greater than
len(s), use len(s). If i is omitted or None, use 0. If j
is omitted or None, use len(s). If i is greater than or equal to
j, the slice is empty.
The documentation of IndexError confirms this behavior:
exception IndexError
Raised when a sequence subscript is out of range. (Slice indices are silently truncated to fall in the allowed range; if an index is
not an integer, TypeError is raised.)
Essentially, stuff like p[20:100] is being reduced to p[len(p):len(p]. p[len(p):len(p] is an empty slice at the end of the list, and assigning a list to it will modify the end of the list to contain said list. Thus, it works like appending/extending the original list.
This behavior is the same as what happens when you assign a list to an empty slice anywhere in the original list. For example:
In [1]: p = [1, 2, 3, 4]
In [2]: p[2:2] = [42, 42, 42]
In [3]: p
Out[3]: [1, 2, 42, 42, 42, 3, 4]

Python's list comprehension: Modify list elements if a certain value occurs

How can I do the following in Python's list comprehension?
nums = [1,1,0,1,1]
oFlag = 1
res = []
for x in nums:
if x == 0:
oFlag = 0
res.append(oFlag)
print(res)
# Output: [1,1,0,0,0]
Essentially in this example, zero out the rest of the list once a 0 occurs.
Some context, a list comprehension is a sort of "imperative" syntax for the map and filter functions that exist in many functional programing languages. What you're trying to do is usually referred to as an accumulate, which is a slightly different operation. You can't implement an accumulate in terms of a map and filter except by using side effects. Python allows you have side effects in a list comprehension so it's definitely possible but list comprehensions with side effects are a little wonky. Here's how you could implement this using accumulate:
nums = [1,1,0,1,1]
def accumulator(last, cur):
return 1 if (last == 1 and cur == 1) else 0
list(accumulate(nums, accumulator))
or in one line:
list(accumulate(nums, lambda last, cur: 1 if (last == 1 and cur == 1) else 0))
Of course there are several ways to do this using an external state and a list comprehension with side effects. Here's an example, it's a bit verbose but very explicit about how state is being manipulated:
class MyState:
def __init__(self, initial_state):
self.state = initial_state
def getNext(self, cur):
self.state = accumulator(self.state, cur)
return self.state
mystate = MyState(1)
[mystate.getNext(x) for x in nums]
nums = [1,1,0,1,1]
[int(all(nums[:i+1])) for i in range(len(nums))]
This steps through the list, applying the all operator to the entire sub-list up to that point.
Output:
[1, 1, 0, 0, 0]
Granted, this is O(n^2), but it gets the job done.
Even more effective is simply to find the index of the first 0.
Make a new list made of that many 1s, padded with the appropriate quantity of zeros.
if 0 in nums:
idx = nums.index(0)
new_list = [1] * idx + [0] * (len(nums) - idx)
... or if the original list can contain elements other than 0 and 1, copy the list that far rather than repeating 1s:
new_list = nums[:idx] + [0] * (len(nums) - idx)
I had an answer using list comprehension, but #Prune beat me to it. It was really just a cautionary tail, showing how it would be done while making an argument against that approach.
Here's an alternative approach that might fit your needs:
import itertools
import operator
nums = [1,1,0,1,1]
res = itertools.accumulate(nums, operator.and_)
In this case res is an iterable. If you need a list, then
res = list(itertools.accumulate(nums, operator.and_))
Let's break this down. The accumulate() function can be used to generate a running total, or 'accumulated sums'. If only one argument is passed the default function is addition. Here we pass in operator.and_. The operator module exports a set of efficient functions corresponding to the intrinsic operators of Python. When an accumulated and is run on a list of 0's and 1's the result is a list that has 1's up till the first 0 is found, then all 0's after.
Of course we're not limited to using functions defined in the operator module. You can use any function that accepts 2 parameters of the type of the elements in the first parameter (and probably returns the same type). You can get creative, but here I'll keep it simple and just implement and:
import itertools
nums = [1,1,0,1,1]
res = itertools.accumulate(nums, lambda a, b: a and b)
Note: using operator.and_ probably runs faster. Here we're just providing an example using the lambda syntax.
While a list comprehension is not used, to me it has a similar feel. It fits in one line and isn't too hard to read.
For a list comprehension approach, you could use index with enumerate:
firstIndex = nums.index(0) if 0 in nums else -1
[1 if i < firstIndex else 0 for i, x in enumerate(nums)]
Another approach using numpy:
import numpy as np
print(np.cumprod(np.array(nums) != 0).tolist())
#[1, 1, 0, 0, 0]
Here we take the convert nums to a numpy array and check to see if the values are not equal to 0. We then take the cumulative product of the array, knowing that once a 0 is found we will multiply by 0 from that point forward.
Here is a linear-time solution that doesn't mutate global state, doesn't require any other iterators except the nums, and that does what you want, albeit requiring some auxiliary data-structures, and using a seriously hacky list-comprehension:
>>> nums = [1,1,0,1,1]
>>> [f for f, ns in [(1, nums)] for n in ns for f in [f & (n==1)]]
[1, 1, 0, 0, 0]
Don't use this. Use your original for-loop. It is more readable, and almost certainly faster. Don't strive to put everything in a list-comprehension. Strive to make your code simple, readable, and maintainable, which your code already was, and the above code is not.

Remove all elements that satisfy a predicate from a set

Given a mutable set of objects,
A = set(1,2,3,4,5,6)
I can construct a new set containing only those objects that don't satisfy a predicate ...
B = set(x for x in A if not (x % 2 == 0))
... but how do I modify A in place to contain only those objects? If possible, do this in linear time, without constructing O(n)-sized scratch objects, and without removing anything from A, even temporarily, that doesn't satisfy the predicate.
(Integers are used here only to simplify the example. In the actual code they are Future objects and I'm trying to pull out those that have already completed, which is expected to be a small fraction of the total.)
Note that it is not, in general, safe in Python to mutate an object that you are iterating over. I'm not sure of the precise rules for sets (the documentation doesn't make any guarantee either way).
I only need an answer for 3.4+, but will take more general answers.
(Not actually O(1) due to implementation details, but I'm loathe to delete it as it's quite clean.)
Use symmetric_difference_update.
>>> A = {1,2,3,4,5,6}
>>> A.symmetric_difference_update(x for x in A if not (x % 2))
>>> A
{1, 3, 5}
With an horrible time complexity (quadratic), but in O(1) space:
>>> A = {1,2,3,4,5,6}
>>> while modified:
... modified = False
... for x in A:
... if not x%2:
... A.remove(x)
... modified = True
... break
...
>>> A
{1, 3, 5}
On the very specific use case you showed, there is a way to do this in O(1) space, but it doesn't generalize very well to sets containing anything other than int objects:
A = {1, 2, 3, 4, 5, 6}
for i in range(min(A), max(A) + 1):
if i % 2 != 0:
A.discard(i)
It also wastes time since it will check numbers that aren't even in the set. For anything other than int objects, I can't yet think of a way to do this without creating an intermediate set or container of some sort.
For a more general solution, it would be better to simply initially construct your set using the predicate (if you don't need to use the set for anything else first). Something like this:
def items():
# maybe this is a file or a stream or something,
# where ever your initial values are coming from.
for thing in source:
yield thing
def predicate(item):
return bool(item)
A = set(item for item in items() if predicate(item))
to maintain the use use of memory constant this is the only thing that come to my mind
def filter_Set(predicate,origen:set) -> set:
resul = set()
while origen:
elem = origen.pop()
if predicate( elem ):
resul.add( elem )
return resul
def filter_Set_inplace(predicate,origen:set):
resul = set()
while origen:
elem = origen.pop()
if predicate( elem ):
resul.add( elem )
while resul:
origen.add(resul.pop())
with this functions I move elems from one set to the other keeping only those that satisfied the predicate

Interleaving multiple iterables randomly while preserving their order in python

Inspired by this earlier stack overflow question I have been considering how to randomly interleave iterables in python while preserving the order of elements within each iterable. For example:
>>> def interleave(*iterables):
... "Return the source iterables randomly interleaved"
... <insert magic here>
>>> interleave(xrange(1, 5), xrange(5, 10), xrange(10, 15))
[1, 5, 10, 11, 2, 6, 3, 12, 4, 13, 7, 14, 8, 9]
The original question asked to randomly interleave two lists, a and b, and the accepted solution was:
>>> c = [x.pop(0) for x in random.sample([a]*len(a) + [b]*len(b), len(a)+len(b))]
However, this solution works for only two lists (though it can easily be extended) and relies on the fact that a and b are lists so that pop() and len() can be called on them, meaning it cannot be used with iterables. It also has the unfortunate side effect of emptying the source lists a and b.
Alternate answers given for the original question take copies of the source lists to avoid modifying them, but this strikes me as inefficient, especially if the source lists are sizeable. The alternate answers also make use of len() and therefore cannot be used on mere iterables.
I wrote my own solution that works for any number of input lists and doesn't modify them:
def interleave(*args):
iters = [i for i, b in ((iter(a), a) for a in args) for _ in xrange(len(b))]
random.shuffle(iters)
return map(next, iters)
but this solution also relies on the source arguments being lists so that len() can be used on them.
So, is there an efficient way to randomly interleave iterables in python, preserving the original order of elements, which doesn't require knowledge of the length of the iterables ahead of time and doesn't take copies of the iterables?
Edit: Please note that, as with the original question, I don't need the randomisation to be fair.
Here is one way to do it using a generator:
import random
def interleave(*args):
iters = map(iter, args)
while iters:
it = random.choice(iters)
try:
yield next(it)
except StopIteration:
iters.remove(it)
print list(interleave(xrange(1, 5), xrange(5, 10), xrange(10, 15)))
Not if you want fit to be "fair".
Imagine you have a list containing one million items and another containing just two items. A "fair" randomization would have the first element from the short list occurring at about index 300000 or so.
a,a,a,a,a,a,a,...,a,a,a,b,a,a,a,....
^
But there's no way to know in advance until you know the length of the lists.
If you just take from each list with 50% (1/n) probability then it can be done without knowing the lengths of the lists but you'll get something more like this:
a,a,b,a,b,a,a,a,a,a,a,a,a,a,a,a,...
^ ^
I am satisfied that the solution provided by aix meets the requirements of the question. However, after reading the comments by Mark Byers I wanted to see just how "unfair" the solution was.
Furthermore, sometime after I wrote this question, stack overflow user EOL posted another solution to the original question which yields a "fair" result. EOL's solution is:
>>> a.reverse()
>>> b.reverse()
>>> [(a if random.randrange(0, len(a)+len(b)) < len(a) else b).pop()
... for _ in xrange(len(a)+len(b))]
I also further enhanced my own solution so that it does not rely on its arguments supporting len() but does make copies of the source iterables:
def interleave(*args):
iters = sum(([iter(list_arg)]*len(list_arg) for list_arg in map(list, args)), [])
random.shuffle(iters)
return map(next, iters)
or, written differently:
def interleave(*args):
iters = [i for i, j in ((iter(k), k) for k in map(list, args)) for _ in j]
random.shuffle(iters)
return map(next, iters)
I then tested the accepted solution to the original question, written by F.J and reproduced in my question above, to the solutions of aix, EOL and my own. The test involved interleaving a list of 30000 elements with a single element list (the sentinel). I repeated the test 1000 times and the following table shows, for each algorithm, the minimum, maximum and mean index of the sentinel after interleaving, along with the total time taken. We would expect a "fair" algorithm to produce a mean of approx. 15,000:
algo min max mean total_seconds
---- --- --- ---- -------------
F.J: 5 29952 14626.3 152.1
aix: 0 8 0.9 27.5
EOL: 45 29972 15091.0 61.2
srgerg: 23 29978 14961.6 18.6
As can be seen from the results, each of the algorithms of F.J, EOL and srgerg produce ostensibly "fair" results (at least under the given test conditions). However aix's algorithm has always placed the sentinel within the first 10 elements of the result. I repeated the experiment several times with similar results.
So Mark Byers is proved correct. If a truly random interleaving is desired, the length of the source iterables will need to be known ahead of time, or copies will need to be made so the length can be determined.

Question on a solution from Google python class day

Hey,
I'm trying to learn a bit about Python so I decided to follow Google's tutorial. Anyway I had a question regarding one of their solution for an exercise.
Where I did it like this way.
# E. Given two lists sorted in increasing order, create and return a merged
# list of all the elements in sorted order. You may modify the passed in lists.
# Ideally, the solution should work in "linear" time, making a single
# pass of both lists.
def linear_merge(list1, list2):
# +++your code here+++
return sorted(list1 + list2)
However they did it in a more complicated way. So is Google's solution quicker? Because I noticed in the comment lines that the solution should work in "linear" time, which mine probably isn't?
This is their solution
def linear_merge(list1, list2):
# +++your code here+++
# LAB(begin solution)
result = []
# Look at the two lists so long as both are non-empty.
# Take whichever element [0] is smaller.
while len(list1) and len(list2):
if list1[0] < list2[0]:
result.append(list1.pop(0))
else:
result.append(list2.pop(0))
# Now tack on what's left
result.extend(list1)
result.extend(list2)
return result
this could be another soln?
#
def linear_merge(list1, list2):
tmp = []
while len(list1) and len(list2):
#print list1[-1],list2[-1]
if list1[-1] > list2[-1]:
tmp.append(list1.pop())
else:
tmp.append(list2.pop())
#print "tmp = ",tmp
#print list1,list2
tmp = tmp + list1
tmp = tmp + list2
tmp.reverse()
return tmp
Yours is not linear, but that doesn't mean it's slower. Algorithmic complexity ("big-oh notation") is often only a rough guide and always only tells one part of the story.
However, theirs isn't linear either, though it may appear to be at first blush. Popping from a list requires moving all later items, so popping from the front requires moving all remaining elements.
It is a good exercise to think about how to make this O(n). The below is in the same spirit as the given solution, but avoids its pitfalls while generalizing to more than 2 lists for the sake of exercise. For exactly 2 lists, you could remove the heap handling and simply test which next item is smaller.
import heapq
def iter_linear_merge(*args):
"""Yield non-decreasing items from sorted a and b."""
# Technically, [1, 1, 2, 2] isn't an "increasing" sequence,
# but it is non-decreasing.
nexts = []
for x in args:
x = iter(x)
for n in x:
heapq.heappush(nexts, (n, x))
break
while len(nexts) >= 2:
n, x = heapq.heappop(nexts)
yield n
for n in x:
heapq.heappush(nexts, (n, x))
break
if nexts: # Degenerate case of the heap, not strictly required.
n, x = nexts[0]
yield n
for n in x:
yield n
Instead of the last if-for, the while loop condition could be changed to just "nexts", but it is probably worthwhile to specially handle the last remaining iterator.
If you want to strictly return a list instead of an iterator:
def linear_merge(*args):
return list(iter_linear_merge(*args))
With mostly-sorted data, timsort approaches linear. Also, your code doesn't have to screw around with the lists themselves. Therefore, your code is possibly just a bit faster.
But that's what timing is for, innit?
I think the issue here is that the tutorial is illustrating how to implement a well-known algorithm called 'merge' in Python. The tutorial is not expecting you to actually use a library sorting function in the solution.
sorted() is probably O(nlgn); then your solution cannot be linear in the worst case.
It is important to understand how merge() works because it is useful in many other algorithms. It exploits the fact the input lists are individually sorted, moving through each list sequentially and selecting the smallest option. The remaining items are appended at the end.
The question isn't which is 'quicker' for a given input case but about which algorithm is more complex.
There are hybrid variations of merge-sort which fall back on another sorting algorithm once the input list size drops below a certain threshold.

Categories

Resources