Inspired by this earlier stack overflow question I have been considering how to randomly interleave iterables in python while preserving the order of elements within each iterable. For example:
>>> def interleave(*iterables):
... "Return the source iterables randomly interleaved"
... <insert magic here>
>>> interleave(xrange(1, 5), xrange(5, 10), xrange(10, 15))
[1, 5, 10, 11, 2, 6, 3, 12, 4, 13, 7, 14, 8, 9]
The original question asked to randomly interleave two lists, a and b, and the accepted solution was:
>>> c = [x.pop(0) for x in random.sample([a]*len(a) + [b]*len(b), len(a)+len(b))]
However, this solution works for only two lists (though it can easily be extended) and relies on the fact that a and b are lists so that pop() and len() can be called on them, meaning it cannot be used with iterables. It also has the unfortunate side effect of emptying the source lists a and b.
Alternate answers given for the original question take copies of the source lists to avoid modifying them, but this strikes me as inefficient, especially if the source lists are sizeable. The alternate answers also make use of len() and therefore cannot be used on mere iterables.
I wrote my own solution that works for any number of input lists and doesn't modify them:
def interleave(*args):
iters = [i for i, b in ((iter(a), a) for a in args) for _ in xrange(len(b))]
random.shuffle(iters)
return map(next, iters)
but this solution also relies on the source arguments being lists so that len() can be used on them.
So, is there an efficient way to randomly interleave iterables in python, preserving the original order of elements, which doesn't require knowledge of the length of the iterables ahead of time and doesn't take copies of the iterables?
Edit: Please note that, as with the original question, I don't need the randomisation to be fair.
Here is one way to do it using a generator:
import random
def interleave(*args):
iters = map(iter, args)
while iters:
it = random.choice(iters)
try:
yield next(it)
except StopIteration:
iters.remove(it)
print list(interleave(xrange(1, 5), xrange(5, 10), xrange(10, 15)))
Not if you want fit to be "fair".
Imagine you have a list containing one million items and another containing just two items. A "fair" randomization would have the first element from the short list occurring at about index 300000 or so.
a,a,a,a,a,a,a,...,a,a,a,b,a,a,a,....
^
But there's no way to know in advance until you know the length of the lists.
If you just take from each list with 50% (1/n) probability then it can be done without knowing the lengths of the lists but you'll get something more like this:
a,a,b,a,b,a,a,a,a,a,a,a,a,a,a,a,...
^ ^
I am satisfied that the solution provided by aix meets the requirements of the question. However, after reading the comments by Mark Byers I wanted to see just how "unfair" the solution was.
Furthermore, sometime after I wrote this question, stack overflow user EOL posted another solution to the original question which yields a "fair" result. EOL's solution is:
>>> a.reverse()
>>> b.reverse()
>>> [(a if random.randrange(0, len(a)+len(b)) < len(a) else b).pop()
... for _ in xrange(len(a)+len(b))]
I also further enhanced my own solution so that it does not rely on its arguments supporting len() but does make copies of the source iterables:
def interleave(*args):
iters = sum(([iter(list_arg)]*len(list_arg) for list_arg in map(list, args)), [])
random.shuffle(iters)
return map(next, iters)
or, written differently:
def interleave(*args):
iters = [i for i, j in ((iter(k), k) for k in map(list, args)) for _ in j]
random.shuffle(iters)
return map(next, iters)
I then tested the accepted solution to the original question, written by F.J and reproduced in my question above, to the solutions of aix, EOL and my own. The test involved interleaving a list of 30000 elements with a single element list (the sentinel). I repeated the test 1000 times and the following table shows, for each algorithm, the minimum, maximum and mean index of the sentinel after interleaving, along with the total time taken. We would expect a "fair" algorithm to produce a mean of approx. 15,000:
algo min max mean total_seconds
---- --- --- ---- -------------
F.J: 5 29952 14626.3 152.1
aix: 0 8 0.9 27.5
EOL: 45 29972 15091.0 61.2
srgerg: 23 29978 14961.6 18.6
As can be seen from the results, each of the algorithms of F.J, EOL and srgerg produce ostensibly "fair" results (at least under the given test conditions). However aix's algorithm has always placed the sentinel within the first 10 elements of the result. I repeated the experiment several times with similar results.
So Mark Byers is proved correct. If a truly random interleaving is desired, the length of the source iterables will need to be known ahead of time, or copies will need to be made so the length can be determined.
Related
EDIT: I know there are other solutions to this. My question is what am I doing wrong. Where is my logic simple. Nothing else.
Was solving the minions work assignment code in Python.
The question is the following
Write a function called solution(data, n) that takes in a list of less than 100 integers and a
number n, and returns that same list but with all of the numbers that occur more than n times
removed entirely. The returned list should retain the same ordering as the original list - you don't want to mix up those carefully planned shift rotations! For instance, if data was [5, 10,
15, 10, 7] and n was 1, solution(data, n) would return the list [5, 15, 7] because 10 occurs
twice, and thus was removed from the list entirely.
My code is the following
from collections import OrderedDict
def solution(data, n):
# Your code here
if(len(data)>=100):
return []
seen=OrderedDict()
s=[]
for i in data:
if i in seen:
seen[i]+=1
else:
seen[i]=1
for k in seen:
if(seen[k]<=n):
s.append(k)
return s
My logic was to use an ordered dict to keep track of the numbers and the number of times they show up. This way we could do the code in linear time instead of n^2 (by checking the count of every value in data). This worked for most cases but is failing in some. What am I missing? Is there some space constraint? Some overlooked case?
So I just came across what seems to me like a strange Python feature and wanted some clarification about it.
The following array manipulation somewhat makes sense:
p = [1,2,3]
p[3:] = [4]
p = [1,2,3,4]
I imagine it is actually just appending this value to the end, correct?
Why can I do this, however?
p[20:22] = [5,6]
p = [1,2,3,4,5,6]
And even more so this:
p[20:100] = [7,8]
p = [1,2,3,4,5,6,7,8]
This just seems like wrong logic. It seems like this should throw an error!
Any explanation?
-Is it just a weird thing Python does?
-Is there a purpose to it?
-Or am I thinking about this the wrong way?
Part of question regarding out-of-range indices
Slice logic automatically clips the indices to the length of the sequence.
Allowing slice indices to extend past end points was done for convenience. It would be a pain to have to range check every expression and then adjust the limits manually, so Python does it for you.
Consider the use case of wanting to display no more than the first 50 characters of a text message.
The easy way (what Python does now):
preview = msg[:50]
Or the hard way (do the limit checks yourself):
n = len(msg)
preview = msg[:50] if n > 50 else msg
Manually implementing that logic for adjustment of end points would be easy to forget, would be easy to get wrong (updating the 50 in two places), would be wordy, and would be slow. Python moves that logic to its internals where it is succint, automatic, fast, and correct. This is one of the reasons I love Python :-)
Part of question regarding assignments length mismatch from input length
The OP also wanted to know the rationale for allowing assignments such as p[20:100] = [7,8] where the assignment target has a different length (80) than the replacement data length (2).
It's easiest to see the motivation by an analogy with strings. Consider, "five little monkeys".replace("little", "humongous"). Note that the target "little" has only six letters and "humongous" has nine. We can do the same with lists:
>>> s = list("five little monkeys")
>>> i = s.index('l')
>>> n = len('little')
>>> s[i : i+n ] = list("humongous")
>>> ''.join(s)
'five humongous monkeys'
This all comes down to convenience.
Prior to the introduction of the copy() and clear() methods, these used to be popular idioms:
s[:] = [] # clear a list
t = u[:] # copy a list
Even now, we use this to update lists when filtering:
s[:] = [x for x in s if not math.isnan(x)] # filter-out NaN values
Hope these practical examples give a good perspective on why slicing works as it does.
The documentation has your answer:
s[i:j]: slice of s from i to j (note (4))
(4) The slice of s from i to j is defined as the sequence of items
with index k such that i <= k < j. If i or j is greater than
len(s), use len(s). If i is omitted or None, use 0. If j
is omitted or None, use len(s). If i is greater than or equal to
j, the slice is empty.
The documentation of IndexError confirms this behavior:
exception IndexError
Raised when a sequence subscript is out of range. (Slice indices are silently truncated to fall in the allowed range; if an index is
not an integer, TypeError is raised.)
Essentially, stuff like p[20:100] is being reduced to p[len(p):len(p]. p[len(p):len(p] is an empty slice at the end of the list, and assigning a list to it will modify the end of the list to contain said list. Thus, it works like appending/extending the original list.
This behavior is the same as what happens when you assign a list to an empty slice anywhere in the original list. For example:
In [1]: p = [1, 2, 3, 4]
In [2]: p[2:2] = [42, 42, 42]
In [3]: p
Out[3]: [1, 2, 42, 42, 42, 3, 4]
The following is a simplified example of my code.
>>> def action(num):
print "Number is", num
>>> items = [1, 3, 6]
>>> for i in [j for j in items if j > 4]:
action(i)
Number is 6
My question is the following: is it bad practice (for reasons such as code clarity) to simply replace the for loop with a comprehension which will still call the action function? That is:
>>> (action(j) for j in items if j > 2)
Number is 6
This shouldn't use a generator or comprehension at all.
def action(num):
print "Number is", num
items = [1, 3, 6]
for j in items:
if j > 4:
action(i)
Generators evaluate lazily. The expression (action(j) for j in items if j > 2) will merely return a generator expression to the caller. Nothing will happen in it unless you explicitly exhaust it. List comprehensions evaluate eagerly, but, in this particular case, you are left with a list with no purpose. Just use a regular loop.
This is bad practice. Firstly, your code fragment does not produce the desired output. You would instead get something like: <generator object <genexpr> at 0x03D826F0>.
Secondly, a list comprehension is for creating sequences, and generators a for creating streams of objects. Typically, they do not have side effects. Your action function is a prime example of a side effect -- it prints its input and returns nothing. Rather, a generator should for each item it generates, take an input and compute some output. eg.
doubled_odds = [x*2 for x in range(10) if x % 2 != 0]
By using a generator you are obfuscating the purpose of your code, which is to mutate global state (printing something), and not to create a stream of objects.
Whereas, just using a for loop makes the code slightly longer (basically just more whitespace), but immediately you can see that the purpose is to apply function to a selection of items (as opposed to creating a new stream/list of items).
for i in items:
if i < 4:
action(i)
Remember that generators are still looping constructs and that the underlying bytecode is more or less the same (if anything, generators are marginally less efficient), and you lose clarity. Generators and list comprehensions are great, but this is not the right situation for them.
While I personally favour Tigerhawk's solution, there might be a middle ground between his and willywonkadailyblah's solution (now deleted).
One of willywonkadailyblah's points was:
Why create a new list instead of just using the old one? You already have the condition to filter out the correct elements, so why put them away in memory and come back for them?
One way to avoid this problem is to use lazy evaluation of the filtering i.e. have the filtering done only when iterating using the for loop by making the filtering part of a generator expression rather than a list comprehension:
for i in (j for j in items if j > 4):
action(i)
Output
Number is 6
In all honesty, I think Tigerhawk's solution is the best for this, though. This is just one possible alternative.
The reason that I proposed this is that it reminds me a lot of LINQ queries in C#, where you define a lazy way to extract, filter and project elements from a sequence in one statement (the LINQ expression) and can then use a separate for each loop with that query to perform some action on each element.
I have three lists: old, new and ignore. old and new are lists of strings. ignore is a list of indices that should be ignored if they do not match. The objective is to create a list of indices which are different and not ignored.
old and new may contain a different number of elements. If there is a difference in size between old and new the difference should be marked as not matching (unless ignored).
My current function is as follows:
def CompareFields( old, new, ignore ):
if ( old == None ):
if ( new == None ):
return [];
else:
return xrange( len(new) )
elif ( new == None ):
return xrange( len(old) )
oldPadded = itertools.chain( old, itertools.repeat(None) )
newPadded = itertools.chain( new, itertools.repeat(None) )
comparisonIterator = itertools.izip( xrange( max( len(old ) , len( new ) ) ), oldPadded, newPadded )
changedItems = [ i for i,lhs,rhs in comparisonIterator if lhs != rhs and i not in ignore ]
return changedItems
The timings of the various options I have tried give the following timings for 100,000 runs:
[4, 9]
CompareFields: 6.083546
set([9, 4])
Set based: 12.594869
[4, 9]
Function using yield: 13.063725
[4, 9]
Use a (precomputed) ignore bitmap: 7.009405
[4, 9]
Use a precomputed ignore bitmap and give a limit to itertools.repeat(): 8.297951
[4, 9]
Use precomputed ignore bitmap, limit padding and itertools.starmap()/operator.ne(): 11.868687
[4, 9]
Naive implementation: 7.438201
The latest version of python I have is 2.6 (it is RHEL5.5). I am currently compiling Pypy to give that a try.
So does anyone have any ideas how to get this function to run faster? Is it worth using Cython?
If I can't get it to run faster I will look at rewriting the whole tool in C++ or Java.
Edit:
Ok I timed the various answers:
[4, 9]
CompareFields: 5.808944
[4, 9]
agf's itertools answer: 4.550836
set([9, 4])
agf's set based answer, but replaced list expression with a set to avoid duplicates: 9.149389
agf's set based answer, as described in answer: about 8 seconds
lucho's set based answer: 10.682579
So itertools seems to be the way to go for now. It is surprising that the set based solution performed so poorly. Although I am not surprised that using a lambda was slower.
Edit: Java benchmark
Naive implementation, with way too many if statements: 128ms
For both of these solutions, you should do:
ignore = set(ignore)
which will give you constant (average) time in tests.
I think this is the itertools / zip based method you were looking for:
[i for i, (o, n) in enumerate(izip_longest(old, new))
if o != n and i not in ignore]
No need for chain / repeat to pad -- that's what izip_longest is for. enumerate is also more appropriate than xrange.
And a more Pythonic (and possibly faster) version of the filter / set difference method in Lucho's answer:
[i for i, v in set(enumerate(new)).symmetric_difference(enumerate(old))
if i not in ignore]
List comprehensions are preferred over filter or map on a lambda, and there is no need to convert both lists to sets if you use the symmetric_difference method instead of the ^ / xor operator.
make ignore as set also.
filter(lambda x: x[0] not in ignore, set(enumerate(new)) ^ set(enumerate(old)))
I bet it will be faster than yours overcomplicated non-pythonic try outs (it would be cool if you can measure it - i am curious).
list constructors are definatly a pythonic thing to do, I would do something similar to this:
def findDiff(old, new, ignore):
ignore = set(ignore)
diff = []
(small, big) = (old, new) if len(old) < len(new) else (new, old)
diff.extend([i for i in xrange(0,len(small)) if i not in ignore and old[i] != new[i]])
diff.extend([i for i in xrange(len(small), len(big)) if i not in ignore])
return diff
for a fast function, this assumes that all indices above the length of the smallest list will be counted as different and are still checked by ignore.
Hey,
I'm trying to learn a bit about Python so I decided to follow Google's tutorial. Anyway I had a question regarding one of their solution for an exercise.
Where I did it like this way.
# E. Given two lists sorted in increasing order, create and return a merged
# list of all the elements in sorted order. You may modify the passed in lists.
# Ideally, the solution should work in "linear" time, making a single
# pass of both lists.
def linear_merge(list1, list2):
# +++your code here+++
return sorted(list1 + list2)
However they did it in a more complicated way. So is Google's solution quicker? Because I noticed in the comment lines that the solution should work in "linear" time, which mine probably isn't?
This is their solution
def linear_merge(list1, list2):
# +++your code here+++
# LAB(begin solution)
result = []
# Look at the two lists so long as both are non-empty.
# Take whichever element [0] is smaller.
while len(list1) and len(list2):
if list1[0] < list2[0]:
result.append(list1.pop(0))
else:
result.append(list2.pop(0))
# Now tack on what's left
result.extend(list1)
result.extend(list2)
return result
this could be another soln?
#
def linear_merge(list1, list2):
tmp = []
while len(list1) and len(list2):
#print list1[-1],list2[-1]
if list1[-1] > list2[-1]:
tmp.append(list1.pop())
else:
tmp.append(list2.pop())
#print "tmp = ",tmp
#print list1,list2
tmp = tmp + list1
tmp = tmp + list2
tmp.reverse()
return tmp
Yours is not linear, but that doesn't mean it's slower. Algorithmic complexity ("big-oh notation") is often only a rough guide and always only tells one part of the story.
However, theirs isn't linear either, though it may appear to be at first blush. Popping from a list requires moving all later items, so popping from the front requires moving all remaining elements.
It is a good exercise to think about how to make this O(n). The below is in the same spirit as the given solution, but avoids its pitfalls while generalizing to more than 2 lists for the sake of exercise. For exactly 2 lists, you could remove the heap handling and simply test which next item is smaller.
import heapq
def iter_linear_merge(*args):
"""Yield non-decreasing items from sorted a and b."""
# Technically, [1, 1, 2, 2] isn't an "increasing" sequence,
# but it is non-decreasing.
nexts = []
for x in args:
x = iter(x)
for n in x:
heapq.heappush(nexts, (n, x))
break
while len(nexts) >= 2:
n, x = heapq.heappop(nexts)
yield n
for n in x:
heapq.heappush(nexts, (n, x))
break
if nexts: # Degenerate case of the heap, not strictly required.
n, x = nexts[0]
yield n
for n in x:
yield n
Instead of the last if-for, the while loop condition could be changed to just "nexts", but it is probably worthwhile to specially handle the last remaining iterator.
If you want to strictly return a list instead of an iterator:
def linear_merge(*args):
return list(iter_linear_merge(*args))
With mostly-sorted data, timsort approaches linear. Also, your code doesn't have to screw around with the lists themselves. Therefore, your code is possibly just a bit faster.
But that's what timing is for, innit?
I think the issue here is that the tutorial is illustrating how to implement a well-known algorithm called 'merge' in Python. The tutorial is not expecting you to actually use a library sorting function in the solution.
sorted() is probably O(nlgn); then your solution cannot be linear in the worst case.
It is important to understand how merge() works because it is useful in many other algorithms. It exploits the fact the input lists are individually sorted, moving through each list sequentially and selecting the smallest option. The remaining items are appended at the end.
The question isn't which is 'quicker' for a given input case but about which algorithm is more complex.
There are hybrid variations of merge-sort which fall back on another sorting algorithm once the input list size drops below a certain threshold.