Algorithm to find the least difference between lists - python

I have been trying to understand the algorithm used here to compare two lists, implemented in this commit. The intention, as I understood, is to find the least amount of changes to create dst from src. These changes are later listed as sequence of patch commands. I am not a python developer, and learned generators to understand the flow and how recursion is done. but, now I can't make much sense out of the output generated by the _split_by_common_seq method. I fed a few different lists, and the output is shown below. Can you please help me to understand why the output is like it is in these cases.
in the reference case,
src [0, 1, 2, 3]
dst [1, 2, 4, 5]
[[(0, 1), None], [(3, 4), (2, 4)]]
I cannot see how it is related to the the picture in the doc. Why (3,4) and (2,4) on the right? Is it a standard algorithm?
test cases
src [1, 2, 3]
dst [1, 2, 3, 4, 5, 6, 7, 8]
[[None, None], [None, (3, 8)]]
src [1, 2, 3, 4, 5]
dst [1, 2, 3, 4, 5, 6, 7, 8]
[[None, None], [None, (5, 8)]]
src [4, 5]
dst [1, 2, 3, 4, 5, 6, 7, 8]
[[None, (0, 3)], [None, (5, 8)]]
src [0, 1, 2, 3]
dst [1, 2, 4, 5]
[[(0, 1), None], [(3, 4), (2, 4)]]
src [0, 1, 2, 3]
dst [1, 2, 3, 4, 5]
[[(0, 1), None], [None, (3, 5)]]
src [0, 1, 3]
dst [1, 2, 4, 5]
[[(0, 1), None], [(2, 3), (1, 4)]]
For future reference, here's the code (taken from the aforementioned repository):
import itertools
def _longest_common_subseq(src, dst):
"""Returns pair of ranges of longest common subsequence for the `src`
and `dst` lists.
>>> src = [1, 2, 3, 4]
>>> dst = [0, 1, 2, 3, 5]
>>> # The longest common subsequence for these lists is [1, 2, 3]
... # which is located at (0, 3) index range for src list and (1, 4) for
... # dst one. Tuple of these ranges we should get back.
... assert ((0, 3), (1, 4)) == _longest_common_subseq(src, dst)
"""
lsrc, ldst = len(src), len(dst)
drange = list(range(ldst))
matrix = [[0] * ldst for _ in range(lsrc)]
z = 0 # length of the longest subsequence
range_src, range_dst = None, None
for i, j in itertools.product(range(lsrc), drange):
if src[i] == dst[j]:
if i == 0 or j == 0:
matrix[i][j] = 1
else:
matrix[i][j] = matrix[i-1][j-1] + 1
if matrix[i][j] > z:
z = matrix[i][j]
if matrix[i][j] == z:
range_src = (i-z+1, i+1)
range_dst = (j-z+1, j+1)
else:
matrix[i][j] = 0
return range_src, range_dst
def split_by_common_seq(src, dst, bx=(0, -1), by=(0, -1)):
"""Recursively splits the `dst` list onto two parts: left and right.
The left part contains differences on left from common subsequence,
same as the right part by for other side.
To easily understand the process let's take two lists: [0, 1, 2, 3] as
`src` and [1, 2, 4, 5] for `dst`. If we've tried to generate the binary tree
where nodes are common subsequence for both lists, leaves on the left
side are subsequence for `src` list and leaves on the right one for `dst`,
our tree would looks like::
[1, 2]
/ \
[0] []
/ \
[3] [4, 5]
This function generate the similar structure as flat tree, but without
nodes with common subsequences - since we're don't need them - only with
left and right leaves::
[]
/ \
[0] []
/ \
[3] [4, 5]
The `bx` is the absolute range for currently processed subsequence of
`src` list. The `by` means the same, but for the `dst` list.
"""
# Prevent useless comparisons in future
bx = bx if bx[0] != bx[1] else None
by = by if by[0] != by[1] else None
if not src:
return [None, by]
elif not dst:
return [bx, None]
# note that these ranges are relative for processed sublists
x, y = _longest_common_subseq(src, dst)
if x is None or y is None: # no more any common subsequence
return [bx, by]
return [split_by_common_seq(src[:x[0]], dst[:y[0]],
(bx[0], bx[0] + x[0]),
(by[0], by[0] + y[0])),
split_by_common_seq(src[x[1]:], dst[y[1]:],
(bx[0] + x[1], bx[0] + len(src)),
(bx[0] + y[1], bx[0] + len(dst)))]

It is a cute algorithm, but I don't think it's a "known" one. It's a clever way of comparing lists, and probably not the first time that someone thought of it, but I had never seen it before.
Basically, the output is telling you the ranges that look different in src and dst.
The function always returns a list with 2 lists. The first list refers to the elements in src and dst that are on the left side of the longest common subsequence between src and dst; the second refers to the elements that are on the right side of the longest common subsequence. Each of these lists holds a pair of tuples. Tuples represent a range in the list - (x, y) denotes the elements you would get if you performed lst[x:y]. From this pair of tuples, the first tuple is the range from src, the second tuple is the range from dst.
At each step, the algorithm computes the ranges of src and dst that are to the left of the longest common subsequence and to the right of the longest common subsequence between src and dst.
Let's look at your first example to clear things up:
src [0, 1, 2, 3]
dst [1, 2, 4, 5]
The longest common subsequence between src and dst is [1, 2]. In src, the range (0, 1) defines the elements that are immediately to the left of [1, 2]; in dst, that range is empty, because there is nothing before [1, 2]. So, the first list will be [(0, 1), None].
To the right of [1, 2], in src, we have the elements in the range (3, 4), and in dst we have 4 and 5, which are represented by the range (2, 4). So the second list will be [(3, 4), (2, 4)].
And there you go:
[[(0, 1), None], [(3, 4), (2, 4)]]
How does this relate to the tree in the comments?
The leafs in the tree are using a different notation: instead of a tuple describing a range, the actual elements on that range are shown. In fact, [0] is the only element in the range (0, 1) in src. The same applies for the rest.
Once you get this, the other examples you posted should be pretty easy to follow. But note that the output can become more complex if there is more than one common subsequence: the algorithm finds every common subsequences in nonincreasing order; since each invocation returns a list with 2 elements, this means that you will get nested lists in cases like these. Consider:
src = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
dst = [46, 1, 2, 3, 4, 5, 99, 98, 97, 5, 6, 7, 30, 31, 32, 11, 12, 956]
This outputs:
[[(0, 1), (0, 1)], [[[None, (6, 10)], [(8, 11), (12, 15)]], [(13, 14), (17, 18)]]]
The second list is nested because there was more than one recursion level (your previous examples immediately fell on a base case).
The explanation shown before applies recursively to each list: the second list in [[(0, 1), (0, 1)], [[[None, (6, 10)], [(8, 11), (12, 15)]], [(13, 14), (17, 18)]]] shows the differences in the lists to the right of the longest common subsequence.
The longest common subsequence is [1, 2, 3, 4, 5]. To the left of [1, 2, 3, 4, 5], both lists are different in the first element (the ranges are equal and easy to check).
Now, the procedure applies recursively. For the right side, there is a new recursive call, and src and dst become:
src = [6, 7, 8, 9, 10, 11, 12, 13]
dst = [99, 98, 97, 5, 6, 7, 30, 31, 32, 11, 12, 956]
# LCS = [6, 7]; Call on the left
src = []
dst = [99, 98, 97, 5]
# LCS = [6, 7]; Call on the right
src = [8, 9, 10, 11, 12, 13]
dst = [30, 31, 32, 11, 12, 956]
# LCS = [11, 12]; Call on the left
src = [8, 9, 10]
dst = [30, 31, 32]
# LCS = [11, 12]; Call on the right
src = [13]
dst = [956]
The longest common subsequence is [6, 7]. Then you will have another recursive call
on the left, for src = [] and dst = [99, 98, 97, 5], now there is no longest common subsequence and the recursion on this side stops (just follow the picture).
Each nested list recursively represents the differences on the sub-list with which the procedure was invoked, but note that the indices always refer to positions in the original list (due to the way arguments for bx and by are passed - note that they always accumulate since the beginning).
The key point here is that you will get nested lists linearly proportional to the depth of the recursion, and in fact, you can actually tell how many common subsequences exist in the original lists just by looking at the nesting level.

Related

Detect ranges in Python

I'm trying to solve this exercise in my coursework:
Create a function named detect_ranges that gets a list of integers as a parameter.
The function should then sort this list, and transform the list into another list where pairs are used for all the detected intervals.
So 3,4,5,6 is replaced by the pair (3,7).
Numbers that are not part of any interval result just single numbers.
The resulting list consists of these numbers and pairs, separated by commas. An example of how this function works:
print(detect_ranges([2,5,4,8,12,6,7,10,13]))
[2,(4,9),10,(12,14)]
I couldn't comprehend the exercise topic and can't think of how I can detect range. Do you guys have any hints or tips?
Another way of doing this. Although this method will not be as efficient as the other one, but since its an exercise, it will be easier to follow.
I have used zip function in python to do some stuff I explained below, you can check it here to know more about it.
1. First sort the list data, so you get: [2, 4, 5, 6, 7, 8, 10, 12, 13]
2. Then find the differences of increasing values in list. Like (4-2),(5-4), .. If the difference is <=1, then it will be part of a range:
(Also, insert a 0 in the front, just to account for the 1st element and make the obtained list's length equal to original list)
>>> diff = [j-i for i, j in zip(lst[:-1], lst[1:])]
>>> diff.insert(0, 0)
>>> diff
[0, 2, 1, 1, 1, 1, 2, 2, 1]
3. Now get positions in above list where difference is >= 2. This is to detect the ranges:
(Again, insert a 0 in the front, just to account for the 1st element, and make sure it gets picked in range detection)
>>> ind = [i for i,v in enumerate(diff) if v >= 2]
>>> ind.insert(0, 0)
>>> ind
[0, 1, 6, 7]
So the ranges are 0 to 1, 1 to 6, and 6 to 7 in your original list.
4. Group the elements together that will form ranges, using the ind list obtained:
>>> groups = [lst[i:j] for i,j in zip(ind, ind[1:]+[None])]
>>> groups
[[2], [4, 5, 6, 7, 8], [10], [12, 13]]
5. Finally obtain your desired ranges:
>>> ranges = [(i[0],i[-1]+1) if len(i)>1 else i[0] for i in groups]
>>> ranges
[2, (4, 9), 10, (12, 14)]
Putting it all in a function detect_ranges:
def detect_ranges(lst):
lst = sorted(lst)
diff = [j-i for i, j in zip(lst[:-1], lst[1:])]
diff.insert(0, 0)
ind = [i for i,v in enumerate(diff) if v >= 2]
ind.insert(0, 0)
groups = [lst[i:j] for i,j in zip(ind, ind[1:]+[None])]
ranges = [(i[0],i[-1]+1) if len(i)>1 else i[0] for i in groups]
return ranges
Examples:
>>> lst = [2,6,1,9,3,7,12,45,46,13,90,14,92]
>>> detect_ranges(lst)
[(1, 4), (6, 8), 9, (12, 15), (45, 47), 90, 92]
>>> lst = [12,43,43,11,4,3,6,6,9,9,10,78,32,23,22,98]
>>> detect_ranges(lst)
[(3, 5), (6, 7), (9, 13), (22, 24), 32, (43, 44), 78, 98]
Iterate through the elements and save the start of each interval.
def detect_ranges(xs):
it = iter(xs)
try:
start = next(it)
except StopIteration:
return
prev = start
for x in it:
if prev + 1 != x:
yield start, prev + 1
start = x
prev = x
yield start, prev + 1
Usage:
>>> xs = [2, 4, 5, 6, 7, 8, 10, 12, 13]
>>> ranges = list(detect_ranges(xs))
>>> ranges
[(2, 3), (4, 9), (10, 11), (12, 14)]
If you want to reduce single item intervals like (2, 3) to 2, you can do:
>>> ranges = [a if a + 1 == b else (a, b) for a, b in ranges]
>>> ranges
[2, (4, 9), 10, (12, 14)]

Creating a nested list referencing specific ranges

I was challenged by a friend to make a simple program that asks a user to input a maximum value, and then a sample size (n). It then just uses randint to create a histogram in shell using ascii characters.
I can establish the class width and boundaries very easily. Where I'm having trouble is in understanding and implementing some sort of algorithm that will append all numbers that fall within a specific class to the histogram list to be printed. For example, if I have:
sample = [5, 1, 3, 9, 7, 13, 12, 5]
class_boundaries = [(1, 4), (4, 7), (7, 10), (10, 14)]
histogram = []
I just need to make a function that appends the sample values in the position that they would belong to in reference to the class boundaries. So for example, histogram[0] should return [1, 3]. I've been doing my best to try different solutions and understand how for-loop algorithms or list comprehensions function, but a practical explanation to my problem would be really helpful in my quest to better understand how to program. Thank you in advance!
sample = [5, 1, 3, 9, 7, 13, 12, 5]
class_boundaries = [(1, 4), (4, 7), (7, 10), (10, 14)]
classified = [[X for X in sample if LO <= X <= HI] for LO,HI in class_boundaries]
counts = [sum(LO <= X <= HI for X in sample) for LO,HI in class_boundaries]
Result: classified = [[1, 3], [5, 7, 5], [9, 7], [13, 12]], counts = [2, 3, 2, 2]
The computation of the counts doesn't need classified, so if thats all you need, skip the classified step.

Creating two 1D lists from python based on function returning two entries

I have a function f(x,y) which returns two values a,b.
I want to construct a 2D List from returned values of a,b from f being called on x in a List of values and y in an equally long list of values.
Is there an easy way to do this?
I tried this and it did not work.
aList, bList = [f(x[i],y[i],1) for i in range(T)]
You can use the the *-operator to perform argument unpacking of the list while calling zip() to compose the two desired lists.
input_data = [(0, 0), (1, 1), (2, 2)]
def f(x, y):
return x+1, y-1
results = [f(i[0], i[1]) for i in input_data]
print (f"results: {results}")
a_list, b_list = zip(*results)
print(f"a_list: {a_list}\nb_list: {b_list}")
# This is equivalent to what the *-operator does above, in that
# it is unpacking the list of tuples into a series of arguments
# to zip().
a_list2, b_list2 = zip(results[0], results[1], results[2])
print(f"a_list2: {a_list2}\nb_list2: {b_list2}")
Output:
$ python3 solution.py
results: [(1, -1), (2, 0), (3, 1)]
a_list: (1, 2, 3)
b_list: (-1, 0, 1)
a_list2: (1, 2, 3)
b_list2: (-1, 0, 1)
Using the *-operator for unpacking is convenient because it can seamlessly scale with the number of arguments to zip().
import random
def f(num_results):
return tuple(random.randint(1,10) for n in range(num_results))
for num_results in range(4,7):
print(f"\nHandling {num_results} values per tuple...")
results = [f(num_results) for _ in range(3)]
print(f"results: {results}")
# Note the use unpacking in the assignment as well to
# capture a variable number of lists.
*x_lists, = zip(*results)
for *x_list, in x_lists:
print(x_list)
Output:
$ python3 solution2.py
Handling 4 values per tuple...
results: [(4, 5, 7, 1), (8, 9, 9, 3), (6, 3, 9, 4)]
[4, 8, 6]
[5, 9, 3]
[7, 9, 9]
[1, 3, 4]
Handling 5 values per tuple...
results: [(6, 3, 3, 9, 9), (10, 1, 1, 5, 4), (3, 10, 8, 3, 2)]
[6, 10, 3]
[3, 1, 10]
[3, 1, 8]
[9, 5, 3]
[9, 4, 2]
Handling 6 values per tuple...
results: [(5, 5, 10, 8, 1, 6), (7, 5, 8, 7, 9, 1), (5, 5, 1, 1, 10, 5)]
[5, 7, 5]
[5, 5, 5]
[10, 8, 1]
[8, 7, 1]
[1, 9, 10]
[6, 1, 5]
You always need to think about what you HAVE vs what you WANT. The list comprehension returns a single list containing 2-tuples. You need to split that list.
result = [f(x[i],y[i],1) for i in range(T)]
aList = [k[0] for k in result]
bList = [k[1] for k in result]

Efficient combinations with replacement for multiple iterables, or order-independent product

I'm trying to find a performant solution in Python that works like so:
>>> func([1,2,3], [1,2])
[(1,1), (1,2), (1,3), (2,2), (2,3)]
This is similar to itertools.combinations_with_replacement, except that it can take multiple iterables. It's also similar to itertools.product, except that it omits order-independent duplicate results.
All of the inputs will be prefixes of the same series (i.e. they all start with the same element and follow the same pattern, but might have different lengths).
The function must be able to take any number of iterables as input.
Given a set of lists A, B, C, ..., here is a sketch of an algorithm that generates those results.
assert len(A) <= len(B) <= len(C) <= ...
for i in 0..len(A)
for j in i..len(B)
for k in j..len(C)
.
.
.
yield A[i], B[j], C[k], ...
Things I can't do
Use itertools.product and filter the results. This has to be performant.
Use recursion. The function overhead would make it slower than using itertools.product and filtering for a reasonable number of iterables.
I suspect there's a way to do this with itertools, but I have no idea what it is.
EDIT: I'm looking for the solution that takes the least time.
EDIT 2: There seems to be some confusion about what I'm trying to optimize. I'll illustrate with an example.
>>> len(list(itertools.product( *[range(8)] * 5 )))
32768
>>> len(list(itertools.combinations_with_replacement(range(8), 5)))
792
The first line gives the number of order-dependent possibilities for rolling 5 8-sided dice. The second gives the number of order-independent possibilities. Regardless of how performant itertools.product is, it'll take 2 orders of magnitude more iterations to get a result than itertools.combinations_with_replacement. I'm trying to find a way to do something similar to itertools.combinations_with_replacement, but with multiple iterables that minimizes the number of iterations, or time performance. (product runs in whereas combinations_with_replacement runs in , where M is the number of sides on the die and N is the number of dice)
This solution hasn't recursion or filtering. It's trying to produce only ascending sequences of indices so it's usable only for prefixes of same collection. Also it's uses only indices for element identification so it's not enforces elements of series to be comparable or even hashable.
def prefixCombinations(coll,prefixes):
"produces combinations of elements of the same collection prefixes"
prefixes = sorted(prefixes) # does not impact result through it's unordered combinations
n = len(prefixes)
indices = [0]*n
while True:
yield tuple(coll[indices[i]] for i in range(n))
#searching backwards for non-maximum index
for i in range(n-1,-1,-1):
if indices[i] < prefixes[i] - 1 : break
# if all indices hits maximum - leave
else: break
level = indices[i] + 1
for i in range(i,n): indices[i] = level
examples are
>>> list(prefixCombinations([1,2,3,4,5], (3,2)))
[[1, 1], [1, 2], [1, 3], [2, 2], [2, 3]]
>>> list(prefixCombinations([1,2,3,4,5], (3,2,5)))
[[1, 1, 1], [1, 1, 2], [1, 1, 3], [1, 1, 4], [1, 1, 5], [1, 2, 2], [1, 2, 3], [1, 2, 4], [1, 2, 5], [1, 3, 3], [1, 3, 4], [1, 3, 5], [2, 2, 2], [2, 2, 3], [2, 2, 4], [2, 2, 5], [2, 3, 3], [2, 3, 4], [2, 3, 5]]
>>> from itertools import combinations_with_replacement
>>> tuple(prefixCombinations(range(10),[10]*4)) == tuple(combinations_with_replacement(range(10),4))
True
Since this is a generator it doesn't effectively change the performance (just wraps O(n) around itertools.product):
import itertools
def product(*args):
for a, b in itertools.product(*args):
if a >= b:
yield b, a
print list(product([1,2,3], [1,2]))
Output:
[(1, 1), (1, 2), (2, 2), (1, 3), (2, 3)]
Or even:
product = lambda a, b: ((y, x) for x in a for y in b if x >= y)
Here an implementation.
The idea is to use sorted containers to impose canonical order and avoid duplicates this way. So I'm not generating duplicates at one step and avoid need of filtering later.
It relies on "sortedcontainers" library that provides fast (as fast as C implementation) sorted containers. [I'm not affiliated to this library in any manner]
from sortedcontainers import SortedList as SList
#see at http://www.grantjenks.com/docs/sortedcontainers/
def order_independant_combination(*args):
filtered = 0
previous= set()
current = set()
for iterable in args:
if not previous:
for elem in iterable:
current.add(tuple([elem]))
else:
for elem in iterable:
for combination in previous:
newCombination = SList(combination)
newCombination.add(elem)
newCombination = tuple(newCombination)
if not newCombination in current:
current.add(newCombination)
else:
filtered += 1
previous = current
current = set()
if filtered != 0:
print("{0} duplicates have been filtered during geneeration process".format(filtered))
return list(SList(previous))
if __name__ == "__main__":
result = order_independant_combination(*[range(8)] * 5)
print("Generated a result of length {0} that is {1}".format(len(result), result))
Execution give:
[(1, 1), (1, 2), (1, 3), (2, 2), (2, 3)]
You can test adding more iterables as parameters, it works.
Hope it can at least helps you if not solve your problem.
Vaisse Arthur.
EDIT : to answer the comment. This is not a good analysis. Filtering duplicates during generation is far most effectives than using itertools.product and then filters duplicates result. In fact, eliminating duplicates result at one step avoid to generate duplicates solution in all the following steps.
Executing this:
if __name__ == "__main__":
result = order_independant_combination([1,2,3],[1,2],[1,2],[1,2])
print("Generated a result of length {0} that is {1}".format(len(result), result))
I got the following result :
9 duplicates have been filtered during geneeration process
Generated a result of length 9 that is [(1, 1, 1, 1), (1, 1, 1, 2), (1, 1, 1, 3), (1, 1, 2, 2), (1, 1, 2, 3), (1, 2, 2, 2), (1, 2, 2, 3), (2, 2, 2, 2), (2, 2, 2, 3)]
While using itertools I got this :
>>> import itertools
>>> c = list(itertools.product([1,2,3],[1,2],[1,2],[1,2]))
>>> c
[(1, 1, 1, 1), (1, 1, 1, 2), (1, 1, 2, 1), (1, 1, 2, 2), (1, 2, 1, 1), (1, 2, 1, 2), (1, 2, 2, 1), (1, 2, 2, 2), (2, 1, 1, 1), (2, 1, 1, 2), (2, 1, 2, 1), (2, 1, 2, 2), (2, 2, 1, 1), (2, 2, 1, 2), (2, 2, 2, 1), (2, 2, 2, 2), (3, 1, 1, 1), (3, 1, 1, 2), (3, 1, 2, 1), (3, 1, 2, 2), (3, 2, 1, 1), (3, 2, 1, 2), (3, 2, 2, 1), (3, 2, 2, 2)]
>>> len(c)
24
Simple calcul give this:
pruned generation : 9 result + 9 element filtered -> 18 element generated.
itertools : 24 element generated.
And the more element you give it, the more they are long, the more the difference will be important.
Example :
result = order_independant_combination([1,2,3,4,5],[1,2,3,4,5],[1,2,3,4,5],[1,2,3,4,5])
print("Generated a result of length {0} that is {1}".format(len(result), result))
Result :
155 duplicates have been filtered during geneeration process
Generated a result of length 70 ...
Itertools :
>>> len(list(itertools.product([1,2,3,4,5],[1,2,3,4,5],[1,2,3,4,5],[1,2,3,4,5])))
625
Difference of 400 elements.
EDIT 2 : with *range(8) * 5 it gives 2674 duplicates have been filtered during geneeration process. Generated a result of length 792...

Python Random List Comprehension

I have a list similar to:
[1 2 1 4 5 2 3 2 4 5 3 1 4 2]
I want to create a list of x random elements from this list where none of the chosen elements are the same. The difficult part is that I would like to do this by using list comprehension...
So possible results if x = 3 would be:
[1 2 3]
[2 4 5]
[3 1 4]
[4 5 1]
etc...
Thanks!
I should have specified that I cannot convert the list to a set. Sorry!
I need the randomly selected numbers to be weighted. So if 1 appears 4 times in the list and 3 appears 2 times in the list, then 1 is twice as likely to be selected...
Disclaimer: the "use a list comprehension" requirement is absurd.
Moreover, if you want to use the weights, there are many excellent approaches listed at Eli Bendersky's page on weighted random sampling.
The following is inefficient, doesn't scale, etc., etc.
That said, it has not one but two (TWO!) list comprehensions, returns a list, never duplicates elements, and respects the weights in a sense:
>>> s = [1, 2, 1, 4, 5, 2, 3, 2, 4, 5, 3, 1, 4, 2]
>>> [x for x in random.choice([p for c in itertools.combinations(s, 3) for p in itertools.permutations(c) if len(set(c)) == 3])]
[3, 1, 2]
>>> [x for x in random.choice([p for c in itertools.combinations(s, 3) for p in itertools.permutations(c) if len(set(c)) == 3])]
[5, 3, 4]
>>> [x for x in random.choice([p for c in itertools.combinations(s, 3) for p in itertools.permutations(c) if len(set(c)) == 3])]
[1, 5, 2]
.. or, as simplified by FMc:
>>> [x for x in random.choice([p for p in itertools.permutations(s, 3) if len(set(p)) == 3])]
[3, 5, 2]
(I'll leave the x for x in there, even though it hurts not to simply write list(random.choice(..)) or just leave it as a tuple..)
Generally, you don't want to do this sort of thing in a list comprehension -- It'll lead to much harder to read code. However, if you really must, we can write a completely horrible 1 liner:
>>> values = [random.randint(0,10) for _ in xrange(12)]
>>> values
[1, 10, 6, 6, 3, 9, 0, 1, 8, 9, 1, 2]
>>> # This is the 1 liner -- The other line was just getting us a list to work with.
>>> [(lambda x=random.sample(values,3):any(values.remove(z) for z in x) or x)() for _ in xrange(4)]
[[6, 1, 8], [1, 6, 10], [1, 0, 2], [9, 3, 9]]
Please never use this code -- I only post it for fun/academic reasons.
Here's how it works:
I create a function inside the list comprehension with a default argument of 3 randomly selected elements from the input list. Inside the function, I remove the elements from values so that they aren't available to be picked again. since list.remove returns None, I can use any(lst.remove(x) for x in ...) to remove the values and return False. Since any returns False, we hit the or clause which just returns x (the default value with 3 randomly selected items) when the function is called. All that is left then is to call the function and let the magic happen.
The one catch here is that you need to make sure that the number of groups you request (here I chose 4) multiplied by the number of items per group (here I chose 3) is less than or equal to the number of values in your input list. It may seem obvious, but it's probably worth mentioning anyway...
Here's another version where I pull shuffle into the list comprehension:
>>> lst = [random.randint(0,10) for _ in xrange(12)]
>>> lst
[3, 5, 10, 9, 10, 1, 6, 10, 4, 3, 6, 5]
>>> [lst[i*3:i*3+3] for i in xrange(shuffle(lst) or 4)]
[[6, 10, 6], [3, 4, 10], [1, 3, 5], [9, 10, 5]]
This is significantly better than my first attempt, however, most people would still need to stop, scratch their head a bit before they figured out what this code was doing. I still assert that it would be much better to do this in multiple lines.
If I'm understanding your question properly, this should work:
def weighted_sample(L, x):
# might consider raising some kind of exception of len(set(L)) < x
while True:
ans = random.sample(L, x)
if len(set(ans)) == x:
return ans
Then if you want many such samples you can just do something like:
[weighted_sample(L, x) for _ in range(num_samples)]
I have a hard time conceiving of a comprehension for the sampling logic that isn't just obfuscated. The logic is a bit too complicated. It sounds like something randomly tacked on to a homework assignment to me.
If you don't like infinite looping, I haven't tried it but I think this will work:
def weighted_sample(L, x):
ans = []
c = collections.Counter(L)
while len(ans) < x:
r = random.randint(0, sum(c.values())
for k in c:
if r < c[k]:
ans.append(k)
del c[k]
break
else:
r -= c[k]
else:
# maybe throw an exception since this should never happen on valid input
return ans
First of all, I hope your list might be like
[1,2, 1, 4, 5, 2, 3, 2, 4, 5, 3, 1, 4, 2]
so if you want to print the permutation from the given list as size 3, you can do as the following.
import itertools
l = [1,2, 1, 4, 5, 2, 3, 2, 4, 5, 3, 1, 4, 2]
for permutation in itertools.permutations(list(set(l)),3):
print permutation,
Output:
(1, 2, 3) (1, 2, 4) (1, 2, 5) (1, 3, 2) (1, 3, 4) (1, 3, 5) (1, 4, 2) (1, 4, 3) (1, 4, 5) (1, 5, 2) (1, 5, 3) (1, 5, 4) (2, 1, 3) (2, 1, 4) (2, 1, 5) (2, 3, 1) (2, 3, 4) (2, 3, 5) (2, 4, 1) (2, 4, 3) (2, 4, 5) (2, 5, 1) (2, 5, 3) (2, 5, 4) (3, 1, 2) (3, 1, 4) (3, 1, 5) (3, 2, 1) (3, 2, 4) (3, 2, 5) (3, 4, 1) (3, 4, 2) (3, 4, 5) (3, 5, 1) (3, 5, 2) (3, 5, 4) (4, 1, 2) (4, 1, 3) (4, 1, 5) (4, 2, 1) (4, 2, 3) (4, 2, 5) (4, 3, 1) (4, 3, 2) (4, 3, 5) (4, 5, 1) (4, 5, 2) (4, 5, 3) (5, 1, 2) (5, 1, 3) (5, 1, 4) (5, 2, 1) (5, 2, 3) (5, 2, 4) (5, 3, 1) (5, 3, 2) (5, 3, 4) (5, 4, 1) (5, 4, 2) (5, 4, 3)
Hope this helps. :)
>>> from random import shuffle
>>> L = [1, 2, 1, 4, 5, 2, 3, 2, 4, 5, 3, 1, 4, 2]
>>> x=3
>>> shuffle(L)
>>> zip(*[L[i::x] for i in range(x)])
[(1, 3, 2), (2, 2, 1), (4, 5, 3), (1, 4, 4)]
You could also use a generator expression instead of the list comprehension
>>> zip(*(L[i::x] for i in range(x)))
[(1, 3, 2), (2, 2, 1), (4, 5, 3), (1, 4, 4)]
Starting with a way to do it without list compehensions:
import random
import itertools
alphabet = [1, 2, 1, 4, 5, 2, 3, 2, 4, 5, 3, 1, 4, 2]
def alphas():
while True:
yield random.choice(alphabet)
def filter_unique(iter):
found = set()
for a in iter:
if a not in found:
found.add(a)
yield a
def dice(x):
while True:
yield itertools.islice(
filter_unique(alphas()),
x
)
for i, output in enumerate(dice(3)):
print list(output)
if i > 10:
break
The part, where list comprehensions have troubles is filter_unique() since list comprehension does not have 'memory' of what it did output. The possible solution would be to generate many outputs while the one of good quality is not found as #DSM suggested.
The slow, naive approach is:
import random
def pick_n_unique(l, n):
res = set()
while len(res) < n:
res.add(random.choice(l))
return list(res)
This will pick elements and only quit when it has n unique ones:
>>> pick_n_unique([1, 2, 1, 4, 5, 2, 3, 2, 4, 5, 3, 1, 4, 2], 3)
[2, 3, 4]
>>> pick_n_unique([1, 2, 1, 4, 5, 2, 3, 2, 4, 5, 3, 1, 4, 2], 3)
[3, 4, 5]
However it can get slow if, for example, you have a list with thirty 1s and one 2, since once it has a 1 it'll keep spinning until it finally hits a 2. The better is to count the number of occurrences of each unique element, choose a random one weighted by their occurrence count, remove that element from the count list, and repeat until you have the desired number of elements:
def weighted_choice(item__counts):
total_counts = sum(count for item, count in item__counts.items())
which_count = random.random() * total_counts
for item, count in item__counts.items():
which_count -= count
if which_count < 0:
return item
raise ValueError("Should never get here")
def pick_n_unique(items, n):
item__counts = collections.Counter(items)
if len(item__counts) < n:
raise ValueError(
"Can't pick %d values with only %d unique values" % (
n, len(item__counts))
res = []
for i in xrange(n):
choice = weighted_choice(item__counts)
res.append(choice)
del item__counts[choice]
return tuple(res)
Either way, this is a problem not well-suited to list comprehensions.
def sample(self, population, k):
n = len(population)
if not 0 <= k <= n:
raise ValueError("sample larger than population")
result = [None] * k
try:
selected = set()
selected_add = selected.add
for i in xrange(k):
j = int(random.random() * n)
while j in selected:
j = int(random.random() * n)
selected_add(j)
result[i] = population[j]
except (TypeError, KeyError): # handle (at least) sets
if isinstance(population, list):
raise
return self.sample(tuple(population), k)
return result
Above is a simplied version of the sample function Lib/random.py. I only removed some optimization code for small data sets. The codes tell us straightly how to implement a customized sample function:
get a random number
if the number have appeared before just abandon it and get a new one
repeat the above steps until you get all the sample numbers you want.
Then the real problem turns out to be how to get a random value from a list by weight.This could be by the original random.sample(population, 1) in the Python standard library (a little overkill here, but simple).
Below is an implementation, because duplicates represent weight in your given list, we can use int(random.random() * array_length) to get a random index of your array.
import random
arr = [1, 2, 1, 4, 5, 2, 3, 2, 4, 5, 3, 1, 4, 2]
def sample_by_weight( population, k):
n = len(population)
if not 0 <= k <= len(set(population)):
raise ValueError("sample larger than population")
result = [None] * k
try:
selected = set()
selected_add = selected.add
for i in xrange(k):
j = population[int(random.random() * n)]
while j in selected:
j = population[int(random.random() * n)]
selected_add(j)
result[i] = j
except (TypeError, KeyError): # handle (at least) sets
if isinstance(population, list):
raise
return self.sample(tuple(population), k)
return result
[sample_by_weight(arr,3) for i in range(10)]
With the setup:
from random import shuffle
from collections import deque
l = [1, 2, 1, 4, 5, 2, 3, 2, 4, 5, 3, 1, 4, 2]
This code:
def getSubLists(l,n):
shuffle(l) #shuffle l so the elements are in 'random' order
l = deque(l,len(l)) #create a structure with O(1) insert/pop at both ends
while l: #while there are still elements to choose
sample = set() #use a set O(1) to check for duplicates
while len(sample) < n and l: #until the sample is n long or l is exhausted
top = l.pop() #get the top value in l
if top in sample:
l.appendleft(top) #add it to the back of l for a later sample
else:
sample.add(top) #it isn't in sample already so use it
yield sample #yield the sample
You end up with:
for s in getSubLists(l,3):
print s
>>>
set([1, 2, 5])
set([1, 2, 3])
set([2, 4, 5])
set([2, 3, 4])
set([1, 4])

Categories

Resources