Say I have a range(1, n + 1). I want to get m unique pairs.
What I found is, if the number of pairs is close to n(n-1)/2 (maxiumum number of pairs), one can't simply generate random pairs everytime because they will start overriding eachother. I'm looking for a somewhat lazy solution, that will be very efficient (in Python's world).
My attempt so far:
def get_input(n, m):
res = str(n) + "\n" + str(m) + "\n"
buffet = range(1, n + 1)
points = set()
while len(points) < m:
x, y = random.sample(buffet, 2)
points.add((x, y)) if x > y else points.add((y, x)) # meeh
for (x, y) in points:
res += "%d %d\n" % (x, y);
return res
You can use combinations to generate all pairs and use sample to choose randomly. Admittedly only lazy in the "not much to type" sense, and not in the use a generator not a list sense :-)
from itertools import combinations
from random import sample
n = 100
sample(list(combinations(range(1,n),2)),5)
If you want to improve performance you can make it lazy by studying this
Python random sample with a generator / iterable / iterator
the generator you want to sample from is this: combinations(range(1,n)
Here is an approach which works by taking a number in the range 0 to n*(n-1)/2 - 1 and decodes it to a unique pair of items in the range 0 to n-1. I used 0-based math for convenience, but you could of course add 1 to all of the returned pairs if you want:
import math
import random
def decode(i):
k = math.floor((1+math.sqrt(1+8*i))/2)
return k,i-k*(k-1)//2
def rand_pair(n):
return decode(random.randrange(n*(n-1)//2))
def rand_pairs(n,m):
return [decode(i) for i in random.sample(range(n*(n-1)//2),m)]
For example:
>>> >>> rand_pairs(5,8)
[(2, 1), (3, 1), (4, 2), (2, 0), (3, 2), (4, 1), (1, 0), (4, 0)]
The math is hard to easily explain, but the k in the definition of decode is obtained by solving a quadratic equation which gives the number of triangular numbers which are <= i, and where i falls in the sequence of triangular numbers tells you how to decode a unique pair from it. The interesting thing about this decode is that it doesn't use n at all but implements a one-to-one correspondence from the set of natural numbers (starting at 0) to the set of all pairs of natural numbers.
I don't think any thing on your line can improve. After all, as your m get closer and closer to the limit n(n-1)/2, you have thinner and thinner chance to find the unseen pair.
I would suggest to split into two cases: if m is small, use your random approach. But if m is large enough, try
pairs = list(itertools.combination(buffet,2))
ponits = random.sample(pairs, m)
Now you have to determine the threshold of m that determines which code path it should go. You need some math here to find the right trade off.
Related
Suppose we have data a₁, ..., aₙ, where n is an even integer and each aᵢ ∈ ℝ. Also define the distance between two pairs of elements dis(aᵢ, aⱼ) = | aᵢ − aⱼ |. Now the program should output a list of pairs of elements sorted by the distance in an ascending order. Also the program should pack the input data into pairs, therefore each element aᵢ would only appear once in the output.
For example, given the input [1, 0.4, 3, 1.1] the output should be [(1, 1.1), (0.4, 3)].
A naive brute-force method is to calculate all C(n,2) pair and sorted the distance of each pair.
def not_in_list_of_pair(i, ls):
return not i in [p[0] for p in ls] + [p[1] for p in ls]
def calc(ls):
ls = sorted(ls)
d ={}
for idx1, i in enumerate(ls[:-1]):
for idx2, j in enumerate(ls[idx1+1:], idx1 + 1):
d[(i,j)] = j - i
# 2nd part
res = []
for pair in sorted(d, key = lambda k: d[k]):
i, j = pair
if not_in_list_of_pair(i, res) and not_in_list_of_pair(j, res):
res.append(pair)
return res
# another example
ls = [1, 0.1, 2, 2.4, 3, 4, 1.5]
assert calc(ls) == [(2, 2.4), (1, 1.5), (3, 4)]
But this naive method only works in O(n²), and the 2nd part (extracting min distance) is also slow. Therefore I am looking for a more effective method to solve this problem. Thanks!
I have to say that your descrption of the problem is not clear and the complexity in the description is not correct, i.e., you have to calculate the distance of all the pairs of integers (which is O(n^2)) and after that you sort all the distance (which is O(n^2 * log(n^2))).
For this problem, you are basically finding two integers with smallest distance, pick these two integers out, and repeat the same process on the remaining integers.
One naive solution is, supposed the integers are sorted, and we only find one pair of integers with smallest distance, then we just need to calculate the distance of each two adjacent integers (e.g., dist between ls[0] and ls[1], between ls[1] and ls[2], ..., between ls[n - 2] and ls[n - 1]) and find out which pair is the smallest. After we find one, remove the two selected integers, the remaining integers are still sorted. If we want to find the next pair of integers with smallest distance, the problem remains the same.
The naive solution is still expensive in two aspsects: (1) we need to calculate the distance of each two adjacent integers each time; (2) we need to remove two integers from a sorted array and keep the array sorted.
To solve (1), in fact, we don't have to calculate the all the distances each time. E.g., suppose we have 6 integers where we calculated dist(0, 1), dist(1, 2), dist(2, 3), dist(3, 4), dist(4, 5). We find that the 2nd and the 3rd integers are the closet ones, so we output and remove the 2nd and the 3rd integers. For the next round, we need to calculate dist(0, 1), dist(1, 4), dist(4, 5). We can see that we only need to remove dist(1, 2) and dist(3, 4) as they're useless, but we need to add a new distance dist(1, 4) while dist(0, 1) and dist(4, 5) are not changed. We can maintain a btree to achieve the purpose.
To solve (2), the best data structure where we can remove items from the middle is double linked list with complexity O(1). But we are using array now and we may not want to change array to linked list. One way is that we use index array to mimic a double linked list.
Here is an example.
Update 1: I found OrderedDict does not pop the minimal item each time. I don't find any data structure in python that works as btree. I have to use a heap where I cannot delete those useless distance but I can identiy and ignore them. Sorry for the mistake.
Update 2: Add a else branch in the while loop, i.e., we should not change the double linked list when we see a useless item.
Update 3: Just realize that the heap will have no more than n items in each iteration in the while loop. So the complexity is roughly O(n log n), with n being the number of integers.
from heapq import *
def calc(ls):
ls = sorted(ls) # O(nlogn)
n = len(ls)
# mimic a double linked list
left = [i - 1 for i in range(n)]
right = [i + 1 for i in range(n)]
appeared = [False for i in range(n)]
btree = []
for i in range(0, n - 1):
# distance of adjacent integers, and their indices
heappush(btree, (ls[i + 1] - ls[i], i, i + 1))
# roughly O(n log n), because the heap will have at most `n` items in each iteration
result = []
while len(btree) != 0:
minimal = heappop(btree)
a, b = minimal[1:3]
# skip if either a or b appeared
if not appeared[a] and not appeared[b]:
result.append((ls[a], ls[b]))
appeared[a] = True
appeared[b] = True
else:
continue # this is important
#print result
if left[a] != -1:
right[left[a]] = right[b]
if right[b] != n:
left[right[b]] = left[a]
if left[a] != -1 and right[b] != n:
heappush(btree, (ls[right[b]] - ls[left[a]], left[a], right[b]))
return result
ls = [1, 0.1, 2, 2.4, 3, 4, 1.5]
print calc(ls)
With the following output:
[(2, 2.4), (1, 1.5), (3, 4)]
Note: The number of input integers is 7, which is NOT even.
Show one more image to present what is going on:
I am not very familiar with Python, so I may not be using the best data structure in the above code snippet.
I got this problem on CoderByte. The requirement was to find a number of ways. I found solutions for that in StackOverflow and other sites. But moving ahead, I need all possible ways as well to reach the Nth step.
Problem description: There is a staircase of N steps and you can climb either 1 or 2 steps at a time. You need to count and return the total number of unique ways to climb the staircase. The order of steps taken matters.
For Example,
Input: N = 3
Output: 3
Explanation: There are 3 unique ways of climbing a staircase of 3 steps :{1,1,1}, {2,1} and {1,2}
Note: There might be another case that a person can take 2 or 3 or 4 steps at a time (I know that's realistically not possible but trying to add scalability to the input steps in the code)
I'm unable to find the right logic to get all the ways possible. It's useful if I get the solution in Python, but it's not a strict requirement though.
Here's a minimal solution using itertools library:
from itertools import permutations, chain
solve = lambda n: [(1,)*n] + list(set(chain(*[permutations((2,)*i + (1,)*(n-2*i)) for i in range(1, n//2+1)])))
For your example input:
> solve(3)
[(1, 1, 1), (1, 2), (2, 1)]
How it works?
It's easier to see what's happening if we take a step backwards:
def solve(n):
combinations = [(1,)*n]
for i in range(1, n//2+1):
combinations.extend(permutations((2,)*i + (1,)*(n-2*i)))
return list(set(combinations))
The most trivial case is the one where you take one step at a time, so n steps: (1,)*n. Then we can look for how many double steps could we take at most, and that's the floor of n divided by 2: n//2. Then we iterate over the possible double steps: try to add a double step each iteration (2,)*i, filling the remaining space with single steps (1,)*(n-2*i).
The function permutations from itertools will generate all the possible permutations of single and double steps for that iteration. With an input of (1,1,2), it will generate (1,1,2), (1,2,1) and (2,1,1). At the end we use the trick of converting the result to a set in order to remove duplicates, then converting it back into a list.
Generalization for any amount and length of steps (not optimal!)
One liner:
from itertools import permutations, chain, combinations_with_replacement
solve = lambda n, steps: list(set(chain(*[permutations(sequence) for sequence in chain(*[combinations_with_replacement(steps, r) for r in range(n//min(steps)+1)]) if sum(sequence) == n])))
Example output:
> solve(8, [2,3])
[(3, 2, 3), (2, 3, 3), (2, 2, 2, 2), (3, 3, 2)]
Easier to read version:
def solve(n, steps):
result = []
for sequence_length in range(n//min(steps)+1):
sequences = combinations_with_replacement(steps, sequence_length)
for sequence in sequences:
if sum(sequence) == n:
result.extend(permutations(sequence))
return list(set(result))
def solve(n) :
if (n == 0):
return [[]]
else:
left_results = []
right_results = []
if (n > 0):
left_results = solve(n - 1)
for res in left_results: # Add the current step to every result
res.append(1)
if (n > 1):
right_results = solve(n - 2)
for res in right_results: # Same above
res.append(2)
return left_results + right_results
I think there is a better way to do this using dynamic programming but I don't know how to do that. Hope it helps anyway.
I would like to write a function my_func(n,l) that, for some positive integer n, efficiently enumerates the ordered non-negative integer composition* of length l (where l is greater than n). For example, I want my_func(2,3) to return [[0,0,2],[0,2,0],[2,0,0],[1,1,0],[1,0,1],[0,1,1]].
My initial idea was to use existing code for positive integer partitions (e.g. accel_asc() from this post), extend the positive integer partitions by a couple zeros and return all permutations.
def my_func(n, l):
for ip in accel_asc(n):
nic = numpy.zeros(l, dtype=int)
nic[:len(ip)] = ip
for p in itertools.permutations(nic):
yield p
The output of this function is wrong, because every non-negative integer composition in which a number appears twice (or multiple times) appears several times in the output of my_func. For example, list(my_func(2,3)) returns [(1, 1, 0), (1, 0, 1), (1, 1, 0), (1, 0, 1), (0, 1, 1), (0, 1, 1), (2, 0, 0), (2, 0, 0), (0, 2, 0), (0, 0, 2), (0, 2, 0), (0, 0, 2)].
I could correct this by generating a list of all non-negative integer compositions, removing repeated entries, and then returning a remaining list (instead of a generator). But this seems incredibly inefficient and will likely run into memory issues. What is a better way to fix this?
EDIT
I did a quick comparison of the solutions offered in answers to this post and to another post that cglacet has pointed out in the comments.
On the left, we have the l=2*n and on the right we have l=n+1. In these two cases, user2357112's second solutions is faster than the others, when n<=5. For n>5, solutions proposed by user2357112, Nathan Verzemnieks, and AndyP are more or less tied. But the conclusions could be different when considering other relationships between l and n.
..........
*I originally asked for non-negative integer partitions. Joseph Wood correctly pointed out that I am in fact looking for integer compositions, because the order of numbers in a sequence matters to me.
Use the stars and bars concept: pick positions to place l-1 bars between n stars, and count how many stars end up in each section:
import itertools
def diff(seq):
return [seq[i+1] - seq[i] for i in range(len(seq)-1)]
def generator(n, l):
for combination in itertools.combinations_with_replacement(range(n+1), l-1):
yield [combination[0]] + diff(combination) + [n-combination[-1]]
I've used combinations_with_replacement instead of combinations here, so the index handling is a bit different from what you'd need with combinations. The code with combinations would more closely match a standard treatment of stars and bars.
Alternatively, a different way to use combinations_with_replacement: start with a list of l zeros, pick n positions with replacement from l possible positions, and add 1 to each of the chosen positions to produce an output:
def generator2(n, l):
for combination in itertools.combinations_with_replacement(range(l), n):
output = [0]*l
for i in combination:
output[i] += 1
yield output
Starting from a simple recursive solution, which has the same problem as yours:
def nn_partitions(n, l):
if n == 0:
yield [0] * l
else:
for part in nn_partitions(n - 1, l):
for i in range(l):
new = list(part)
new[i] += 1
yield new
That is, for each partition for the next lower number, for each place in that partition, add 1 to the element in that place. It yields the same duplicates yours does. I remembered a trick for a similar problem, though: when you alter a partition p for n into one for n+1, fix all the elements of p to the left of the element you increase. That is, keep track of where p was modified, and never modify any of p's "descendants" to the left of that. Here's the code for that:
def _nn_partitions(n, l):
if n == 0:
yield [0] * l, 0
else:
for part, start in _nn_partitions(n - 1, l):
for i in range(start, l):
new = list(part)
new[i] += 1
yield new, i
def nn_partitions(n, l):
for part, _ in _nn_partitions(n, l):
yield part
It's very similar - there's just the extra parameter passed along at each step, so I added wrapper to remove that for the caller.
I haven't tested it extensively, but this appears to be reasonably fast - about 35 microseconds for nn_partitions(3, 5) and about 18s for nn_partitions(10, 20) (which yields just over 20 million partitions). (The very elegant solution from user2357112 takes about twice as long for the smaller case and about four times as long for the larger one. Edit: this refers to the first solution from that answer; the second one is faster than mine under some circumstances and slower under others.)
Given a set of N elements, I want to choose m random, non-repeating subsets of k elements.
If I was looking to generate all the N choose k combinations, I could
have used itertools.combination, so one way to do what I m asking would be:
import numpy as np
import itertools
n=10
A = np.arange(n)
k=4
m=5
result = np.random.permutation([x for x in itertools.permutations(A,k)])[:m]
print(result)
The problem is of course that this code first generates all the possible permutations, and that this can be quite expensive.
Another suboptimal solution would be to choose each time a single permutation at random (e.g. choose-at-random-from-combinations, then sort to get permutation), and discard it if it has already been selected.
Is there a better way to do this?
Your second solution seems to be the only practical way to do it. It will work well unless k is close to n and m is "large", in which case there will be more repetitions.
I added a count of the tries needed to get the samples we need. For m=50, with n=10 and k=4, it takes usually less than 60 tries. You can see how it goes with the size of your population and your samples.
You can use random.sample to get a list of k values without replacement, then sort it and turn it into a tuple. So, we can use a set for keeping only unique results.
import random
n = 10
A = list(range(n))
k = 4
m = 5
samples = set()
tries = 0
while len(samples) < m:
samples.add(tuple(sorted(random.sample(A, k))))
tries += 1
print(samples)
print(tries)
# {(1, 4, 5, 9), (0, 3, 6, 8), (0, 4, 7, 8), (3, 5, 7, 9), (1, 2, 3, 4)}
# 6
# 6 tries this time !
The simplest way to do it is to random.shuffle(range) then take first k elements (need to be repeated until m valid samples are collected).
Of course this procedure cannot guarantee unique samples. You are to check a new sample against your historical hash if you really need it.
Since Pyton2.3, random.sample(range, k) can be used to produce a sample in a more efficient way
What is an efficient way to do the following in python?
Given N symbols, iterate through all L length sequences of N symbols, that include all N symbols.
The order does not matter, as long as all sequences are covered, and each only once.
Let's call this iterator seq(symbols,L). Then, for example,
list(seq([1,2,3],2))=[]
list(seq([1,2,3],3))=[(1, 2, 3), (1, 3, 2), (2, 1, 3), (2, 3, 1), (3, 1, 2), (3, 2, 1)]
list(seq([1,2,3],4))=[(1, 1, 2, 3), (1, 1, 3, 2), (1, 2, 1, 3), ...
Here's an intuitive, yet slow implementation:
import itertools
def seq(symbols,L):
for x in itertools.product(symbols,repeat=L):
if all(s in x for s in symbols):
yield x
When N is large and L is close to N, there is a lot of wasted effort. For example, when L==N, it would be much better to use itertools.permutations(). Since every sequence needs to have all N symbols, it seems like a better solution would somehow start with the permuted solution, then add in the extra repeated symbols somehow, but I can't figure out how to do this without double counting (and without resorting to saving all previous output to check for a repeat).
An idea:
import itertools
def solve(size, symbols, todo = None):
if todo is None: todo = frozenset(symbols)
if size < len(todo): return
if size == len(todo):
yield from itertools.permutations(todo) # use sorted(todo) here
# for lexicographical order
return
for s in symbols:
for xs in solve(size - 1, symbols, todo - frozenset((s,))):
yield (s,) + xs
for x in solve(5, (1,2,3)):
print(x)
Will print all sequences of size 5 that contain each of 1,2,3 and 2 more arbitrary elements. You can use bitmasks instead of a set if you aim for efficiency, but I guess you're not since you are using Python :) The complexity is optimal in the sense that it is linear in the output size.
Some "proof":
$ python3 test.py | wc -l # number of output lines
150
$ python3 test.py | sort | uniq | wc -l # unique output lines
150
$ python3 test.py | grep "1"|grep "2"|grep "3"| wc -l # lines with 1,2,3
150
You can do this by breaking the problem into two parts:
Find every possible multiset of size L of N symbols which includes every symbol at least once.
For each multiset, find all unique permutations.
For simplicity, let's suppose the N symbols are the integers in range(N). Then we can represent a multiset as a vector of length N whose values are non-negative integers summing to L. To restrict the multiset to include every symbol at least once, we require that the values in the vector all be strictly positive.
def msets(L, N):
if L == N:
yield (1,) * L
elif N == 1:
yield (L,)
elif N > 0:
for i in range(L - N + 1):
for m in msets(L - i - 1, N - 1):
yield (i + 1,) + m
Unfortunately, itertools.permutations does not produce unique iterations of lists with repeating elements. If we were writing this in C++, we could use std::next_permutation, which does produce unique iterations. There is a sample implementation (in C++, but it's straightforward to convert it to Python) on the linked page.