I need to split the total number of elements in iterator :
tot= itertools.combinations(dict1.keys(), 2) into 3 parts.
The size of dict1 = 285056
Total combinations possible = 40billion
My goal is to somehow divide these 40billion into 3 parts of 13.5 billion elements each to process on different processors parallely. At the moment i am naively iterating the 40billion and dumping pickle files when i reach 13.5 billion which isnt efficient as each 13.5 billion pickle is 160gb on disk (much larger when loaded in memory)
So is there any way I could iterate the 40billion till 13.5billionth element in one code and then start from 13.6 billionth element in code 2 and so on without iteration like i did.
Below code i use to get certain number of elements from combinations iterable.
def grouper(n, iterable):
it = iter(iterable)
while True:
chunk = tuple(itertools.islice(it, n))
if not chunk:
return
yield chunk
for first_chunk in grouper(1350000000,tot ):
It is easy to create this kind of split with itertools.
Given a set of elements, we can test if the first part of
the generated combination belongs to the computation for
machine i.
In the code below, I show a crude solution for this, with
the code in the for loop intended to be split over 3 machines.
Machine i will run the code for the i-th segment of
the keys for the first element of the combination,
combined with the full set for the second element.
The combinations are supposed to be processed in the line
where cnt2 is calculated. Replace that with the kind of for
loop you want to process your combinations with.
Compared with generating and storing all combinations, this
solution does not store any, but it will (internally) generate
all. But what's a couple of billion combinations between friends?
import itertools
def is_not_for_machine(i, t):
""" t is in the set if first element
in my_set_prefix[i] for machine i """
if my_set_prefix[i][0] <= t[0] < my_set_prefix[i][1]:
return False
return True
my_set_prefix = []
for i in range(3):
my_set_prefix.append((len(my_keys)*i//3, len(my_keys)*(i+1)//3))
print(f"== partition: {my_set_prefix}")
my_keys = range(12)
all = itertools.combinations(my_keys, 2)
cnt = len([_ for _ in all])
print(f"== total set size {cnt}")
for i in range(3):
all = itertools.combinations(my_keys, 2)
cnt2 = len([_ for _ in itertools.filterfalse(lambda t: is_not_for_machine(i, t), all)])
print(f"== set size for prefix {my_set_prefix[i]}: {cnt2}")
The output shows that some load balancing might be necessary,
since this partition is "triangular descending", with the
highest count for the first combinations.
== partition: [(0, 4), (4, 8), (8, 12)]
== total set size 66
== set size for prefix (0, 4): 38
== set size for prefix (4, 8): 22
== set size for prefix (8, 12): 6
why not directly use the math.comb command to get the number of combinations?
Just go to the question there.
Related
Using Python
Here's some requirements:
I want to find a list of of numbers [ ] that:
add up to a number (say 30)
Be within a range of (start,end) let's say (8, 20)
Have Y (say 3) elements within the list
Ex: [8,10,12]
I've tried the code below, which works for what I want, but it gives me ALL combinations which is very heavy on memory. To select one, I've just randomly selected one, however I would like to use this for bigger lists with greater ranges, so this is not efficient.
list(combinations(list(range(8,20)),3))
The code you posted does not check for the sum.
The below snippets optimize memory use, not run time
If you are using Python 3 then combinations already returns a generator. All you have to do is iterate over the combinations. If the sum is correct, print the combination and break from the loop:
from itertools import combinations
for comb in combinations(range(8, 20), 3):
if sum(comb) == 30:
print(comb)
break
Outputs
(8, 9, 13)
Alternatively, you could use filter and then call next on the result. This way you can get as many combinations as you want:
from itertools import combinations
valid_combs = filter(lambda c: sum(c) == 30, combinations(range(8, 20), 3))
print(next(valid_combs))
print(next(valid_combs))
print(next(valid_combs))
Outputs
(8, 9, 13)
(8, 10, 12)
(9, 10, 11)
A bit more advanced and dynamic solution is with a function and yield from (if you are using Python >= 3.3):
from itertools import combinations
def get_combs(r, n, s):
yield from filter(lambda c: sum(c) == s, combinations(r, n))
valid_combs = get_combs(range(8, 20), 3, 30)
print(next(valid_combs))
print(next(valid_combs))
print(next(valid_combs))
Outputs
(8, 9, 13)
(8, 10, 12)
(9, 10, 11)
Here's an example of a recursive function that will do this efficiently.
def rangeToSum(start,stop,target,count):
if count == 0: return []
minSum = sum(range(start,start+count-1))
stop = min(stop,target+1-minSum)
if count == 1 :
return [target] if target in range(start,stop) else []
for n in reversed(range(start,stop)):
subSum = rangeToSum(start,n,target-n,count-1)
if subSum: return subSum+[n]
return []
print(rangeToSum(8,20,30,3)) # [8,10,12]
The way it works is by trying the largest numbers first and calling itself to find numbers in the remaining range that add up to the remaining value. This will skip whole swats of combinations that cannot produce the target sum. e.g. trying 20 skips over combinations containing 19,18,16,15,14,13,12 or 11.
It also takes into account the minimum sum that the first count-1 items will produce to further reduce the stop value. e.g. to reach 30 from 8 with 3 numbers will use at least 17 (8+9) for the first two numbers so the stop value of the range can be reduced to 14 because 17 + 13 will reach 30 and any higher value will go beyond 30.
For large numbers, the function will find a solution quickly in most cases but it could also take a very long time depending on the combination of parameters.
rangeToSum(80000,20000000,3000000,10) # 0.6 second
# [80000, 80002, 80004, 80006, 80008, 80010, 80012, 80014, 80016, 2279928]
If you need it to be faster, you could try memoization (e.g. using lru_cache from functools).
Given a set of N elements, I want to choose m random, non-repeating subsets of k elements.
If I was looking to generate all the N choose k combinations, I could
have used itertools.combination, so one way to do what I m asking would be:
import numpy as np
import itertools
n=10
A = np.arange(n)
k=4
m=5
result = np.random.permutation([x for x in itertools.permutations(A,k)])[:m]
print(result)
The problem is of course that this code first generates all the possible permutations, and that this can be quite expensive.
Another suboptimal solution would be to choose each time a single permutation at random (e.g. choose-at-random-from-combinations, then sort to get permutation), and discard it if it has already been selected.
Is there a better way to do this?
Your second solution seems to be the only practical way to do it. It will work well unless k is close to n and m is "large", in which case there will be more repetitions.
I added a count of the tries needed to get the samples we need. For m=50, with n=10 and k=4, it takes usually less than 60 tries. You can see how it goes with the size of your population and your samples.
You can use random.sample to get a list of k values without replacement, then sort it and turn it into a tuple. So, we can use a set for keeping only unique results.
import random
n = 10
A = list(range(n))
k = 4
m = 5
samples = set()
tries = 0
while len(samples) < m:
samples.add(tuple(sorted(random.sample(A, k))))
tries += 1
print(samples)
print(tries)
# {(1, 4, 5, 9), (0, 3, 6, 8), (0, 4, 7, 8), (3, 5, 7, 9), (1, 2, 3, 4)}
# 6
# 6 tries this time !
The simplest way to do it is to random.shuffle(range) then take first k elements (need to be repeated until m valid samples are collected).
Of course this procedure cannot guarantee unique samples. You are to check a new sample against your historical hash if you really need it.
Since Pyton2.3, random.sample(range, k) can be used to produce a sample in a more efficient way
There are k treatments and N total tests to distribute among the treatments, which is called a plan. For a fixed plan, I want to output in Python all the possible success sets.
Question:
For example, if doctors are testing headache medicine, if k=2 types of treatments (i.e. Aspirin and Ibuprofen) and N=3 total tests, one plan could be (1 test for Aspirin, 2 tests for Ibuprofen). For that plan, how do I output all possible combinations of 0-1 successful tests of Aspirin and 0-2 successful tests for Ibuprofen? One successful test means that when a patient with a headache is given Aspirin, the Aspirin cures their headache.
Please post an answer with python code, NOT a math answer.
Desired output is a list w/n a list that has [# successes for treatment 1, # successes of treatment 2]:
[ [0,0], [0,1], [0,2], [1,0], [1,1], [1,2] ]
It would be great if yield could be used because the list above could be really long and I don't want to store the whole list in memory, which would increase computation time.
Below I have the code for enumerating all possible combinations of N balls in A boxes, which should be similar to creating all possible success sets I think, but I'm not sure how.
Code
#Return list of tuples of all possible plans (n1,..,nk), where N = total # of tests = balls, K = # of treatments = boxes
#Code: Glyph, http://stackoverflow.com/questions/996004/enumeration-of-combinations-of-n-balls-in-a-boxes
def ballsAndBoxes(balls, boxes, boxIndex=0, sumThusFar=0):
if boxIndex < (boxes - 1):
for counter in range(balls + 1 - sumThusFar):
for rest in ballsAndBoxes(balls, boxes,
boxIndex + 1,
sumThusFar + counter):
yield (counter,) + rest
else:
yield (balls - sumThusFar,)
Generating the plans is a partition problem, but generating the success sets for a given plan only requires generating the Cartesian product of a set of ranges.
from itertools import product
def success_sets(plan):
return product(*map(lambda n: range(n + 1), plan))
plan = [1, 2]
for s in success_sets(plan):
print(s)
# (0, 0), (0, 1), (0, 2), (1, 0), (1, 1), (1, 2)
Since itertools.product returns a generator, the entire list will not be stored in memory as requested.
I am not sure exactly what you're trying to achieve. But combinations can be generated using itertools.
from itertools import combinations
#You can add an extra loop for all treatments
for j in range(1, N): #N is number of tests
for i in combinations(tests, r = j):
indexes = set(i)
df_cur = tests[indexes] #for tests i am using a pandas df
if :# condition for success
#actions
else:
#other_actions
I am trying to write an algorithm that would pick N distinct items from an sequence at random, without knowing the size of the sequence in advance, and where it is expensive to iterate over the sequence more than once. For example, the elements of the sequence might be the lines of a huge file.
I have found a solution when N=1 (that is, "pick exactly one element at random from a huge sequence"):
import random
items = range(1, 10) # Imagine this is a huge sequence of unknown length
count = 1
selected = None
for item in items:
if random.random() * count < 1:
selected = item
count += 1
But how can I achieve the same thing for other values of N (say, N=3)?
If your sequence is short enough that reading it into memory and randomly sorting it is acceptable, then a straightforward approach would be to just use random.shuffle:
import random
arr=[1,2,3,4]
# In-place shuffle
random.shuffle(arr)
# Take the first 2 elements of the now randomized array
print arr[0:2]
[1, 3]
Depending upon the type of your sequence, you may need to convert it to a list by calling list(your_sequence) on it, but this will work regardless of the types of the objects in your sequence.
Naturally, if you can't fit your sequence into memory or the memory or CPU requirements of this approach are too high for you, you will need to use a different solution.
Use reservoir sampling. It's a very simple algorithm that works for any N.
Here is one Python implementation, and here is another.
Simplest I've found is this answer in SO, improved a bit below:
import random
my_list = [1, 2, 3, 4, 5]
how_big = 2
new_list = random.sample(my_list, how_big)
# To preserve the order of the list, you could do:
randIndex = random.sample(range(len(my_list)), how_big)
randIndex.sort()
new_list = [my_list[i] for i in randIndex]
If you have python version of 3.6+ you can use choices
from random import choices
items = range(1, 10)
new_items = choices(items, k = 3)
print(new_items)
[6, 3, 1]
#NPE is correct, but the implementations that are being linked to are sub-optimal and not very "pythonic". Here's a better implementation:
def sample(iterator, k):
"""
Samples k elements from an iterable object.
:param iterator: an object that is iterable
:param k: the number of items to sample
"""
# fill the reservoir to start
result = [next(iterator) for _ in range(k)]
n = k - 1
for item in iterator:
n += 1
s = random.randint(0, n)
if s < k:
result[s] = item
return result
Edit As #panda-34 pointed out the original version was flawed, but not because I was using randint vs randrange. The issue is that my initial value for n didn't account for the fact that randint is inclusive on both ends of the range. Taking this into account fixes the issue. (Note: you could also use randrange since it's inclusive on the minimum value and exclusive on the maximum value.)
Following will give you N random items from an array X
import random
list(map(lambda _: random.choice(X), range(N)))
It should be enough to accept or reject each new item just once, and, if you accept it, throw out a randomly chosen old item.
Suppose you have selected N items of K at random and you see a (K+1)th item. Accept it with probability N/(K+1) and its probabilities are OK. The current items got in with probability N/K, and get thrown out with probability (N/(K+1))(1/N) = 1/(K+1) so survive through with probability (N/K)(K/(K+1)) = N/(K+1) so their probabilities are OK too.
And yes I see somebody has pointed you to reservoir sampling - this is one explanation of how that works.
As aix mentioned reservoir sampling works. Another option is generate a random number for every number you see and select the top k numbers.
To do it iteratively, maintain a heap of k (random number, number) pairs and whenever you see a new number insert to the heap if it is greater than smallest value in the heap.
This was my answer to a duplicate question (closed before I could post) that was somewhat related ("generating random numbers without any duplicates"). Since, it is a different approach than the other answers, I'll leave it here in case it provides additional insight.
from random import randint
random_nums = []
N = # whatever number of random numbers you want
r = # lower bound of number range
R = # upper bound of number range
x = 0
while x < N:
random_num = randint(r, R) # inclusive range
if random_num in random_nums:
continue
else:
random_nums.append(random_num)
x += 1
The reason for the while loop over the for loop is that it allows for easier implementation of non-skipping in random generation (i.e. if you get 3 duplicates, you won't get N-3 numbers).
There's one implementation from the numpy library.
Assuming that N is smaller than the length of the array, you'd have to do the following:
# my_array is the array to be sampled from
assert N <= len(my_array)
indices = np.random.permutation(N) # Generates shuffled indices from 0 to N-1
sampled_array = my_array[indices]
If you need to sample the whole array and not just the first N positions, then you can use:
import random
sampled_array = my_array[random.sample(len(my_array), N)]
The range for x and y is from 0 to 99.
I am currently doing it like this:
excludeFromTrainingSet = []
while len(excludeFromTrainingSet) < 4000:
tempX = random.randint(0, 99)
tempY = random.randint(0, 99)
if [tempX, tempY] not in excludeFromTrainingSet:
excludeFromTrainingSet.append([tempX, tempY])
But it takes ages and I really need to speed this up.
Any ideas?
Vincent Savard has an answer that's almost twice as fast as the first solution offered here.
Here's my take on it. It requires tuples instead of lists for hashability:
def method2(size):
ret = set()
while len(ret) < size:
ret.add((random.randint(0, 99), random.randint(0, 99)))
return ret
Just make sure that the limit is sane as other answerers have pointed out. For sane input, this is better algorithmically O(n) as opposed to O(n^2) because of the set instead of list. Also, python is much more efficient about loading locals than globals so always put this stuff in a function.
EDIT: Actually, I'm not sure that they're O(n) and O(n^2) respectively because of the probabilistic component but the estimations are correct if n is taken as the number of unique elements that they see. They'll both be slower as they approach the total number of available spaces. If you want an amount of points which approaches the total number available, then you might be better off using:
import random
import itertools
def method2(size, min_, max_):
range_ = range(min_, max_)
points = itertools.product(range_, range_)
return random.sample(list(points), size)
This will be a memory hog but is sure to be faster as the density of points increases because it avoids looking at the same point more than once. Another option worth profiling (probably better than last one) would be
def method3(size, min_, max_):
range_ = range(min_, max_)
points = list(itertools.product(range_, range_))
N = (max_ - min_)**2
L = N - size
i = 1
while i <= L:
del points[random.randint(0, N - i)]
i += 1
return points
My suggestion :
def method2(size):
randints = range(0, 100)
excludeFromTrainingSet = set()
while len(excludeFromTrainingSet) < size:
excludeFromTrainingSet.add((random.choice(randints), random.choice(randints)))
return excludeFromTrainingSet
Instead of generation 2 random numbers every time, you first generate the list of numbers from 0 to 99, then you choose 2 and appends to the list. As others pointed out, there are only 10 000 possibilities so you can't loop until you get 40 000, but you get the point.
I'm sure someone is going to come in here with a usage of numpy, but how about using a set and tuple?
E.g.:
excludeFromTrainingSet = set()
while len(excludeFromTrainingSet) < 40000:
temp = (random.randint(0, 99), random.randint(0, 99))
if temp not in excludeFromTrainingSet:
excludeFromTrainingSet.add(temp)
EDIT: Isn't this an infinite loop since there are only 100^2 = 10000 POSSIBLE results, and you're waiting until you get 40000?
Make a list of all possible (x,y) values:
allpairs = list((x,y) for x in xrange(99) for y in xrange(99))
# or with Py2.6 or later:
from itertools import product
allpairs = list(product(xrange(99),xrange(99)))
# or even taking DRY to the extreme
allpairs = list(product(*[xrange(99)]*2))
Shuffle the list:
from random import shuffle
shuffle(allpairs)
Read off the first 'n' values:
n = 4000
trainingset = allpairs[:n]
This runs pretty snappily on my laptop.
You could make a lookup table of random values... make a random index into that lookup table, and then step through it with a static increment counter...
Generating 40 thousand numbers inevitably will take a while. But you are performing an O(n) linear search on the excludeFromTrainingSet, which takes quite a while especially later in the process. Use a set instead. You could also consider generating a number of coordinate sets e.g. over night and pickle them, so you don't have to generate new data for each test run (dunno what you're doing, so this might or might not help). Using tuples, as someone noted, is not only the semantically correct choice, it might also help with performance (tuple creation is faster than list creation). Edit: Silly me, using tuples is required when using sets, since set members must be hashable and lists are unhashable.
But in your case, your loop isn't terminating because 0..99 is 100 numbers and two-tuples of them have only 100^2 = 10000 unique combinations. Fix that, then apply the above.
Taking Vince Savard's code:
>>> from random import choice
>>> def method2(size):
... randints = range(0, 100)
... excludeFromTrainingSet = set()
... while True:
... x = size - len(excludeFromTrainingSet)
... if not x:
... break
... else:
... excludeFromTrainingSet.add((choice(randints), choice(randints)) for _ in range(x))
... return excludeFromTrainingSet
...
>>> s = method2(4000)
>>> len(s)
4000
This is not a great algorithm because it has to deal with collisions, but the tuple-generation makes it tolerable. This runs in about a second on my laptop.
## for py 3.0+
## generate 4000 points in 2D
##
import random
maxn = 10000
goodguys = 0
excluded = [0 for excl in range(0, maxn)]
for ntimes in range(0, maxn):
alea = random.randint(0, maxn - 1)
excluded[alea] += 1
if(excluded[alea] > 1): continue
goodguys += 1
if goodguys > 4000: break
two_num = divmod(alea, 100) ## Unfold the 2 numbers
print(two_num)