Itertools combinations, ¿How to make it faster? - python

I am coding this program that takes 54 (num1) numbers and puts them in a list. It then takes 16 (num2) of those numbers and forms a list that contains lists of 16 numbers chosen from all the combinations possible of "num1"c"num2". It then takes those lists and generates 4x4 arrays.
The code I have works, but running 54 numbers to get all the arrays I want will take a long time. I know this because I have tested the code using from 20 up to 40 numbers and timed it.
20 numbers = 0.000055 minutes
30 numbers = 0.045088 minutes
40 numbers = 17.46944 minutes
Using all the 20 points of test data I got, I built a math model to predict how long it would take to run the 54 numbers, and I am getting 1740 minutes = 29 hours. This is already an improvement from a v1 of this code that was predicting 38 hours and from a v0 that was actually crashing my machine.
I am reaching out to you to try and make this run even faster. The program is not even RAM intensive. I have 8GB of RAM and a core i7 processor, it does not slow down my machine at all. It actually runs very smooth compared to previous versions I had where my computer even crashed a few times.
Do you guys think there is a way? I am currently sampling to reduce processing time but I would prefer not to sample it at all, if that's possible. I am not even printing the arrays also to reduce processing time, I am just printing a counter to see how many combinations I generated.
This is the code:
import numpy as np
import itertools
from itertools import combinations
from itertools import islice
from random import sample
num1 = 30 #ideally this is 54, just using 30 now so you can test it.
num2 = 16
steps = 1454226 #represents 1% of "num1"c"num2" just to reduce processing time for testing.
nums=list()
for i in range(1,num1+1):
nums.append(i)
#print ("nums: ", nums) #Just to ensure that I am indeed using numbers from 1 to num1
vun=list()
tabl=list()
counter = 0
combin = islice(itertools.combinations(nums, num2),0,None,steps)
for i in set(combin):
vun.append(sample(i,num2))
counter = counter + 1
p1=i[0];p2=i[1];p3=i[2];p4=i[3];p5=i[4];p6=i[5];p7=i[6];p8=i[7];p9=i[8]
p10=i[9];p11=i[10];p12=i[11];p13=i[12];p14=i[13];p15=i[14];p16=i[15]
tun = np.array ([(p1,p2,p3,p4),(p5,p6,p7,p8),(p9,p10,p11,p12),(p13,p14,p15,p16)])
tabl.append(tun)
# print ("TABL:" ,tabl)
# print ("vun: ", vun)
print ("combinations:",counter)
The output I get with this code is:
combinations: 101
Ideally, this number would be 2.109492366(10)¹³ or at least 1%. As long as it runs the 54x16 and does not take 29 hours.

The main inefficiency comes from generating all the combinations (itertools.combinations(nums, num2)), only to throw most of them away.
Another approach would be to generate combinations at random, ensuring there are no duplicates.
import itertools
import random
def random_combination(iterable, r):
"Random selection from itertools.combinations(iterable, r)"
pool = tuple(iterable)
n = len(pool)
indices = sorted(random.sample(range(n), r))
return tuple(pool[i] for i in indices)
items = list(range(1, 55))
samples = set()
while len(samples) < 100_000:
sample = random_combination(items, 16)
samples.add(sample)
for sample in samples:
board = list(sample)
random.shuffle(board)
board = [board[0:4], board[4: 8], board[8: 12], board[12: 16]]
print("done")
This uses the random_combination function from answers to this question, which in turn comes from the itertools documentation.
The code generates 100,000 unique 4x4 samples in about ten seconds, at least on my machine.
A few notes:
Each sample is a tuple and the entries are sorted; this means we can store them in a set and avoid duplicates.
Because of the first point, we shuffle each sample before creating a 4x4 board from it; the later code doesn't do anything with these boards, but I wanted to include them to get a sense of the timing.
It's possible that there would be lots of hash collisions if you were to sample a large proportion of the space, but that's not feasible anyway because of the amount of data that would involve (see below).
I think there's been some confusion about what you are trying to achieve here.
54C16 = 2.1 x 10^13 ... to store 16 8-bit integers for all of these points would take 2.7 x 10^15 bits, which is 337.5 terabytes. That's beyond what could be stored on a local disk.
So, to cover even 1% of the space would take over 3TB ... maybe possible to store on disk at a push. You hinted in the question that you'd like to cover this proportion of the space. Clearly that's not going to happen in 8GB of RAM.

Just calculating the number of combinations is trivial, since it's just a formula:
import math
math.comb(30, 16)
# 145422675
math.comb(54, 16)
# 21094923659355
The trouble is that storing the results of the 16 of 30 case requires about 64 GB of RAM on my machine. You might but probably don't have that much RAM just sitting around like I do. The 16 of 54 case requires about 9.3 PB of RAM, which no modern architecture supports.
You're going to need to take one of two approaches:
Limit to the 16 in 30 case, and don't store any results into vun or tabl.
Pros: Can be made to work in < 5 minutes in my testing.
Cons: Doesn't work for the 16 in 54 case at all, no additional processing is practical
Do a Monte Carlo simulation instead: generate randomized combinations up to some large but reachable sample count and do your math on those.
Pros: Fast and supports both 16 of 30, and 16 of 54 potentially with the same time performance
Cons: Results will have some random variation depending on random seed, should be statistically treated to get confidence intervals for validity.
Note: The formulas used for confidence intervals depend on which actual math you intend to do with these numbers, the average is a good place to start though if you're only looking for an estimate.
I strongly suggest option (2), the Monte Carlo simulation.

Related

How do I make appending large vectors to list faster

I am trying to implement skip gram algorithm in plain numpy (not pytorch) which requires doing calculation (possibly avoidable details at the end):
xy = []
for i in tqdm(range(100000)):
for j in range(15):
xy.append([np.zeros(50000), np.zeros(50000)])
So, this is a huge array 15*100000*2*50000 elements of type numpy.float64. There are two issues here:
This takes huge memory (at least 3 GBs) as explained above. In fact, I was never able to complete this calculation because of second issue mentioned below. But it easily filled all my laptop RAM (total 16 GB).
This takes huge time (at least couple of hours), may be because of first issue above.
I also tried to pre-generate x with all zeroes as follows:
count = 15*100000
xy = [[np.zeros(vocabSize), np.zeros(vocabSize)] for _ in range(count)]
But the moment I step over this second line in my debugger, my RAM fills up.
How can deal with this?
Avoidable details
I am trying to implement skip gram algorithm, in which we have to prepare list of skip grams [target-word, context-words]. Each target-word and context-words are represented as one hot vector of size equal to input vocabulary size (50000 above). 100000 above is number of sentences in data. 15 is average number of words per sentence.
PS
I have to implement this in plain python + numpy. That is not using any ML library like pytorch or tensorflow
100000 sentences * 15 words per sentence * 50000 vocabulary size * 2 * 8 bytes per float 64 = 1200000000000 bytes = 1200 GiB. You cannot initialize this object in memory on your 16 GiB laptop.
My guess is that you do not want to store all of the vectors of every sentence in memory but instead run the algorithm sentence by sentence? In this case it would take 12 MiB of memory per sentence.

How to generate unique(!) arrays/lists/sequences of uniformly distributed random

Let‘s say I generate a pack, i.e., a one dimensional array of 10 random numbers with a random generator. Then I generate another array of 10 random numbers. I do this X times. How can I generate unique arrays, that even after a trillion generations, there is no array which is equal to another?
In one array, the elements can be duplicates. The array just has to differ from the other arrays with at least one different element from all its elements.
Is there any numpy method for this? Is there some special algorithm which works differently by exploring some space for the random generation? I don’t know.
One easy answer would be to write the arrays to a file and check if they were generated already, but the I/O operations on a subsequently bigger file needs way too much time.
This is a difficult request, since one of the properties of a RNG is that it should repeat sequences randomly.
You also have the problem of trying to record terabytes of prior results. Once thing you could try is to form a hash table (for search speed) of the existing arrays. Using this depends heavily on whether you have sufficient RAM to hold the entire list.
If not, you might consider disk-mapping a fast search structure of some sort. For instance, you could implement an on-disk binary tree of hash keys, re-balancing whenever you double the size of the tree (with insertions). This lets you keep the file open and find entries via seek, rather than needing to represent the full file in memory.
You could also maintain an in-memory index to the table, using that to drive your seek to the proper file section, then reading only a small subset of the file for the final search.
Does that help focus your implementation?
Assume that the 10 numbers in a pack are each in the range [0..max]. Each pack can then be considered as a 10 digit number in base max+1. Obviously, the size of max determines how many unique packs there are. For example, if max=9 there are 10,000,000,000 possible unique packs from [0000000000] to [9999999999].
The problem then comes down to generating unique numbers in the correct range.
Given your "trillions" then the best way to generate guaranteed unique numbers in the range is probably to use an encryption with the correct size output. Unless you want 64 bit (DES) or 128 bit (AES) output then you will need some sort of format preserving encryption to get output in the range you want.
For input, just encrypt the numbers 0, 1, 2, ... in turn. Encryption guarantees that, given the same key, the output is unique for each unique input. You just need to keep track of how far you have got with the input numbers. Given that, you can generate more unique packs as needed, within the limit imposed by max. After that point the output will start repeating.
Obviously as a final step you need to convert the encryption output to a 10 digit base max+1 number and put it into an array.
Important caveat:
This will not allow you to generate "arbitrarily" many unique packs. Please see limits as highlighted by #Prune.
Note that as the number of requested packs approaches the number of unique packs this takes longer and longer to find a pack. I also put in a safety so that after a certain number of tries it just gives up.
Feel free to adjust:
import random
## -----------------------
## Build a unique pack generator
## -----------------------
def build_pack_generator(pack_length, min_value, max_value, max_attempts):
existing_packs = set()
def _generator():
pack = tuple(random.randint(min_value, max_value) for _ in range(1, pack_length +1))
pack_hash = hash(pack)
attempts = 1
while pack_hash in existing_packs:
if attempts >= max_attempts:
raise KeyError("Unable to fine a valid pack")
pack = tuple(random.randint(min_value, max_value) for _ in range(1, pack_length +1))
pack_hash = hash(pack)
attempts += 1
existing_packs.add(pack_hash)
return list(pack)
return _generator
generate_unique_pack = build_pack_generator(2, 1, 9, 1000)
## -----------------------
for _ in range(50):
print(generate_unique_pack())
The Birthday problem suggests that at some point you don't need to bother checking for duplicates. For example, if each value in a 10 element "pack" can take on more than ~250 values then you only have a 50% chance of seeing a duplicate after generating 1e12 packs. The more distinct values each element can take on the lower this probability.
You've not specified what these random values are in this question (other than being uniformly distributed) but your linked question suggests they are Python floats. Hence each number has 2**53 distinct values it can take on, and the resulting probability of seeing a duplicate is practically zero.
There are a few ways of rearranging this calculation:
for a given amount of state and number of iterations what's the probability of seeing at least one collision
for a given amount of state how many iterations can you generate to stay below a given probability of seeing at least one collision
for a given number of iterations and probability of seeing a collision, what state size is required
The below Python code calculates option 3 as it seems closest to your question. The other options are available on the birthday attack page.
from math import log2, log1p
def birthday_state_size(size, p):
# -log1p(p) is a numerically stable version of log(1/(1+p))
return size**2 / (2*-log1p(-p))
log2(birthday_state_size(1e12, 1e-6)) # => ~100
So as long as you have more than 100 uniform bits of state in each pack everything should be fine. For example, two or more Python floats is OK (2 * 53), as is 10 integers with >= 1000 distinct values (10*log2(1000)).
You can of course reduce the probability down even further, but as noted in the Wikipedia article going below 1e-15 quickly approaches the reliability of a computer. This is why I say "practically zero" given the 530 bits of state provided by 10 uniformly distributed floats.

How do I make a list of random unique tuples?

I've looked over several answers similar to this question, and all seem to have good oneliner answers that however only deal with the fact of making the list unique by removing duplicates. I need the list to have exactly 5.
The only code I could come up with is as such:
from random import *
tuples = []
while len(tuples) < 5:
rand = (randint(0, 6), randint(0,6))
if rand not in tuples:
tuples.append(rand)
I feel like there is a simpler way but I can't figure it out. I tried playing with sample() from random:
sample((randint(0,6), randint(0,6)), 5)
But this gives me a "Sample larger than population or is negative" error.
One quick way is to use itertools.product to generate all tuple possibilities before using sample to choose 5 from them:
from itertools import product
from random import sample
sample(list(product(range(7), repeat=2)), k=5)
For such a small set of inputs, just generate all possible outputs, and sample them:
import itertools
import random
size = 6
random.sample(list(itertools.product(range(size+1), repeat=2)), 5)
You indicate that the bounds (size) may be a parameter though, and if the bounds might be even a little larger, this could be a problem (you'd be generating size ** 2 tuples to select 5 of them, and the memory usage could get out of control). If that's a problem, given you only need a pair of integers, there is a cheap trick: Choose one random integer that encodes both resulting integers, then decode it. For example:
size = 6
raw_sample = random.sample(range((size + 1) ** 2), 5)
decoded_sample = [divmod(x, size+1) for x in raw_sample)]
Since range is zero overhead (the memory usage doesn't depend on the length), you can select precisely five values from it with overhead proportionate to the five selected, not the 49 possible results. You then compute the quotient and remainder based on the range of a single value (0 to size inclusive in this case, so size + 1 possible values), and that gets the high and low results cheaply.
The performance differences are stark; comparing:
def unique_random_pairs_by_product(size):
return random.sample(list(itertools.product(range(size+1), repeat=2)), 5)
to:
def unique_random_pairs_optimized(size):
val_range = size + 1
return [divmod(x, val_range) for x in random.sample(range(val_range * val_range), 5)]
the optimized version takes about 15% less time even for an argument of 6 (~4.65 μs for product, ~3.95 μs for optimized). But at size of 6, you're not seeing the scaling factor at all. For size=100, optimized only increases to ~4.35 μs (the time increasing slightly because the larger range is more likely to have to allocate new ints, instead of using the small int cache), while product jumps to 387 μs, a nearly 100x difference. And for size=1000, the time for product jumps to 63.8 ms, while optimized remains ~4.35 μs; a factor of 10,000x difference in runtime (and an even higher multiplier on memory usage). If size gets any larger than that, the product-based solution will quickly reach the point where the delay from even a single sampling is noticeable to humans; the optimized solution will continue to run with identical performance (modulo incredibly tiny differences in the cost of the divmod).

Avoid generating duplicate values from random

I want to generate random numbers and store them in a list as the following:
alist = [random.randint(0, 2 ** mypower - 1) for _ in range(total)]
My concern is the following: I want to generate total=40 million values in the range of (0, 2 ** mypower - 1). If mypower = 64, then alist will be of size ~20GB (40M*64*8) which is very large for my laptop memory. I have an idea to iteratively generate chunk of values, say 5 million at a time, and save them to a file so that I don't have to generate all 40M values at once. My concern is that if I do that in a loop, it is guaranteed that random.randint(0, 2 ** mypower - 1) will not generate values that were already generated from the previous iteration? Something like this:
for i in range(num_of_chunks):
alist = [random.randint(0, 2 ** mypower - 1) for _ in range(chunk)]
# save to file
Well, since efficiency/speed doesn't matter, I think this will work:
s = set()
while len(s) < total:
s.add(random.randint(0, 2 ** mypower - 1))
alist = list(s)
Since sets can only have unique elements in it, i think this will work well enough
To guarantee unique values you should avoid using random. Instead you should use an encryption. Because encryption is reversible, unique inputs guarantee unique outputs, given the same key. Encrypt the numbers 0, 1, 2, 3, ... and you will get guaranteed unique random-seeming outputs back providing you use a secure encryption. Good encryption is designed to give random-seeming output.
Keep track of the key (essential) and how far you have got. For your first batch encrypt integers 0..5,000,000. For the second batch encrypt 5,000,001..10,000,000 and so on.
You want 64 bit numbers, so use DES in ECB mode. DES is a 64-bit cipher, so the output from each encryption will be 64 bits. ECB mode does have a weakness, but that only applies with identical inputs. You are supplying unique inputs so the weakness is not relevant for your particular application.
If you need to regenerate the same numbers, just re-encrypt them with the same key. If you need a different set of random numbers (which will duplicate some from the first set) then use a different key. The guarantee of uniqueness only applies with a fixed key.
One way to generate random values that don't repeat is first to create a list of contiguous values
l = list(range(1000))
then shuffle it:
import random
random.shuffle(l)
You could do that several times, and save it in a file, but you'll have limited ranges since you'll never see the whole picture because of your limited memory (it's like trying to sort a big list without having the memory for it)
As someone noted, to get a wide span of random numbers, you'll need a lot of memory, so simple but not so efficient.
Another hack I just though of: do the same as above but generate a range using a step. Then in a second pass, add a random offset to the values. Even if the offset values repeat, it's guaranteed to never generate the same number twice:
import random
step = 10
l = list(range(0,1000-step,step))
random.shuffle(l)
newlist = [x+random.randrange(0,step) for x in l]
with the required max value and number of iterations that gives:
import random
number_of_iterations = 40*10**6
max_number = 2**64
step = max_number//number_of_iterations
l = list(range(0,max_number-step,step))
random.shuffle(l)
newlist = [x+random.randrange(0,step) for x in l]
print(len(newlist),len(set(newlist)))
runs in 1-2 minutes on my laptop, and gives 40000000 distinct values (evenly scattered across the range)
Usually random number generators are not really random at all. In fact, this is quite helpful in some situations. If you want the values to be unique after the second iteration, send it a different seed value.
random.seed()
The same seed will generate the same list, so if you want the next iteration to be the same, use the same seed. if you want it to be different, use a different seed.
It may need High CPU and Physical memory!
I suggest you to classify your data.
For Example you can Save:
All numbers starting with 10 and are 5 characters(Example:10365) to 10-5.txt
All numbers starting with 11 and are 6 characters(Example:114567) to 11-6.txt
Then for checking a new number:
For Example my number is 9256547
It starts with 92 and is 7 characters.
So I search 92-7.txt for this number and if it isn't duplicate,I will add it to 92-7.txt
Finally,You can join all files together.
Sorry If I have mistakes.My main language isn't English.

Generate big random sequence of unique numbers [duplicate]

This question already has answers here:
Unique (non-repeating) random numbers in O(1)?
(22 answers)
Closed 9 years ago.
I need to fill a file with a lot of records identified by a number (test data). The number of records is very big, and the ids should be unique and the order of records should be random (or pseudo-random).
I tried this:
# coding: utf-8
import random
COUNT = 100000000
random.seed(0)
file_1 = open('file1', 'w')
for i in random.sample(xrange(COUNT), COUNT):
file_1.write('ID{0},A{0}\n'.format(i))
file_1.close()
But it's eating all of my memory.
Is there a way to generate a big shuffled sequence of consecutive (not necessarily but it would be nice, otherwise unique) integer numbers? Using a generator and not keeping all the sequence in RAM?
If you have 100 million numbers like in the question, then this is actually manageable in-memory (it takes about 0.5 GB).
As DSM pointed out, this can be done with the standard modules in an efficient way:
>>> import array
>>> a = array.array('I', xrange(10**8)) # a.itemsize indicates 4 bytes per element => about 0.5 GB
>>> import random
>>> random.shuffle(a)
It is also possible to use the third-party NumPy package, which is the standard Python tool for managing arrays in an efficient way:
>>> import numpy
>>> ids = numpy.arange(100000000, dtype='uint32') # 32 bits is enough for numbers up to about 4 billion
>>> numpy.random.shuffle(ids)
(this is only useful if your program already uses NumPy, as the standard module approach is about as efficient).
Both method take about the same amount of time on my machine (maybe 1 minute for the shuffling), but the 0.5 GB they use is not too big for current computers.
PS: There are too many elements for the shuffling to be really random because there are way too many permutations possible, compared to the period of the random generators used. In other words, there are fewer Python shuffles than the number of possible shuffles!
Maybe something like (won't be consecutive, but will be unique):
from uuid import uuid4
def unique_nums(): # Not strictly unique, but *practically* unique
while True:
yield int(uuid4().hex, 16)
# alternative yield uuid4().int
unique_num = unique_nums()
next(unique_num)
next(unique_num) # etc...
You can fetch random int easily from reading (on linux) /dev/urandom or using os.urandom() and struct.unpack():
Return a string of n random bytes suitable for cryptographic use.
This function returns random bytes from an OS-specific randomness source. The returned data should be unpredictable enough for cryptographic applications, though its exact quality depends on the OS implementation. On a UNIX-like system this will query /dev/urandom, and on Windows it will use CryptGenRandom. If a randomness source is not found, NotImplementedError will be raised.
>>> for i in range(4): print( hex( struct.unpack('<L', os.urandom(4))[0]))
...
0xbd7b6def
0xd3ecf2e6
0xf570b955
0xe30babb6
While on the other hand random package:
However, being completely deterministic, it is not suitable for all purposes, and is completely unsuitable for cryptographic purposes.
If you really need unique records you should go with this or answer provided by EOL.
But assuming really random source, with possibly repeated characters you'll have 1/N (where N = 2 ** sizeof(int)*8 = 2 ** 32) chance of hitting item at first guess, thus you can get (2**32) ** length possible outputs.
On the other hand when using just unique results you'll have max:
product from i = 0 to length {2*32 - i}
= n! / (n-length)!
= (2**32)! / (2**32-length)!
Where ! is factorial, not logical negation. So you'll just decrease randomness of result.
This one will keep your memory OK but will probably kill your disk :)
It generates a file with the sequence of the numbers from 0 to 100000000 and then it randomly pick positions in it and writes to another file. The numbers have to be re-organized in the first file to "delete" the numbers that have been chosen already.
import random
COUNT = 100000000
# Feed the file
with open('file1','w') as f:
i = 0
while i <= COUNT:
f.write("{0:08d}".format(i))
i += 1
with open('file1','r+') as f1:
i = COUNT
with open('file2','w') as f2:
while i >= 0:
f1.seek(i*8)
# Read the last val
last_val = f1.read(8)
random_pos = random.randint(0, i)
# Read random pos
f1.seek(random_pos*8)
random_val = f1.read(8)
f2.write('ID{0},A{0}\n'.format(random_val))
# Write the last value to this position
f1.seek(random_pos*8)
f1.write(last_val)
i -= 1
print "Done"

Categories

Resources