Generate all possible sequence of altrenate digits and alphabets - python

I want to generate all possible sequence of alternative digit and numbers. For example
5j1c6l2d4p9a9h9q
6d5m7w4c8h7z4s0i
3z0v5w1f3r6b2b1z
NumberSmallletterNumberSmallletter
NumberSmallletterNumberSmallletter
NumberSmallletterNumberSmallletter
NumberSmallletterNumberSmallletter
I can do it by using 16 loop But it will take 30+ hours (rough idea). Is there any efficient way. I hope there will be in python.

You can use itertools.product to generate all of the 16 long cases:
import string, itertools
i = itertools.product(string.digits, string.ascii_lowercase, repeat=8)
j = (''.join(p) for p in i)
As i is an iterator of tuples, we need to convert these all to strings (so they are in the format that you want). This is relatively straight forward to do as we can just pass each tuple into a generator and join the elements together into one string.
We can see that the iterator (j) is working by calling next() on it a couple of times:
>>> next(j)
'0a0a0a0a0a0a0a0a'
>>> next(j)
'0a0a0a0a0a0a0a0b'
>>> next(j)
'0a0a0a0a0a0a0a0c'
>>> next(j)
'0a0a0a0a0a0a0a0d'
>>> next(j)
'0a0a0a0a0a0a0a0e'
>>> next(j)
'0a0a0a0a0a0a0a0f'
>>> next(j)
'0a0a0a0a0a0a0a0g'

There is no "efficient" way to do this. There are 2.8242954e+19 different possible combonations, or 28,242,954,000,000,000,000. If each combination is 16 characters long, storing this all in a raw text file would take up 451,887,264,000 gigabytes, 441,296,156.25 terabytes, 430,953.2775878906 petabytes, or 420.8528101444 exabytes. The largest hard drive available to the average consumer is 16TB (Samsung PM1633a). They cost 12 thousand US dollars. This puts the total cost of storing all of this data to 330,972,117,600 US dollars (3677.46797 times Bill Gates' net worth). Even ignoring the amount of space all of these drives would take up, and ignoring the cost of the hardware you would need to connect them to, and assuming that they could all be running at highest performance all together in a lossless RAID array, this would make the write speed 330,972,118 gigabytes a second. Sounds like a lot, doesn't it? Even with that write speed, the file would take 22 minutes to write, assuming that there were no bottlenecks from CPU power, RAM speed, or the RAID controller itself.
Sources - a calculator.

import sys
n = 5
ans = [0 for i in range(26)]
all = ['a','b','A','Z','0','1']
def rec(pos, prev):
if (pos==n) :
for i in range(n):
sys.stdout.write(str(ans[i]))
sys.stdout.flush()
print ""
return
for i in all:
if(i != prev):
ans[pos] = i
rec(pos+1, i)
return
for i in all:
ans[0] = i;
rec(1, i)
The basic idea is backtracking. It is too slow. But code is short. You can modify your characters in 'all' and length of the sequences n.
If you don't clear about the code then try to simulate it with some cases.

Related

Find all permutations of a string that is variable only at specific positions in Python

I have a DNA sequence which is variable only at specific locations and need to find all possible scenarios:
DNA_seq='ANGK' #N can be T or C and K can be A or G
N=['T','C']
K=['A','G']
Results:
['ATGA','ATGG','ACGA','ACGG']
The offered solution by #vladimir works perfectly for simple cases like the example above but for complicated scenarios as below runs quickly out of memory. For the example below, even running with 120G of memory ended with out-of-memory error. This is surprising because the total number of combinations would be around 500K of 33bp strings which I assume should not consume more than 100G of RAM. Are my assumptions wrong? Any suggestions?
N=['A','T','C','G']
K=['G','T']
dev_seq=[f'{N1}{N2}{K1}{N3}{N4}{K2}{N5}{N6}{K3}TCC{N7}{N8}{K4}CTG{N9}{N10}{K5}CTG{N11}{N12}{K6}{N13}{N14}{K7}{N15}{N16}{K8}' for \
N1,N2,K1,N3,N4,K2,N5,N6,K3,N7,N8,K4,N9,N10,K5,N11,N12,K6,N13,N14,K7,N15,N16,K8 in \
product(N,N,K,N,N,K,N,N,K,N,N,K,N,N,K,N,N,K,N,N,K,N,N,K)]
Use itertools.product:
from itertools import product
result = [f'A{n}G{k}' for n, k in product(N, K)]
Result:
['ATGA', 'ATGG', 'ACGA', 'ACGG']
EDIT
If you don't want to store the whole list in memory at one time, and would rather process the strings sequentially as they come, you can use a generator:
g = (f'A{n}G{k}' for n, k in product(N, K))

why is my iterator implementation very inefficient?

I wrote the following python script to count the number of occurrences of a character (a) in the first n characters of an infinite string.
from itertools import cycle
def count_a(str_, n):
count = 0
str_ = cycle(str_)
for i in range(n):
if next(str_) == 'a':
count += 1
return count
My understanding of iterators is that they are supposed to be efficient, but this approach is super slow for very large n. Why is this so?
The cycle iterator might not be as efficient as you think, the documentation says
Make an iterator returning elements from the iterable and saving a
copy of each.
When the iterable is exhausted, return elements from the saved copy.
Repeats indefinitely
...Note, this member of the toolkit may require significant auxiliary
storage (depending on the length of the iterable).
Why not simplify and just not use the iterator at all? It adds unnecessary overhead and gives you no benefit. You can easily count the occurrences with a simple str_[:n].count('a')
The first problem here is that despite using itertools, you're still doing explicit python-level for loop. To gain the C level speed boost when using itertools you want to keep all the iteration in the high speed itertools.
So let's do this step by step, first we want to get the number of characters in a finite string. To do this, you can use the itertools.islice method to get the first n characters in the string:
str_first_n_chars = islice(cycle(str_), n)
You next want to count the number of occurrences of the letter (a), to do this you can do some variation of either of these (you may want to experiment which variants is faster):
count_a = sum(1 for c in str_first_n_chars if c == 'a')
count_a = len(tuple(filter('a'.__eq__, str_first_n_chars))
This is all and well, but this is still slow for really large n because you need to iterate through str_ many, many times for really large n, like for example n = 10**10000. In other words, this algorithm is O(n).
There's one last improvement we could made. Notice how that the number of (a) in the str_ never really change in each iteration. Rather than iterating through str_ multiple times for large n, we can do a little bit of smarter with a bit of math so that we only need to iterate through str_ twice. First we count the number of (a) in a single stretch of str_:
count_a_single = str_.count('a')
Then we find out how many times we would have to iterate through str_ to get length n by using divmod function:
iter_count, remainder = divmod(n, len(str_))
then we can just multiply iter_count with count_a_single and add the number of (a) in the remaining length. We don't need cycle or islice and such here because remainder < len(str_)
count_a = iter_count * count_a_single + str_[:remainder].count('a')
With this method, the runtime performance of the algorithm grows only on the length of a single cycle of str_ rather than n. In other words, this algorithm is O(len(str_)).

List taking up 13 gigs of ram with 127 mil entries: how?

I'm working through a programming challenge involving quick processing and large data. I'm trying to generate a list of possible permutations of a number range and then search through them.
Code:
def generate_list(numA, numB):
combo = list(range(0, numB))
permutation_list = list(itertools.permutations(combo, numA))
print("initial dictionary length: " + len(permutation_list))
The problem is that when A is 6 and B is 25, my program slows down immensely and takes up a huge amount of RAM. It peeked at around 13 gigs. The length of the listis around 127 mil and each object is of length 6. That should put the usage at around 750 megs of memory, not 13 gigs. What's going on?
Edit: The data is just numbers. So [[0,1,2,3,4,5],[0,1,2,3,4,6],...]
Each element of a list or a tuple is a pointer. And has a size of either 4 or 8 bytes. The following assumes the latter. Just counting the pointers in the list and tuples accounts for half of the space used. The rest is likely the object header which is about 48 bytes. This yields the formula:
(48+8+(8*6)) * 127000000 == 13208000000
which is about your 13 gigabytes.
I would suggest doing anything possible to avoid generating that complete permutation.
An example of how you might output the entire list of permutations might be as follows:
import itertools
def combo(b):
for combination in range(0, b):
yield combination
def generate_list(numA, numB):
for l in itertools.permutations(combo(numB), numA):
yield list(l), len(l)
if __name__ == '__main__':
total_length = int()
with open('permutations', 'w+') as f:
f.write('[')
for permutation in generate_list(6, 25):
data, length = permutation
total_length += length
f.write(str(data) + ', ')
f.write(']\n')
print("initial dictionary length: " + str(total_length))
I've turned your code into two separate generators. One that gives the combination, another that gives the permutation.
You can calculate the entire thing without a MemoryError and write them to a file. A very large file. Or you could just print it to stdout, up to you.
It will also tell you the length at the end, without requiring massive amounts of memory to do so.

Sequenced string PK's in python

i'm looking for optimal string sequence generator that could be used as user id generator. As far as i get it should implement following features:
Length is restricted to 8 characters.
Consists from latin symbols and digits
Public and thus should be obfuscated
For now i came up with following algorithm:
def idgen(begin, end):
assert type(begin) is int
assert type(end) is int
allowed = reduce(
lambda L, ri: L + map(chr, range(ord(ri[0]), ord(ri[1]) + 1)),
(('a', 'z'), ('0', '9')), list()
)
shift = lambda c, i: allowed[(allowed.index(c) + i) % len(allowed)]
for cur in xrange(begin, end):
cur = str(cur).zfill(8)
cur = izip(xrange(0, len(cur)), iter(cur))
cur = ''.join([shift(c, i) for i, c in cur])
yield cur
But it would give pretty similar ids, 0-100 example:
'01234567', '01234568', '01234569', '0123456a' etc.
So what are best practices? How, i suppose url shorteners should use some kind of similar algorithm?
Since you need "obfuscated" ids, you want something that looks random, but isn't. Fortunately, almost all computer generated randomness fulfills this criteria, including the one generated by the random python module.
You can generate a fixed sequence of numbers by settings the PRNG seed, like so:
import random
import string
valid_characters= string.lowercase+string.digits
def get_random_id():
return ''.join(random.choice(valid_characters) for x in range(8))
random.seed(0)
for x in range(10):
print get_random_id()
This will always print the same sequence of 10 "random" ids.
You probably want to generate ids on demand, instead of all at once. To do so, you need persist the PRNG state:
random.setstate( get_persisted() )
get_random_id()
persist( random.getstate )
#repeat ad infinitum
This is somewhat expensive, so you'll want to generate a couple of random ids at once, and keep them on a queue. You could also use random.jumpahead, but I suggest you try both techniques and see which one is faster.
Avoiding collisions for this task is not trivial. Even using hash functions will not save you from collisions.
The random module's PRNG has a period of 2**19937-1. You need a period of at least (26+10)**8, which is about 2**42. However, the fact that the period you need is lower than the period of the PRNG does not guarantee that there will be no collisions, because the size of your output (8 bytes, or 64 bits) may not match the output size of the PRNG. Also, depending on the particular implementation of choice, it's possible that, even if the sizes matched, the PRNG simply advances the PRNG on each call.
You could try to iterate over the keyspace, checking if something repeats:
random.seed(0)
space=len(valid_characters)**8
found=set()
x=0
while x<space:
id= get_random_id()
if id in found:
print "cycle found after",x,"iterations"
found.update(id)
if not x% 1000000:
print "progress:",(float(x)/space)*100,"%"
x+=1
But this will take a long, long while.
You could also handcraft a PRNG with the desired period, but it's a bit out of scope on stackoverflow. You could try the math stackexchange.
Ultimately, and depending on your purpose, I think your best bet may be simply keeping track of already generated ids, and skipping repeated ones.

Generate big random sequence of unique numbers [duplicate]

This question already has answers here:
Unique (non-repeating) random numbers in O(1)?
(22 answers)
Closed 9 years ago.
I need to fill a file with a lot of records identified by a number (test data). The number of records is very big, and the ids should be unique and the order of records should be random (or pseudo-random).
I tried this:
# coding: utf-8
import random
COUNT = 100000000
random.seed(0)
file_1 = open('file1', 'w')
for i in random.sample(xrange(COUNT), COUNT):
file_1.write('ID{0},A{0}\n'.format(i))
file_1.close()
But it's eating all of my memory.
Is there a way to generate a big shuffled sequence of consecutive (not necessarily but it would be nice, otherwise unique) integer numbers? Using a generator and not keeping all the sequence in RAM?
If you have 100 million numbers like in the question, then this is actually manageable in-memory (it takes about 0.5 GB).
As DSM pointed out, this can be done with the standard modules in an efficient way:
>>> import array
>>> a = array.array('I', xrange(10**8)) # a.itemsize indicates 4 bytes per element => about 0.5 GB
>>> import random
>>> random.shuffle(a)
It is also possible to use the third-party NumPy package, which is the standard Python tool for managing arrays in an efficient way:
>>> import numpy
>>> ids = numpy.arange(100000000, dtype='uint32') # 32 bits is enough for numbers up to about 4 billion
>>> numpy.random.shuffle(ids)
(this is only useful if your program already uses NumPy, as the standard module approach is about as efficient).
Both method take about the same amount of time on my machine (maybe 1 minute for the shuffling), but the 0.5 GB they use is not too big for current computers.
PS: There are too many elements for the shuffling to be really random because there are way too many permutations possible, compared to the period of the random generators used. In other words, there are fewer Python shuffles than the number of possible shuffles!
Maybe something like (won't be consecutive, but will be unique):
from uuid import uuid4
def unique_nums(): # Not strictly unique, but *practically* unique
while True:
yield int(uuid4().hex, 16)
# alternative yield uuid4().int
unique_num = unique_nums()
next(unique_num)
next(unique_num) # etc...
You can fetch random int easily from reading (on linux) /dev/urandom or using os.urandom() and struct.unpack():
Return a string of n random bytes suitable for cryptographic use.
This function returns random bytes from an OS-specific randomness source. The returned data should be unpredictable enough for cryptographic applications, though its exact quality depends on the OS implementation. On a UNIX-like system this will query /dev/urandom, and on Windows it will use CryptGenRandom. If a randomness source is not found, NotImplementedError will be raised.
>>> for i in range(4): print( hex( struct.unpack('<L', os.urandom(4))[0]))
...
0xbd7b6def
0xd3ecf2e6
0xf570b955
0xe30babb6
While on the other hand random package:
However, being completely deterministic, it is not suitable for all purposes, and is completely unsuitable for cryptographic purposes.
If you really need unique records you should go with this or answer provided by EOL.
But assuming really random source, with possibly repeated characters you'll have 1/N (where N = 2 ** sizeof(int)*8 = 2 ** 32) chance of hitting item at first guess, thus you can get (2**32) ** length possible outputs.
On the other hand when using just unique results you'll have max:
product from i = 0 to length {2*32 - i}
= n! / (n-length)!
= (2**32)! / (2**32-length)!
Where ! is factorial, not logical negation. So you'll just decrease randomness of result.
This one will keep your memory OK but will probably kill your disk :)
It generates a file with the sequence of the numbers from 0 to 100000000 and then it randomly pick positions in it and writes to another file. The numbers have to be re-organized in the first file to "delete" the numbers that have been chosen already.
import random
COUNT = 100000000
# Feed the file
with open('file1','w') as f:
i = 0
while i <= COUNT:
f.write("{0:08d}".format(i))
i += 1
with open('file1','r+') as f1:
i = COUNT
with open('file2','w') as f2:
while i >= 0:
f1.seek(i*8)
# Read the last val
last_val = f1.read(8)
random_pos = random.randint(0, i)
# Read random pos
f1.seek(random_pos*8)
random_val = f1.read(8)
f2.write('ID{0},A{0}\n'.format(random_val))
# Write the last value to this position
f1.seek(random_pos*8)
f1.write(last_val)
i -= 1
print "Done"

Categories

Resources