Generate all possibilities of a set of alphanumeric with custom conditions - python

I want to generate all the possibilities of a set of alphanumeric "abcdef0123456789" in a length of 64
I know that the number of possibilities are petabytes huge, so I want to limit it with 3 conditions:
the letter or number shouldn't be used more than 8 times in a single string,
ex: aaababaabaab not acceptable
aaabaaabaa is acceptable
there is no more than 3 adjacent identical letters or numbers
ex: aaaab not acceptable
aaaba is acceptable
the option of excluding the use 1 or 2 characters of the set
ex: "bcdef0123456789" with no a
I have searched over the internet for 3 days and never got an answer, could you please help?
Thank you in advance

The internet doesn't have answer to all concrete questions :) You need to break your problem into small pieces and search for them.
If you want to solve these kind of problems you might want to lookup backtracking. It's a generic way to solve constraint problems. You'll do recursion, start with empty string and in each step you'll try to add a character to your string. If one of rules are broken you'll return out of recursion.
Anyway I don't think it's possible, my educated guess is the number of possibilities is much much more than "peta byte".
>>> 16 ** 64
115792089237316195423570985008687907853269984665640564039457584007913129639936
Of course your constraints will reduce this but I don't think more than a mere few digits.

Related

Shorten long small-alphabet string using larger alphabet

I have a set of ~100 long (between 120 and 150 characters) strings encoded using a 20 letter alphabet (the natural amino acid alphabet). I'm using them in database entries, but they're cumbersome. I'd like to shorten (not compressing, because I don't care about the memory size) them to make them easier to:
Visually compare
Copy/Paste
Manually enter
I was hoping a feasible way to shorten them would be convert the string to a larger alphabet. Specifically, the set of single digits, as well as lower and upper case alphabet.
For example:
# given some long string as input
shorten("ACTRYP...TW")
# returns something shorter like "a3A4n"
Possible approaches
From my elementary understanding of compression, this could be accomplished naively by making a lookup dictionary which maps certain repeating sequences elements of the larger alphabet.
Related Question
This question seemed to pointing in a similar direction, but was working with the DNA alphabet and seemed to be actually seeking compression.
As suggested by #thethiny a combination of hashing can accomplish the shortening desired:
import base64
import hashlib
kinda_long = "ELYWPSRVESGTLVGYQYGRAITGQGKTSGGGSGWLGGGLRLSALELSGKTFSCDQAYYQVLSLNRGVICFLKVSTSVWSYESAAGFTMSGSAQYDYNVSGKANRSDMPTAFDVSGA"
shorter = base64.b32encode(hashlib.sha256(af.encode()).digest()).decode().strip("=")
My original question mentioned using ASCII alphabet and digits. This would be a base 62 encoding. Various libraries exist for this.

How to change numbers in a number

I'm currently trying to learn python.
Suppose there was a a number n = 12345.
How would one go about changing every digit starting from the first spot and iterating it between (1-9) and every other spot after (0-9).
I'm sadly currently learning python so I apologize for all the syntax error that might follow.
Here's my last few attempts/idea for skeleton of the code.
define the function
turn n into string
start with a for loop that for i in n range(0,9) for i[1]
else range(10)
Basically how does one fix a number while changing the others?
Please don't give solution just hints I enjoy the thinking process.
For example if n =29 the program could check
19,39,49,59,69,79,89,99
and
21,22,23,24,25,26,27,28
Although you are new, the process seems far easy than you think.
You want to make that change to every digit of the number (let's say n=7382). But you cannot iterate over numbers (not even changing specific digits of it as you want to): only over iterables (like lists). A string is an iterable. If you get the way to do a int-str conversion, you could iterate over every number and then print the new number.
But how do you change only the digit you are iterating to? Again, the way is repeating the conversion (saving it into a var before the loop would make great DRY) and getting a substring that gets all numbers except the one you are. There are two ways of doing this:
You search for that specific value and get its index (bad).
You enumerate the loop (good).
Why 2 is good? Because you have the real position of the actual number being change (think that doing an index in 75487 with 7 as the actual one changing would not work well when you get to the last one). Search for a way to iterate over items in a loop to get its actual index.
The easiest way to get a substring in Python is slicing. You slice two times: one to get all numbers before the actual one, and other to get all after it. Then you just join those two str with the actual variable number and you did it.
I hope I didn't put it easy for you, but is hard for a simple task as that.

Formatting string to certain character limit in Python

I have a code that outputs a whole bunch of numbers after doing some maths on them. At one point in the code they are rounded off with numpy.rint, and in certain cases (I believe when a 9 is rounded to a 10) I end up with a trailing zero that I do not want. I have some code that looks sort-of like this
ra3n = ra3/60 * 10
ra3n = np.rint(ra3n)
ra3n = ra3n.astype(str) ##there is a good reason that this needs to be a string
I need all of the resulting ra3n to be 5 characters long, but occasionally one pops out as 6 characters long. How would I format this properly? Keep in mind I'm a total python noob, so I might need it spelled out for me =)
EDIT:
Here's my output:
00244-2451
00244-2702
00278-0629
00286-1614
00295-1101
002910-0546
00303+0711
00305+2246
00348+2604
003410+0423
00355-0204
00359+1236
00360-0931
00386-1210
The instances where there are six digits instead of 5 in the first half of the string are the erroneous ones; those trailing zeroes should not be there.
ra3n = ra3n[:-1] if ra3n[-1] == '0' else ra3n
There's probably a better solution, but I'm not sure I really understand your issue without seeing some output.
You change the type of ra3n, which is poor programming practice. Try this.
ra3n = format(ra3/60.*10., '5f')[:5]
This gives exactly five characters. Note that if the string would usually be six characters long, this cuts off the last character, for good or for bad. Note also that I included decimal points in the 60 and 10 numbers: this guarantees that floating-point division will be used, rather than integer division if this is done in Python 2.

How can you parallelize a regex search of one long string? [duplicate]

This question already has answers here:
How can I tell if a string repeats itself in Python?
(13 answers)
Closed 7 years ago.
I'm testing the output of a simulation to see if it enters a loop at some point, so I need to know if the output repeats itself. For example, there may be 400 digits, followed by a 400000 digit cycle. The output consists only of digits from 0-9. I have the following regex function that I'm using to match repetitions in a single long string:
def repetitions(s):
r = re.compile(r"(.+?)\1+")
for match in r.finditer(s):
if len(match.group(1)) > 1 and len(match.group(0))/len(match.group(1)) > 4:
yield (match.group(1), len(match.group(0))/len(match.group(1)))
This function works fantastically, but it takes far too long. My most recent test was 4 million digits, and it took 4.5 hours to search. It found no repetitions, so I now need to increase the search space. The code only concerns itself with subsequences that repeat themselves more than 4 times because I'm considering 5 repetitions to give a set that can be checked manually: the simulation will generate subsequences that will repeat hundreds of times. I'm running on a four core machine, and the digits to be checked are generated in real time. How can I increase the speed of the search?
Based on information given by nhahtdh in one of the other answers, some things have come to light.
First, the problem you are posing is called finding "tandem repeats" or "squares".
Second, the algorithm given in http://csiflabs.cs.ucdavis.edu/~gusfield/lineartime.pdf finds z tandem repeats in O(n log n + z) time and is "optimal" in the sense that there can be that many answers. You may be able to use parallelize the tandem searches, but I'd first do timings with the simple-minded approach and divide by 4 to see if that is in the speed range you expect.
Also, in order to use this approach you are going to need O(n) space to store this suffix tree. So if you have on the order of 400,000 digits, you are going to need on the order of 400,000 time to build and 400,000 bytes to and store this suffix tree.
I am not totally what is meant by searching in "real time", I usually think of it as a hard limit on how long an operation can take. If that's the case, then that's not going to happen here. This algorithm needs to read in the entire input string and processes that before you start to get results. In that sense, it is what's called an "off-line" algorithm,.
http://web.cs.ucdavis.edu/~gusfield/strmat.html has C code that you can download. (In tar file strmat.tar.gz look for repeats_tandem.c and repeats_tandem.h).
In light of the above, if that algorithm isn't sufficiently fast or space efficient, I'd look for ways to change or narrow the problem. Maybe you only need a fixed number of answers (e.g. up to 5)? If the cycles are a result of executing statements in a program, given that programming languages (other than assembler) don't have arbitrary "goto" statements, it's possible that this can narrow the kinds of cycles that can occur and somehow by make use of that structure might offer a way to speed things up.
When one algorithm is too slow, switch algorithms.
If you are looking for repeating strings, you might consider using a suffix tree scheme: https://en.wikipedia.org/wiki/Suffix_tree
This will find common substrings in for you in linear time.
EDIT: #nhahtdh inb a comment below has referenced a paper that tells you how to pick out all z tandem repeats very quickly. If somebody upvotes
my answer, #nhahtdh should logically get some of the credit.
I haven't tried it, but I'd guess that you might be able to parallelize the construction of the suffix tree itself.
I'm sure there's room for optimization, but test this algorithm on shorter strings to see how it compares to your current solution:
def partial_repeat(string):
l = len(string)
for i in range(2, l//2+1):
s = string[0:i]
multi = l//i-1
factor = l//(i-1)
ls = len(s)
if s*(multi) == string[:ls*(multi)] and len(string)-len(string[:ls*factor]) <= ls and s*2 in string:
return s
>>> test_string
'abc1231231231231'
>>> results = {x for x in (partial_repeat(test_string[i:]) for i in range(len(test_string))) if x}
>>> sorted(sorted(results, key=test_string.index), key=test_string.count, reverse=True)[0]
'123'
In this test string, it's unclear whether the non-repeating initial characters are 'abc' or 'abc1', so the repeating string could be either '123' or '231'. The above sorts each found substring by its earliest appearance in the test string, sorts again (sorted() is a stable sort) by the highest frequency, and takes the top result.
With standard loops and min() instead of comprehensions and sorted():
>>> g = {partial_repeat(test_string[i:]) for i in range(len(test_string))}
>>> results = set()
>>> for x in g:
... if x and (not results or test_string.count(x) >= min(map(test_string.count, results))):
... results.add(x)
...
>>> min(results, key=test_string.index)
'123'
I tested these solutions with the test string 'abc123123a' multiplied by (n for n in range(100, 10101, 500) to get some timing data. I entered these data into Excel and used its FORECAST() function to estimate the processing time of a 4-million character string at 430 seconds, or about seven minutes.

Variable Length Needle in Haystack (Python)

I have a function designed to find errors in an application's search capabilities, which generates a variable-length search string from the non-control UTF-8 possibilities. Running pytest iterations on this function, the random UTF-8 strings, submitted for search, generate debug errors roughly once per 500 searches.
As I can grab each of the strings that caused an error, I want to determine what is the minimal sub-series of the characters in those strings which truly provoke the error. In other words, (inside of a pytest loop):
def fumble_towards_ecstasy(string_that_breaks):
# iterate over both length and content of the string
nugget = # minimum series of characters that break the search
return nugget
Should I slice the string in half and whittle down each side and re-submit until it fails, choose random characters from its (len() - 1) and then back up if an error doesn't happen? Brute force combinatorial? What's the best way to step through this?
Thanks.
Splitting the string in half will fail if there is a two character sequence that causes the failure, and that sequence lies exactly in the middle. Each half succeeds, but the combined string fails.
Here's one algorithm that will find a local minimum:
Try removing each character in turn.
If removing the character still causes failure, keep the new shorter string and repeat the algorithm on this new string.
If removing the character no longer causes failure, put it back and try removing the next character. Keep going until there are no more characters left to try. When you reach the end of the string you know that removing any one character causes the search to succeed.
I'd use a "whittle from both sides" approach. Splitting the string will always run the risk of breaking up the substring that was causing the error. My approach would be:
Pop as many characters off the left of the string as you can while still ensuring that the string causes an error.
Do the same to the right side.
You're left with - in theory - the minimal substring that causes the error.
Hope that helps!
First of all it's worth noting that the solution is possibly not unique, i.e. it may be the case that there are two or more broken substrings.
An alternate suggestion (to the good answers by both Xavier and Mark) is to run a recursive approach. Repeat the sampling with the limited subset of strings that caused the error. Once another error is found, repeat until a minimal substring is reached. This approach is robust enough to handle a more complex use case, where the error can exist in two non-adjacent entries. I don't think that is the case here, but it's nice to have a general purpopse method.

Categories

Resources