Adding exceptions to Levenshtein-Distance-like algorithm - python

I'm trying to compute how similar a sequence of up to 6 variables are. Currently I'm using a Collections Counter to return the frequency of different variables as my edit-distance.
By default, the distance in editing a variable (add/sub/change) is either 1 or 0. I'd like to change the distance depending on the variable and what value I set for that variable.
So I can say certain variables are similar to other variables, and provide a value for how similar they are.
I also want to say certain variables are worth less or more distance than usual.
Here is my previous post as context: Modify Levenshtein-Distance to ignore order
Example:
# 'c' and 'k' are quite similar, so their distance from eachother is 0.5 instead of 1
>>> groups = {['c','k'] : 0.5}
# the letter 'e' is less significant, and 'x' is very significant
>>> exceptions = {'e': 0.3, 'x': 1.5}
>>> distance('woke', 'woc')
0.8
Explanation:
woke
k -> c = 1
woce
-e = 1
woc
Distance = 2
# With exceptions:
woke
k -> c = 0.5
woce
-e = 0.3
woc
Distance = 0.8
How could I achieve this? Would this be possible to implement with this Counter algorithm?
Current code (thank you David Eisenstat)
def distance(s1, s2):
cnt = collections.Counter()
for c in s1:
cnt[c] += 1
for c in s2:
cnt[c] -= 1
return sum(abs(diff) for diff in cnt.values()) // 2 + \
(abs(sum(cnt.values())) + 1) // 2

Generally, this is the assignment problem. You have a set of characters from one string, a set of characters fro another string, and a set of costs for assigning a character from one string to a character in another string. You can add some dummy characters to both strings to treat the add/delete operations.
There are many known algorithms for the assignment problem. One of them is the Hungarian algorithm, the linked Wikipedia article contains links to some of the implementations. Other method is reducing the assignment problem to maxflow-mincost problem, which you can find simpler to adjust for add/delete operations.

I ended up dividing the process into a few stages then iterating through the strings for each stage. I'm not sure if its as efficient as it could be but it works.
Summing up what I was trying to achieve (in relation to Edit-distance algorithms)
Distance from one letter to another is 1. change j -> k = 1
0 being no difference at all. e.g. change j -> j = 0
Similar letters can be worth less than 1 (specified by me) e.g. c and k sound the same, therefore c, k = 0.5, change c -> k = 0.5
Certain letters could be worth more or less (specified by me) e.g. x is uncommon so I want it to have more weight, x = 1.4, change x -> k = 1.4
Created 2 dictionaries, 1 for similar letters, 1 for exceptions
Populate Counter -
Iterate through both strings
Match similar items - Iterate string1, if in similar dict, iterate string2, if in similar dict
Update Counter - remove similar items,
Find Distance - add up absolute frequencies, account for difference in string length
Include exceptions distance - Account for exception values based on frequency of letters

Related

Optimally selecting n datapoints from k

Problem statement:
I have 32k strings that consist of 13 characters. Each character can take 3 values (a, b or c). I need to select n strings from the 32k that satisfy the following:
select minimal number of strings so that the selected strings are not different than any other string within the 32k by more than 2 characters
This means that the count of strings that needs to be selected is variable. Also, the strings are not randomly generated, so the average difference is less than 2/3*13 - meaning that the eventual count of strings to be selected is not astronomical.
What I tried so far:
Clustering with k++ initialization and then k-means using hamming distance - but this did not yield in the desired outcome, albeit the problem resembles a clustering problem in a sense that we are practically looking for cluster centers with cluster members within a radius of 2.
What I am thinking of is simply selecting that string which has the most other strings having a distance of 1 and then of 2 - afterwards take out all these from the 32k and then repeat the calculation until no strings are left, but this is likely to be a suboptimal solution, e.g. this way I would select more strings than what is required at minimum I believe (selecting additional strings is a cost)
Question:
What other algorithms should I consider or think of? Thanks!
Here are examples of each method from my previous post. I always have trouble working code into my posts, so I did this separately. The first method computes the percentage that the strings are identical; the second method returns the number of differences.
string1 = ('abcbacaacbaab')
string2 = ('abcbacaacbbbb')
from difflib import SequenceMatcher
a=string1
b=string2
x = SequenceMatcher(a=a,b=b).ratio()
print(x)
#output: 0.8462
#OR (I used pip3 install jellyfish first)
import jellyfish
x=jellyfish.damerau_levenshtein_distance
(a,b)
print(x)
#output: 2
You might be able to use one of the types of 'fuzzy string matching' explained at:
https://miguendes.me/python-compare-strings#how-to-compare-two-strings-for-similarity-fuzzy-string-matching
There's "difflib" which computes a ratio of the differences. (You're in luck, your strings are all the same length.)
There's also something called "jellyfish" that returns a character count of the differences. It sounds like an interesting assignment, good luck!
if I understood, you want the minimum subset such that all elements in the subset are not different by more than two characters to the elements outside of the subset (please, let me know if I misunderstood the problem).
If that is the problem, there is a simple ad hoc algorithm that solves it in O(m * max(n, k)), where n is the total number of elements in the set (32000 in this case), m is the number of characters of an element of the set (13 in this case) and k is the size of the alphabet (3 in this case).
You can precalculate the quantity of each unique character of the alphabet in each column in O(m * max(n, k)). It's O(m * k) for initialization of the precalculation matrix and O(m * n) to actually calculate it.
Each column can vote for the removal of a string of the set if the character of the string in that column is equal to the number of strings in the initial set. Notice that a column can vote in O(1) using the precalculation. For each string, iterate through its columns and let the column vote. If you get three votes, you are sure the string needs to be kicked out of the set. So there is no need to continue iterating through the columns, just go to the next string. Else, the string needs to remain, just append it to the answer.
A python code is attached:
def solve(s: list[str], n: int = 32000, m: int = 13, k: int = 3) -> list[str]:
pre_calc = [[0 for j in range(k)] for i in range(m)]
ans = []
for i in range(n):
for j in range(m):
pre_calc[j][ord(s[i][j]) - ord('a')] += 1
for i in range(n):
votes_cnt = 0
remove = False
for j in range(m):
if pre_calc[j][ord(s[i][j]) - ord('a')] == n:
votes_cnt += 1
if votes_cnt == 3:
remove = True
break
if remove is False:
ans.append(s[i])
if len(ans) == 0:
ans.append(s[0])
return ans

combining elements of matrix in groups of 4 without repetition

I am a new member of this community and a new user of Python.
I am facing a little problem, both at conceptual and at coding level. It is the following:
I have 11 groups of the 8 identical elements (in real life they are 2 cotton pads cut in 4 pieces each, for a total number of 8 pieces, multiplied for 11 donors - it is a body odor collection study), namely:
A A A A A A A A
B B B B B B B B
C C C C C C C C
...
M M M M M M M M
I have now to form a supra-donor pad by combining 4 different-donor pieces, e.g. ABCD, ABCE, CDEF etc... The new groups of 4 elements shouldn't contain pieces from the same donor (e.g. AABC or ABDD are not allowed) and of course if a piece is used, then it can't be used to form another supra-donor pad.
I wanted to code something that allowed the groups formation automatically, without hitting my head to do it manually, risking to lose count of it.
Is this a case of combinations without repetitions?
I was thinking in doing something like: create a matrix like the one above, create 20 (the number of supra-donor pads I need) 4-elements empty groups (lists?) and then a loop in which I tell to randomly pick the Cij element of the matrix and move it in the empty list, then go on to the next element to pick, but being sure that it is a different type and a piece that was not pick in a previous group (e.g. if a group has element C43, then that same element shouldn't be used in another group). Do so until the 4-element group is full and then move to the next 4-element group
I am asking some help because I have little time to do this, otherwise I would try to learn by making loads of mistakes.
Any suggestion?
EDIT: example of a table with the 4-element groups already created and the number of eighth-part parts used for the different elements used (some of them will be in advance, of course)
supra-donor pads
Thank you in advance to everyone willing to provide insights!
Here is a solution that generates all possible combinations of the donors, then shuffles them and picks a new one in every iteration. If the picked one is not valid, due to all samples of the picked donor is already exhausted, that combination is dropped and a new one is chosen.
import itertools
import random
def comb_generator(bases, per_base, sought):
combs = list(itertools.combinations(bases, 4))
random.shuffle(combs)
avail = {}
for base in bases:
avail[base] = per_base
comb_wanted = sought > 0 # Ensure picking at least one new comb
while comb_wanted:
comb = combs.pop()
comb_wanted = False
for base in comb:
comb_wanted = avail[base] <= 0
if comb_wanted:
break
if not comb_wanted:
for base in comb:
avail[base] -= 1
yield comb
sought -= 1
comb_wanted = sought > 0
bases = 'ABCDEFGHIJKLM'
per_base = 8
sought = 20
comb_gens = comb_generator(bases, per_base, sought)
for i, comb in enumerate(comb_gens):
print('{:2}: {}'.format(i, comb))
As can be seen, I have implemented the solution as a generator, since I find that to be rather useful when later on working with the entries.
N.B. The current solution does not care about keeping an equal number of samples from each donor. That could be done by adding a requirement on the availability of different donors to not vary too much. But it would add to the complexity, where discarded combinations are re-inserted, at a later position in the 'deck'.

Best way to hash ordered permutation of [1,9]

I'm trying to implement a method to keep the visited states of the 8 puzzle from generating again.
My initial approach was to save each visited pattern in a list and do a linear check each time the algorithm wants to generate a child.
Now I want to do this in O(1) time through list access. Each pattern in 8 puzzle is an ordered permutation of numbers between 1 to 9 (9 being the blank block), for example 125346987 is:
1 2 5
3 4 6
_ 8 7
The number of all of the possible permutation of this kind is around 363,000 (9!). what is the best way to hash these numbers to indexes of a list of that size?
You can map a permutation of N items to its index in the list of all permutations of N items (ordered lexicographically).
Here's some code that does this, and a demonstration that it produces indexes 0 to 23 once each for all permutations of a 4-letter sequence.
import itertools
def fact(n):
r = 1
for i in xrange(n):
r *= i + 1
return r
def I(perm):
if len(perm) == 1:
return 0
return sum(p < perm[0] for p in perm) * fact(len(perm) - 1) + I(perm[1:])
for p in itertools.permutations('abcd'):
print p, I(p)
The best way to understand the code is to prove its correctness. For an array of length n, there's (n-1)! permutations with the smallest element of the array appearing first, (n-1)! permutations with the second smallest element appearing first, and so on.
So, to find the index of a given permutation, see count how many items are smaller than the first thing in the permutation and multiply that by (n-1)!. Then recursively add the index of the remainder of the permutation, considered as a permutation of (n-1) elements. The base case is when you have a permutation of length 1. Obviously there's only one such permutation, so its index is 0.
A worked example: [1324].
[1324]: 1 appears first, and that's the smallest element in the array, so that gives 0 * (3!)
Removing 1 gives us [324]. The first element is 3. There's one element that's smaller, so that gives us 1 * (2!).
Removing 3 gives us [24]. The first element is 2. That's the smallest element remaining, so that gives us 0 * (1!).
Removing 2 gives us [4]. There's only one element, so we use the base case and get 0.
Adding up, we get 0*3! + 1*2! + 0*1! + 0 = 1*2! = 2. So [1324] is at index 2 in the sorted list of 4 permutations. That's correct, because at index 0 is [1234], index 1 is [1243], and the lexicographically next permutation is our [1324].
I believe you're asking for a function to map permutations to array indices. This dictionary maps all permutations of numbers 1-9 to values from 0 to 9!-1.
import itertools
index = itertools.count(0)
permutations = itertools.permutations(range(1, 10))
hashes = {h:next(index) for h in permutations}
For example, hashes[(1,2,5,3,4,6,9,8,7)] gives a value of 1445.
If you need them in strings instead of tuples, use:
permutations = [''.join(x) for x in itertools.permutations('123456789')]
or as integers:
permutations = [int(''.join(x)) for x in itertools.permutations('123456789')]
It looks like you are only interested in whether or not you have already visited the permutation.
You should use a set. It grants the O(1) look-up you are interested in.
A space as well lookup efficient structure for this problem is a trie type structure, as it will use common space for lexicographical matches in any
permutation.
i.e. the space used for "123" in 1234, and in 1235 will be the same.
Lets assume 0 as replacement for '_' in your example for simplicity.
Storing
Your trie will be a tree of booleans, the root node will be an empty node, and then each node will contain 9 children with a boolean flag set to false, the 9 children specify digits 0 to 8 and _ .
You can create the trie on the go, as you encounter a permutation, and store the encountered digits as boolean in the trie by setting the bool as true.
Lookup
The trie is traversed from root to children based on digits of the permutation, and if the nodes have been marked as true, that means the permutation has occured before. The complexity of lookup is just 9 node hops.
Here is how the trie would look for a 4 digit example :
Python trie
This trie can be easily stored in a list of booleans, say myList.
Where myList[0] is the root, as explained in the concept here :
https://webdocs.cs.ualberta.ca/~holte/T26/tree-as-array.html
The final trie in a list would be around 9+9^2+9^3....9^8 bits i.e. less than 10 MB for all lookups.
Use
I've developed a heuristic function for this specific case. It is not a perfect hashing, as the mapping is not between [0,9!-1] but between [1,767359], but it is O(1).
Let's assume we already have a file / reserved memory / whatever with 767359 bits set to 0 (e.g., mem = [False] * 767359). Let a 8puzzle pattern be mapped to a python string (e.g., '125346987'). Then, the hash function is determined by:
def getPosition( input_str ):
data = []
opts = range(1,10)
n = int(input_str[0])
opts.pop(opts.index(n))
for c in input_str[1:len(input_str)-1]:
k = opts.index(int(c))
opts.pop(k)
data.append(k)
ind = data[3]<<14 | data[5]<<12 | data[2]<<9 | data[1]<<6 | data[0]<<3 | data[4]<<1 | data[6]<<0
output_str = str(ind)+str(n)
output = int(output_str)
return output
I.e., in order to check if a 8puzzle pattern = 125346987 has already been used, we need to:
pattern = '125346987'
pos = getPosition(pattern)
used = mem[pos-1] #mem starts in 0, getPosition in 1.
With a perfect hashing we would have needed 9! bits to store the booleans. In this case we need 2x more (767359/9! = 2.11), but recall that it is not even 1Mb (barely 100KB).
Note that the function is easily invertible.
Check
I could prove you mathematically why this works and why there won't be any collision, but since this is a programming forum let's just run it for every possible permutation and check that all the hash values (positions) are indeed different:
def getPosition( input_str ):
data = []
opts = range(1,10)
n = int(input_str[0])
opts.pop(opts.index(n))
for c in input_str[1:len(input_str)-1]:
k = opts.index(int(c))
opts.pop(k)
data.append(k)
ind = data[3]<<14 | data[5]<<12 | data[2]<<9 | data[1]<<6 | data[0]<<3 | data[4]<<1 | data[6]<<0
output_str = str(ind)+str(n)
output = int(output_str)
return output
#CHECKING PURPOSES
def addperm(x,l):
return [ l[0:i] + [x] + l[i:] for i in range(len(l)+1) ]
def perm(l):
if len(l) == 0:
return [[]]
return [x for y in perm(l[1:]) for x in addperm(l[0],y) ]
#We generate all the permutations
all_perms = perm([ i for i in range(1,10)])
print "Number of all possible perms.: "+str(len(all_perms)) #indeed 9! = 362880
#We execute our hash function over all the perms and store the output.
all_positions = [];
for permutation in all_perms:
perm_string = ''.join(map(str,permutation))
all_positions.append(getPosition(perm_string))
#We wan't to check if there has been any collision, i.e., if there
#is one position that is repeated at least twice.
print "Number of different hashes: "+str(len(set(all_positions)))
#also 9!, so the hash works properly.
How does it work?
The idea behind this has to do with a tree: at the beginning it has 9 branches going to 9 nodes, each corresponding to a digit. From each of these nodes we have 8 branches going to 8 nodes, each corresponding to a digit except its parent, then 7, and so on.
We first store the first digit of our input string in a separate variable and pop it out from our 'node' list, because we have already taken the branch corresponding to the first digit.
Then we have 8 branches, we choose the one corresponding with our second digit. Note that, since there are 8 branches, we need 3 bits to store the index of our chosen branch and the maximum value it can take is 111 for the 8th branch (we map branch 1-8 to binary 000-111). Once we have chosen and store the branch index, we pop that value out, so that the next node list doesn't include again this digit.
We proceed in the same way for branches 7, 6 and 5. Note that when we have 7 branches we still need 3 bits, though the maximum value will be 110. When we have 5 branches, the index will be at most binary 100.
Then we get to 4 branches and we notice that this can be stored just with 2 bits, same for 3 branches. For 2 branches we will just need 1bit, and for the last branch we don't need any bit: there will be just one branch pointing to the last digit, which will be the remaining from our 1-9 original list.
So, what we have so far: the first digit stored in a separated variable and a list of 7 indexes representing branches. The first 4 indexes can be represented with 3bits, the following 2 indexes can be represented with 2bits and the last index with 1bit.
The idea is to concatenate all this indexes in their bit form to create a larger number. Since we have 17bits, this number will be at most 2^17=131072. Now we just add the first digit we had stored to the end of that number (at most this digit will be 9) and we have that the biggest number we can create is 1310729.
But we can do better: recall that when we had 5 branches we needed 3 bits, though the maximum value was binary 100. What if we arrange our bits so that those with more 0s come first? If so, in the worst case scenario our final bit number will be the concatenation of:
100 10 101 110 111 11 1
Which in decimal is 76735. Then we proceed as before (adding the 9 at the end) and we get that our biggest possible generated number is 767359, which is the ammount of bits we need and corresponds to input string 987654321, while the lowest possible number is 1 which corresponds to input string 123456789.
Just to finish: one might wonder why have we stored the first digit in a separate variable and added it at the end. The reason is that if we had kept it then the number of branches at the beginning would have been 9, so for storing the first index (1-9) we would have needed 4 bits (0000 to 1000). which would have make our mapping much less efficient, as in that case the biggest possible number (and therefore the amount of memory needed) would have been
1000 100 10 101 110 111 11 1
which is 1125311 in decimal (1.13Mb vs 768Kb). It is quite interesting to see that the ratio 1.13M/0.768K = 1.47 has something to do with the ratio of the four bits compared to just adding a decimal value (2^4/10 = 1.6) which makes a lot of sense (the difference is due to the fact that with the first approach we are not fully using the 4 bits).
First. There is nothing faster than a list of booleans. There's a total of 9! == 362880 possible permutations for your task, which is a reasonably small amount of data to store in memory:
visited_states = [False] * math.factorial(9)
Alternatively, you can use array of bytes which is slightly slower (not by much though) and has a much lower memory footprint (by a power of magnitude at least). However any memory savings from using an array will probably be of little value considering the next step.
Second. You need to convert your specific permutation to it's index. There are algorithms which do this, one of the best StackOverflow questions on this topic is probably this one:
Finding the index of a given permutation
You have fixed permutation size n == 9, so whatever complexity an algorithm has, it will be equivalent to O(1) in your situation.
However to produce even faster results, you can pre-populate a mapping dictionary which will give you an O(1) lookup:
all_permutations = map(lambda p: ''.join(p), itertools.permutations('123456789'))
permutation_index = dict((perm, index) for index, perm in enumerate(all_permutations))
This dictionary will consume about 50 Mb of memory, which is... not that much actually. Especially since you only need to create it once.
After all this is done, checking your specific combination is done with:
visited = visited_states[permutation_index['168249357']]
Marking it to visited is done in the same manner:
visited_states[permutation_index['168249357']] = True
Note that using any of permutation index algorithms will be much slower than mapping dictionary. Most of those algorithms are of O(n2) complexity and in your case it results 81 times worse performance even discounting the extra python code itself. So unless you have heavy memory constraints, using mapping dictionary is probably the best solution speed-wise.
Addendum. As has been pointed out by Palec, visited_states list is actually not needed at all - it's perfectly possible to store True/False values directly in the permutation_index dictionary, which saves some memory and an extra list lookup.
Notice if you type hash(125346987) it returns 125346987. That is for a reason, because there is no point in hashing an integer to anything other than an integer.
What you should do, is when you find a pattern add it to a dictionary rather than a list. This will provide the fast lookup you need rather than traversing the list like you are doing now.
So say you find the pattern 125346987 you can do:
foundPatterns = {}
#some code to find the pattern
foundPatterns[1] = 125346987
#more code
#test if there?
125346987 in foundPatterns.values()
True
If you must always have O(1), then seems like a bit array would do the job. You'd only need to store 363,000 elements, which seems doable. Though note that in practice it's not always faster. Simplest implementation looks like:
Create data structure
visited_bitset = [False for _ in xrange(373000)]
Test current state and add if not visited yet
if !visited[current_state]:
visited_bitset[current_state] = True
Paul's answer might work.
Elisha's answer is perfectly valid hash function that would guarantee that no collision happen in the hash function. The 9! would be a pure minimum for a guaranteed no collision hash function, but (unless someone corrects me, Paul probably has) I don't believe there exists a function to map each board to a value in the domain [0, 9!], let alone a hash function that is nothing more that O(1).
If you have a 1GB of memory to support a Boolean array of 864197532 (aka 987654321-12346789) indices. You guarantee (computationally) the O(1) requirement.
Practically (meaning when you run in a real system) speaking this isn't going to be cache friendly but on paper this solution will definitely work. Even if an perfect function did exist, doubt it too would be cache friendly either.
Using prebuilts like set or hashmap (sorry I haven't programmed Python in a while, so don't remember the datatype) must have an amortized 0(1). But using one of these with a suboptimal hash function like n % RANDOM_PRIME_NUM_GREATER_THAN_100000 might give the best solution.

Methodology: Recursive datasets - tree branch refinement

I have a number of sets of data. These sets contain numbers that specify how much points a user gains upon passing to the next index:
A = (2,[2],2,6,6,10)
B = (2,4,[4],2,5,7,7,6,10,12,10,6)
C = (2,3,[4],5,6,7,7,8,10)
In this example I use three sets but in the real problem it are far more sets (a variable amount). The [square] brackets mean that that is the current selected index, so the indexes specified above are: (1,2,2)
All these indexes together form a total that I can keep track of by grabbing it from a webpage (in this case the total is: (2+2)+(2+4+4)+(2+3+4) = 23). By keeping track of the total I know that the total changes with a number, let's call this number X.
Total: 23 -> 25 -> 30
X: +2 +5 (these are the numbers X I can keep track of, they are given, but variable)
In this example the first X is +2, this either means that A went from 1->2 or B from 2->3:
Case 1: A passes on
A = (2,2,[2],6,6,10)
B = (2,4,[4],4,5,7,7,6,10,12,10,6)
C = (2,3,[4],5,6,7,7,8,10)
Case 2: B passes on
A = (2,[2],2,6,6,10)
B = (2,4,4,[2],5,7,7,6,10,12,10,6)
C = (2,3,[4],5,6,7,7,8,10)
We know the next increase is +5, this either means that for case 1, C goes from 2 -> 3 or for case 2: B 3->4 or C 2 -> 3
Case 1: A increased => C increased
A = (2,2,[2],6,6,10)
B = (2,4,4,[2],5,7,7,6,10,12,10,6)
C = (2,3,4,[5],6,7,7,8,10)
Case 2: B increased => B increased
A = (2,[2],2,6,6,10)
B = (2,4,4,2,[5],7,7,6,10,12,10,6)
C = (2,3,[4],5,6,7,7,8,10)
Case 3: B increased => C increased
A = (2,[2],2,6,6,10)
B = (2,4,4,[2],5,7,7,6,10,12,10,6)
C = (2,3,4,[5],6,6,7,8,10)
Now what I need to write an algorithm for is to display EVERY possible combination of indexes as a result of the increases I=(+2,+5), note that it in reality these are variables: I=(+X, +Y, +Z, ...) and the depth is also variable.
Now the problem looks quite easy, but imagine the next increase being 7 resulting in I=(+2,+5,+7), then there only remains 1 case valid (B->B->B). In some way I thus need to write a big recursive function that re-evaluates all results and removes dead ends, for every following increase, but I'm not sure how to write such a function.
For extra clarification: imagine the tracked data going +2, +5, +6, +6 then this diagram shows me what I want accomplished:
Summary: The full problem with all its variables is thus:
N Sets of data:
A = (a1,a2,a3,a4,...)
B = (b1,b2,b3,b4,...)
...
N = (n1,n2,n3,n4,...)
A given array Z with the current selected indexes:
Z = ([A], [B], [C], ... , [N])
A given array I with increases with depth N:
I = (+X,+Y,...,+N)
Asked: possible new arrays Z (possible ways to get to the new total with given intervals using only the increases specified in the data sets)
What I want: How to write an algorithm for this purpose, I don't need you to write the algorithm, but a starting points would be nice, I'm kinda lost in the problem.
Note: due to this question being quite long and technical, it is possible that some minor mistakes got in, comment below and I'll try to solve them

Can Python generate a random number that excludes a set of numbers, without using recursion?

I looked over Python Docs (I may have misunderstood), but I didn't see that there was a way to do this (look below) without calling a recursive function.
What I'd like to do is generate a random value which excludes values in the middle.
In other words,
Let's imagine I wanted X to be a random number that's not in
range(a - b, a + b)
Can I do this on the first pass,
or
1. Do I have to constantly generate a number,
2. Check if in range(),
3. Wash rinse ?
As for why I don't wish to write a recursive function,
1. it 'feels like' I should not have to
2. the set of numbers I'm doing this for could actually end up being quite large, and
... I hear stack overflows are bad, and I might just be being overly cautious in doing this.
I'm sure that there's a nice, Pythonic, non-recursive way to do it.
Generate one random number and map it onto your desired ranges of numbers.
If you wanted to generate an integer between 1-4 or 7-10, excluding 5 and 6, you might:
Generate a random integer in the range 1-8
If the random number is greater than 4, add 2 to the result.
The mapping becomes:
Random number: 1 2 3 4 5 6 7 8
Result: 1 2 3 4 7 8 9 10
Doing it this way, you never need to "re-roll". The above example is for integers, but it can also be applied to floats.
Use random.choice().
In this example, a is your lower bound, the range between b and c is skipped and d is your upper bound.
import random
numbers = range(a,b) + range(c,d)
r = random.choice(numbers)
A possible solution would be to just shift the random numbers out of that range. E.g.
def NormalWORange(a, b, sigma):
r = random.normalvariate(a,sigma)
if r < a:
return r-b
else:
return r+b
That would generate a normal distribution with a hole in the range (a-b,a+b).
Edit: If you want integers then you will need a little bit more work. If you want integers that are in the range [c,a-b] or [a+b,d] then the following should do the trick.
def RangeWORange(a, b, c, d):
r = random.randrange(c,d-2*b) # 2*b because two intervals of length b to exclude
if r >= a-b:
return r+2*b
else:
return r
I may have misunderstood your problem, but you can implement this without recursion
def rand(exclude):
r = None
while r in exclude or r is None:
r = random.randrange(1,10)
return r
rand([1,3,9])
though, you're still looping over results until you find new ones.
The fastest solution would be this (with a and b defining the exclusion zone and c and d the set of good answers including the exclusion zone):
offset = b - a
maximum = d - offset
result = random.randrange(c, maximum)
if result >= a:
result += offset
You still need some range, i.e., a min-max possible value excluding your middle values.
Why don't you first randomly pick which "half" of the range you want, then pick a random number in that range? E.g.:
def rand_not_in_range(a,b):
rangechoices = ((0,a-b-1),(a+b+1, 10000000))
# Pick a half
fromrange = random.choice(rangechoices)
# return int from that range
return random.randint(*fromrange)
Li-aung Yip's answer makes the recursion issue moot, but I have to point out that it's possible to do any degree of recursion without worrying about the stack. It's called "tail recursion". Python doesn't support tail recursion directly, because GvR thinks it's uncool:
http://neopythonic.blogspot.com/2009/04/tail-recursion-elimination.html
But you can get around this:
http://paulbutler.org/archives/tail-recursion-in-python/
I find it interesting that stick thinks that recursion "feels wrong". In extremely function-oriented languages, such as Scheme, recursion is unavoidable. It allows you to do iteration without creating state variables, which the functional programming paradigm rigorously avoids.
http://www.pling.org.uk/cs/pop.html

Categories

Resources