Optimally selecting n datapoints from k - python

Problem statement:
I have 32k strings that consist of 13 characters. Each character can take 3 values (a, b or c). I need to select n strings from the 32k that satisfy the following:
select minimal number of strings so that the selected strings are not different than any other string within the 32k by more than 2 characters
This means that the count of strings that needs to be selected is variable. Also, the strings are not randomly generated, so the average difference is less than 2/3*13 - meaning that the eventual count of strings to be selected is not astronomical.
What I tried so far:
Clustering with k++ initialization and then k-means using hamming distance - but this did not yield in the desired outcome, albeit the problem resembles a clustering problem in a sense that we are practically looking for cluster centers with cluster members within a radius of 2.
What I am thinking of is simply selecting that string which has the most other strings having a distance of 1 and then of 2 - afterwards take out all these from the 32k and then repeat the calculation until no strings are left, but this is likely to be a suboptimal solution, e.g. this way I would select more strings than what is required at minimum I believe (selecting additional strings is a cost)
Question:
What other algorithms should I consider or think of? Thanks!

Here are examples of each method from my previous post. I always have trouble working code into my posts, so I did this separately. The first method computes the percentage that the strings are identical; the second method returns the number of differences.
string1 = ('abcbacaacbaab')
string2 = ('abcbacaacbbbb')
from difflib import SequenceMatcher
a=string1
b=string2
x = SequenceMatcher(a=a,b=b).ratio()
print(x)
#output: 0.8462
#OR (I used pip3 install jellyfish first)
import jellyfish
x=jellyfish.damerau_levenshtein_distance
(a,b)
print(x)
#output: 2

You might be able to use one of the types of 'fuzzy string matching' explained at:
https://miguendes.me/python-compare-strings#how-to-compare-two-strings-for-similarity-fuzzy-string-matching
There's "difflib" which computes a ratio of the differences. (You're in luck, your strings are all the same length.)
There's also something called "jellyfish" that returns a character count of the differences. It sounds like an interesting assignment, good luck!

if I understood, you want the minimum subset such that all elements in the subset are not different by more than two characters to the elements outside of the subset (please, let me know if I misunderstood the problem).
If that is the problem, there is a simple ad hoc algorithm that solves it in O(m * max(n, k)), where n is the total number of elements in the set (32000 in this case), m is the number of characters of an element of the set (13 in this case) and k is the size of the alphabet (3 in this case).
You can precalculate the quantity of each unique character of the alphabet in each column in O(m * max(n, k)). It's O(m * k) for initialization of the precalculation matrix and O(m * n) to actually calculate it.
Each column can vote for the removal of a string of the set if the character of the string in that column is equal to the number of strings in the initial set. Notice that a column can vote in O(1) using the precalculation. For each string, iterate through its columns and let the column vote. If you get three votes, you are sure the string needs to be kicked out of the set. So there is no need to continue iterating through the columns, just go to the next string. Else, the string needs to remain, just append it to the answer.
A python code is attached:
def solve(s: list[str], n: int = 32000, m: int = 13, k: int = 3) -> list[str]:
pre_calc = [[0 for j in range(k)] for i in range(m)]
ans = []
for i in range(n):
for j in range(m):
pre_calc[j][ord(s[i][j]) - ord('a')] += 1
for i in range(n):
votes_cnt = 0
remove = False
for j in range(m):
if pre_calc[j][ord(s[i][j]) - ord('a')] == n:
votes_cnt += 1
if votes_cnt == 3:
remove = True
break
if remove is False:
ans.append(s[i])
if len(ans) == 0:
ans.append(s[0])
return ans

Related

how to calculate the minimum unfairness sum of a list

I have tried to summarize the problem statement something like this::
Given n, k and an array(a list) arr where n = len(arr) and k is an integer in set (1, n) inclusive.
For an array (or list) myList, The Unfairness Sum is defined as the sum of the absolute differences between all possible pairs (combinations with 2 elements each) in myList.
To explain: if mylist = [1, 2, 5, 5, 6] then Minimum unfairness sum or MUS. Please note that elements are considered unique by their index in list not their values
MUS = |1-2| + |1-5| + |1-5| + |1-6| + |2-5| + |2-5| + |2-6| + |5-5| + |5-6| + |5-6|
If you actually need to look at the problem statement, It's HERE
My Objective
given n, k, arr(as described above), find the Minimum Unfairness Sum out of all of the unfairness sums of sub arrays possible with a constraint that each len(sub array) = k [which is a good thing to make our lives easy, I believe :) ]
what I have tried
well, there is a lot to be added in here, so I'll try to be as short as I can.
My First approach was this where i used itertools.combinations to get all the possible combinations and statistics.variance to check its spread of data (yeah, I know I'm a mess).
Before you see the code below, Do you think these variance and unfairness sum are perfectly related (i know they are strongly related) i.e. the sub array with minimum variance has to be the sub array with MUS??
You only have to check the LetMeDoIt(n, k, arr) function. If you need MCVE, check the second code snippet below.
from itertools import combinations as cmb
from statistics import variance as varn
def LetMeDoIt(n, k, arr):
v = []
s = []
subs = [list(x) for x in list(cmb(arr, k))] # getting all sub arrays from arr in a list
i = 0
for sub in subs:
if i != 0:
var = varn(sub) # the variance thingy
if float(var) < float(min(v)):
v.remove(v[0])
v.append(var)
s.remove(s[0])
s.append(sub)
else:
pass
elif i == 0:
var = varn(sub)
v.append(var)
s.append(sub)
i = 1
final = []
f = list(cmb(s[0], 2)) # getting list of all pairs (after determining sub array with least MUS)
for r in f:
final.append(abs(r[0]-r[1])) # calculating the MUS in my messy way
return sum(final)
The above code works fine for n<30 but raised a MemoryError beyond that.
In Python chat, Kevin suggested me to try generator which is memory efficient (it really is), but as generator also generates those combination on the fly as we iterate over them, it was supposed to take over 140 hours (:/) for n=50, k=8 as estimated.
I posted the same as a question on SO HERE (you might wanna have a look to understand me properly - it has discussions and an answer by fusion which takes me to my second approach - a better one(i should say fusion's approach xD)).
Second Approach
from itertools import combinations as cmb
def myvar(arr): # a function to calculate variance
l = len(arr)
m = sum(arr)/l
return sum((i-m)**2 for i in arr)/l
def LetMeDoIt(n, k, arr):
sorted_list = sorted(arr) # i think sorting the array makes it easy to get the sub array with MUS quickly
variance = None
min_variance_sub = None
for i in range(n - k + 1):
sub = sorted_list[i:i+k]
var = myvar(sub)
if variance is None or var<variance:
variance = var
min_variance_sub=sub
final = []
f = list(cmb(min_variance_sub, 2)) # again getting all possible pairs in my messy way
for r in f:
final.append(abs(r[0] - r[1]))
return sum(final)
def MainApp():
n = int(input())
k = int(input())
arr = list(int(input()) for _ in range(n))
result = LetMeDoIt(n, k, arr)
print(result)
if __name__ == '__main__':
MainApp()
This code works perfect for n up to 1000 (maybe more), but terminates due to time out (5 seconds is the limit on online judge :/ ) for n beyond 10000 (the biggest test case has n=100000).
=====
How would you approach this problem to take care of all the test cases in given time limits (5 sec) ? (problem was listed under algorithm & dynamic programming)
(for your references you can have a look on
successful submissions(py3, py2, C++, java) on this problem by other candidates - so that you can
explain that approach for me and future visitors)
an editorial by the problem setter explaining how to approach the question
a solution code by problem setter himself (py2, C++).
Input data (test cases) and expected output
Edit1 ::
For future visitors of this question, the conclusions I have till now are,
that variance and unfairness sum are not perfectly related (they are strongly related) which implies that among a lots of lists of integers, a list with minimum variance doesn't always have to be the list with minimum unfairness sum. If you want to know why, I actually asked that as a separate question on math stack exchange HERE where one of the mathematicians proved it for me xD (and it's worth taking a look, 'cause it was unexpected)
As far as the question is concerned overall, you can read answers by archer & Attersson below (still trying to figure out a naive approach to carry this out - it shouldn't be far by now though)
Thank you for any help or suggestions :)
You must work on your list SORTED and check only sublists with consecutive elements. This is because BY DEFAULT, any sublist that includes at least one element that is not consecutive, will have higher unfairness sum.
For example if the list is
[1,3,7,10,20,35,100,250,2000,5000] and you want to check for sublists with length 3, then solution must be one of [1,3,7] [3,7,10] [7,10,20] etc
Any other sublist eg [1,3,10] will have higher unfairness sum because 10>7 therefore all its differences with rest of elements will be larger than 7
The same for [1,7,10] (non consecutive on the left side) as 1<3
Given that, you only have to check for consecutive sublists of length k which reduces the execution time significantly
Regarding coding, something like this should work:
def myvar(array):
return sum([abs(i[0]-i[1]) for i in itertools.combinations(array,2)])
def minsum(n, k, arr):
res=1000000000000000000000 #alternatively make it equal with first subarray
for i in range(n-k):
res=min(res, myvar(l[i:i+k]))
return res
I see this question still has no complete answer. I will write a track of a correct algorithm which will pass the judge. I will not write the code in order to respect the purpose of the Hackerrank challenge. Since we have working solutions.
The original array must be sorted. This has a complexity of O(NlogN)
At this point you can check consecutive sub arrays as non-consecutive ones will result in a worse (or equal, but not better) "unfairness sum". This is also explained in archer's answer
The last check passage, to find the minimum "unfairness sum" can be done in O(N). You need to calculate the US for every consecutive k-long subarray. The mistake is recalculating this for every step, done in O(k), which brings the complexity of this passage to O(k*N). It can be done in O(1) as the editorial you posted shows, including mathematic formulae. It requires a previous initialization of a cumulative array after step 1 (done in O(N) with space complexity O(N) too).
It works but terminates due to time out for n<=10000.
(from comments on archer's question)
To explain step 3, think about k = 100. You are scrolling the N-long array and the first iteration, you must calculate the US for the sub array from element 0 to 99 as usual, requiring 100 passages. The next step needs you to calculate the same for a sub array that only differs from the previous by 1 element 1 to 100. Then 2 to 101, etc.
If it helps, think of it like a snake. One block is removed and one is added.
There is no need to perform the whole O(k) scrolling. Just figure the maths as explained in the editorial and you will do it in O(1).
So the final complexity will asymptotically be O(NlogN) due to the first sort.

Best way to hash ordered permutation of [1,9]

I'm trying to implement a method to keep the visited states of the 8 puzzle from generating again.
My initial approach was to save each visited pattern in a list and do a linear check each time the algorithm wants to generate a child.
Now I want to do this in O(1) time through list access. Each pattern in 8 puzzle is an ordered permutation of numbers between 1 to 9 (9 being the blank block), for example 125346987 is:
1 2 5
3 4 6
_ 8 7
The number of all of the possible permutation of this kind is around 363,000 (9!). what is the best way to hash these numbers to indexes of a list of that size?
You can map a permutation of N items to its index in the list of all permutations of N items (ordered lexicographically).
Here's some code that does this, and a demonstration that it produces indexes 0 to 23 once each for all permutations of a 4-letter sequence.
import itertools
def fact(n):
r = 1
for i in xrange(n):
r *= i + 1
return r
def I(perm):
if len(perm) == 1:
return 0
return sum(p < perm[0] for p in perm) * fact(len(perm) - 1) + I(perm[1:])
for p in itertools.permutations('abcd'):
print p, I(p)
The best way to understand the code is to prove its correctness. For an array of length n, there's (n-1)! permutations with the smallest element of the array appearing first, (n-1)! permutations with the second smallest element appearing first, and so on.
So, to find the index of a given permutation, see count how many items are smaller than the first thing in the permutation and multiply that by (n-1)!. Then recursively add the index of the remainder of the permutation, considered as a permutation of (n-1) elements. The base case is when you have a permutation of length 1. Obviously there's only one such permutation, so its index is 0.
A worked example: [1324].
[1324]: 1 appears first, and that's the smallest element in the array, so that gives 0 * (3!)
Removing 1 gives us [324]. The first element is 3. There's one element that's smaller, so that gives us 1 * (2!).
Removing 3 gives us [24]. The first element is 2. That's the smallest element remaining, so that gives us 0 * (1!).
Removing 2 gives us [4]. There's only one element, so we use the base case and get 0.
Adding up, we get 0*3! + 1*2! + 0*1! + 0 = 1*2! = 2. So [1324] is at index 2 in the sorted list of 4 permutations. That's correct, because at index 0 is [1234], index 1 is [1243], and the lexicographically next permutation is our [1324].
I believe you're asking for a function to map permutations to array indices. This dictionary maps all permutations of numbers 1-9 to values from 0 to 9!-1.
import itertools
index = itertools.count(0)
permutations = itertools.permutations(range(1, 10))
hashes = {h:next(index) for h in permutations}
For example, hashes[(1,2,5,3,4,6,9,8,7)] gives a value of 1445.
If you need them in strings instead of tuples, use:
permutations = [''.join(x) for x in itertools.permutations('123456789')]
or as integers:
permutations = [int(''.join(x)) for x in itertools.permutations('123456789')]
It looks like you are only interested in whether or not you have already visited the permutation.
You should use a set. It grants the O(1) look-up you are interested in.
A space as well lookup efficient structure for this problem is a trie type structure, as it will use common space for lexicographical matches in any
permutation.
i.e. the space used for "123" in 1234, and in 1235 will be the same.
Lets assume 0 as replacement for '_' in your example for simplicity.
Storing
Your trie will be a tree of booleans, the root node will be an empty node, and then each node will contain 9 children with a boolean flag set to false, the 9 children specify digits 0 to 8 and _ .
You can create the trie on the go, as you encounter a permutation, and store the encountered digits as boolean in the trie by setting the bool as true.
Lookup
The trie is traversed from root to children based on digits of the permutation, and if the nodes have been marked as true, that means the permutation has occured before. The complexity of lookup is just 9 node hops.
Here is how the trie would look for a 4 digit example :
Python trie
This trie can be easily stored in a list of booleans, say myList.
Where myList[0] is the root, as explained in the concept here :
https://webdocs.cs.ualberta.ca/~holte/T26/tree-as-array.html
The final trie in a list would be around 9+9^2+9^3....9^8 bits i.e. less than 10 MB for all lookups.
Use
I've developed a heuristic function for this specific case. It is not a perfect hashing, as the mapping is not between [0,9!-1] but between [1,767359], but it is O(1).
Let's assume we already have a file / reserved memory / whatever with 767359 bits set to 0 (e.g., mem = [False] * 767359). Let a 8puzzle pattern be mapped to a python string (e.g., '125346987'). Then, the hash function is determined by:
def getPosition( input_str ):
data = []
opts = range(1,10)
n = int(input_str[0])
opts.pop(opts.index(n))
for c in input_str[1:len(input_str)-1]:
k = opts.index(int(c))
opts.pop(k)
data.append(k)
ind = data[3]<<14 | data[5]<<12 | data[2]<<9 | data[1]<<6 | data[0]<<3 | data[4]<<1 | data[6]<<0
output_str = str(ind)+str(n)
output = int(output_str)
return output
I.e., in order to check if a 8puzzle pattern = 125346987 has already been used, we need to:
pattern = '125346987'
pos = getPosition(pattern)
used = mem[pos-1] #mem starts in 0, getPosition in 1.
With a perfect hashing we would have needed 9! bits to store the booleans. In this case we need 2x more (767359/9! = 2.11), but recall that it is not even 1Mb (barely 100KB).
Note that the function is easily invertible.
Check
I could prove you mathematically why this works and why there won't be any collision, but since this is a programming forum let's just run it for every possible permutation and check that all the hash values (positions) are indeed different:
def getPosition( input_str ):
data = []
opts = range(1,10)
n = int(input_str[0])
opts.pop(opts.index(n))
for c in input_str[1:len(input_str)-1]:
k = opts.index(int(c))
opts.pop(k)
data.append(k)
ind = data[3]<<14 | data[5]<<12 | data[2]<<9 | data[1]<<6 | data[0]<<3 | data[4]<<1 | data[6]<<0
output_str = str(ind)+str(n)
output = int(output_str)
return output
#CHECKING PURPOSES
def addperm(x,l):
return [ l[0:i] + [x] + l[i:] for i in range(len(l)+1) ]
def perm(l):
if len(l) == 0:
return [[]]
return [x for y in perm(l[1:]) for x in addperm(l[0],y) ]
#We generate all the permutations
all_perms = perm([ i for i in range(1,10)])
print "Number of all possible perms.: "+str(len(all_perms)) #indeed 9! = 362880
#We execute our hash function over all the perms and store the output.
all_positions = [];
for permutation in all_perms:
perm_string = ''.join(map(str,permutation))
all_positions.append(getPosition(perm_string))
#We wan't to check if there has been any collision, i.e., if there
#is one position that is repeated at least twice.
print "Number of different hashes: "+str(len(set(all_positions)))
#also 9!, so the hash works properly.
How does it work?
The idea behind this has to do with a tree: at the beginning it has 9 branches going to 9 nodes, each corresponding to a digit. From each of these nodes we have 8 branches going to 8 nodes, each corresponding to a digit except its parent, then 7, and so on.
We first store the first digit of our input string in a separate variable and pop it out from our 'node' list, because we have already taken the branch corresponding to the first digit.
Then we have 8 branches, we choose the one corresponding with our second digit. Note that, since there are 8 branches, we need 3 bits to store the index of our chosen branch and the maximum value it can take is 111 for the 8th branch (we map branch 1-8 to binary 000-111). Once we have chosen and store the branch index, we pop that value out, so that the next node list doesn't include again this digit.
We proceed in the same way for branches 7, 6 and 5. Note that when we have 7 branches we still need 3 bits, though the maximum value will be 110. When we have 5 branches, the index will be at most binary 100.
Then we get to 4 branches and we notice that this can be stored just with 2 bits, same for 3 branches. For 2 branches we will just need 1bit, and for the last branch we don't need any bit: there will be just one branch pointing to the last digit, which will be the remaining from our 1-9 original list.
So, what we have so far: the first digit stored in a separated variable and a list of 7 indexes representing branches. The first 4 indexes can be represented with 3bits, the following 2 indexes can be represented with 2bits and the last index with 1bit.
The idea is to concatenate all this indexes in their bit form to create a larger number. Since we have 17bits, this number will be at most 2^17=131072. Now we just add the first digit we had stored to the end of that number (at most this digit will be 9) and we have that the biggest number we can create is 1310729.
But we can do better: recall that when we had 5 branches we needed 3 bits, though the maximum value was binary 100. What if we arrange our bits so that those with more 0s come first? If so, in the worst case scenario our final bit number will be the concatenation of:
100 10 101 110 111 11 1
Which in decimal is 76735. Then we proceed as before (adding the 9 at the end) and we get that our biggest possible generated number is 767359, which is the ammount of bits we need and corresponds to input string 987654321, while the lowest possible number is 1 which corresponds to input string 123456789.
Just to finish: one might wonder why have we stored the first digit in a separate variable and added it at the end. The reason is that if we had kept it then the number of branches at the beginning would have been 9, so for storing the first index (1-9) we would have needed 4 bits (0000 to 1000). which would have make our mapping much less efficient, as in that case the biggest possible number (and therefore the amount of memory needed) would have been
1000 100 10 101 110 111 11 1
which is 1125311 in decimal (1.13Mb vs 768Kb). It is quite interesting to see that the ratio 1.13M/0.768K = 1.47 has something to do with the ratio of the four bits compared to just adding a decimal value (2^4/10 = 1.6) which makes a lot of sense (the difference is due to the fact that with the first approach we are not fully using the 4 bits).
First. There is nothing faster than a list of booleans. There's a total of 9! == 362880 possible permutations for your task, which is a reasonably small amount of data to store in memory:
visited_states = [False] * math.factorial(9)
Alternatively, you can use array of bytes which is slightly slower (not by much though) and has a much lower memory footprint (by a power of magnitude at least). However any memory savings from using an array will probably be of little value considering the next step.
Second. You need to convert your specific permutation to it's index. There are algorithms which do this, one of the best StackOverflow questions on this topic is probably this one:
Finding the index of a given permutation
You have fixed permutation size n == 9, so whatever complexity an algorithm has, it will be equivalent to O(1) in your situation.
However to produce even faster results, you can pre-populate a mapping dictionary which will give you an O(1) lookup:
all_permutations = map(lambda p: ''.join(p), itertools.permutations('123456789'))
permutation_index = dict((perm, index) for index, perm in enumerate(all_permutations))
This dictionary will consume about 50 Mb of memory, which is... not that much actually. Especially since you only need to create it once.
After all this is done, checking your specific combination is done with:
visited = visited_states[permutation_index['168249357']]
Marking it to visited is done in the same manner:
visited_states[permutation_index['168249357']] = True
Note that using any of permutation index algorithms will be much slower than mapping dictionary. Most of those algorithms are of O(n2) complexity and in your case it results 81 times worse performance even discounting the extra python code itself. So unless you have heavy memory constraints, using mapping dictionary is probably the best solution speed-wise.
Addendum. As has been pointed out by Palec, visited_states list is actually not needed at all - it's perfectly possible to store True/False values directly in the permutation_index dictionary, which saves some memory and an extra list lookup.
Notice if you type hash(125346987) it returns 125346987. That is for a reason, because there is no point in hashing an integer to anything other than an integer.
What you should do, is when you find a pattern add it to a dictionary rather than a list. This will provide the fast lookup you need rather than traversing the list like you are doing now.
So say you find the pattern 125346987 you can do:
foundPatterns = {}
#some code to find the pattern
foundPatterns[1] = 125346987
#more code
#test if there?
125346987 in foundPatterns.values()
True
If you must always have O(1), then seems like a bit array would do the job. You'd only need to store 363,000 elements, which seems doable. Though note that in practice it's not always faster. Simplest implementation looks like:
Create data structure
visited_bitset = [False for _ in xrange(373000)]
Test current state and add if not visited yet
if !visited[current_state]:
visited_bitset[current_state] = True
Paul's answer might work.
Elisha's answer is perfectly valid hash function that would guarantee that no collision happen in the hash function. The 9! would be a pure minimum for a guaranteed no collision hash function, but (unless someone corrects me, Paul probably has) I don't believe there exists a function to map each board to a value in the domain [0, 9!], let alone a hash function that is nothing more that O(1).
If you have a 1GB of memory to support a Boolean array of 864197532 (aka 987654321-12346789) indices. You guarantee (computationally) the O(1) requirement.
Practically (meaning when you run in a real system) speaking this isn't going to be cache friendly but on paper this solution will definitely work. Even if an perfect function did exist, doubt it too would be cache friendly either.
Using prebuilts like set or hashmap (sorry I haven't programmed Python in a while, so don't remember the datatype) must have an amortized 0(1). But using one of these with a suboptimal hash function like n % RANDOM_PRIME_NUM_GREATER_THAN_100000 might give the best solution.

How to make this nested summation efficient in python?

I have a problem making an efficient function to be used in a rootfinding algorithm. I need to make a function that is a triple summation over lables, which contains a triple summation over a pair of lists. I have already tried several implementations such as using nested lists, dictionaries, splitting the inner triple summation in two double summations (with an intermediate lists), keeping the 'm' & 'n' dictionaries/lists apart, using itertools.izip() to go over R and E together and probably a few others I'm forgetting.
The idea is that I need to be able to discriminate the labels for other functions, so I need an efficient way to store, access and sum over these sets of numbers.
Now, this function is part of an iteration, so the first time, most of these lists are empty. Then I need to use a rootfinding algorithm(pretty simple) for each value in the E-dictionary. After using a root finding algorithm (which depends on this function), the lists are refilled with its solutions. This means that in the second iteration each list will contain about the order of 1000 numbers. After that the rootfinding is again used with this new function, giving (after rebinning) 1000 new numbers in each list.
Clearly I have a problem if this rootfinding already takes several minutes in the first iteration. I have a specific implementation for the first iteration (which is reduced to 3 loops over the labels because of all the empty/only one value in list stuff) which finds all these roots in two seconds.
How can I do this summation efficiently, while still being able to discriminate between the various lists?
Thanks in advance
Note: this code is not my most beautifull attempt, but it is the most clear in what I'm trying to accomplish.
E = {}
R = {}
for i in labels:
E[i] = {'m':[i], 'n':[]} #the label happens to be the value that's in here
E[-i] = {'m':[], 'n':[-i]}
R[i] = {'m':[1], 'n':[]}
R[-i] = {'m':[], 'n':[1]}
def function(A, En):
temp = 0
for a in E:
if (not(ACTIVATE) or a != A):
for b in E:
for c in E:
if ( not(ACTIVATE) or b != c):
for i in xrange(len(R[a]['n'])):
for j in xrange(len(R[b]['m'])):
for k in xrange(len(R[c]['m'])):
temp += R[a]['n'][i]*R[b]['m'][j]*R[c]['m'][k]/(En - (-E[a]['n'][i] + E[b]['m'][j] + E[c]['m'][k]))
for i in xrange(len(R[a]['m'])):
for j in xrange(len(R[b]['n'])):
for k in xrange(len(R[c]['n'])):
temp += R[a]['m'][i]*R[b]['n'][j]*R[c]['n'][k]/(En + (E[a]['m'][i] - E[b]['n'][j] - E[c]['n'][k]))
return .5*temp

Permute a string to print all possible words

The code that i have written seems to be looking bad with asymptotic measure of running time and space
I am getting
T(N) = T(N-1)*N + O((N-1!)*N) where N is the size of input. I need advise to optimize it
Since it is an algorithm based interview question we are required to implement the logic in most efficient way without using any libraries
Here is my code
def str_permutations(str_input,i):
if len(str_input) == 1:
return [str_input]
comb_list = []
while i < len(str_input):
key = str_input[i]
if i+1 != len(str_input):
remaining_str = "".join((str_input[0:i],str_input[i+1:]))
else:
remaining_str = str_input[0:i]
all_combinations = str_permutations(remaining_str,0)
for index,value in enumerate(all_combinations):
all_combinations[index] = "".join((key,value))
comb_list.extend(all_combinations)
i = i+1
return comb_list
As I mentioned in a comment to the question, in the general case you won't get below exponential complexity since for n distinct characters, there are n! permutations of the input string, and O(2n) is a subset of O(n!).
Now the following won't improve the asymptotic complexity for the general case, but you can optimize the brute-force approach of producing all permutations for strings that have some characters with multiple occurrences. Take for example the string daedoid; if you blindly produce all permutations of it, you'll get every permutation 6 = 3! times since you have three occurrences of d. You can avoid that by first eliminating multiple occurrences of the same letter and instead remembering how often to use each letter. So if there is a letter c that has kc occurrences, you'll save kc! permutations. So in total, this saves you a factor of "product over kc! for all c".
If you don't need to write your own, see itertools.permutations and combinations.

Python: efficiently check if integer is within *many* ranges

I am working on a postage application which is required to check an integer postcode against a number of postcode ranges, and return a different code based on which range the postcode matches against.
Each code has more than one postcode range. For example, the M code should be returned if the postcode is within the ranges 1000-2429, 2545-2575, 2640-2686 or is equal to 2890.
I could write this as:
if 1000 <= postcode <= 2429 or 2545 <= postcode <= 2575 or 2640 <= postcode <= 2686 or postcode == 2890:
return 'M'
but this seems like a lot of lines of code, given that there are 27 returnable codes and 77 total ranges to check against. Is there a more efficient (and preferably more concise) method of matching an integer to all these ranges using Python?
Edit: There's a lot of excellent solutions flying around, so I have implemented all the ones that I could, and benchmarked their performances.
The environment for this program is a web service (Django-powered actually) which needs to check postcode region codes one-by-one, on the fly. My preferred implementation, then, would be one that can be quickly used for each request, and does not need any process to be kept in memory, or needs to process many postcodes in bulk.
I tested the following solutions using timeit.Timer with default 1000000 repetitions using randomly generated postcodes each time.
IF solution (my original)
if 1000 <= postcode <= 2249 or 2555 <= postcode <= 2574 or ...:
return 'M'
if 2250 <= postcode <= 2265 or ...:
return 'N'
...
Time for 1m reps: 5.11 seconds.
Ranges in tuples (Jeff Mercado)
Somewhat more elegant to my mind and certainly easier to enter and read the ranges. Particularly good if they change over time, which is possible. But it did end up four times slower in my implementation.
if any(lower <= postcode <= upper for (lower, upper) in [(1000, 2249), (2555, 2574), ...]):
return 'M'
if any(lower <= postcode <= upper for (lower, upper) in [(2250, 2265), ...]):
return 'N'
...
Time for 1m reps: 19.61 seconds.
Set membership (gnibbler)
As stated by the author, "it's only better if you are building the set once to check against many postcodes in a loop". But I thought I would test it anyway to see.
if postcode in set(chain(*(xrange(start, end+1) for start, end in ((1000, 2249), (2555, 2574), ...)))):
return 'M'
if postcode in set(chain(*(xrange(start, end+1) for start, end in ((2250, 2265), ...)))):
return 'N'
...
Time for 1m reps: 339.35 seconds.
Bisect (robert king)
This one may have been a bit above my intellect level. I learnt a lot reading about the bisect module but just couldn't quite work out which parameters to give find_ge() to make a runnable test. I expect that it would be extremely fast with a loop of many postcodes, but not if it had to do the setup each time. So, with 1m repetitions of filling numbers, edgepairs, edgeanswers etc for just one postal region code (the M code with four ranges), but not actually running the fast_solver:
Time for 1m reps: 105.61 seconds.
Dict (sentinel)
Using one dict per postal region code pre-generated, cPickled in a source file (106 KB), and loaded for each run. I was expecting much better performance from this method, but on my system at least, the IO really destroyed it. The server is a not-quite-blindingly-fast-top-of-the-line Mac Mini.
Time for 1m reps: 5895.18 seconds (extrapolated from a 10,000 run).
The summary
Well, I was expecting someone to just give a simple 'duh' answer that I hadn't considered, but it turns out this is much more complicated (and even a little controversial).
If every nanosecond of efficiency counted in this case, I would probably keep a separate process running which implemented one of the binary search or dict solutions and kept the result in memory for an extremely fast lookup. However, since the IF tree takes only five seconds to run a million times, which is plenty fast enough for my small business, that's what I'll end up using in my application.
Thank you to everyone for contributing!
You can throw your ranges into tuples and put the tuples in a list. Then use any() to help you find if your value is within these ranges.
ranges = [(1000,2429), (2545,2575), (2640,2686), (2890, 2890)]
if any(lower <= postcode <= upper for (lower, upper) in ranges):
print('M')
Probably the fastest will be to check the membership of a set
>>> from itertools import chain
>>> ranges = ((1000, 2429), (2545, 2575), (2640, 2686), (2890, 2890))
>>> postcodes = set(chain(*(xrange(start, end+1) for start, end in ranges)))
>>> 1000 in postcodes
True
>>> 2500 in postcodes
False
But it does use more memory this way, and building the set takes time, so it's only better if you are building the set once to check against many postcodes in a loop
EDIT: seems that different ranges need to map to different letters
>>> from itertools import chain
>>> ranges = {'M':((1000,2429), (2545,2575), (2640,2686), (2890, 2890)),
# more ranges
}
>>> postcodemap = dict((k,v) for v in ranges for k in chain(*imap(xrange, *zip(*ranges[v]))))
>>> print postcodemap.get(1000)
M
>>> print postcodemap.get(2500)
None
Here is a fast and short solution, using numpy:
import numpy as np
lows = np.array([1, 10, 100]) # the lower bounds
ups = np.array([3, 15, 130]) # the upper bounds
def in_range(x):
return np.any((lows <= x) & (x <= ups))
Now for instance
in_range(2) # True
in_range(23) # False
you only have to solve for edge cases and for one number between edge cases when doing inequalities.
e.g. if you do the following tests on TEN:
10 < 20, 10 < 15, 10 > 8, 10 >12
It will give True True True False
but note that the closest numbers to 10 are 8 and 12
this means that 9,10,11 will give the answers that ten did.. if you don't have too many initial range numbers and they are sparse then this well help. Otherwise you will need to see if your inequalities are transitive and use a range tree or something.
So what you can do is sort all of your boundaries into intervals.
e.g. if your inequalities had the numbers 12, 50, 192,999
you would get the following intervals that ALL have the same answer:
less than 12, 12, 13-49, 50, 51-191, 192, 193-998, 999, 999+
as you can see from these intervals we only need to solve for 9 cases and we can then quickly solve for anything.
Here is an example of how I might carry it out for solving for a new number x using these pre-calculated results:
a) is x a boundary? (is it in the set)
if yes, then return the answer you found for that boundary previously.
otherwise use case b)
b) find the maximum boundary number that is smaller than x, call it maxS
find the minimum boundary number that is larger than x call it minL.
Now just return any previously found solution that was between maxS and minL.
see Python binary search-like function to find first number in sorted list greater than a specific value
for finding closest numbers. bisect module will help (import it in your code)
This will help finding maxS and minL
You can use bisect and the function i have included in my sample code:
def find_ge(a, key):
'''Find smallest item greater-than or equal to key.
Raise ValueError if no such item exists.
If multiple keys are equal, return the leftmost.
'''
i = bisect_left(a, key)
if i == len(a):
raise ValueError('No item found with key at or above: %r' % (key,))
return a[i]
ranges=[(1000,2429), (2545,2575), (2640,2686), (2890, 2890)]
numbers=[]
for pair in ranges:
numbers+=list(pair)
numbers+=[-999999,999999] #ensure nothing goes outside the range
numbers.sort()
edges=set(numbers)
edgepairs={}
for i in range(len(numbers)-1):
edgepairs[(numbers[i],numbers[i+1])]=(numbers[i+1]-numbers[i])//2
def slow_solver(x):
return #your answer for postcode x
listedges=list(edges)
edgeanswers=dict(zip(listedges,map(solver,listedges)))
edgepairsanswers=dict(zip(edgepairs.keys(),map(solver,edgepairs.values())))
#now we are ready for fast solving:
def fast_solver(x):
if x in edges:
return edgeanswers[x]
else:
#find minL and maxS using find_ge and your own similar find_le
return edgepairsanswers[(minL,maxS)]
The full data isn't there, but I'm assuming the ranges are non-overlapping, so you can express your ranges as a single sorted tuple of ranges, along with their codes:
ranges = (
(1000, 2249, 'M'),
(2250, 2265, 'N'),
(2555, 2574, 'M'),
# ...
)
This means we can binary search over them in one go. This should be O(log(N)) time, which should result in pretty decent performance with very large sets.
def code_lookup(value, ranges):
left, right = 0, len(ranges)
while left != right - 1:
mid = left + (right - left) // 2
if value <= ranges[mid - 1][1]: # Check left split max
right = mid
elif value >= ranges[mid][0]: # Check right split min
left = mid
else: # We are in a gap
return None
if ranges[left][0] <= value <= ranges[left][1]:
# Return the code
return ranges[left][2]
I don't have your exact values, but for comparison I ran it against some generated ranges (77 ranges with various codes) and compared it to a naive approach:
def get_code_naive(value):
if 1000 < value < 2249:
return 'M'
if 2250 < value < 2265:
return 'N'
# ...
The result for 1,000,000 was that the naive version ran in about 5 sec and the binary search version in 4 sec. So it's a bit faster (20%), the codes are a lot nicer to maintain and the longer the list gets, the more it will out-perform the naive method over time.
Recently I had a similar requirement and I used bit manipulation to test if an integer belongs to said range. It is definitely faster, but I guess not suitable if your ranges involve huge numbers. I liberally copied example methods from here
First we create a binary number which will have all bits in the range set to 1.
#Sets the bits to one between lower and upper range
def setRange(permitRange, lower, upper):
# the range is inclusive of left & right edge. So add 1 upper limit
bUpper = 1 << (upper + 1)
bLower = 1 << lower
mask = bUpper - bLower
return (permitRange | mask)
#For my case the ranges also include single integers. So added method to set single bits
#Set individual bits to 1
def setBit(permitRange, number):
mask = 1 << vlan
return (permitRange| mask)
Now time to parse the range and populate our binary mask. If the highest number in the range is n, we will be creating integer greater than 2^n in binary
#Example range (10-20, 25, 30-50)
rangeList = "10-20, 25, 30-50"
maxRange = 100
permitRange = 1 << maxRange
for range in rangeList.split(","):
if range.isdigit():
permitRange = setBit(permitRange, int(range))
else:
lower, upper = range.split("-",1)
permitRange = setRange(permitRange, int(lower), int(upper))
return permitRange
To check if a number 'n' belongs to the range, simply test the bit at n'th position
#return a non-zero result, 2**offset, if the bit at 'offset' is one.
def testBit(permitRange, number):
mask = 1 << number
return (permitRange & mask)
if testBit(permitRange,10):
do_something()
Warning - This is probably premature optimisation. For a large list of ranges it might be worthwhile, but probably not in your case. Also, although dictionary/set solutions will use more memory, they are still probably a better choice.
You could do a binary-search into your range end-points. This would be easy if all ranges are non-overlapping, but could still be done (with some tweaks) for overlapping ranges.
Do a find-highest-match-less-than binary search. This is the same as a find-lowest-match-greater-than-or-equal (lower bound) binary search, except that you subtract one from the result.
Use half-open items in your list of end points - that is if your range is 1000..2429 inclusive, use the values 1000 and 2430. If you get an end-point and a start-point with the same value (two ranges touching, so there is no gap between) exclude the end-point for the lower range from your list.
If you find a start-of-range end-point, your goal value is within that range. If you find an end-of-range end-point, your goal value isn't in any range.
The binary search algorithm is roughly (don't expect this to run without editing)...
while upperbound > lowerbound :
testpos = lowerbound + ((upperbound-lowerbound) // 2)
if item [testpos] > goal :
# new best-so-far
upperbound = testpos
else :
lowerbound = testpos + 1
Note - the "//" division operator is necessary for integer division in Python 3. In Python 2, the normal "/" will work, but it's best to be ready for Python 3.
At the end, both upperbound and lowerbound point to the found item - but for the "upper bound" search. Subtract one to get the required search result. If that gives -1, there is no matching range.
There's probably a binary search routine in the library that does the upper-bound search, so prefer that to this if so. To get a better understanding of how the binary search works, see How can I better understand the one-comparison-per-iteration binary search? - no, I'm not above begging for upvotes ;-)
Python has a range(a, b) function which means the range from (and including) a, to (but excluding) b. You can make a list of these ranges and check to see if a number is in any of them. It may be more efficient to use xrange(a, b) which has the same meaning but doesn't actually make a list in memory.
list_of_ranges = []
list_of_ranges.append(xrange(1000, 2430))
list_of_ranges.append(xrange(2545, 2576))
for x in [999, 1000, 2429, 2430, 2544, 2545]:
result = False
for r in list_of_ranges:
if x in r:
result = True
break
print x, result
A bit of a silly approach to an old question, but I was curious how well regex character classes would handle the problem, since this exact problem occurs frequently in questions about character validity.
To make a regex for the "M" postal codes you showed, we can turn the numbers into unicode using chr():
m_ranges = [(1000, 2249), (2545, 2575), (2640, 2686)]
m_singletons = [2890]
m_range_char_class_members = [fr"{chr(low)}-{chr(high)}" for (low, high) in m_ranges]
m_singleton_char_class_members = [fr"{chr(x)}" for x in m_singletons]
m_char_class = f"[{''.join(m_range_char_class_members + m_singleton_char_class_members)}]"
m_regex = re.compile(m_char_class)
Then a very rough benchmark on 1 million random postal codes for this method vs your original if-statement:
test_values = [random.randint(1000, 9999) for _ in range(1000000)]
def is_m_regex(num):
return m_regex.match(chr(num))
def is_m_if(num):
return 1000 <= num <= 2249 or 2545 <= num <= 2575 or 2640 <= num <= 2686 or num == 2890
def run_regex_test():
start_time = time.time()
for i in test_values:
is_m_regex(i)
print("--- REGEX: %s seconds ---" % (time.time() - start_time))
def run_if_test():
start_time = time.time()
for i in test_values:
is_m_if(i)
print("--- IF: %s seconds ---" % (time.time() - start_time))
...
running regex test
--- REGEX: 0.3418138027191162 seconds ---
--- IF: 0.19183707237243652 seconds ---
So this would suggest that for comparing one character at a time, using raw if statements is faster than character classes in regexes. No surprise here, since using regex is a bit silly for this problem.
BUT. When doing a an operation like sub to eliminate all matches from a string composed of all the original test values, it ran much quicker:
blob_value = ''.join([chr(x) for x in test_values])
def run_regex_test_char_blob():
start_time = time.time()
subbed = m_regex.sub('', blob_value)
print("--- REGEX BLOB: %s seconds ---" % (time.time() - start_time))
print(f"original blob length : {len(blob_value)}")
print(f"sub length : {len(subbed)}")
...
--- REGEX BLOB: 0.03655815124511719 seconds ---
original blob length : 1000000
sub length : 851928
The sub method here replaces all occurrences of M-postal-characters (~15% of this sample), which means it operated on all 1 million characters of the string. That would suggest to me that mass operations by the re package are MUCH more efficient than individual operations suggested in these answers. So if you've really got a lot of comparisons to do at once in a data pipeline, you may find the best performance by doing some string composition and using regex.
In Python 3.2 functools.lru_cache was introduced.
Your solution along with the beforementioned decorator should be pretty fast.
Or, Python 3.9's functools.cache could be used as well (which should be even faster).
Have you really made benchmarks? Does the performance of this piece of code influence the performance of the overall application? So benchmark first!
But you can also use a dict e.g. for storing all keys of the "M" ranges:
mhash = {1000: true, 1001: true,..., 2429: true,...}
if postcode in mhash:
print 'M'
Of course: the hashes require more memory but access time is O(1).

Categories

Resources