Permute a string to print all possible words - python

The code that i have written seems to be looking bad with asymptotic measure of running time and space
I am getting
T(N) = T(N-1)*N + O((N-1!)*N) where N is the size of input. I need advise to optimize it
Since it is an algorithm based interview question we are required to implement the logic in most efficient way without using any libraries
Here is my code
def str_permutations(str_input,i):
if len(str_input) == 1:
return [str_input]
comb_list = []
while i < len(str_input):
key = str_input[i]
if i+1 != len(str_input):
remaining_str = "".join((str_input[0:i],str_input[i+1:]))
else:
remaining_str = str_input[0:i]
all_combinations = str_permutations(remaining_str,0)
for index,value in enumerate(all_combinations):
all_combinations[index] = "".join((key,value))
comb_list.extend(all_combinations)
i = i+1
return comb_list

As I mentioned in a comment to the question, in the general case you won't get below exponential complexity since for n distinct characters, there are n! permutations of the input string, and O(2n) is a subset of O(n!).
Now the following won't improve the asymptotic complexity for the general case, but you can optimize the brute-force approach of producing all permutations for strings that have some characters with multiple occurrences. Take for example the string daedoid; if you blindly produce all permutations of it, you'll get every permutation 6 = 3! times since you have three occurrences of d. You can avoid that by first eliminating multiple occurrences of the same letter and instead remembering how often to use each letter. So if there is a letter c that has kc occurrences, you'll save kc! permutations. So in total, this saves you a factor of "product over kc! for all c".

If you don't need to write your own, see itertools.permutations and combinations.

Related

Trying to calculate algorithm time complexity

So last night I solved this LeetCode question. My solution is not great, quite slow. So I'm trying to calculate the complexity of my algorithm to compare with standard algorithms that LeetCode lists in the Solution section. Here's my solution:
class Solution:
def longestCommonPrefix(self, strs: List[str]) -> str:
# Get lengths of all strings in the list and get the minimum
# since common prefix can't be longer than the shortest string.
# Catch ValueError if list is empty
try:
min_len = min(len(i) for i in strs)
except ValueError:
return ''
# split strings into sets character-wise
foo = [set(list(zip(*strs))[k]) for k in range(min_len)]
# Now go through the resulting list and check whether resulting sets have length of 1
# If true then add those characters to the prefix list. Break as soon as encounter
# a set of length > 1.
prefix = []
for i in foo:
if len(i) == 1:
x, = i
prefix.append(x)
else:
break
common_prefix = ''.join(prefix)
return common_prefix
I'm struggling a bit with calculating complexity. First step - getting minimum length of strings - takes O(n) where n is number of strings in the list. Then the last step is also easy - it should take O(m) where m is the length of the shortest string.
But the middle bit is confusing. set(list(zip(*strs))) should hopefully take O(m) again and then we do it n times so O(mn). But then overall complexity is O(mn + m + n) which seems way too low for how slow the solution is.
The other option is that the middle step is O(m^2*n), which makes a bit more sense. What is the proper way to calculate complexity here?
Yes, the middle portion is O{mn}, as well the overall is O{mn} because that dwarfs the O{m} and O{n} terms for large values of m and n.
Your solution has an ideal order of runtime complexity.
Optimize: Short-Circuit
However, you are probably dismayed that others have faster solutions. I suspect that others likely short-circuit on the first non-matching index.
Let's consider a test case of 26 strings (['a'*500, 'b'*500, 'c'*500, ...]). Your solution would proceed to create a list that is 500 long, with each entry containing a set of 26 elements. Meanwhile, if you short-circuited, you would only process the first index, ie one set of 26 characters.
Try changing your list into a generator. This might be all you need to short-circuit.
foo = (set(x) for x in zip(*strs)))
You can skip min_len check because default behaviour of zip is to iterate only as long as the shortest input.
Optimize: Generating Intermediate Results
I see that you append each letter to a list, then ''.join(lst). This is efficient, especially compared to the alternative of iteratively appending to a string.
However, we could just as easily save a counter match_len. Then when we detect the first mis-match, just:
return strs[0][:match_len]

how to calculate the minimum unfairness sum of a list

I have tried to summarize the problem statement something like this::
Given n, k and an array(a list) arr where n = len(arr) and k is an integer in set (1, n) inclusive.
For an array (or list) myList, The Unfairness Sum is defined as the sum of the absolute differences between all possible pairs (combinations with 2 elements each) in myList.
To explain: if mylist = [1, 2, 5, 5, 6] then Minimum unfairness sum or MUS. Please note that elements are considered unique by their index in list not their values
MUS = |1-2| + |1-5| + |1-5| + |1-6| + |2-5| + |2-5| + |2-6| + |5-5| + |5-6| + |5-6|
If you actually need to look at the problem statement, It's HERE
My Objective
given n, k, arr(as described above), find the Minimum Unfairness Sum out of all of the unfairness sums of sub arrays possible with a constraint that each len(sub array) = k [which is a good thing to make our lives easy, I believe :) ]
what I have tried
well, there is a lot to be added in here, so I'll try to be as short as I can.
My First approach was this where i used itertools.combinations to get all the possible combinations and statistics.variance to check its spread of data (yeah, I know I'm a mess).
Before you see the code below, Do you think these variance and unfairness sum are perfectly related (i know they are strongly related) i.e. the sub array with minimum variance has to be the sub array with MUS??
You only have to check the LetMeDoIt(n, k, arr) function. If you need MCVE, check the second code snippet below.
from itertools import combinations as cmb
from statistics import variance as varn
def LetMeDoIt(n, k, arr):
v = []
s = []
subs = [list(x) for x in list(cmb(arr, k))] # getting all sub arrays from arr in a list
i = 0
for sub in subs:
if i != 0:
var = varn(sub) # the variance thingy
if float(var) < float(min(v)):
v.remove(v[0])
v.append(var)
s.remove(s[0])
s.append(sub)
else:
pass
elif i == 0:
var = varn(sub)
v.append(var)
s.append(sub)
i = 1
final = []
f = list(cmb(s[0], 2)) # getting list of all pairs (after determining sub array with least MUS)
for r in f:
final.append(abs(r[0]-r[1])) # calculating the MUS in my messy way
return sum(final)
The above code works fine for n<30 but raised a MemoryError beyond that.
In Python chat, Kevin suggested me to try generator which is memory efficient (it really is), but as generator also generates those combination on the fly as we iterate over them, it was supposed to take over 140 hours (:/) for n=50, k=8 as estimated.
I posted the same as a question on SO HERE (you might wanna have a look to understand me properly - it has discussions and an answer by fusion which takes me to my second approach - a better one(i should say fusion's approach xD)).
Second Approach
from itertools import combinations as cmb
def myvar(arr): # a function to calculate variance
l = len(arr)
m = sum(arr)/l
return sum((i-m)**2 for i in arr)/l
def LetMeDoIt(n, k, arr):
sorted_list = sorted(arr) # i think sorting the array makes it easy to get the sub array with MUS quickly
variance = None
min_variance_sub = None
for i in range(n - k + 1):
sub = sorted_list[i:i+k]
var = myvar(sub)
if variance is None or var<variance:
variance = var
min_variance_sub=sub
final = []
f = list(cmb(min_variance_sub, 2)) # again getting all possible pairs in my messy way
for r in f:
final.append(abs(r[0] - r[1]))
return sum(final)
def MainApp():
n = int(input())
k = int(input())
arr = list(int(input()) for _ in range(n))
result = LetMeDoIt(n, k, arr)
print(result)
if __name__ == '__main__':
MainApp()
This code works perfect for n up to 1000 (maybe more), but terminates due to time out (5 seconds is the limit on online judge :/ ) for n beyond 10000 (the biggest test case has n=100000).
=====
How would you approach this problem to take care of all the test cases in given time limits (5 sec) ? (problem was listed under algorithm & dynamic programming)
(for your references you can have a look on
successful submissions(py3, py2, C++, java) on this problem by other candidates - so that you can
explain that approach for me and future visitors)
an editorial by the problem setter explaining how to approach the question
a solution code by problem setter himself (py2, C++).
Input data (test cases) and expected output
Edit1 ::
For future visitors of this question, the conclusions I have till now are,
that variance and unfairness sum are not perfectly related (they are strongly related) which implies that among a lots of lists of integers, a list with minimum variance doesn't always have to be the list with minimum unfairness sum. If you want to know why, I actually asked that as a separate question on math stack exchange HERE where one of the mathematicians proved it for me xD (and it's worth taking a look, 'cause it was unexpected)
As far as the question is concerned overall, you can read answers by archer & Attersson below (still trying to figure out a naive approach to carry this out - it shouldn't be far by now though)
Thank you for any help or suggestions :)
You must work on your list SORTED and check only sublists with consecutive elements. This is because BY DEFAULT, any sublist that includes at least one element that is not consecutive, will have higher unfairness sum.
For example if the list is
[1,3,7,10,20,35,100,250,2000,5000] and you want to check for sublists with length 3, then solution must be one of [1,3,7] [3,7,10] [7,10,20] etc
Any other sublist eg [1,3,10] will have higher unfairness sum because 10>7 therefore all its differences with rest of elements will be larger than 7
The same for [1,7,10] (non consecutive on the left side) as 1<3
Given that, you only have to check for consecutive sublists of length k which reduces the execution time significantly
Regarding coding, something like this should work:
def myvar(array):
return sum([abs(i[0]-i[1]) for i in itertools.combinations(array,2)])
def minsum(n, k, arr):
res=1000000000000000000000 #alternatively make it equal with first subarray
for i in range(n-k):
res=min(res, myvar(l[i:i+k]))
return res
I see this question still has no complete answer. I will write a track of a correct algorithm which will pass the judge. I will not write the code in order to respect the purpose of the Hackerrank challenge. Since we have working solutions.
The original array must be sorted. This has a complexity of O(NlogN)
At this point you can check consecutive sub arrays as non-consecutive ones will result in a worse (or equal, but not better) "unfairness sum". This is also explained in archer's answer
The last check passage, to find the minimum "unfairness sum" can be done in O(N). You need to calculate the US for every consecutive k-long subarray. The mistake is recalculating this for every step, done in O(k), which brings the complexity of this passage to O(k*N). It can be done in O(1) as the editorial you posted shows, including mathematic formulae. It requires a previous initialization of a cumulative array after step 1 (done in O(N) with space complexity O(N) too).
It works but terminates due to time out for n<=10000.
(from comments on archer's question)
To explain step 3, think about k = 100. You are scrolling the N-long array and the first iteration, you must calculate the US for the sub array from element 0 to 99 as usual, requiring 100 passages. The next step needs you to calculate the same for a sub array that only differs from the previous by 1 element 1 to 100. Then 2 to 101, etc.
If it helps, think of it like a snake. One block is removed and one is added.
There is no need to perform the whole O(k) scrolling. Just figure the maths as explained in the editorial and you will do it in O(1).
So the final complexity will asymptotically be O(NlogN) due to the first sort.

why is my iterator implementation very inefficient?

I wrote the following python script to count the number of occurrences of a character (a) in the first n characters of an infinite string.
from itertools import cycle
def count_a(str_, n):
count = 0
str_ = cycle(str_)
for i in range(n):
if next(str_) == 'a':
count += 1
return count
My understanding of iterators is that they are supposed to be efficient, but this approach is super slow for very large n. Why is this so?
The cycle iterator might not be as efficient as you think, the documentation says
Make an iterator returning elements from the iterable and saving a
copy of each.
When the iterable is exhausted, return elements from the saved copy.
Repeats indefinitely
...Note, this member of the toolkit may require significant auxiliary
storage (depending on the length of the iterable).
Why not simplify and just not use the iterator at all? It adds unnecessary overhead and gives you no benefit. You can easily count the occurrences with a simple str_[:n].count('a')
The first problem here is that despite using itertools, you're still doing explicit python-level for loop. To gain the C level speed boost when using itertools you want to keep all the iteration in the high speed itertools.
So let's do this step by step, first we want to get the number of characters in a finite string. To do this, you can use the itertools.islice method to get the first n characters in the string:
str_first_n_chars = islice(cycle(str_), n)
You next want to count the number of occurrences of the letter (a), to do this you can do some variation of either of these (you may want to experiment which variants is faster):
count_a = sum(1 for c in str_first_n_chars if c == 'a')
count_a = len(tuple(filter('a'.__eq__, str_first_n_chars))
This is all and well, but this is still slow for really large n because you need to iterate through str_ many, many times for really large n, like for example n = 10**10000. In other words, this algorithm is O(n).
There's one last improvement we could made. Notice how that the number of (a) in the str_ never really change in each iteration. Rather than iterating through str_ multiple times for large n, we can do a little bit of smarter with a bit of math so that we only need to iterate through str_ twice. First we count the number of (a) in a single stretch of str_:
count_a_single = str_.count('a')
Then we find out how many times we would have to iterate through str_ to get length n by using divmod function:
iter_count, remainder = divmod(n, len(str_))
then we can just multiply iter_count with count_a_single and add the number of (a) in the remaining length. We don't need cycle or islice and such here because remainder < len(str_)
count_a = iter_count * count_a_single + str_[:remainder].count('a')
With this method, the runtime performance of the algorithm grows only on the length of a single cycle of str_ rather than n. In other words, this algorithm is O(len(str_)).

Big O notation of simple anagram function

I've worked up the following code which finds anagrams. I had thought the big O notation for this was O(n) But was informed by my instructor that I am incorrect. I am confused on why this is not correct however, would anyone be able to offer any advice?
# Define an anagram.
def anagram(s1, s2):
return sorted(s1) == sorted(s2)
# Main function.
def Question1(t, s):
# use built in any function to check any anagram of t is substring of s
return any(anagram(s[i: i+len(t)], t)
for i in range(len(s)-len(t)+ 1))
Function Call:
# Simple test case.
print Question1("app", "paple")
# True
any anagram of t is substring of s
That's not what your code says.
You have "any substring of s is an anagram of t", which might be equivalent, but it's easier to understand that way.
As for complexity, you need to define what you're calling N... Is it len(s)-len(t)+ 1?
The function any() has complexity N, in that case, yes.
However, you've additionally called anagram over an input of T length, and you seem to have ignored that.
anagram calls sorted twice. Each call to sorted is closer to O(T * log(T)) itself assuming merge sort. You're also performing a list slice, so it could be slightly higher.
Let's say your complexity is somewhere on the order of (S-T) * 2 * (T * log(T)) where T and S are lengths of strings.
The answer depends on which string of your input is larger.
Best case is that they are the same length because then your range only has one element.
Big O notation is worst case, though, so you need to figure out which conditions generate the most complexity in terms of total operations. For example, what if T > S? Then len(s)-len(t)+ 1 will be non positive, so does the code run more or less than equal length strings? And what about S < T or S = 0?
This is not N complexity due a few factors. First one sorted has O(n log n) complexity. And Potentially you can call it few times (and sort T and S), if T long enough.

An efficient way to search similar words (with specified length) in two strings using python

My input is two strings of the same length and a number which represents the length of the common words I need to find in both strings. I wrote a very straightforward code to do so, and it works, but it is super super slow, considering the fact that each string is ~200K letters.
This is my code:
for i in range(len(X)):
for j in range(len(Y)):
if(X[i] == Y[j]):
for k in range (kmer):
if (X[i+k] == Y[j+k]):
count +=1
else:
count=0
if(count == int(kmer)):
loc=(i,j)
pos.append(loc)
count=0
if(Xcmp[i] == Y[j]):
for k in range (kmer):
if (Xcmp[i+k] == Y[j+k]):
count +=1
else:
count=0
if(count == int(kmer)):
loc=(i,j)
pos.append(loc)
count=0
return pos
Where the first sequence is X and the second is Y and kmer is the length of the common words. (and when I say word, I just mean characters..)
I was able to create a X by kmer matrix (rather than the huge X by Y) but that's still very slow.
I also thought about using a trie, but thought that maybe it will take too long to populate it?
At the end I only need the positions of those common subsequences.
any ideas on how to improve my algorithm?
Thanks!! :)
Create a set of words like this
words = {X[i:i+kmer] for i in range(len(X)-kmer+1)}
for i in range(len(Y)-kmer+1):
if Y[i:i+kmer] in words:
print Y[i:i+kmer]
This is fairly efficient as long as kmer isn't so large that you'd run out of memory for the set. I assume it isn't since you were creating a matrix that size already.
For the positions, create a dict instead of a set as Tim suggests
from collections import defaultdict
wordmap = defaultdict(list)
for i in range(len(X)-kmer+1):
wordmap[X[i:i+kmer]].append(i)
for i in range(len(Y)-kmer+1):
word = Y[i:i+kmer]
if word in wordmap:
print word, wordmap[word], i
A triple nested for loop is giving you a runtime of n^3 because you're literally going through each entry. Consider using Rolling Hash. It has a linear average runtime and worstcase n^2. It's best for finding substrings so more or less what you're doing. In this case you may be closer to n^2 but it's still pretty good over n^3.

Categories

Resources