Code complexity Explanation | Powerset Generation

Code complexity Explanation | Powerset Generation - python

I am trying to understand difference/similarity in complexities when writing the code to generate powerset in 2 ways:
def powerset(s, i, cur):
if i==len(s):
print(cur)
return
powerset(s, i+1, cur+s[i]) # string addition is also possibly O(n^2) here?
powerset(s, i+1, cur)
powerset("abc", 0, "")
Output:
['abc', 'ab', 'ac', 'a', 'bc', 'b', 'c', '']
This is going into recursion, with 2 choices at each step (adding s[i] or not), creating 2 branches. Leading to 2^n and adding to the array/printing is another O(n), leading to O(n*2^n)
Also thinking about it in the terms of branches^depth = O(2^n)
The space complexity for this will be: O(n)? Considering max depth of the tree to go up to n by the above logic.
And with this:
s = "abc"
res = [""]
for i in s:
res += [j+i for j in res]
I get the same output.
But here, I see 2 for loops and the additional complexity for creating the strings -- which is possibly O(n^2) in Python. Leading to possible O(N^4) as opposed to O(n*2^n) in the solution above.
Space complexity here seems to me to be O(n) since we are reserving space for just the output. But no additional space, so overall: O(1)
Is my understanding for these solutions in time and space correct? I was under the impression that computing powerset is O(2^n). But I figured maybe its a more optimized solution? (even though the second solution seems more naive).
https://stackoverflow.com/questions/34008010/is-the-time-complexity-of-iterative-string-append-actually-on2-or-on
Here, they suggest using arrays to avoid the `O(n^2)` complexity of string concatenation.

It's clear that both of those solutions involving creating 2len(s) strings, which are all the subsets of s. The amount of time that takes is O(2len(s) * T) where T is the time to create and process each string. The processing time is impossible to know without seeing the code which consumes the subsets; if the processing consisting only of printing, then it's presumably O(len(s)) since printing has to be done character by character.
I think the core of your question is how much time constructing the string takes.
It's often said that constructing a string by repeated concatenation is O(n²), since the entire string needs to be copied in order to perform each concatenation. And that's basically true (although Python has some tricks up its sleeve which can optimise in certain cases). But that's assuming that you're constructing the string from scratch, tossing away the intermediate results as soon as they're no longer necessary.
In this case, however, the intermediate results are also part of the desired result. And you only copy the prefixes implicitly, when you append the next character. That means that the additional cost of producing each subset is O(n), not O(n²), because the cost of producing the immediate prefix was already accounted for. That's true in both of the solutions you provide.

Related

Compare strings in a list to another list of strings: pairwise string comparison vs check existence in set [duplicate]

This question already has answers here:
Can hash tables really be O(1)?
(10 answers)
Closed last year.
I'm comparing a list of strings words_to_lookup to a list of strings word_list and for every match I'll do something.
I wonder whether direct string comparison or checking existence in a set will be more time efficient.
String comparison is O(n)
for w in word_list:
for s in words_to_lookup:
if s == w:
# do something
Checking existence in a set is O(1) but getting the hash of the string (probably) takes O(n).
w_set = set(word_list)
for s in words_to_lookup:
if s in w_set:
# do something
So is string comparison an efficient approach? If the set approach is faster, how?
EDIT:
I was not thinking clearly when I first posted the question. Thank you for the comments. I find it hard to convert my real problem to a concise one suited for online discussions. Now I made the edit, the answer is obvious that the first approach is O(n^2) and the second approach is O(n).
My real question should have been this: Can hash tables really be O(1)?.
And the answer is:
The hash function however does not have to be O(m) - it can be O(1). Unlike a cryptographic hash, a hash function for use in a dictionary does not have to look at every bit in the input in order to calculate the hash. Implementations are free to look at only a fixed number of bits.

If you only need to search once, your first approach is more efficient.
One advantage of constructing a set is the following: if you need to search against the same set many times, you only need to build the set once.
In other words, suppose you have N words in the dictionary (dictionary_list) and you have a list of M words that you want to look up (words_to_lookup). If you go with the set approach, the complexity is O(N+M). If you don't build a set, the complexity is O(N*M) because you may have to go over the whole dictionary of N words for each of the M words that you are looking up.
For this problem, the following code is the more efficient approach.
w_set = set(dictionary_list)
for w in words_to_lookup:
if w in w_set:
# do something

EDIT
Ok. Now I see what you mean. In that case, the set version is definitely better. Note, you can also do:
for s in words_to_lookup:
if s in word_list:
# do something
That's the same thing as your set way, but the running time of the "in" operator will be worse.
list - Average: O(n)
set/dict - Average: O(1), Worst: O(n)
So the set way is probably best.

Time Complexity of Nested For Loop in Python

What is the time complexity of the following nested for loop please?
Edit. I think the answer to this question hinges on another question, to which I don't know if there is a "canonical" answer.
That question is whether the n in big-O expressions such as O(n), O(n^2) refers explicitly to an input parameter called n, or to a general value representing the size of the input.
Some of the answers given so far seem to contradict the answer given here: https://stackoverflow.com/a/23361893/3042018 I would appreciate some more clarity if possible.
for i in range(n):
for j in range(m):
print(i, j) # Output statement occurs n * m times.
I'm thinking O(n^2) as each loop is O(n), but I'm wondering if it might be O(nm), and if/whether these are in fact the same thing.

You have n loops by m loops. For example for each n loop you must do m loops . That means O(n) * O(m) equal O(n*m)

In your code, you introduce two variables, n and m. Without more information about them, the best we can do is say the time complexity is O(nm), as already stated by other answers.
You might be confused by the fact that other examples (such as in the link you posted) have more information that we can use. If we knew something else, for example that m = 2n, then we can simplify O(nm) = O(2n2) = O(n2), which might be a more informative answer for that question. However, it doesn't change the fact that your loop executes O(nm) times.
In your current algorithm description, there is nothing that tells us that we could simplify to O(n2) like you suggest. What if m is already n2? Then we get O(n3).
In general, the big-O notation does not place any requirements on what you are measuring. You decide yourself which variable or variables you want to express it in terms of. Whatever is useful to convey your message. Indeed, these variables often represent the size of the input when talking about the time complexity of algorithms, but they do not have to. They can be just numbers, given as input to the function. Or the number of bits required to represent a number given as input. As long as you define them in an unambiguous way and choose them such that they help convey your message, you're good.

For every element in list n the whole list m is iterated. So by this said, you have m * n printings and a time complexity of O(m * n)

The solution of this problem is o(n)
because here we are simple printing the value of i & J
if we do any Mathematical (addition, subtraction..etc then) answer for this problem would O(logn)
If we follow any type of complex iterative then answer would be O(n^2)

What is the time complexity of the below code in python?

Below is a code to remove the recurrent alphabets from a string in python. I would like to know the time complexity of this code. More specifically time complexity of line if string_1[i] not in char_found:. Searching in a list.
Also if possible can this be explained using space allocated by a list.
def remove_recorring_char(string_1):
result = ""
char_found = []
for i in range(0,len(string_1)):
if string_1[i] not in char_found:
char_found.append(string_1[i])
result = result+string_1[i]
return result
if __name__== "__main__":
print(remove_recorring_char("aabbbcc"))
print(remove_recorring_char("chdsgdsgggsggsjddaaxcvcj"))

if string_1[i] not in char_found:
This line does two things:
First, it accesses string_1[i]. That takes constant time, because strings are basically just arrays of characters.
Then it searches in a list char_found, comparing that character string_1[i] to each element until one matches. That takes (worst-case) linear time in the length of char_found. And, since char_found could (worst-case) be all of the characters in string_1[:i], that's linear in the length of string_1.
So, this line is O(N).
And of course this line is inside an outer loop that's even more obviously O(N): for i in range(0,len(string_1)):. So, that combination of the two is O(N**2).
Even if you fix that in test to be constant time, you also do result = result+string_1[i] inside the loop. String concatenation is worst-case linear in the length of the string. Recent versions of CPython and PyPy have some optimizations so it's sometimes amortized constant time, like appending to a list, but Python the language doesn't guarantee those optimizations. And result is, worst-case, also as long as string_1. So, the whole thing is still O(N**2), unless your interpreter is extra nice.
You could reduce the whole thing to O(N) by making two small changes.
First, use a set rather than a list for char_found. Searching a set, and adding to it, are both amortized constant-time operations.
Second, use a list rather than a str for result, then do result = ''.join(result) at the end. Appending to a list is amortized constant-time. Converting a list back to a string is of course linear time, but you're not doing it inside your loops, so that's fine.

How can you parallelize a regex search of one long string? [duplicate]

This question already has answers here:
How can I tell if a string repeats itself in Python?
(13 answers)
Closed 7 years ago.
I'm testing the output of a simulation to see if it enters a loop at some point, so I need to know if the output repeats itself. For example, there may be 400 digits, followed by a 400000 digit cycle. The output consists only of digits from 0-9. I have the following regex function that I'm using to match repetitions in a single long string:
def repetitions(s):
r = re.compile(r"(.+?)\1+")
for match in r.finditer(s):
if len(match.group(1)) > 1 and len(match.group(0))/len(match.group(1)) > 4:
yield (match.group(1), len(match.group(0))/len(match.group(1)))
This function works fantastically, but it takes far too long. My most recent test was 4 million digits, and it took 4.5 hours to search. It found no repetitions, so I now need to increase the search space. The code only concerns itself with subsequences that repeat themselves more than 4 times because I'm considering 5 repetitions to give a set that can be checked manually: the simulation will generate subsequences that will repeat hundreds of times. I'm running on a four core machine, and the digits to be checked are generated in real time. How can I increase the speed of the search?

Based on information given by nhahtdh in one of the other answers, some things have come to light.
First, the problem you are posing is called finding "tandem repeats" or "squares".
Second, the algorithm given in http://csiflabs.cs.ucdavis.edu/~gusfield/lineartime.pdf finds z tandem repeats in O(n log n + z) time and is "optimal" in the sense that there can be that many answers. You may be able to use parallelize the tandem searches, but I'd first do timings with the simple-minded approach and divide by 4 to see if that is in the speed range you expect.
Also, in order to use this approach you are going to need O(n) space to store this suffix tree. So if you have on the order of 400,000 digits, you are going to need on the order of 400,000 time to build and 400,000 bytes to and store this suffix tree.
I am not totally what is meant by searching in "real time", I usually think of it as a hard limit on how long an operation can take. If that's the case, then that's not going to happen here. This algorithm needs to read in the entire input string and processes that before you start to get results. In that sense, it is what's called an "off-line" algorithm,.
http://web.cs.ucdavis.edu/~gusfield/strmat.html has C code that you can download. (In tar file strmat.tar.gz look for repeats_tandem.c and repeats_tandem.h).
In light of the above, if that algorithm isn't sufficiently fast or space efficient, I'd look for ways to change or narrow the problem. Maybe you only need a fixed number of answers (e.g. up to 5)? If the cycles are a result of executing statements in a program, given that programming languages (other than assembler) don't have arbitrary "goto" statements, it's possible that this can narrow the kinds of cycles that can occur and somehow by make use of that structure might offer a way to speed things up.

When one algorithm is too slow, switch algorithms.
If you are looking for repeating strings, you might consider using a suffix tree scheme: https://en.wikipedia.org/wiki/Suffix_tree
This will find common substrings in for you in linear time.
EDIT: #nhahtdh inb a comment below has referenced a paper that tells you how to pick out all z tandem repeats very quickly. If somebody upvotes
my answer, #nhahtdh should logically get some of the credit.
I haven't tried it, but I'd guess that you might be able to parallelize the construction of the suffix tree itself.

I'm sure there's room for optimization, but test this algorithm on shorter strings to see how it compares to your current solution:
def partial_repeat(string):
l = len(string)
for i in range(2, l//2+1):
s = string[0:i]
multi = l//i-1
factor = l//(i-1)
ls = len(s)
if s*(multi) == string[:ls*(multi)] and len(string)-len(string[:ls*factor]) <= ls and s*2 in string:
return s
>>> test_string
'abc1231231231231'
>>> results = {x for x in (partial_repeat(test_string[i:]) for i in range(len(test_string))) if x}
>>> sorted(sorted(results, key=test_string.index), key=test_string.count, reverse=True)[0]
'123'
In this test string, it's unclear whether the non-repeating initial characters are 'abc' or 'abc1', so the repeating string could be either '123' or '231'. The above sorts each found substring by its earliest appearance in the test string, sorts again (sorted() is a stable sort) by the highest frequency, and takes the top result.
With standard loops and min() instead of comprehensions and sorted():
>>> g = {partial_repeat(test_string[i:]) for i in range(len(test_string))}
>>> results = set()
>>> for x in g:
... if x and (not results or test_string.count(x) >= min(map(test_string.count, results))):
... results.add(x)
...
>>> min(results, key=test_string.index)
'123'
I tested these solutions with the test string 'abc123123a' multiplied by (n for n in range(100, 10101, 500) to get some timing data. I entered these data into Excel and used its FORECAST() function to estimate the processing time of a 4-million character string at 430 seconds, or about seven minutes.

Fastest sorted string concatenation

What is the fastest and most efficient way to do this:
word = "dinosaur"
newWord = word[0] + ''.join(sorted(word[1:]))
output:
"dainorsu"
Thoughts:
Would something as converting the word to an array increase performance? I read somewhere that arrays have less overhead due to them being the same data type compared to a string.
Basically I want to sort everything after the first character in the string as fast as possible. If memory is saved that would also be a plus. The problem I am trying to solve needs to be within a certain time limit so I am trying to be as fast as possible. I dont know much about python efficiency under the hood so if you explain why this method is fast as well that would be AWESOME!

Here's how I'd approach it.
Create an array of size 26 (assuming that only lowercase letters are used). Then iterate through each character in the string. For the 1st letter of the alphabet, increment the 1st index of the array; for the 2nd, increment the 2nd. Once you've scanned the whole string (which is of complexity O(n)) you will be able to reconstruct it afterwards by repeating the 'a' array[0] times, 'b' array[1] times, and so on.
This approach would beat a fast sort algorithm like quicksort or partition sort, which have complexity O(nlogn).
EDIT: Finally you'd also want to reassemble the final string efficiently. The string concatenation provided by some languages using the + operator can be inefficient, so consider using an efficient string builder class.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.