What is the time complexity of the below code in python? - python

Below is a code to remove the recurrent alphabets from a string in python. I would like to know the time complexity of this code. More specifically time complexity of line if string_1[i] not in char_found:. Searching in a list.
Also if possible can this be explained using space allocated by a list.
def remove_recorring_char(string_1):
result = ""
char_found = []
for i in range(0,len(string_1)):
if string_1[i] not in char_found:
char_found.append(string_1[i])
result = result+string_1[i]
return result
if __name__== "__main__":
print(remove_recorring_char("aabbbcc"))
print(remove_recorring_char("chdsgdsgggsggsjddaaxcvcj"))

if string_1[i] not in char_found:
This line does two things:
First, it accesses string_1[i]. That takes constant time, because strings are basically just arrays of characters.
Then it searches in a list char_found, comparing that character string_1[i] to each element until one matches. That takes (worst-case) linear time in the length of char_found. And, since char_found could (worst-case) be all of the characters in string_1[:i], that's linear in the length of string_1.
So, this line is O(N).
And of course this line is inside an outer loop that's even more obviously O(N): for i in range(0,len(string_1)):. So, that combination of the two is O(N**2).
Even if you fix that in test to be constant time, you also do result = result+string_1[i] inside the loop. String concatenation is worst-case linear in the length of the string. Recent versions of CPython and PyPy have some optimizations so it's sometimes amortized constant time, like appending to a list, but Python the language doesn't guarantee those optimizations. And result is, worst-case, also as long as string_1. So, the whole thing is still O(N**2), unless your interpreter is extra nice.
You could reduce the whole thing to O(N) by making two small changes.
First, use a set rather than a list for char_found. Searching a set, and adding to it, are both amortized constant-time operations.
Second, use a list rather than a str for result, then do result = ''.join(result) at the end. Appending to a list is amortized constant-time. Converting a list back to a string is of course linear time, but you're not doing it inside your loops, so that's fine.

Related

Code complexity Explanation | Powerset Generation

I am trying to understand difference/similarity in complexities when writing the code to generate powerset in 2 ways:
def powerset(s, i, cur):
if i==len(s):
print(cur)
return
powerset(s, i+1, cur+s[i]) # string addition is also possibly O(n^2) here?
powerset(s, i+1, cur)
powerset("abc", 0, "")
Output:
['abc', 'ab', 'ac', 'a', 'bc', 'b', 'c', '']
This is going into recursion, with 2 choices at each step (adding s[i] or not), creating 2 branches. Leading to 2^n and adding to the array/printing is another O(n), leading to O(n*2^n)
Also thinking about it in the terms of branches^depth = O(2^n)
The space complexity for this will be: O(n)? Considering max depth of the tree to go up to n by the above logic.
And with this:
s = "abc"
res = [""]
for i in s:
res += [j+i for j in res]
I get the same output.
But here, I see 2 for loops and the additional complexity for creating the strings -- which is possibly O(n^2) in Python. Leading to possible O(N^4) as opposed to O(n*2^n) in the solution above.
Space complexity here seems to me to be O(n) since we are reserving space for just the output. But no additional space, so overall: O(1)
Is my understanding for these solutions in time and space correct? I was under the impression that computing powerset is O(2^n). But I figured maybe its a more optimized solution? (even though the second solution seems more naive).
https://stackoverflow.com/questions/34008010/is-the-time-complexity-of-iterative-string-append-actually-on2-or-on
Here, they suggest using arrays to avoid the `O(n^2)` complexity of string concatenation.
It's clear that both of those solutions involving creating 2len(s) strings, which are all the subsets of s. The amount of time that takes is O(2len(s) * T) where T is the time to create and process each string. The processing time is impossible to know without seeing the code which consumes the subsets; if the processing consisting only of printing, then it's presumably O(len(s)) since printing has to be done character by character.
I think the core of your question is how much time constructing the string takes.
It's often said that constructing a string by repeated concatenation is O(n²), since the entire string needs to be copied in order to perform each concatenation. And that's basically true (although Python has some tricks up its sleeve which can optimise in certain cases). But that's assuming that you're constructing the string from scratch, tossing away the intermediate results as soon as they're no longer necessary.
In this case, however, the intermediate results are also part of the desired result. And you only copy the prefixes implicitly, when you append the next character. That means that the additional cost of producing each subset is O(n), not O(n²), because the cost of producing the immediate prefix was already accounted for. That's true in both of the solutions you provide.

Compare strings in a list to another list of strings: pairwise string comparison vs check existence in set [duplicate]

This question already has answers here:
Can hash tables really be O(1)?
(10 answers)
Closed last year.
I'm comparing a list of strings words_to_lookup to a list of strings word_list and for every match I'll do something.
I wonder whether direct string comparison or checking existence in a set will be more time efficient.
String comparison is O(n)
for w in word_list:
for s in words_to_lookup:
if s == w:
# do something
Checking existence in a set is O(1) but getting the hash of the string (probably) takes O(n).
w_set = set(word_list)
for s in words_to_lookup:
if s in w_set:
# do something
So is string comparison an efficient approach? If the set approach is faster, how?
EDIT:
I was not thinking clearly when I first posted the question. Thank you for the comments. I find it hard to convert my real problem to a concise one suited for online discussions. Now I made the edit, the answer is obvious that the first approach is O(n^2) and the second approach is O(n).
My real question should have been this: Can hash tables really be O(1)?.
And the answer is:
The hash function however does not have to be O(m) - it can be O(1). Unlike a cryptographic hash, a hash function for use in a dictionary does not have to look at every bit in the input in order to calculate the hash. Implementations are free to look at only a fixed number of bits.
If you only need to search once, your first approach is more efficient.
One advantage of constructing a set is the following: if you need to search against the same set many times, you only need to build the set once.
In other words, suppose you have N words in the dictionary (dictionary_list) and you have a list of M words that you want to look up (words_to_lookup). If you go with the set approach, the complexity is O(N+M). If you don't build a set, the complexity is O(N*M) because you may have to go over the whole dictionary of N words for each of the M words that you are looking up.
For this problem, the following code is the more efficient approach.
w_set = set(dictionary_list)
for w in words_to_lookup:
if w in w_set:
# do something
EDIT
Ok. Now I see what you mean. In that case, the set version is definitely better. Note, you can also do:
for s in words_to_lookup:
if s in word_list:
# do something
That's the same thing as your set way, but the running time of the "in" operator will be worse.
list - Average: O(n)
set/dict - Average: O(1), Worst: O(n)
So the set way is probably best.

Fastest sorted string concatenation

What is the fastest and most efficient way to do this:
word = "dinosaur"
newWord = word[0] + ''.join(sorted(word[1:]))
output:
"dainorsu"
Thoughts:
Would something as converting the word to an array increase performance? I read somewhere that arrays have less overhead due to them being the same data type compared to a string.
Basically I want to sort everything after the first character in the string as fast as possible. If memory is saved that would also be a plus. The problem I am trying to solve needs to be within a certain time limit so I am trying to be as fast as possible. I dont know much about python efficiency under the hood so if you explain why this method is fast as well that would be AWESOME!
Here's how I'd approach it.
Create an array of size 26 (assuming that only lowercase letters are used). Then iterate through each character in the string. For the 1st letter of the alphabet, increment the 1st index of the array; for the 2nd, increment the 2nd. Once you've scanned the whole string (which is of complexity O(n)) you will be able to reconstruct it afterwards by repeating the 'a' array[0] times, 'b' array[1] times, and so on.
This approach would beat a fast sort algorithm like quicksort or partition sort, which have complexity O(nlogn).
EDIT: Finally you'd also want to reassemble the final string efficiently. The string concatenation provided by some languages using the + operator can be inefficient, so consider using an efficient string builder class.

Python's immutable strings and their slices

The strings in Python are immutable and support the buffer interface. It could be efficient to return not the new strings, but the buffers pointing to the parts of the old string when using slices or the .split() method. However, a new string object is constructed each time. Why? The single reason I see is that it can make garbage collection a bit more difficult.
True: in regular situations the memory overhead is linear and isn't noticeable. Copying is fast, and so is allocation. But there is already too much done in Python, so maybe such buffers are worth the effort?
EDIT:
It seems that forming substrings this way would make memory management much more complicated. The case where only 20% of the arbitrary string is used, and we can't deallocate the rest of the string, is a simple example. We can improve the memory allocator, so it would be able to deallocate strings partially, but probably it would be mostly a disprovement. All the standard functions can anyway be emulated with buffer or memoryview if memory becomes critical. The code wouldn't be that concise, but one has to give up something in order to get something.
The underlying string representation is null-terminated, even though it keeps track of the length, hence you cannot have a string object that references a sub-string that isn't a suffix. This already limits the usefulness of your proposal since it would add a lot of complications to deal differently with suffices and non-suffices (and giving up with null-terminating strings brings other consequences).
Allowing to refer to sub-strings of a string means to complicate a lot garbage collection and string handling. For every string you'd have to keep track how many objects refer to each character, or to each range of indices. This means complicating a lot the struct of string objects and any operation that deals with them, meaning a, probably big, slow down.
Add the fact that starting with python3 strings have 3 different internal representations, and things are going to be too messy to be maintainable,
and your proposal probably doesn't give enough benefits to be accepted.
An other problem with this kind of "optimization" is when you want to deallocate "big strings":
a = "Some string" * 10 ** 7
b = a[10000]
del a
After this operations you have the substring b that prevents a, a huge string, to be deallocated. Surely you could do copies of small strings, but what if b = a[:10000](or another big number)? 10000 characters looks like a big string which ought to use the optimization to avoid copying, but it is preventing to realease megabytes of data.
The garbage collector would have to keep checking whether it is worth to deallocate a big string object and make copies or not, and all these operations must be as fast as possible, otherwise you end up decreasing time-performances.
99% of the times the strings used in the programs are "small"(max 10k characters), hence copying is really fast, while the optimizations you propose start to become effective with really big strings(e.g. take substrings of size 100k from huge texts)
and are much slower with really small strings, which is the common case, i.e. the case that should be optimized.
If you think important then you are free to propose a PEP, show an implementation and the resultant changes in speed/memory usage of your proposal. If it is really worth the effort it may be included in a future version of python.
That's how slices work. Slices always perform a shallow copy, allowing you to do things like
>>> x = [1,2,3]
>>> y = x[:]
Now it would be possible to make an exception for strings, but is it really worth it? Eric Lippert blogged about his decision not to do that for .NET; I guess his argument is valid for Python as well.
See also this question.
If you are worried about memory (in the case of really large strings), use a buffer():
>>> a = "12345"
>>> b = buffer(a, 2, 2)
>>> b
<read-only buffer for 0xb734d120, size 2, offset 2 at 0xb734d4a0>
>>> print b
34
>>> print b[:]
34
Knowing about this allows you for alternatives to string methods such as split().
If you want to split() a string, but keep the original string object (as you maybe need it), you could do:
def split_buf(s, needle):
start = None
add = len(needle)
res = []
while True:
index = s.find(needle, start)
if index < 0:
break
res.append(buffer(s, start, index-start))
start = index + add
return res
or, using .index():
def split_buf(s, needle):
start = None
add = len(needle)
res = []
try:
while True:
index = s.index(needle, start)
res.append(buffer(s, start, index-start))
start = index + add
except ValueError:
pass
return res

What could affect Python string comparison performance for strings over 64 characters?

I'm trying to evaluate if comparing two string get slower as their length increases. My calculations suggest comparing strings should take an amortized constant time, but my Python experiments yield strange results:
Here is a plot of string length (1 to 400) versus time in milliseconds. Automatic garbage collection is disabled, and gc.collect is run between every iteration.
I'm comparing 1 million random strings each time, counting matches as follows.The process is repeated 50 times before taking the min of all measured times.
for index in range(COUNT):
if v1[index] == v2[index]:
matches += 1
else:
non_matches += 1
What might account for the sudden increase around length 64?
Note: The following snippet can be used to try to reproduce the problem assuming v1 and v2 are two lists of random strings of length n and COUNT is their length.
timeit.timeit("for i in range(COUNT): v1[i] == v2[i]",
"from __main__ import COUNT, v1, v2", number=50)
Further note: I've made two extra tests: comparing string with is instead of == suppresses the problem completely, and the performance is about 210ms/1M comparisons.
Since interning has been mentioned, I made sure to add a white space after each string, which should prevent interning; that doesn't change anything. Is it something else than interning then?
Python can 'intern' short strings; stores them in a special cache, and re-uses string objects from that cache.
When then comparing strings, it'll first test if it is the same pointer (e.g. an interned string):
if (a == b) {
switch (op) {
case Py_EQ:case Py_LE:case Py_GE:
result = Py_True;
goto out;
// ...
Only if that pointer comparison fails does it use a size check and memcmp to compare the strings.
Interning normally only takes place for identifiers (function names, arguments, attributes, etc.) however, not for string values created at runtime.
Another possible culprit is string constants; string literals used in code are stored as constants at compile time and reused throughout; again only one object is created and identity tests are faster on those.
For string objects that are not the same, Python tests for equal length, equal first characters then uses the memcmp() function on the internal C strings. If your strings are not interned or otherwise are reusing the same objects, all other speed characteristics come down to the memcmp() function.
I am just making wild guesses but you asked "what might" rather than what does so here are some possibilities:
The CPU cache line size is 64 bytes and longer strings cause a cache miss.
Python might store strings of 64 bytes in one kind of structure and longer strings in a more complicated structure.
Related to the last one: it might zero-pad strings into a 64-byte array and is able to use very fast SSE2 vector instructions to match two strings.

Categories

Resources