Understanding this solution to the longest substring without repetition problem in leetcode - python

So I'm trying to understand this solution for this problem. The goal here is to get the length of the longest substring from a string without having repeated characters.
How I understand it is that it goes character by character. Using the current index, it will subtract from the start position which is 0 because the the index initially starts 0. The addition of 1 is to compensate from starting at index 0.
If it encounters a duplicate character, it will shift the start position until no duplicates are found, this essentially separates the previous characters into a substring and starts at the position of the new substring with the duplicates, e.g. abcab => "abc" and "ab". It will continue until the length longest substring with no duplicates is found.
The code for the solution is as seen below:
class Solution(object):
def lengthOfLongestSubstring(self, s):
"""
:type s: str
:rtype: int
"""
used = {}
max_length = start = 0
for i,c in enumerate(s):
if c in used and start<=used[c]:
start = used[c]+1
else:
max_length = max(max_length,i-start+1)
used[c] = i
return max_length
What I don't understand is the start<=used[c] and used[c] = i part of this solution, what does it do? Can someone clarify with me?
EDIT: I understand that the dictionary is being used to keep track of the character count. I just don't understand the logic of it. Sorry, I should've clarified.
Thank you for reading.

If I understand correctly, your goal is to form a longest sub-string without repeating characters.
Algorithm Psuedocode:
Start with an empty string, with start as begin index of string. You want to extend the string until you get a duplicate character.
There are 2 possibilities for each character, either a character has been seen for the first time or character is already seen before. After each character we update bookmark used to keep track of last seen index.
a) If the character is not seen before, you can safely extend the current string.
Or
b) If the character was seen before, then we can only extend the string if it is not part of current string (start > used[c]). If it is part of the string ( start <= used[c]), you will need to update the sub-string's begin index start with index next to the last seen of current character as we don't want the characters to repeat, i.e. start = used[c] + 1. Since you are shortening the string in the latter case, maximal string won't be ending at this position.

Related

First recurring character problem in Python

I'm trying to solve a problem that I have with a recurring character problem.
I'm a beginner in development so I'm trying to think of ways I can do this.
thisWord = input()
def firstChar(thisWord):
for i in range(len(thisWord)):
for j in range(i+1, len(thisWord)):
if thisWord[i] == thisWord[j]:
return thisWord[i]
print(firstChar(thisWord))
This is what I came up with. In plenty of use cases, the result is fine. The problem I found after some fiddling around is that with a word like "statistics", where the "t" is the first recurring letter rather than the "s" because of the distance between the letters, my code counts the "s" first and returns that as the result.
I've tried weird solutions like measuring the entire string first for each possible case, creating variables for string length, and then comparing it to another variable, but I'm just ending up with more errors than I can handle.
Thank you in advance.
So you want to find the first letter that recurs in your text, with "first" being determined by the recurrence, not the first occurrence of the letter? To illustrate that with your "statistics" example, the t is the first letter that recurs, but the s had its first occurrence before the first occurrence of the t. I understand that in such cases, it's the t you want, not the s.
If that's the case, then I think a set is what you want, since it allows you to keep track of letters you've already seen before:
thisword = "statistics"
set_of_letters = set()
for letter in thisword:
if letter not in set_of_letters:
set_of_letters.add(letter)
else:
firstchar = letter
break
print(firstchar)
Whenever you're looking at a certain character in the word, you should not check whether the character will occur again at all, but whether it has already occurred. The algorithmically optimal way would be to use a set to store and look up characters as you go, but it could just as well be done with your double loop. The second one should then become for j in range(i).
This is not an answer to your problem (one was already provided), but an advice for a better solution:
def firstChar(thisWord):
occurrences: dict[str, int] = {char: 0 for char in thisWord} # At the beginning all the characters occurred once
for char in thisWord:
occurrences[char] += 1 # You found this char
if (occurrences[char] == 2): # This was already found one time before
return char # So you return it as the first duplicate
This works as expected:
>>> firstChar("statistics")
't'
EDIT:
occurrences: dict[str, int] = {char: 0 for char in thisWord}
This line of code creates a dictionary with the chars from thisWord as keys and 0 as values, so that you can use it to count the occurrences starting from 0 (before finding a char its count is 0).

Python how to correct a misaligned substring position info from string

I have a list of strings and the start offset and end offset of substrings that need to be used for training a nlp model.
Some of these positions for substring are misaligned. Eg:
text = 'Car is blue'
start_offset = 0
end_offset = 2 #misaligned. should be 3.
substring = text[start_offset:end_offset] # should be 'Car' but misaligned to give substring as 'Ca'
The aim is to check if substring highlighted is a whole word from the whole string. If not, correct the start and end offset.
What python code could I use to get whole word substrings?
Just do end_offset + 1. Range selectors on strings are inclusive of the first element and exclusive of the last, so the letter "r" on index "2" in this case is not taken. If you want the whole word, the range should be 0:3.

Why does it recognize the second capital T as 0?

I'm trying to make a short program that will find all the capital letters in a single string. I got it to work for the first two capital letters but it won't return the correct position of the last capital letter. What did I do wrong?
def capital_indexes(n):
listOfUpperPlaces = []
for x in n:
print(x)
if x.isupper():
characterPlace = n.index(x)
print(characterPlace)
listOfUpperPlaces.append(characterPlace)
return listOfUpperPlaces
print(capital_indexes("TEsTo"))
That is because n.index(x) returns the first occurrence of x in the string n. Because "T" occurs multiple times, n.index(x) returns the first occurrence of "T"
You want to iterate through range(len(n), like
def capital_indexes(n):
listOfUpperPlaces = []
for x in range(len(n)):
print(n[x])
if n[x].isupper():
print(x)
listOfUpperPlaces.append(x)
return listOfUpperPlaces
print(capital_indexes("TEsTo"))
The issue is the call to n.index(x)
This is searching the string to find x, and its able to find a capital T right at the beginning of the string.
A better way to do this would be to use enumerate, which gives you both the index and the item at the same time.
Can't code very well from a phone, but something like:
for index, character in enumerate(n):
if character.isUpper():
list_of_upper_places.append(index)
This will handle duplicates correctly, and will also be faster, since you don't need to search through the string just to count which character you are currently checking. It will be easier to read for most python programmers too.

Very Beginner Python: Replacing Part of a String

I want to know how you can replace one letter of a string without replacing the same letter. For example, let the variable:
action = play sports.
I could substitute "play" for "playing" by doing print(action.replace("play", "playing")
But what if you have to of the same letters?
For example, what if you want to replace the last half of "honeyhoney" into "honeysweet" (Replacing the last half of the string to sweet?
Sorry for the bad wording, I am new to coding and really unfamiliar with this. Thanks!
def replaceLast(str, old, new):
return str[::-1].replace(old[::-1],new[::-1], 1)[::-1]
print(replaceLast("honeyhoney", "honey", "sweet"))
output
honeysweet
so the idea is to reverse the string and the old and new substrings,
so the last substring becomes the first, do a replace and then reverse the returned string once again, and the number 1 is to replace only once and not both matches
Another solution
def replaceLast(str, old, new):
ind = str.rfind(old)
if ind == -1 : return str
return str[:ind] + new + str[ind + len(old):];
print(replaceLast("honeyhoney", "honey", "sweet"))
output
honeysweet
so here we get the string from the beginning to the index of the last substring then we add the new substring and the rest of the string from where the old substring ends and return them as the new string, String.rfind returns -1 in case of no match found and we need to check aginst that to make sure the output is correct even if there is nothing to replace.

Using nested for loop and if statement to replace character with integer

I need to output any repeated character to refer to the previous character.
For example: a(-1)rdv(-4)(-4)k or hel(-1)o
This is my code so far:
text= 'aardvark'
i=0
j=0
for i in range(len(text)-1):
for j in range(i+1, len(text)):
if text[j]==text[i]:
sub= text[j]
val2=text.find(sub, i+1, len(text))
p=val2+1
val=str(i-j)
text= text[:val2] + val + text[p:]
break
print(text)
Output: a-1rdva-4k
The second 'a' is not recognised. And I'm not sure how to include brackets in my print.
By updating the text in-place each time you find a back-reference, you muck up your indices (your text gets longer each time) and you never process the last characters properly. You stop checking when you find the first repeat of the 'current' character, so the 3rd a is never processed. This applies to every 3rd repeat in an input string. In addition, if your input text contains any - characters or digits they'll end up being tested against the -offset references you inserted before them too!
For your specific example of aardvark, a string with 8 characters, what happens is this:
You find the second a and set text to a-1rdvark. The text is now 9 characters long, so the last r will never be checked (you loop to i = 6 at most); this would be a problem if your test string ended in a double letter. You break out of the loop, so the j for loop never comes to the 3rd a, and the second a can't be tested for anymore as it has already been replaced.
Your code finds - (not repeated), 1 (not repeated) and then r (repeated once), so now you replace text with a-1rdva-4k. Now you have a string of 10 characters, so -, and 4 will never be tested. Not a big problem anymore, but what if there was a repeat in just the last 3 positions of the string?
Build a new object for the output (adding both letters you haven't seen before and backreferences). That way you won't cause the text you are looping over to grow, and you will continue to find repeats; for the parentheses you could use more string concatenation. You'll need to scan the part of the string before i, not after, for this to work, and go backwards! Testing i - 1, i - 2, etc, down to 0. Naturally, this means your i loop should then range up to the full length:
output = ''
for i in range(len(text)):
current = text[i]
for j in range(i - 1, -1, -1):
if text[j] == current:
current = '(' + str(j - i) + ')'
break
output = output + current
print(output)
I kept the fix to a minimum here, but ideally I'd also make some more changes:
Add all processed characters and references to a new list instead of a string, then use str.join() to join that list into the output afterwards. This is far more efficient than rebuilding the string each iteration.
Using two loops means you check every character in the string again while looping over the text, so the number of steps the algorithm takes grows exponentially with the length of the input. In Computer Science we talk about the time complexity of algorithms, and yours is a O(N^2) (N squared) exponential algorithm. A text with 1000 letters would take up to 1 million steps to process! Rather than loop an exponential number of times, you can use a dictionary to track indices of letters you have seen. If the current character is in the dictionary you can then trivially calculate the offset. Dictionary lookups take constant time (O(1)), making the whole algorithm take linear time (O(N)), meaning that the time the process takes is directly proportional to the length of the input string.
Use enumerate() to add a counter to the loop so you can just loop over the characters directly, no need to use range().
You can use string formatting to build a "(<offset>)" string; Python 3.6 and newer have formatted string literals, where f'...' strings take {} placeholders that are just expressions. f'({some - calculation + or * other})' will execute the expression and put the result in a string that has(and)characters in it too. For earlier Python versions, you can use the [str.format()method](https://docs.python.org/3/library/stdtypes.html#str.format) to get the same result; the syntax then becomes'({})'.format(some - calculation + or * other)`.
Put together, that becomes:
def add_backrefs(text):
output = []
seen = {}
for i, character in enumerate(text):
if character in seen:
# add a back-reference, we have seen this already
output.append(f'({seen[character] - i})')
else:
# add the literal character instead
output.append(character)
# record the position of this character for later reference
seen[character] = i
return ''.join(output)
Demo:
>>> add_backrefs('aardvark')
'a(-1)rdv(-4)(-4)k'
>>> add_backrefs('hello')
'hel(-1)o'
text= 'aardvark'
d={} # create a dictionary to keep track of index of element last seen at
new_text='' # new text to be generated
for i in range(len(text)): # iterate in text from index 0 up to length of text
c = text[i] # storing a character in temporary element as used frequently
if c not in d: # check if character which is explored is visited before or not
d[c] = i # if character visited first time then just add index value of it in dictionary
new_text += c # concatenate character to result text
else: # visiting alreaady visited character
new_text += '({0})'.format(d[c]-i) # used string formatting which will print value of difference of last seen repeated character with current index instead of {0}
d[c] = i # change last seen character index
print(new_text)
Output:
a(-1)rdv(-4)(-4)k

Categories

Resources