Number of occurrences of a substring in a string - python

I need to count the nunber of times the substring 'bob' occurs in a string.
Example problem: Find the number of times 'bob' occurs in string s such that
"s = xyzbobxyzbobxyzbob" #(here there are three occurrences)
Here is my code:
s = "xyzbobxyzbobxyzbob"
numBobs = 0
while(s.find('bob') >= 0)
numBobs = numBobs + 1
print numBobs
Since the find function in Python is supposed to return -1 if a substring is unfound the while loop ought to end after printing out the incremented number of bobs each time it finds the substring.
However the program turns out to be an infinite loop when I run it.

For this job, str.find isn't very efficient. Instead, str.count should be what you use:
>>> s = 'xyzbobxyzbobxyzbob'
>>> s.count('bob')
3
>>> s.count('xy')
3
>>> s.count('bobxyz')
2
>>>
Or, if you want to get overlapping occurrences, you can use Regex:
>>> from re import findall
>>> s = 'bobobob'
>>> len(findall('(?=bob)', s))
3
>>> s = "bobob"
>>> len(findall('(?=bob)', s))
2
>>>

When you do s.find('bob') you search from the beginning, so you end-up finding the same bob again and again, you need to change your search position to end of the bob you found.
string.find takes start argument which you can pass to tell it from where to start searching, string.find also return the position are which it found bob, so you can use that, add length of bob to it and pass it to next s.find.
So at start of loop set start=0 as you want to search from start, inside loop if find returns a non-negative number you should add length of search string to it to get new start:
srch = 'bob'
start = numBobs = 0 while start >= 0:
pos = s.find(srch, start)
if pos < 0:
break
numBobs += 1
start = pos + len(srch)
Here I am assuming that overlapped search string are not considered

find doesn't remember where the previous match was and start from there, not unless you tell it to. You need to keep track of the match location and pass in the optional start parameter. If you don't find will just find the first bob over and over.
find(...)
S.find(sub [,start [,end]]) -> int
Return the lowest index in S where substring sub is found,
such that sub is contained within s[start:end]. Optional
arguments start and end are interpreted as in slice notation.
Return -1 on failure.

Here is a solution that returns number of overlapping sub-strings without using Regex:
(Note: the 'while' loop here is written presuming you are looking for a 3-character sub-string i.e. 'bob')
bobs = 0
start = 0
end = 3
while end <= len(s) + 1 and start < len(s)-2 :
if s.count('bob', start,end) == 1:
bobs += 1
start += 1
end += 1
print(bobs)

Here you have an easy function for the task:
def countBob(s):
number=0
while s.find('Bob')>0:
s=s.replace('Bob','',1)
number=number+1
return number
Then, you ask countBob whenever you need it:
countBob('This Bob runs faster than the other Bob dude!')

def count_substring(string, sub_string):
count=a=0
while True:
a=string.find(sub_string)
string=string[a+1:]
if a>=0:
count=count+1;
else:
break
return count

Related

how to count how many times word appears in a string [duplicate]

I need to count the nunber of times the substring 'bob' occurs in a string.
Example problem: Find the number of times 'bob' occurs in string s such that
"s = xyzbobxyzbobxyzbob" #(here there are three occurrences)
Here is my code:
s = "xyzbobxyzbobxyzbob"
numBobs = 0
while(s.find('bob') >= 0)
numBobs = numBobs + 1
print numBobs
Since the find function in Python is supposed to return -1 if a substring is unfound the while loop ought to end after printing out the incremented number of bobs each time it finds the substring.
However the program turns out to be an infinite loop when I run it.
For this job, str.find isn't very efficient. Instead, str.count should be what you use:
>>> s = 'xyzbobxyzbobxyzbob'
>>> s.count('bob')
3
>>> s.count('xy')
3
>>> s.count('bobxyz')
2
>>>
Or, if you want to get overlapping occurrences, you can use Regex:
>>> from re import findall
>>> s = 'bobobob'
>>> len(findall('(?=bob)', s))
3
>>> s = "bobob"
>>> len(findall('(?=bob)', s))
2
>>>
When you do s.find('bob') you search from the beginning, so you end-up finding the same bob again and again, you need to change your search position to end of the bob you found.
string.find takes start argument which you can pass to tell it from where to start searching, string.find also return the position are which it found bob, so you can use that, add length of bob to it and pass it to next s.find.
So at start of loop set start=0 as you want to search from start, inside loop if find returns a non-negative number you should add length of search string to it to get new start:
srch = 'bob'
start = numBobs = 0 while start >= 0:
pos = s.find(srch, start)
if pos < 0:
break
numBobs += 1
start = pos + len(srch)
Here I am assuming that overlapped search string are not considered
find doesn't remember where the previous match was and start from there, not unless you tell it to. You need to keep track of the match location and pass in the optional start parameter. If you don't find will just find the first bob over and over.
find(...)
S.find(sub [,start [,end]]) -> int
Return the lowest index in S where substring sub is found,
such that sub is contained within s[start:end]. Optional
arguments start and end are interpreted as in slice notation.
Return -1 on failure.
Here is a solution that returns number of overlapping sub-strings without using Regex:
(Note: the 'while' loop here is written presuming you are looking for a 3-character sub-string i.e. 'bob')
bobs = 0
start = 0
end = 3
while end <= len(s) + 1 and start < len(s)-2 :
if s.count('bob', start,end) == 1:
bobs += 1
start += 1
end += 1
print(bobs)
Here you have an easy function for the task:
def countBob(s):
number=0
while s.find('Bob')>0:
s=s.replace('Bob','',1)
number=number+1
return number
Then, you ask countBob whenever you need it:
countBob('This Bob runs faster than the other Bob dude!')
def count_substring(string, sub_string):
count=a=0
while True:
a=string.find(sub_string)
string=string[a+1:]
if a>=0:
count=count+1;
else:
break
return count

How to find the amount of equal characters that are next to eachother in a string?

i just started using python and im a noob.
this is an example of the string i have to work with "--+-+++----------------+-+"
The program needs to find whats the longest ++ "chain", so how many times does + appear, when they are next to eachother. I dont really know how to explain this, but i need it to find that chain of 3 + smybols, so i can print that the longest + chain contains 3 + symbols.
a = "--+-+++----------------+-+"
count = 0
most = 0
for x in range(len(a)):
if a[x] == "+":
count+=1
else:
count = 0
if count > most:
most = count
print(f"longest + chain includes {most} symbols")
there might be a better way but it's more self explanatory
Try this. It uses regular expressions and a list comprehension, so you may need to read about them.
But the idea is to find all the + chains, calculate their lengths and get the maximum length
import re
s = '+++----------------+-+'
occurs = re.findall('\++',s)
print(max([len(i) for i in occurs]))
Output:
3
You can use a regular expression to specify "one or more + characters". The character for specifying this kind of repetition in a regex is itself +, so to specify the actual + character you have to escape it.
haystack = "--+-+++----------------+-+"
needle = re.compile(r"\++")
Now we can use findall to find all the occurrences of this pattern in the original string, and max to find the longest of these.
longest = max(len(x) for x in needle.findall(haystack))
If you instead need the position of the longest sequence in the target string, you can use:
pos = haystack.index(max(needle.findall(haystack), key=len))
A simple solution is to iterate over the string one character at a time. When the character is the same as the last add one to a counter and each time the character is different to the previous the count can be restarted.
s = "--+-+++----------------+-+"
p = s[0]
max, count = 0
for c in s:
if c == p:
count = count + 1
else:
count = 0
if count > max:
max = count
p = c
s is the string, c is the character being checked, p is previous character, count is the counter, and max is the highest found value,
If the only other character in your string is a minus sign, you can split the string on the minus sign and get maximum length of the resulting substrings:
a = "--+-+++----------------+-+"
r = max(map(len,a.split('-')))
print(r) # 3

finding the minimum window substring

the problem says to create a string, take 3 non-consecutive characters from the string and put it into a sub-string and print the which character the first one is and which character the last one is.
str="subliminal"
sub="bmn"
n = len(str)-3
for i in range(0, n):
print(str1[i:i+4])
if sub1 in str1:
print(sub1[i])
this should print 3 to 8 because b is the third letter and n is the 8th letter.
i also don't know how to make the code work for substrings that aren't 3 characters long without changing the code in total.
Not sure if this is what you meant. I assume that the substring is already valid, which means that it contains non consecutive letters. Then I get the first and last letter of the substring and create a list of all the letters in the string using a list comprehension. Then i just loop through the letters and save where the first and last letter occur. If anything is missing, hmu.
sub = "bmn"
str = "subliminal"
first_letter = sub[0]
last_letter = sub[-1]
start = None
end = None
letters = [let for let in str]
for i, letter in enumerate(letters):
if letter == first_letter:
start = i
if letter == last_letter:
end = i
if start and end:
print(f"From %s to %s." % (start + 1, end + 1)) # Output: From 3 to 8.
Some recursion for good health:
def minimum_window_substring(strn, sub, beg=0, fin=0, firstFound=False):
if len(sub) == 0 or len(strn) == 0:
return f'From {beg + 1} to {fin}'
elif strn[0] == sub[0]:
return minimum_window_substring(strn[1:], sub[1:], beg, fin + 1, True)
if not firstFound:
beg += 1
return minimum_window_substring(strn[1:], sub, beg, fin + 1, firstFound)
Explanation:
The base case is if we get our original string or our sub-string to be length 0, we then stop and print the beginning and the end of the substring in the original string.
If the first letter of the current string is equal then we start the counter (we fix the beginning "beg" with the flag "firstFound") Then increment until we finish (sub is an empty string / original string is empty)
Something to think about / More explanation:
If for example, you ask for the first occurrence of the substring, for example if the original string would be "sububusubulum" and the sub would equal to "sbl" then when we hit our first "s" - it means it would 100% start from there, because if another "sbl" is inside the original string - then it must contain the remaining letters, and so we would say they belong to the first s. (A horrible explanation, I am sorry) what I am trying to say is that if we have 2 occurrences of the substring - then we would pick the first one, no matter what.
Note: This function does not really care if the sub-string contains consecutive letters, also, it does not check whether the characters are in the string itself, because you said that we must be given characters from the original string. The positive thing about it, is that the function can be given more than (or less than) 3 characters long substring
When I say "original string" I mean subliminal (or other inputs)
There are many different ways you could do it,
here is a soultion,
import re
def Func(String, SubString):
patt = "".join([char + "[A-Za-z]" + "+" for char in sub[:-1]] + [sub[-1]])
MatchedString = re.findall(patt, String)[0]
FirstIndex = String.find(MatchedString) + 1
LastIndex = FirstIndex + len(MatchedString) -1
return FirstIndex, LastIndex
string="subliminal"
sub="bmn"
FirstIndex, LastIndex = Func(string, sub)
This will return 3, 8 and you could change the length of the substring, and assuming you want just the first match only

how to reset counter when looking for a sequence of strings properly

so i am doing the cs50 dna problem and i am having a hard time with the counter as i don't know how to code to properly count the highest amount of times the sequence i am looking for has repeated without another sequence in between the two. for example i am looking for the sequence AAT and the text is AATDHDHDTKSDHAATAAT so the highest amount should be two as the last two sequence is AAT and there is no sequence between them.
here is my code:
text="TCTAGTCTAGTCTAGTCTAGTCTAGACTTGTCGCTGACTCCGAGAAGATCCTAACATTAACCAATTCCCCCTAGTCTGAGGCACGGTTACCGATCGGGTTAATGGATCTCTCACCGTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTTTTTTTCTGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAGATAAACGTGTAACTGTAATAATCCGCCCGAAAAAACTGATCTTAGGGTTGCGGCATCTGCACGTGACAGTGTGCTACTGTTAGATAGAGGGATCAAACGAGGTTGCAAGGATTATATCTCTCCGTGCTCGATAAGACACAGCCGGTTGCGGGCTGCTTCCTCTGGATCCAATGCAGCCGTACGTACACCGTAGAGCAAATTTAGTGGTAAAGGAACTTGCTCAAACACTACGGCTTCGGGCTACTGTTGGCGCCGGTTGGGGATCCCATTCAACGCTGGCCCTTTCGCTATGGTTCGGTGATTTTACACCGAAGCGAACCTTGAACCGTGGATTTCGGGTGTCCTCCGTTTTTAGGTACTGCGTGCAGACATGGGCACCTGCCATAGTGCGATCAGCCAGAATCCATTGTATGGGAGTTGGACTCGTTTGAATTTACCGGAAACCTCATGCTTGGTCTGTAGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGTCTGAAACTGGGCGACTTGAAGTCGGCTTGCGTATTAATAGCTCTGCAATGTAACTCGGCCCTTGGCGGCGGGCAGCTTAGTATTGAACCGCGACACACCATAGGTGCGGCAAATATTAAAAGTACGCTCGAACCGGAACCTGTCTCCATGACTGGACGACCAGCCCGGCGTCTTCTACGTAACACAGGGGGCTGTCGAGGTAGGGCGTAGGAACTTCGGGGTCACTACGCCGTAACAGCACCGAATATCATATCATCCAACTTGCTTGGTACATGCCCCGTTCTGTATCAAAAGTTTACGGCCCCGGACATACCTGCTGTCAGTTGAATACCTATGCGAGTCTGAAACACGAATAGTTCAGGCGTGCAAAGACACGCTAAGCACACGCCGCAGGCAGGGGGGGTATTTTATAAGTCGTTTTTTGGAAGGGTAATGTAAACTTATCCCATAATACCCTTTGGCTTCCCCTCACTCGTGCACTTCTCATAATGATACGTCAGGGTGATTGTAGATTCACGCGTCATCAGATTGTCCCTTTCTCGAGTCTTAGTATCTTTCCTAATCCGCTCGACTCTGCGCCATGATCGAATTCCTGACAGGCTACAAGAATAAACTGCCAGCATACTCCTTACCGATTGGCGCCTACTAATTATACGCACATGGGCATCTTCGACGTCTAAACATAGGCTCTTAGTATTCCGTAGGATGTTGAGCCGACAGGAAAGTCAAACGTCGTGGGTGACCGTAGCCTGACTCGCCCGACGCAGGATTCGCTCATATGTGTGAACGGATGCTTATGTAACTTCCTAATTGCAGCGAATGGCAGTTCCGTAGTGAAGGTTCGAAACGTACGGGGTCCGGCCATGGATTAGATCTTTCAGTGCGCTAAACTCTTAACCGCAGATACTTGGCGGACCATCTTCGTGTTGCTACTATGGTATAGACCAGGCTGTCGAATCTACTTAACACAGGTGAACCCCCAGATCGGCTAGAGCCTTCGAGGCTAGACCTTTAACAATCTTTAGACACTTCCAAATCGCGGCCGGATATGTCTCGTTGGCAGCCGCAGACAAGAGAAGAGGGTCGGCAGTGTCTGCCACGCGTGACCTGTATGATCTTAGCCTTTAAGATCACACTACTGATCACAATCTATTATGATTGCCTTAGCTAACTGAGTGATGCACCCCCACAGGCTGAGAGAAATCTGTAGTTTGACGACACGCCGTCTGGCTAAAAATGTGAATCCGCCGATCCGAGACGGTGGAAGCTTGAGACCAAATGCGGGAAACCAATGACTTCATTACGGAACAAGACATAACGGCGTGAGTTGACGACTGGGATTAACCCTTTCCCGAGTCTGTACTTCTGCTACACAATGAGGATGCGAATTATCTAAGACCTTGTACTACCTAAACTAACCCTGAGGCGGGCATTGAATTCCGGCCATCTTCAGCCCAAAGAAAGACCAAATGTGAGGAAAATGAGGGATCGGTATAAGCTTTTCACGATCTCAAGGTTCACGGCCGCCAGGGCCGTAGTTGGGGCTTCATGCACATTGCCAACCCGGACATCGACAGTCGGTACCGCAGGGGTTCGAGGAATACTCCCAGCTGTGACACCTGGTCGTCGACTGGACCCAGCTGGTGGGCGGCATAGGTAGTTAATACTGAATTAAAGCCGGGAACGTCTCTCTAACTAGAAACCTTGTGATAGGATACACAGACCTAGTGCCCCGACGTTAGCATTTGAATTCATCTATCTTGGCGTCTTTTAGTAGGCCTGGGTCAACTCCGGCGTTGGCCAAAATAACCGATCTGCGTTATGTGGCCACGCATCGAGTGACAGGGTGCATACAAATTGATGGTCAAAGAGTTTAAACAAGACAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCAGATCCCCACGCTTCTACATAGCCACACTGGAGCTAGTCCTCGTGTTAAATTTTTCGCTTGTTGCACGGTTATCATCAGAAGTGCCACTGGTATTCCTCTGTAGCTCCCGTATGCCGAAGGTTGCGGCTTAGGTACTGCTTATACACGTCTCTCAAGTTTGTCAGCCGCGTGATCTTTCTGCGGGGATAGGTGATCGTCCCTCGCTCCGGACATTGCATTAAAATTACCTAGTTGATAGGGCGGCGGAGTTGCATACCGGCGTTCAATCGCGGCTCCAGACTGGTTTGAGCTACGCGTCTGCCAGCGTGAAAAAGCTGATTTGTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCCAGGTATTATCATTTGAATCGTATGTTTTCTGCCGTACGTCAACTGCGTCGTCGGGGACTGAAATGGTCTGCCTCCAGACCCTTACCTCCCGATAAGCCATGACTAAGTATGTGAAGGATCACCTGAATTGCTGAAAGTTAACGGTAAGATATCTGAAAGAGCTCATTAGATCCAACACTTATCTACTCAAAAATTCGTCATATTTCGGTGACTTGCTAGAAAGGCTCTTGCACAGTAAGGTTATAGAGAATGCTACCGTTGAAGCACCAGCCGTTGAAGCCCGCCTTTAACCACGCGATATATCCAATTAACCAAGGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGAATGTCGCCTTGTAATAATTACTTTGGCCCGGATTATAACGAAGGAACTCGCCATGAACTCGCAGCACGTTGTACTGGAACAATCTACTTTTTATAATATAGCGATAACTCCCAGCTTTTATGTGGGTGATATTGTCCTAGCTTTTTAAAGATACCCTCTGGCCCGGTCCAAGTAAGGTCCACATTGCCTGACGTAAGCGTACGGTCAACGGGTGCACCGGTTCCCGCTAAAGCTCGATCCTATTCTTTCAGTCGGGGGGAAATAAACTCGTATACTCTCCACCCACCCGTACGTCCCGGACTAGAATAACTACCGGGTATTTCCGGTTCGTAACACCACGCCATGACGTGTCAACATAAACGCTTCTTTTGAAAGGTGCACATGCAGATTGCACAAGCAGCAGGCACCGCCCTTATCCATATCCTGTTGAGGCCCTCGATCCTAGTGTTCCTTGTTATCAGGATATTTTCTCGCTGTACGTTATTGTCCTTTTCAAATTACAACTGACCGCTTCCTCACCCGCTAAACCCTACCTTACGCACAACCAAGGCCTTGTCCCGGATGAACCCGGCTGCTCCTATGGATAAGCAACCCAGCCCGGCAGTTTACTTCAGGTGTTATCGTCGACTGACACCCTCAGCTTTCTCCCATTACACAGCGAGTATTTTCCGCGTAGCAATGGCAGTGACTTTGAGCGCACACTCAGAAGCCGTTGGAATGGCACCGGGGACGGCCCGATTTAGCCCCGCACACCTCCTGGAATCTTAGATCGCACGGCGATCTCGGTTCAGGCACCAACCCCAAAGAGTGTTTTGAGTTTTTGGTATGGCTCGCCTCAATTATCGGTTTTCGCTGCTCTGTGCCTGTCAACTCGGCTAGCTGTCGTGTTTTGTCGATCAGTGCGTGGACACTCTCGGTCGATGGTCGTGGATGGGACTGTAGTAAGTTTCACCGAAGCAGGAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAACTTCGCTTCATATAACGTAGCCATAGTGCTGTCTGCCATCAATAAGTCTTGCTCAGTGGTGCATACGTCGGGGAGGTTTGTTCCGCCTGGTCAGAACGAGTCTAGGGCGAGCCTATAGGCCAGTCGAGAGCCAAGATTCTATGAAATTAATACGACTACTGGGTGAGAGGTCATACAATTCCCGTGGAATCTGTACCTAAGATATTTCCAGATAGGGATGGCTACTGGTTAAGTTGACAGTGTCTAGATACGTGAGAGCACCTGAGAGGACGCCACGAGTCGGAGCGTGGGCGATCACCCTTCTGAGTCATAAGTCATGTCTATATATCCCTCACTAAAAAGGGCACACGACTATACATGCTTGAGCTTTACGGTCTGGCATGTGGAATGCCCGGAGCAACCCAGTCTTACCATCCTTTACGTACATTTACCGACCCGGCAGTGGCCGGCGCGGAAACCCAGGAGAACGTCGGTCATGATACGCGCCCTCCGCCGAAAGCGTGCTCACACCTCAGGATATCAGCGCTATTACCGGACGTCCCGCGTCCACCATCTAATAATTCAGGTGCTCCTAATAAGTGGGCTGGAGAGCGAGGATTGATATACGTTGAGGAGCTCCGACGGCCCTCTCGTGCGTTTGATGTAGATTGCGTTACCGACGGAGCACGCGTTTGTCAATTTCTGTCTAGGGACGTTTATGTCCTCAATACGAATACCAGGCCTATTTTAGTGTACAAATCACTTAGCAGTCGGAATTGGAAACCTGATGGAAGCGT"
counter=0
length=len(text)
search="AGATC"
tmp=0
for i in range(length):
if text[i:i + len(search)] == search:
tmp += 1
if tmp > counter:
counter = tmp
if text[i:i + len(search)] != search:
tmp = 0
print("done")
print(counter)
try this
import re
sequence = "AATDHDHDTKSDHAATAAT"
matches = re.findall(r'(?:AAT)+', sequence)
largest = max(matches, key=len)
print(len(largest)//len('AAT'))
basically this way will find you the list of substrings in the string you are have then you choose the largest substring. The number of occurrences of the substring will be the length of the largest divide by the length of substring
First and foremost, the regex solution is the Python way to solve this. However, if you want your code repaired ...
The problem with your code is that your index fails to acknowledge that you've found a match. You have no way to recognize consecutive occurrences.
Consider the case where you've found the start of a triple-match, AATAATAAT. You get to the first A, recognize, the AAT and increment tmp. You go to the next loop iteration, and now i points at the second A. You see that it's not AAT here (it's ATA, spanning the first two occurrences), so you record one instance and reset all your status variables.
Instead, you have to jump to the end of the first match and look for a second. Since your index does not move smoothly by increments of 1, you'll want a while loop instead.
Please learn to use meaningful variable names where the variables have
any meaning. i is fine if all it does is to manage your loop. As
soon as you use it for anything else, give it a real name. Similarly,
tmp and count really need replacing.
snip_size = len(search)
pos = 0 # position in the genetic sequence
rep = 0 # number of consecutive repetitions
max_rep = 0 # longest repetition sequence found
while pos < length:
if text[pos:pos + snip_size] == search:
rep += 1
pos += snip_size
else:
max_rep = max(max_rep, rep)
rep = 0
pos += 1
print(max_rep, "repetitions found")
Output:
15 repetitions found

Python: Find a substring in a string and returning the index of the substring

I have:
a function: def find_str(s, char)
and a string: "Happy Birthday",
I essentially want to input "py" and return 3 but I keep getting 2 to return instead.
Code:
def find_str(s, char):
index = 0
if char in s:
char = char[0]
for ch in s:
if ch in s:
index += 1
if ch == char:
return index
else:
return -1
print(find_str("Happy birthday", "py"))
Not sure what's wrong!
There's a builtin method find on string objects.
s = "Happy Birthday"
s2 = "py"
print(s.find(s2))
Python is a "batteries included language" there's code written to do most of what you want already (whatever you want).. unless this is homework :)
find returns -1 if the string cannot be found.
Ideally you would use str.find or str.index like demented hedgehog said. But you said you can't ...
Your problem is your code searches only for the first character of your search string which(the first one) is at index 2.
You are basically saying if char[0] is in s, increment index until ch == char[0] which returned 3 when I tested it but it was still wrong. Here's a way to do it.
def find_str(s, char):
index = 0
if char in s:
c = char[0]
for ch in s:
if ch == c:
if s[index:index+len(char)] == char:
return index
index += 1
return -1
print(find_str("Happy birthday", "py"))
print(find_str("Happy birthday", "rth"))
print(find_str("Happy birthday", "rh"))
It produced the following output:
3
8
-1
There is one other option in regular expression, the search method
import re
string = 'Happy Birthday'
pattern = 'py'
print(re.search(pattern, string).span()) ## this prints starting and end indices
print(re.search(pattern, string).span()[0]) ## this does what you wanted
By the way, if you would like to find all the occurrence of a pattern, instead of just the first one, you can use finditer method
import re
string = 'i think that that that that student wrote there is not that right'
pattern = 'that'
print([match.start() for match in re.finditer(pattern, string)])
which will print all the starting positions of the matches.
Adding onto #demented hedgehog answer on using find()
In terms of efficiency
It may be worth first checking to see if s1 is in s2 before calling find().
This can be more efficient if you know that most of the times s1 won't be a substring of s2
Since the in operator is very efficient
s1 in s2
It can be more efficient to convert:
index = s2.find(s1)
to
index = -1
if s1 in s2:
index = s2.find(s1)
This is useful for when find() is going to be returning -1 a lot.
I found it substantially faster since find() was being called many times in my algorithm, so I thought it was worth mentioning
Here is a simple approach:
my_string = 'abcdefg'
print(text.find('def'))
Output:
3
I the substring is not there, you will get -1.
For example:
my_string = 'abcdefg'
print(text.find('xyz'))
Output:
-1
Sometimes, you might want to throw exception if substring is not there:
my_string = 'abcdefg'
print(text.index('xyz')) # It returns an index only if it's present
Output:
Traceback (most recent call last):
File "test.py", line 6, in
print(text.index('xyz'))
ValueError: substring not found
late to the party, was searching for same, as "in" is not valid, I had just created following.
def find_str(full, sub):
index = 0
sub_index = 0
position = -1
for ch_i,ch_f in enumerate(full) :
if ch_f.lower() != sub[sub_index].lower():
position = -1
sub_index = 0
if ch_f.lower() == sub[sub_index].lower():
if sub_index == 0 :
position = ch_i
if (len(sub) - 1) <= sub_index :
break
else:
sub_index += 1
return position
print(find_str("Happy birthday", "py"))
print(find_str("Happy birthday", "rth"))
print(find_str("Happy birthday", "rh"))
which produces
3
8
-1
remove lower() in case case insensitive find not needed.
Not directly answering the question but I got a similar question recently where I was asked to count the number of times a sub-string is repeated in a given string. Here is the function I wrote:
def count_substring(string, sub_string):
cnt = 0
len_ss = len(sub_string)
for i in range(len(string) - len_ss + 1):
if string[i:i+len_ss] == sub_string:
cnt += 1
return cnt
The find() function probably returns the index of the fist occurrence only. Storing the index in place of just counting, can give us the distinct set of indices the sub-string gets repeated within the string.
Disclaimer: I am 'extremly' new to Python programming.

Categories

Resources