Python - Find words in string - python

I know that I can find a word in a string with
if word in my_string:
But I want to find all "word" in the string, like this.
counter = 0
while True:
if word in my_string:
counter += 1
How can I do it without "counting" the same word over and over again?

If you want to make sure that it counts a full word like is will only have one in this is even if there is an is in this, you can split, filter and count:
>>> s = 'this is a sentences that has is and is and is (4)'
>>> word = 'is'
>>> counter = len([x for x in s.split() if x == word])
>>> counter
4
However, if you just want count all occurrences of a substring, ie is would also match the is in this then:
>>> s = 'is this is'
>>> counter = len(s.split(word))-1
>>> counter
3
in other words, split the string at every occurrence of the word, then minus one to get the count.
Edit - JUST USE COUNT:
It's been a long day so I totally forgot but str has a built-in method for this str.count(substring) that does the same as my second answer but way more readable. Please consider using this method (and look at other people's answers for how to)

Use the beg argument for the .find method.
counter = 0
search_pos = 0
while True:
found = my_string.find(word, search_pos)
if found != -1: # find returns -1 when it's not found
#update counter and move search_pos to look for the next word
search_pos = found + len(word)
counter += 1
else:
#the word wasn't found
break
This is kinda a general purpose solution. Specifically for counting in a string you can just use my_string.count(word)

String actually already has the functionality you are looking for. You simply need to use str.count(item) for example.
EDIT: This will search for all occurrences of said string including parts of words.
string_to_search = 'apple apple orange banana grapefruit apple banana'
number_of_apples = string_to_search.count('apple')
number_of_bananas = string_to_search.count('banana')
The following will search for only complete words, just split the string you want to search.
string_to_search = 'apple apple orange banana grapefruit apple banana'.split()
number_of_apples = string_to_search.count('apple')
number_of_bananas = string_to_search.count('banana')

Use regular expressions:
import re
word = 'test'
my_string = 'this is a test and more test and a test'
# Use escape in case your search word contains periods or symbols that are used in regular expressions.
re_word = re.escape(word)
# re.findall returns a list of matches
matches = re.findall(re_word, my_string)
# matches = ['test', 'test', 'test']
print len(matches) # 3
Be aware that this will catch other words that contain your word like testing. You could change your regex to just match exactly your word

Related

Reverse a specific word function

I'm having trouble doing the next task:
So basically, I need to build a function that receives a (sentence, word, occurrence)
and it will search for that word and reverse it only where it occurs
for example:
function("Dani likes bananas, Dani also likes apples", "lik", "2")
returns: "Dani likes bananas, Dani also kiles apples"
As you can see, the "word" is 'lik' and at the second time it occurred it reversed to 'kil'.
I wrote something but it's too messy and that part still doesn't work for me,
def q2(sentence, word, occurrence):
count = 0
reSentence = ''
reWord = ''
for char in word:
if sentence.find(word) == -1:
print('could not find the word')
break
for letter in sentence:
if char == letter:
if word != reWord:
reWord += char
reSentence += letter
break
elif word == reWord:
if count == int(occurrence):
reWord = word[::-1]
reSentence += reWord
elif count > int(occurrence):
print("no such occurrence")
else:
count += 1
else:
reSentence += letter
print(reSentence)
sentence = 'Dani likes bananas, Dani also likes apples'
word = 'li'
occurrence = '2'
q2(sentence,word,occurrence)
the main problem right now is that, after it breaks it goes back to check from the start of the sentence so it will find i in "Dani". I couldn't think of a way to make it check from where it stopped.
I tried using enumerate but still had no idea how.
This will work for the given scenario
scentence = 'Dani likes bananas, Dani also likes apples'
word = 'lik'
st = word
occ = 2
lt = scentence.split(word)
op = ''
if (len(lt) > 1):
for i,x in enumerate(lt[:-1]):
if (i+1) == occ:
word = ''.join(reversed(word))
op = op + x + word
word = st
print(op+lt[-1])
Please test yourself for other scenario
This line for i,x in enumerate(lt[:-1]) basically loops on the list excluding the last element. using enumerate we can get index of the element in the list in i and value of element in x. So when code gets loops through it I re-join the split list with same word by which I broke, but I change the word on the specified position where you desired. The reason to exclude the last element while looping is because inside loop there is addition of word and after each list of element and if I include the whole list there will be extra word at the end. Hope it explains.
Your approach shows that you've clearly thought about the problem and are using the means you know well enough to solve it. However, your code has a few too many issue to simply fix, for example:
you only check for occurrence of the word once you're inside the loop;
you loop over the entire sentence for each letter in the word;
you only compare a character at a time, and make some mistakes in keeping track of how much you've matched so far.
you pass a string '2', which you intend to use as a number 2
All of that and other problems can be fixed, but you would do well to use what the language gives you. Your task breaks down into:
find the n-th occurrence of a substring in a string
replace it with another word where found and return the string
Note that you're not really looking for a 'word' per se, as your example shows you replacing only part of a word (i.e. 'lik') and a 'word' is commonly understood to mean a whole word between word boundaries.
def q2(sentence, word, occurrence):
# the first bit
position = 0
count = 0
while count < occurrence:
position = sentence.find(word, position+1)
count += 1
if position == -1:
print (f'Word "{word}" does not appear {occurrence} times in "{sentence}"')
return None
# and then using what was found for a result
return sentence[0:position] + word[::-1] + sentence[position+len(word):]
print(q2('Dani likes bananas, Dani also likes apples','lik',2))
print(q2('Dani likes bananas, Dani also likes apples','nope',2))
A bit of explanation on that return statement:
sentence[0:position] gets sentence from the start 0 to the character just before position, this is called a 'slice'
word[::-1] get word from start to end, but going in reverse -1. Leaving out the values in the slice implies 'from one end to the other'
sentence[position+len(word):] gets sentence from the position position + len(word), which is the character after the found word, until the end (no index, so taking everything).
All those combined is the result you need.
Note that the function returns None if it can't find the word the right number of times - that may not be what is needed in your case.
import re
from itertools import islice
s = "Dani likes bananas, Dani also likes apples"
t = "lik"
n = 2
x = re.finditer(t, s)
try:
i = next(islice(x, n - 1, n)).start()
except StopIteration:
i = -1
if i >= 0:
y = s[i: i + len(t)][::-1]
print(f"{s[:i]}{y}{s[i + len(t):]}")
else:
print(s)
Finds the 2nd starting index (if exists) using Regex. May require two passes in the worst case over string s, one to find the index, one to form the output. This can also be done in one pass using two pointers, but I'll leave that to you. From what I see, no one has offered a solution yet that does in one pass.
index = Find index of nth occurence
Use slice notation to get part you are interested in (you have it's beginning and length)
Reverse it
Construct your result string:
result = sentence[:index] + reversed part + sentence[index+len(word):]

Find top frequency word in string and check if string only contains [a-z][A-Z] characters

I have created some code in Python to find the top frequency word in a string. I am pretty new in Python and ask for your help to see if I could code this better and more effectively.
(code returns one integer with highest frequency word in the string.)
Also I want to make sure that string only contains [a-z][A-Z] , I tried it but don't know how to do that check.
from collections import Counter
class WordCounter:
def __init__(self, word, frequency):
self.word = word
self.frequency = frequency
# find_top_frequency should return the highest frequency in the text
def find_top_frequency(text: str) -> int:
incoming_string = text.split()
incoming_string= [x.lower() for x in incoming_string]
Words_in_dict_count = {}
for i in incoming_string:
if i not in Words_in_dict_count.keys():
Words_in_dict_count[i] = 0
Words_in_dict_count[i] = Words_in_dict_count[i] + 1
return (max(Words_in_dict_count.values()))
print("\nTop frequency of same word in string is: " +str(WordCounter.find_top_frequency("I would love to make this code better but hope you can help me with it. Thank you helping me out. Hope you can help me."))+"\n")
Use str.isalpha() to filter for words that only have letters in them, and use Counter.most_common to get the most common word.
>>> from collections import Counter
>>> def find_top_frequency(text: str) -> int:
... return Counter(
... word for word in text.split() if word.isalpha()
... ).most_common(1).pop()[1]
...
>>> find_top_frequency("foo bar foo foo bar 1 1 1 1 1 1")
3
Note that most_common(1).pop() is a Tuple[str, int] of the word and the count (in this case it'd be ('foo', 3), so if you want both of those instead of just the count, all you have to do is remove the [1].
Use regular expression to split in words, count occurrence of each word and return the max:
from collections import Counter
import re
def find_top_frequency(text: str) -> int:
words = re.findall(r'\w+', text)
counter = Counter(words)
return max(counter.values())
# or return counter.most_common(1) if you want the word too.
# s = "I would love to make this code better but hope you can help me with it. Thank you helping me out. Hope you can help me."
>>> find_top_frequency(s)
3
Quite simple.
EDIT: I forgot to check the characters at first.
import re
test_string = 'bee movie bee movie bee movie script lelelelele-'
def word_counter(string):
if re.search(r'[^a-zA-Z\s]+',string) == None:
words = string.split()
unique_words = set(words)
word_count_dict = {word:words.count(word) for word in unique_words}
return word_count_dict
else:
print('There are characters which aren\'t whitespaces or letters in your string.')
print(word_counter(test_string))

How to find the largest repeating substring given character in Python?

Given some string say 'aabaaab', how would I go about finding the largest substring of a. So it should return 'aaa'. Any help would be greatly appreciated.
def sub_string(s):
best_run = 0
current_run = 0
for char in s:
if char == 'a'
current_run += 1
else:
current_letter = char
return(best_run)
I have something like the one above. Not sure where I can fix it up.
not the most efficient, but a straightforward solution:
word = "aasfgaaassaasdsddaaaaaafff"
substr_count = 0
substr_counts = []
character = "f"
for i, letter in enumerate(word):
if (letter == character):
substr_count += 1
else:
substr_counts.append(substr_count)
substr_count = 0
if (i == len(word) - 1):
substr_counts.append(substr_count)
print(max(substr_counts))
If you want a short method using standard python tools (and avoid writing loops to reconstruct the string as you iterate), you can use regex to split the string by any non-a characters than get the max() according to len:
import re
test_string = 'aabaaab'
split_string_list = re.split( '[^a]', test_string )
longest_string_subset = max( split_string_list, key=len )
print( longest_string_subset )
The re library is for regex, the '[^a]' is a regex statement for any non-a character. Basically, the 'aabaaab' is being split into a list according to any matches on the regex statement, so that it becomes [ 'aa' 'aaa' '' ]. Then, the max() statement looks for the longest string based on len (aka length).
You can read more about functions like re.split() in the docs: https://docs.python.org/2/library/re.html

Check for words in a sentence

I write a program in Python. The user enters a text message. It is necessary to check whether there is a sequence of words in this message. Sample. Message: "Hello world, my friend.". Check the sequence of these two words: "Hello", "world". The Result Is "True". But when checking the sequence of these words in the message: "Hello, beautiful world "the result is"false". When you need to check the presence of only two words it is possible as I did it in the code, but when combinations of 5 or more words is difficult. Is there any small solution to this problem?
s=message.text
s=s.lower()
lst = s.split()
elif "hello" in lst and "world" in lst :
if "hello" in lst:
c=lst.index("hello")
if lst[c+1]=="world" or lst[c-1]=="world":
E=True
else:
E=False
The straightforward way is to use a loop. Split your message into individual words, and then check for each of those in the sentence in general.
word_list = message.split() # this gives you a list of words to find
word_found = True
for word in word_list:
if word not in message2:
word_found = False
print(word_found)
The flag word_found is True iff all words were found in the sentence. There are many ways to make this shorter and faster, especially using the all operator, and providing the word list as an in-line expression.
word_found = all(word in message2 for word in message.split())
Now, if you need to restrict your "found" property to matching exact words, you'll need more preprocessing. The above code is too forgiving of substrings, such as finding "Are you OK ?" in the sentence "your joke is only barely funny". For the more restrictive case, you should break message2 into words, strip those words of punctuation, drop them to lower-case (to make matching easier), and then look for each word (from message) in the list of words from message2.
Can you take it from there?
I will clarify your requirement first:
ignore case
consecutive sequence
match in any order, like permutation or anagram
support duplicated words
if the number is not too large, you can try this easy-understanding but not the fastest way.
split all words in text message
join them with ' '
list all the permutation of words and join them with ' ' too, For
example, if you want to check sequence of ['Hello', 'beautiful', 'world']. The permutation will be 'Hello beautiful world',
'Hello world beautiful', 'beautiful Hello world'... and so on.
and you can just find whether there is one permutation such as
'hello beautiful world' is in it.
The sample code is here:
import itertools
import re
# permutations brute-force, O(nk!)
def checkWords(text, word_list):
# split all words without space and punctuation
text_words= re.findall(r"[\w']+", text.lower())
# list all the permutations of word_list, and match
for words in itertools.permutations(word_list):
if ' '.join(words).lower() in ' '.join(text_words):
return True
return False
# or use any, just one line
# return any(' '.join(words).lower() in ' '.join(text_words) for words in list(itertools.permutations(word_list)))
def test():
# True
print(checkWords('Hello world, my friend.', ['Hello', 'world', 'my']))
# False
print(checkWords('Hello, beautiful world', ['Hello', 'world']))
# True
print(checkWords('Hello, beautiful world Hello World', ['Hello', 'world', 'beautiful']))
# True
print(checkWords('Hello, beautiful world Hello World', ['Hello', 'world', 'world']))
But it costs a lot when words number is large, k words will generate k! permutation, the time complexity is O(nk!).
I think a more efficient solution is sliding window. The time complexity will decrease to O(n):
import itertools
import re
import collections
# sliding window, O(n)
def checkWords(text, word_list):
# split all words without space and punctuation
text_words = re.findall(r"[\w']+", text.lower())
counter = collections.Counter(map(str.lower, word_list))
start, end, count, all_indexes = 0, 0, len(word_list), []
while end < len(text_words):
counter[text_words[end]] -= 1
if counter[text_words[end]] >= 0:
count -= 1
end += 1
# if you want all the index of match, you can change here
if count == 0:
# all_indexes.append(start)
return True
if end - start == len(word_list):
counter[text_words[start]] += 1
if counter[text_words[start]] > 0:
count += 1
start += 1
# return all_indexes
return False
I don't know if that what you really need but this worked you can tested
message= 'hello world'
message2= ' hello beautiful world'
if 'hello' in message and 'world' in message :
print('yes')
else :
print('no')
if 'hello' in message2 and 'world' in message2 :
print('yes')
out put :
yes
yes

Removing words containing digits from a given string

I'm trying to write a simple program that removes all words containing digits from a received string.
Here is my current implementation:
import re
def checkio(text):
text = text.replace(",", " ").replace(".", " ") .replace("!", " ").replace("?", " ").lower()
counter = 0
words = text.split()
print words
for each in words:
if bool(re.search(r'\d', each)):
words.remove(each)
print words
checkio("1a4 4ad, d89dfsfaj.")
However, when I execute this program, I get the following output:
['1a4', '4ad', 'd89dfsfaj']
['4ad']
I can't figure out why '4ad' is printed in the second line as it contains digits and should have been removed from the list. Any ideas?
Assuming that your regular expression does what you want, you can do this to avoid removing while iterating.
import re
def checkio(text):
text = re.sub('[,\.\?\!]', ' ', text).lower()
words = [w for w in text.split() if not re.search(r'\d', w)]
print words ## prints [] in this case
Also, note that I simplified your text = text.replace(...) line.
Additionally, if you do not need to reuse your text variable, you can use regex to split it directly.
import re
def checkio(text):
words = [w for w in re.split('[,.?!]', text.lower()) if w and not re.search(r'\d', w)]
print words ## prints [] in this case
If you are testing for alpha numeric strings why not use isalnum() instead of regex ?
In [1695]: x = ['1a4', '4ad', 'd89dfsfaj']
In [1696]: [word for word in x if not word.isalnum()]
Out[1696]: []
This would be possible through using re.sub, re.search and list_comprehension.
>>> import re
>>> def checkio(s):
print([i for i in re.sub(r'[.,!?]', '', s.lower()).split() if not re.search(r'\d', i)])
>>> checkio("1a4 4ad, d89dfsfaj.")
[]
>>> checkio("1a4 ?ad, d89dfsfaj.")
['ad']
So apparently what happens is a concurrent access error. Namely - you are deleting an element while traversing the array.
At the first iteration we have words = ['1a4', '4ad', 'd89dfsfaj']. Since '1a4' has a number, we remove it.
Now, words = ['4ad','d89dfsfaj']. However, at the second iteration, the current word is now 'd89dfsfaj' and we remove it. What happens is that we skip '4ad', because it is now at index 0 and the current pointer for the for cycle is at 1.

Categories

Resources