I'm trying to determine whether a substring is in a string.
The issue I'm running into is that I don't want my function to return True if the substring is found within another word in the string.
For example: if the substring is; "Purple cow"
and the string is; "Purple cows make the best pets."
This should return False. Since cow isn't plural in the substring.
And if the substring was; "Purple cow"
and the string was; "Your purple cow trampled my hedge!"
would return True
My code looks something like this:
def is_phrase_in(phrase, text):
phrase = phrase.lower()
text = text.lower()
return phrase in text
text = "Purple cows make the best pets!"
phrase = "Purple cow"
print(is_phrase_in(phrase, text)
In my actual code I clean up unnecessary punctuation and spaces in 'text' before comparing it to phrase, but otherwise this is the same.
I've tried using re.search, but I don't understand regular expressions very well yet and have only gotten the same functionality from them as in my example.
Thanks for any help you can provide!
Since your phrase can have multiple words, doing a simple split and intersect won't work. I'd go with regex for this one:
import re
def is_phrase_in(phrase, text):
return re.search(r"\b{}\b".format(phrase), text, re.IGNORECASE) is not None
phrase = "Purple cow"
print(is_phrase_in(phrase, "Purple cows make the best pets!")) # False
print(is_phrase_in(phrase, "Your purple cow trampled my hedge!")) # True
Using PyParsing:
import pyparsing as pp
def is_phrase_in(phrase, text):
phrase = phrase.lower()
text = text.lower()
rule = pp.ZeroOrMore(pp.Keyword(phrase))
for t, s, e in rule.scanString(text):
if t:
return True
return False
text = "Your purple cow trampled my hedge!"
phrase = "Purple cow"
print(is_phrase_in(phrase, text))
Which yields:
True
One can do this very literally with a loop
phrase = phrase.lower()
text = text.lower()
answer = False
j = 0
for i in range(len(text)):
if j == len(phrase):
return text[i] == " "
if phrase[j] == text[i]:
answer = True
j+=1
else:
j = 0
answer = False
return answer
Or by splitting
phrase_words = phrase.lower().split()
text_words = text.lower().split()
return phrase_words in text_words
or using regular expressions
import re
pattern = re.compile("[^\w]" + text + ""[^\w]")
pattern.match(phrase.lower())
to say that we want no characters preceding or following our text, but whitespace is okay.
Regular Expressions should do the trick
import re
def is_phrase_in(phrase, text):
phrase = phrase.lower()
text = text.lower()
if re.findall('\\b'+phrase+'\\b', text):
found = True
else:
found = False
return found
Here you go, hope this helps
# Declares
string = "My name is Ramesh and I am cool. You are Ram ?"
sub = "Ram"
# Check String For SUb String
result = sub in string
# Condition Check
if result:
# find starting position
start_position = string.index(sub)
# get stringlength
length = len(sub)
# return string
output = string[start_position:len]
Related
The task is to get text between two signs in a sentence.
User input sentence in one line in next one he input signs(for this case it's [ and ]).
Example:
In this sentence [need to get] only [few words].
Output needs to look like:
need to get few words
Can someone have any clue how to do this?
I have some idea like split input so we will access every element of the list and if a first sign is [ and finish with ] we save that word to other list, but there is a problem if the word doesn't end with ]
P.S. user will never input empty string or have a sign inside sign like [word [another] word].
You can use a regex:
import re
text = 'In this sentence [need to get] only [few words] and not [unbalanced'
' '.join(re.findall(r'\[(.*?)\]', text))
Output: 'need to get few words'
Or '(?<=\[).*?(?=\])' as regex using lookarounds
You can use regular expressions like this:
import re
your_string = "In this sentence [need to get] only [few words]"
matches = re.findall(r'\[([^\[\]]*)]', your_string)
print(' '.join(matches))
Regex demo
Solution without regex:
your_string = "In this sentence [need to get] only [few words]"
result_parts = []
current_square_brackets_part = ''
need_to_add_letter_to_current_square_brackets_part = False
for letter in your_string:
if letter == '[':
need_to_add_letter_to_current_square_brackets_part = True
elif letter == ']':
need_to_add_letter_to_current_square_brackets_part = False
result_parts.append(current_square_brackets_part)
current_square_brackets_part = ''
elif need_to_add_letter_to_current_square_brackets_part:
current_square_brackets_part += letter
print(' '.join(result_parts))
Here is a more classical solution using parsing.
It reads the string character by character and keeps it only if a flag is set. The flag is set when meeting a [ and unset on ]
text = 'In this sentence [need to get] only [few words] and not [unbalanced'
add = False
l = []
m = []
for c in text:
if c == '[':
add = True
elif c == ']':
if add and m:
l.append(''.join(m))
add = False
m = []
elif add:
m.append(c)
out = ' '.join(l)
print(out)
Output: need to get few words
I've written a function that searches into a text and finds a certain word, but it says that there isn't that word - but I know there is so it doesn't work.
def search(text, item):
list_ = []
p = [';', '.', ' ', ',', ':']
string = ''
for i in range(len(text)):
if text[i] not in p:
string += text[i]
else:
list_ += string
string = ''
if item in list_:
return True
else:
return False
Here is something that does work. It uses a regular expression to determine what the "words" are in the text, which in my opinion is the crux of the problem. It puts them all in a set and then uses that to determine if the item passed is a member of by using the in operator.
import re
WORD_PATTERN = re.compile("([\w][\w']*\w)") # Regex to find words in a string.
# see https://stackoverflow.com/a/12705513/355230
def search(text, item):
words = {*WORD_PATTERN.findall(text)} # Set for fast membership testing.
return item in words
if __name__ == '__main__':
s = "John's mom went there, but he wasn't there. So she said: 'Where are you?'"
for word in ('mom', 'bug', 'so', 'So', "wasn't"):
print(f'{word!r} in string s -> {search(s, word)}')
Printed output:
'mom' in string s -> True
'bug' in string s -> False
'so' in string s -> False
'So' in string s -> True
"wasn't" in string s -> True
As you can see the search are case-sensitive. Also note there are some subtleties with respect to the treatment of apostrophes — see the answer where I got the regex from for details.
This is an extension of this previous question.
I have a python dictionary, made like this
a = {"animal": [ "dog", "cat", "dog and cat"], "XXX": ["I've been", "asp*", ":)"]}
I want to find a solution to replace, as fast as possible, all the words in the dictionary values, with their keys. Solution should be scalable for large text. If words end with asterisk, it means that all words in the text that start with that prexif should be replaced.
So the following sentence "I've been bad but I aspire to be a better person, and behave like my dog and cat :)" should be transformed into "XXX bad but I XXX to be a better person, and behave like my animal XXX".
I am trying to use trrex for this, thinking it should be the fastest option. Is it? However I cannot succeed.
Moreover I find problems:
in handling words which include punctuation (such as ":)" and "I've been");
when some string is repeated like "dog" and "dog and cat".
Can you help me achieve my goal with a scalable solution?
You can tweak this solution to suit your needs:
Create another dictionary from a that will contain the same keys and the regex created from the values
If a * char is found, replace it with \w* if you mean any zero or more word chars, or use \S* if you mean any zero or more non-whitespace chars (please adjust the def quote(self, char) method), else, quote the char
Use unambiguous word boundaries, (?<!\w) and (?!\w), or remove them altogether if they interfere with matching non-word entries
The first regex here will look like (?<!\w)(?:cat|dog(?:\ and\ cat)?)(?!\w) (demo) and the second will look like (?<!\w)(?::\)|I've\ been|asp\w*)(?!\w) (demo)
Replace in a loop.
See the Python demo:
import re
# Input
text = "I've been bad but I aspire to be a better person, and behave like my dog and cat :)"
a = {"animal": [ "dog", "cat", "dog and cat"], "XXX": ["I've been", "asp*", ":)"]}
class Trie():
"""Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern.
The corresponding Regex should match much faster than a simple Regex union."""
def __init__(self):
self.data = {}
def add(self, word):
ref = self.data
for char in word:
ref[char] = char in ref and ref[char] or {}
ref = ref[char]
ref[''] = 1
def dump(self):
return self.data
def quote(self, char):
if char == '*':
return r'\w*'
else:
return re.escape(char)
def _pattern(self, pData):
data = pData
if "" in data and len(data.keys()) == 1:
return None
alt = []
cc = []
q = 0
for char in sorted(data.keys()):
if isinstance(data[char], dict):
try:
recurse = self._pattern(data[char])
alt.append(self.quote(char) + recurse)
except:
cc.append(self.quote(char))
else:
q = 1
cconly = not len(alt) > 0
if len(cc) > 0:
if len(cc) == 1:
alt.append(cc[0])
else:
alt.append('[' + ''.join(cc) + ']')
if len(alt) == 1:
result = alt[0]
else:
result = "(?:" + "|".join(alt) + ")"
if q:
if cconly:
result += "?"
else:
result = "(?:%s)?" % result
return result
def pattern(self):
return self._pattern(self.dump())
# Creating patterns
a2 = {}
for k,v in a.items():
trie = Trie()
for w in v:
trie.add(w)
a2[k] = re.compile(fr"(?<!\w){trie.pattern()}(?!\w)", re.I)
for k,r in a2.items():
text = r.sub(k, text)
print(text)
# => XXX bad but I XXX to be a better person, and behave like my animal XXX
import re
def step_through_with(s):
pattern = re.compile(s + ',')
if pattern == True:
return True
else:
return False
The task is to find a word in a sentence, which is the input parameter of the function. How should the syntax look like?
If you want to find a word in a sentence, you have to take into account boundaries (so searching for 'fun' won't match 'function' for instance).
An example:
import re
def step_through_with(sentence, word):
pattern = r'\b{}\b'.format(word)
if re.search(pattern, sentence):
return True
return False
sentence = 'we are looking for the input of a function'
print step_through_with(sentence, 'input') # True
print step_through_with(sentence, 'fun') # False
What is an elegant way to look for a string within another string in Python, but only if the substring is within whole words, not part of a word?
Perhaps an example will demonstrate what I mean:
string1 = "ADDLESHAW GODDARD"
string2 = "ADDLESHAW GODDARD LLP"
assert string_found(string1, string2) # this is True
string1 = "ADVANCE"
string2 = "ADVANCED BUSINESS EQUIPMENT LTD"
assert not string_found(string1, string2) # this should be False
How can I best write a function called string_found that will do what I need? I thought perhaps I could fudge it with something like this:
def string_found(string1, string2):
if string2.find(string1 + " "):
return True
return False
But that doesn't feel very elegant, and also wouldn't match string1 if it was at the end of string2. Maybe I need a regex? (argh regex fear)
You can use regular expressions and the word boundary special character \b (highlight by me):
Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore character. Note that \b is defined as the boundary between \w and \W, so the precise set of characters deemed to be alphanumeric depends on the values of the UNICODE and LOCALE flags. Inside a character range, \b represents the backspace character, for compatibility with Python’s string literals.
def string_found(string1, string2):
if re.search(r"\b" + re.escape(string1) + r"\b", string2):
return True
return False
Demo
If word boundaries are only whitespaces for you, you could also get away with pre- and appending whitespaces to your strings:
def string_found(string1, string2):
string1 = " " + string1.strip() + " "
string2 = " " + string2.strip() + " "
return string2.find(string1)
The simplest and most pythonic way, I believe, is to break the strings down into individual words and scan for a match:
string = "My Name Is Josh"
substring = "Name"
for word in string.split():
if substring == word:
print("Match Found")
For a bonus, here's a oneliner:
any(substring == word for word in string.split())
Here's a way to do it without a regex (as requested) assuming that you want any whitespace to serve as a word separator.
import string
def find_substring(needle, haystack):
index = haystack.find(needle)
if index == -1:
return False
if index != 0 and haystack[index-1] not in string.whitespace:
return False
L = index + len(needle)
if L < len(haystack) and haystack[L] not in string.whitespace:
return False
return True
And here's some demo code (codepad is a great idea: Thanks to Felix Kling for reminding me)
I'm building off aaronasterling's answer.
The problem with the above code is that it will return false when there are multiple occurrences of needle in haystack, with the second occurrence satisfying the search criteria but not the first.
Here's my version:
def find_substring(needle, haystack):
search_start = 0
while (search_start < len(haystack)):
index = haystack.find(needle, search_start)
if index == -1:
return False
is_prefix_whitespace = (index == 0 or haystack[index-1] in string.whitespace)
search_start = index + len(needle)
is_suffix_whitespace = (search_start == len(haystack) or haystack[search_start] in string.whitespace)
if (is_prefix_whitespace and is_suffix_whitespace):
return True
return False
One approach using the re, or regex, module that should accomplish this task is:
import re
string1 = "pizza pony"
string2 = "who knows what a pizza pony is?"
search_result = re.search(r'\b' + string1 + '\W', string2)
print(search_result.group())
Excuse me REGEX fellows, but the simpler answer is:
text = "this is the esquisidiest piece never ever writen"
word = "is"
" {0} ".format(text).lower().count(" {0} ".format(word).lower())
The trick here is to add 2 spaces surrounding the 'text' and the 'word' to be searched, so you guarantee there will be returning only counts for the whole word and you don't get troubles with endings and beginnings of the 'text' searched.
Thanks for #Chris Larson's comment, I test it and updated like below:
import re
string1 = "massage"
string2 = "muscle massage gun"
try:
re.search(r'\b' + string1 + r'\W', string2).group()
print("Found word")
except AttributeError as ae:
print("Not found")
def string_found(string1,string2):
if string2 in string1 and string2[string2.index(string1)-1]=="
" and string2[string2.index(string1)+len(string1)]==" ":return True
elif string2.index(string1)+len(string1)==len(string2) and
string2[string2.index(string1)-1]==" ":return True
else:return False