check if string contains word in any variation - python

I'm trying to build a bad words filter in python (I accept non-coded answers, just need to know more about an algorithm that would work) and I need to know how I can check if a string, contains a specific word in any variation.
For example, let's say my bad word array is:
['others','hello','banana']
And the String I need to check is:
Thinking alike or understanding something in a similar way with others.
For now, I'm looping on the string by checking every time if any element of the array exists in the phrase, but what if I want to check variations of the array? Like 0th3rs,Oth3r5 for the first element? For now, I'm manually checking it by doing multiple if statements and replacing a with # etc... But this would not be good for a production code since I cannot prevent every scenario of character replacing, So I thought of something like an array of objects, where the index is the letter, like A which contains an array of its variations and check it dynamically in the string, but would this take too much time? Since it needs to check every type of word variation? Or is this achievable and usable in a real scenario?

Have you try using replace()?
For example:
input="0th3rs"
replace_pair={'0':'o','3':'e'}
for old, new in replace_pair.items():
input = input.replace(old, new)
print(input)
will give you "others"
You have to still provide the replacement pairs but that would be better than "if" statement.

I cannot prevent every scenario of character replacing
That's true. However, you can handle the majority of scenarios.
I would consider declaring a mapping of replacements and their meaning:
REPLACEMENTS_DICT = {
"#": "a",
"4": "a",
"3": "e",
"0": "o",
...
}
Then, before checking if a particular string is inside the bad_word_array, one should translate the string with regard to the replacement dict and then make a case-insensitive comparison:
def translate(word: str) -> str:
return "".join(REPLACEMENTS_DICT.get(c, c) for c in word).lower()
def is_bad_word(word: str) -> bool:
return translate(word) in BAD_WORDS
Example
BAD_WORDS = ["others", "hello", "banana"]
print(is_bad_word("0th3rs")) # True
print(is_bad_word("Oth3rs")) # True
For tokenizing the text into words you can use nltk.
import nltk
sentence = "Thinking alike or understanding something in a similar way with others."
words = nltk.word_tokenize(sentence)
for word in words:
assert is_bad_word(word)

can't you just extend your list of bad words to contain different variations?
bad_words = ["others", "0th3rs", "banana"]
text = "this is the text about bananas and 0th3rs"
for word in bad_words:
if word in text:
text = text.replace(word, "*flowers*")

Related

Checking if any word in a string appears in a list using python

I have a pandas dataframe that contains a column of several thousands of comments. I would like to iterate through every row in the column, check to see if the comment contains any word found in a list of words I've created, and if the comment contains a word from my list I want to label it as such in a separate column. This is what I have so far in my code:
retirement_words_list = ['match','matching','401k','retirement','retire','rsu','rrsp']
def word_checker(row):
for sentence in df['comments']:
if any(word in re.findall(r'\w+', sentence.lower()) for word in retirement_words_list):
return '401k/Retirement'
else:
return 'Other'
df['topic'] = df.apply(word_checker,axis=1)
The code is labeling every single comment in my dataframe as 'Other' even though I have double-checked that many comments contain one or several of the words from my list. Any ideas for how I may correct my code? I'd greatly appreciate your help.
Probably more convenient to have a set version of retirements_word_list (for efficient inclusing testing) and then loop over words in the sentence, checking inclusion in this set, rather than the other way round:
retirement_words_list = ['match','matching','401k','retirement','retire','rsu','rrsp']
retirement_words_set = set(retirement_words_list)
and then
if any(word in retirement_words_list for word in sentence.lower().split()):
# .... etc ....
Your code is just checking whether any word in retirement_words_list is a substring of the sentence, but in fact you must be looking for whole-word matches or it wouldn't make sense to include 'matching' and 'retirement' on the list given that 'match' and 'retire' are already included. Hence the use of split -- and the reason why we can then also reverse the logic.
NOTE: You may need some further changes because your function word_checker has a parameter called row which it does not use. Possibly what you meant to do was something like:
def word_checker(sentence):
if any(word in retirement_words_list for word in sentence.lower().split()):
return '401k/Retirement'
else:
return 'Other'
and:
df['topic'] = df['comments'].apply(word_checker,axis=1)
where sentence is the contents of each row from the comments column.
Wouldn't this simplified version (without regex) work?
if any(word in sentence.lower() for word in retirement_words_list):

Given string and (list) of words, return words that contain string (optimal algorithm)

Let's say we have a list of unique words and a substring.
I am looking for an optimal algorithm, that returns words containing the substring.
The general application is: Given a database use search bar to filter the results.
A simple implementation in Python:
def search_bar(words, substring):
ret = []
for word in words:
if substring in word:
ret.append(word)
return ret
words = ["abc", "bcd", "thon", "Python"]
substring = "on"
search_bar(words, substring)
this would return:
["thon", "Python"]
in time O(lenght_of_list * complexity_of_in), where complexity_of_in depends in some way on the length of the substring and the length of individual words.
What I am asking is whether there is faster implementation. Given that we can preprocess the list into any structure we want.
Just redirection to the problem/answer would be amazing.
Note: It would be better if such structure doesn't take too long to add a new word. But primarily it doesn't have to be able to add anything, as the Python example doesn't.
Also, I am not sure about the tags with this question...
maybe use
word.find(substring)
instead
substring in word
and as variant:
def search_bar(words, substring):
return list(filter(lambda word: word.find(substring)!=-1, words))

Python - Capture string with or without specific character

I am trying to capture the sentence after a specific word. Each sentences are different in my code and those sentence doesn't necessarily have to have this specific word to split by. If the word doesn't appear, I just need like blank string or list.
Example 1: working
my_string="Python is a amazing programming language"
print(my_string.split("amazing",1)[1])
programming language
Example 2:
my_string="Java is also a programming language."
print(my_string.split("amazing",1)[1]) # amazing word doesn't appear in the sentence.
Error: IndexError: list index out of range
Output needed :empty string or list ..etc.
I tried something like this, but it still fails.
my_string.split("amazing",1)[1] if my_string.split("amazing",1)[1] == None else my_string.split("amazing",1)[1]
When you use the .split() argument you can specify what part of the list you want to use with either integers or slices. If you want to check a specific word in your string you can do is something like this:
my_str = "Python is cool"
my_str_list = my_str.split()
if 'cool' in my_str_list:
print(my_str)`
output:
"Python is cool"
Otherwise, you can run a for loop in a list of strings to check if it finds the word in multiple strings.
You have some options here. You can split and check the result:
tmp = my_string.split("amazing", 1)
result = tmp[1] if len(tmp) > 1 else ''
Or you can check for containment up front:
result = my_string.split("amazing", 1)[1] if 'amazing' in my_string else ''
The first option is more efficient if most of the sentences have matches, the second one if most don't.
Another option similar to the first is
result = my_string.split("amazing", 1)[-1]
if result == my_string:
result = ''
In all cases, consider doing something equivalent to
result = result.lstrip()
Instead of calling index 1, call index -1. This calls the last item in the list.
my_string="Java is also a programming language."
print(my_string.split("amazing",1)[1])
returns ' programming language.'

How to print random items from a dictionary?

everyone. I'm trying to complete a basic assignment. The program should allow a user to type in a phrase. If the phrase contains the word "happy" or "sad", that word should then be randomly replaced by a synonym (stored in a dictionary). The new phrase should then be printed out. What am I doing wrong? Every time I try to run it, the program crashes. This is the error I get:
0_part1.py", line 13, in <module>
phrase["happy"] = random.choice(thesaurus["happy"])
TypeError: 'str' object does not support item assignment
Here is what I have so far:
import random
thesaurus = {
"happy": ["glad", "blissful", "ecstatic", "at ease"],
"sad": ["bleak", "blue", "depressed"]
}
phrase = input("Enter a phrase: ")
phrase2 = phrase.split(' ')
if "happy" in phrase:
phrase["happy"] = random.choice(thesaurus["happy"])
if "sad" in phrase:
phrase["sad"] = random.choice(thesaurus["sad"])
print(phrase)
The reason for your error is that phrase is a string, and strings are immutable. On top of that, strings are sequences, not mappings; you can index them or slice them (e.g., happy_index = phrase.find("happy"); phrase[happy_index:happy_index+len("happy")]), but you can't use them like dictionaries.
If you want to create a new string, replacing the substring happy with another word, use the replace method.
And there's no reason to check first; if happy isn't found, replace wil do nothing.
So:
phrase = phrase.replace("happy", random.choice(thesaurus["happy"]))
While we're at it, instead of explicitly looking up each key, you may want to loop over the dictionary and apply all the synonyms:
for key, replacements in thesaurus.items():
phrase = phrase.replace(key, random.choice(replacements))
Finally, notice that this code will replace all instances of happy with the same replacement. Which I think your intended code was also trying to do. If you want to replace each of them with a separately randomly-chosen synonym, that's a bit more complicated. You could loop over phrase.find("happy", offset) until it returns -1, but a neat trick might make it simpler: split the string around each instance of happy, substitute in a different synonym for each split part, then join them all back together. Like this:
parts = phrase.split("happy")
parts[:-1] = [part + random.choice(thesaurus["happy"]) for part in parts[:-1]]
phrase = ''.join(parts)
Generate a random number from (0..[size of list - 1]). Then, access that index of the list. To get the length of a list, just do len(list_name).

finding strings in one list and based on what it says, replacing it with strings from another list

basically I have a user enter a sentence
eg. "hello, how are you?"
and from a large list it replaces "are" with "am" and "you" with "I". to return:
"hello, how am i?"
problem is i have no idea how to do this.
so my list looks a bit like reflections = [["I, you"],["are","am]] ---> etc.
and so far i've got some code which collects raw input from the user and then calls this function to reply to it.
def reflects_users_string(reply):
reply_list = reply.split()
for _ in reply_list
if ????
????
????
else
print "i don't understand"
from what I understand (noob here) it turns the users input into a list and then compares each item in that list with items in the "reflections" list, then it replaces the identical string in one list with the string next to it eg. "are" with "am"
ive been playing with all sorts of ways to do this but just cant seem to figure it out
Try learning to use list comprehensions, it's a powerful way to filter out lists in make iterations.
Let's try to solve your problems with list comprehensions:
#first we need to create mappings in a dict for your reflections
reflect = {
'you': 'I',
'are': 'am'
}
# After we read user input
user_input = 'hello, how are you ?'
#Now look how we can replace all words in user_input from reflect with one line
reflected = [word for word in [reflect.get(key, key) for key in user_input.split()]]
print ' '.join(reflected)
Let's analyse the list comprehension:
First we split user input into a list user_input.split()
Then we iter through the user input words for key in
user_input.split()
For each word in user input words we query the reflect dict. Using
reflect.get(key, key) is a way to query the reflect dict for key
and if we can't find the key a default value of key is returned
instead.
Finally, we wrap all this comprehension with [word for word in [getting reflected words from user input and a default value of the same word if we can't find it's reflection]]
And Voila !
It's a good start so far. As for next step, make a big dict of all the mappings of words, look up each word in that dict, and replace it if it has a replacement.

Categories

Resources