Using python3, i have a list of words like:
['foot', 'stool', 'carpet']
these lists vary in length from 1-6 or so. i have thousands and thousands of strings to check, and it is required to make sure that all three words are present in a title. where:
'carpet stand upon the stool of foot balls.'
is a correct match, as all the words are present here, even though they are out of order.
ive wondered about this for a long time, and the only thing i could think of was some sort of iteration like:
for word in list: if word in title: match!
but this give me results like 'carpet cleaner' which is incorrect. i feel as though there is some sort of shortcut to do this, but i cant seem to figure it out without using excessivelist(), continue, break or other methods/terminology that im not yet familiar with. etc etc.
You can use all():
words = ['foot', 'stool', 'carpet']
title = "carpet stand upon the stool of foot balls."
matches = all(word in title for word in words)
Or, inverse the logic with not any() and not in:
matches = not any(word not in title for word in words)
Related
I am reading a badly formatted text, and often there are unwanted spaces inside a single word. For example, "int ernational trade is not good for economies" and so forth. Is there any efficient tool that can cope with this? (There are a couple of other answers like here, which do not work in a sentence.)
Edit: About the impossibility mentioned, I agree. One option is to preserve all possible options. In my case this edited text will be matched with another database that has the original (clean) text. This way, any wrong removal of spaces just gets tossed away.\
You could use the PyEnchant package to get a list of English words. I will assume words that do not have meaning on their own but do together are a word, and use the following code to find words that are split by a single space:
import enchant
text = "int ernational trade is not good for economies"
fixed_text = []
d = enchant.Dict("en_US")
for i in range(len(words := text.split())):
if fixed_text and not d.check(words[i]) and d.check(compound_word := ''.join([fixed_text[-1], words[i]])):
fixed_text[-1] = compound_word
else:
fixed_text.append(words[i])
print(' '.join(fixed_text))
This will split the text on spaces and append words to fixed_text. When it finds that a previously added word is not in the dictionary, but appending the next word to it does make it valid, it sticks those two words together.
This should help sanitize most of the invalid words, but as the comments mentioned it is sometimes impossible to find out if two words belong together without performing some sort of lexical analysis.
As suggested by Pranav Hosangadi, here is a modified (and a little more involved) version which can remove multiple spaces in words by compounding previously added words which are not in the dictionary. However, since a lot of smaller words are valid in the English language, many spaced out words don't correctly concatenate.
import enchant
text = "inte rnatio nal trade is not good for ec onom ies"
fixed_text = []
d = enchant.Dict("en_US")
for i in range(len(words := text.split())):
if fixed_text and not d.check(compound_word := words[i]):
for j, pending_word in enumerate(fixed_text[::-1], 1):
if not d.check(pending_word) and d.check(compound_word := ''.join([pending_word, compound_word])):
del fixed_text[-j:]
fixed_text.append(compound_word)
break
else:
fixed_text.append(words[i])
else:
fixed_text.append(words[i])
print(' '.join(fixed_text))
I'm facing this issue:
I need to remove duplications from the beginning of each word of a text, but only if
all words in the text are duplicated. (And capitalized after)
Examples:
text = str("Thethe cacar isis momoving vvery fasfast")
So this text should be treated and printed as:
output:
"The car is moving very fast"
I got these to treat the text:
phrase = str("Thethe cacar isis momoving vvery fasfast")
phrase_up = phrase.upper()
text = re.sub(r'(.+?)\1+', r'\1', phrase_up)
text_cap = text.capitalize()
"The car is moving very fast"
Or:
def remove_duplicates(word):
unique_letters = set(word)
sorted_letters = sorted(unique_letters, key=word.index)
return ''.join(sorted_letters)
words = phrase.split(' ')
new_phrase = ' '.join(remove_duplicates(word) for word in words)
What I can't work it out, is HOW to determine if a text needs this treatment.
Because if we get a text such as:
"This meme is funny, said Barbara"
Where even though "meme" and "Barbara" (ar - ar) are repeating substrings, not all are, so this text shouldn't be treated.
Any pointers here?
I would suggest you to adopt a solution to check if a word is legal, using something like what is described in this post's best answer. If the word is not an english word, than you should use the regex.
For example, a word like meme should be in the english dictionary, so you should not check for repetitions.
So I would firstly split the string on spaces, in order to have the tokens. Then check if a token is an english word. If it is, skip the regex check. Otherwise check for repetitions
I have a pandas dataframe that contains a column of several thousands of comments. I would like to iterate through every row in the column, check to see if the comment contains any word found in a list of words I've created, and if the comment contains a word from my list I want to label it as such in a separate column. This is what I have so far in my code:
retirement_words_list = ['match','matching','401k','retirement','retire','rsu','rrsp']
def word_checker(row):
for sentence in df['comments']:
if any(word in re.findall(r'\w+', sentence.lower()) for word in retirement_words_list):
return '401k/Retirement'
else:
return 'Other'
df['topic'] = df.apply(word_checker,axis=1)
The code is labeling every single comment in my dataframe as 'Other' even though I have double-checked that many comments contain one or several of the words from my list. Any ideas for how I may correct my code? I'd greatly appreciate your help.
Probably more convenient to have a set version of retirements_word_list (for efficient inclusing testing) and then loop over words in the sentence, checking inclusion in this set, rather than the other way round:
retirement_words_list = ['match','matching','401k','retirement','retire','rsu','rrsp']
retirement_words_set = set(retirement_words_list)
and then
if any(word in retirement_words_list for word in sentence.lower().split()):
# .... etc ....
Your code is just checking whether any word in retirement_words_list is a substring of the sentence, but in fact you must be looking for whole-word matches or it wouldn't make sense to include 'matching' and 'retirement' on the list given that 'match' and 'retire' are already included. Hence the use of split -- and the reason why we can then also reverse the logic.
NOTE: You may need some further changes because your function word_checker has a parameter called row which it does not use. Possibly what you meant to do was something like:
def word_checker(sentence):
if any(word in retirement_words_list for word in sentence.lower().split()):
return '401k/Retirement'
else:
return 'Other'
and:
df['topic'] = df['comments'].apply(word_checker,axis=1)
where sentence is the contents of each row from the comments column.
Wouldn't this simplified version (without regex) work?
if any(word in sentence.lower() for word in retirement_words_list):
I have a list of word library and a text in which there are a spell error (typos), and I want to correct the word spell error to be correct according to list of library
for example
in list of word :
listOfWord = [...,"halo","saya","sedangkan","semangat","cemooh"..];
this is my string :
string = "haaallllllooo ssya sdngkan ceemoooh , smngat semoga menyenangkan"
I want change the spellerror to be correct like :
string = "halo saya sedangkan cemooh, semangat semoga menyenangkan"
what is the best algorithm to check each word in list, because I have millions of words in the list and have many possibilities
It depends on how your data is stored, but you'll probably want to use a pattern matching algorithm like Aho–Corasick. Of course, that assumes your input data structure is a Trie. A Trie a very space-efficient storage container for words that may also be of interest to you (again, depending on your environment.)
You can use difflib's get close matches, though it is not that efficient.
words = ["halo","saya","sedangkan","semangat","cemooh"]
def get_exact_words(input_str):
exact_words = difflib.get_close_matches(input_str,words,n=1,cutoff=0.7)
if len(exact_words)>0:
return exact_words[0]
else:
return input_str
string = "haaallllllooo ssya sdngkan ceemoooh , smngat semoga menyenangkan"
string = string.split(' ')
exact = [get_exact_words(word) for word in string]
exact = ' '.join(exact)
print(exact)
Output :
With difflib
haaallllllooo saya sedangkan cemooh , semangat semangat menyenangkan
I am assuming you are writing spell checker for some language.
You might want tokenize the sentence into words.
Then shorten words like haaallllllooo to haalloo. Assuming the language you have doesn't have words that have many repeated letters too often. Easy to check since you have the dictionary.
Then you can use this algorithm/implementation by Peter Norvig. All you have to do is to replace his dictionary of correct words with your dictionary.
You can use hashing techniques for checking correct pattern, something on the lines of Rabin Karp Algorithm.
You know what would be the hash value of your original strings in the list. For spell correction, you can try the combination of those words that gives you same hash value before matching them with original string present in the dictionary. This would require, anyways, to iterate through all the characters in the spellerror list only once. But it will be efficient.
You can use pyenchant to check spelling with your list of words.
>>> import enchant
>>> d = enchant.request_pwl_dict("mywords.txt")
>>> d.check('helo')
False
>>> d.suggest("Helo")
['He lo', 'He-lo', 'Hello', 'Helot', 'Help', 'Halo', 'Hell', 'Held', 'Helm', 'Hero', "He'll"]
You need to split your words and check one-by-one, choose the first suggestion to replace if it's false.
There are more advance features in the tutorial here.
http://pyenchant.readthedocs.io/en/latest/tutorial.html
I think you should apply string distance algorithms with a word to find nearest. You can apply these algorithms to find the nearest word. Those are mostly O(n) algorithms so at the end your sentence replacement would cost you O(n) at most.
I'm looking to count the number of words per sentence, calculate the mean words per sentence, and put that info into a CSV file. Here's what I have so far. I probably just need to know how to count the number of words before a period. I might be able to figure it out from there.
#Read the data in the text file as a string
with open("PrideAndPrejudice.txt") as pride_file:
pnp = pride_file.read()
#Change '!' and '?' to '.'
for ch in ['!','?']:
if ch in pnp:
pnp = pnp.replace(ch,".")
#Remove period after Dr., Mr., Mrs. (choosing not to include etc. as that often ends a sentence although in can also be in the middle)
pnp = pnp.replace("Dr.","Dr")
pnp = pnp.replace("Mr.","Mr")
pnp = pnp.replace("Mrs.","Mrs")
To split a string into a list of strings on some character:
pnp = pnp.split('.')
Then we can split each of those sentences into a list of strings (words)
pnp = [sentence.split() for sentence in pnp]
Then we get the number of words in each sentence
pnp = [len(sentence) for sentence in pnp]
Then we can use statistics.mean to calculate the mean:
statistics.mean(pnp)
To use statistics you must put import statistics at the top of your file. If you don't recognize the ways I'm reassigning pnp, look up list comprehensions.
You might be interested in the split() function for strings. It seems like you're editing your text to make sure all sentences end in a period and every period ends a sentence.
Thus,
pnp.split('.')
is going to give you a list of all sentences. Once you have that list, for each sentence in the list,
sentence.split() # i.e., split according to whitespace by default
will give you a list of words in the sentence.
Is that enough of a start?
You can try the code below.
numbers_per_sentence = [len(element) for element in (element.split() for element in pnp.split("."))]
mean = sum(numbers_per_sentence)/len(numbers_per_sentence)
However, for real natural language processing I would probably recommend a more robust solution such as NLTK. The text manipulation you perform (replacing "?" and "!", removing commas after "Dr.","Mr." and "Mrs.") is probably not enough to be 100% sure that comma is always a sentence separator (and that there are no other sentence separators in your text, even if it happens to be true for Pride And Prejudice)