Elongated word check in sentence [closed] - python

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I want to check in a sentence if there are elongated words. For example, soooo, toooo, thaaatttt, etc. Now I dont know what the user might type because I have a list of sentences which may or may not have elongated words. How do I check that in python. I am new to python.

Try this:
import re
s1 = "This has no long words"
s2 = "This has oooone long word"
def has_long(sentence):
elong = re.compile("([a-zA-Z])\\1{2,}")
return bool(elong.search(sentence))
print has_long(s1)
False
print has_long(s2)
True

#HughBothwell had a good idea. As far as I know, there is not a single English word that has the same letter repeat three consecutive times. So, you can search for words that do this:
>>> from re import search
>>> mystr = "word word soooo word tooo thaaatttt word"
>>> [x for x in mystr.split() if search(r'(?i)[a-z]\1\1+', x)]
['soooo,', 'tooo', 'thaaatttt']
>>>
Any you find will be elongated words.

Well, you can make a list of every elongated word logically possible. Then loop through the words in the sentence then the words in the list to find elongated words.
sentence = "Hoow arre you doing?"
elongated = ["hoow",'arre','youu','yoou','meee'] #You will need to have a much larger list
for word in sentence:
word = word.lower()
for e_word in elongated:
if e_word == word:
print "Found an elongated word!"
If you wanted to do what Hugh Bothwell said, then:
sentence = "Hooow arrre you doooing?"
elongations = ["aaa","ooo","rrr","bbb","ccc"]#continue for all the letters
for word in sentence:
for x in elongations:
if x in word.lower():
print '"'+word+'" is an elongated word'

You need to have a reference of valid English words available. On *NIX systems, you could use /etc/share/dict/words or /usr/share/dict/words or equivalent and store all the words into a set object.
Then, you'll want to check, for every word in a sentence,
That the word is not itself a valid word (i.e., word not in all_words); and
That, when you shorten all consecutive sequences to one or two letters, the new word is a valid word.
Here's one way you might try to extract all of the possibilities:
import re
import itertools
regex = re.compile(r'\w\1\1')
all_words = set(get_all_words())
def without_elongations(word):
while re.search(regex, word) is not None:
replacing_with_one_letter = re.sub(regex, r'\1', word, 1)
replacing_with_two_letters = re.sub(regex, r'\1\1', word, 1)
return list(itertools.chain(
without_elongations(replacing_with_one_letter),
without_elongations(replacing_with_two_letters),
))
for word in sentence.split():
if word not in all_words:
if any(map(lambda w: w in all_words, without_elongations(word)):
print('%(word) is elongated', { 'word': word })

Related

How to determine if all substrings in a string contains duplicates?

I'm facing this issue:
I need to remove duplications from the beginning of each word of a text, but only if
all words in the text are duplicated. (And capitalized after)
Examples:
text = str("Thethe cacar isis momoving vvery fasfast")
So this text should be treated and printed as:
output:
"The car is moving very fast"
I got these to treat the text:
phrase = str("Thethe cacar isis momoving vvery fasfast")
phrase_up = phrase.upper()
text = re.sub(r'(.+?)\1+', r'\1', phrase_up)
text_cap = text.capitalize()
"The car is moving very fast"
Or:
def remove_duplicates(word):
unique_letters = set(word)
sorted_letters = sorted(unique_letters, key=word.index)
return ''.join(sorted_letters)
words = phrase.split(' ')
new_phrase = ' '.join(remove_duplicates(word) for word in words)
What I can't work it out, is HOW to determine if a text needs this treatment.
Because if we get a text such as:
"This meme is funny, said Barbara"
Where even though "meme" and "Barbara" (ar - ar) are repeating substrings, not all are, so this text shouldn't be treated.
Any pointers here?
I would suggest you to adopt a solution to check if a word is legal, using something like what is described in this post's best answer. If the word is not an english word, than you should use the regex.
For example, a word like meme should be in the english dictionary, so you should not check for repetitions.
So I would firstly split the string on spaces, in order to have the tokens. Then check if a token is an english word. If it is, skip the regex check. Otherwise check for repetitions

how can I write a program in python that compares a given word with a text?

So i need to write a function in python, where it compares a input word with a text and gives me the word 'closest'(with the smallest distance) to the input word.
With distance I mean for example the words 'the' and 'to' have distance 2 because I need to change 2 letters.
(PS: I just started learning coding so I don't really know much about this)
These 2 codes is what I already have and I've tested these so it's correct
This is to get to words out the text:
def get_words():
return re.findall(r'\w+', open('big.txt').read().lower())
Then I wrote this to put the words in a dictonary with the number of times the word occurs:
d=dict()
for token in get_words():
if token in d:
d[token]+=1
else:
d[token]=1
for key in d:
if d[key]>5:
print(key,d[key])
My problem starts with letting the word compare with the text
This is what I have:
if distance(word_dict, word) <= 1:
word = input('give a word')
return (word_dict)
else:
return ('no match')
you need to rate every word in your text so first you split the text to words and then rate them by proximity to the given word
Now all you need to do is to output the best rated word
You still have to figure out what to do if there are two words with the same rating but that is the basic logic
Good luck
you can do something like :
>>> w1="the"
>>> w2="to"
>>> len([i for i in w1 if i not in w2])
2
to compare two words... Then :
>>> txt1="this is the sentence"
>>> w1="to"
>>> txt1.split(" ")[min([len([i for i in w2 if i not in w1]) for w2 in txt1.split(" ")])]
'the'
"the" is the closest word to "to" in the text txt1. this work is there is only one space between each word in your text...
so you should adapt it to your text to obtain a list of words from the text and then compare each element of this list with your "input word".

Searching for two or more words in string - Python Troubleshooting Program

I know the code for searching words in a string that match to another string.
if any(word in problem for word in keyword_virus):
#Some code her e.g. pc_virus()
But is there any code that would allow me to check if two or more words matched/or even any modifications to this code?
Thank you :)
keyword_virus = 'the brown fox jumps'
print([x for x in ['brown', 'jumps', 'notinstring'] if x in keyword_virus.split()])
#['brown', 'jumps']
That will return all matched words in keyword_virus.
If I understand your question correctly, I'd rewrite the for loop to check every word in your checklist and append each match.
matches = []
for word in check_list:
if word in problem_list:
matches.append(word)
You'll end up with a list of words that match, from which you can count the occurrence of each word.

Cannot prove that there are Words in my list for my WordSearch game

I have created a list of all possible outcomes for this specific wordgrid, doing diagonals,up,down and all the reverses too):
I have called this allWords, but when I try too find specific words I know are in the allWords the loop does not find the Hidden words. I know my problem but I do know how to go around it (sorry for terrible explanation hopefully an example below will show it better):
an Example follows: My wordList is the list of words that I know are hidden somewhere in the wordgrid. My allWords is a list of Rows,Columns,Diagonals from the wordgrid but
WordList = ['HAMMER','....']
allWords = ['ARBHAMMERTYU','...']
that HAMMER is in allWords but 'cloaked' by other characters after it so I am unable to show HAMMER is in the wordgrid.
length = len(allWords)
for i in range(length):
word = allWords[i]
if word in wordList:
print("I have found", word)
it does not find any word HAMMER in allWords.
Any help towards solving this problem would be great
You are not comparing each word in wordList to a word in allWords. The line if word in wordList compares the exact word.
i.e.
if word in wordList will return True only if the word Hammer is in wordList.
To match substring you need another loop:
for i in range(length):
word = allWords[i]
for w in WordList:
if w in word:
print("I have found ", word)
If I understand your problem correctly, you probably need to implement a function that checks if a token (e.g. 'HAMMER') is present in any of the entries in allWords. My best bet for solution would be to use regular expressions.
import re
def findWordInWordList(word, allWords):
pattern = re.compile(".*%s.*" % word)
for item in allWords:
match = pattern.search(item)
if match:
return match
This will return first occurence, if you want more then it's easy to collect them in a list.
You could try something like this:
for word in allWords:
if word in WordList:
print("I have found", word)
Ah, or maybe the error is that you wrote wordList and you really defined WordList. Hope this helps.
If I understand correctly, you are trying to find a match inside allWords and you want to iterate over WordList and determine if there is a substring match.
So, if that is correct, then your code is not exactly doing that. To go through your code step by step to correct what is happening:
length = len(allWords)
for i in range(length):
What you want to do above is not necessarily go over your allWords. You want to iterate over WordList and see if it is inside allWords. You are not doing that, instead you want to do this:
length = len(WordList)
for i in range(length):
With that in mind, that means now you want to reference WordList and not allWords, so you want to now change this:
word = allWords[i]
to this:
word = WordList[i]
Finally, here comes a new bit of information to determine if you in fact have a substring match in the strings you are matching. A method called "any". The "any" method works by returning True if at least one match of what you are looking for is found. It looks like this:
any(if "something" in word in word for words)
Then it will return True if it "something" is in word otherwise it will return False.
So, to put this all together, and run your code with your sample input, we get:
WordList = ['HAMMER','....']
allWords = ['ARBHAMMERTYU','...']
length = len(WordList)
for i in range(length):
word = WordList[i]
if any(word in w for w in allWords):
print("I have found", word)
Output:
I have found HAMMER

Determine if a list of words is in a sentence?

Is there a way (Pattern or Python or NLTK, etc) to detect of a sentence has a list of words in it.
i.e.
The cat ran into the hat, box, and house. | The list would be hat, box, and house
This could be string processed but we may have more generic lists:
i.e.
The cat likes to run outside, run inside, or jump up the stairs. |
List=run outside, run inside, or jump up the stairs.
This could be in the middle of a paragraph or the end of the sentence which further complicates things.
I've been working with Pattern for python for awhile and I'm not seeing a way to go about this and was curious if there is a way with pattern or nltk (natural language tool kit).
From what I got from your question, I think you want to search whether all the words in your list is present in a sentence or not.
In general to search for a list elements, in a sentence, you can use all function. It returns true, if all the arguments in it are true.
listOfWords = ['word1', 'word2', 'word3', 'two words']
sentence = "word1 as word2 a fword3 af two words"
if all(word in sentence for word in listOfWords):
print "All words in sentence"
else:
print "Missing"
OUTPUT: -
"All words in sentence"
I think this might serve your purpose. If not, then you can clarify.
Using a Trie, you will be able to achieve this is O(n) where n is the number of words in the list of words after building a trie with the list of words which takes O(n) where n is the number of words in the list.
Algorithm
split the sentence into list of words separated by space.
For each word check if it has a key in the trie. i.e. that word exist in the list
if it exits add that word to the result to keep track of how many words from the list appear in the sentence
keep track of the words that has a has subtrie that is the current word is a prefix of the longer word in the list of words
for each word in this words see by extending it with the current word it can be a key or a subtrie on the list of words
if it's a subtrie then we add it to the extend_words list and see if concatenating with the next words we are able to get an exact match.
Code
import pygtrie
listOfWords = ['word1', 'word2', 'word3', 'two words']
trie = pygtrie.StringTrie()
trie._separator = ' '
for word in listOfWords:
trie[word] = True
print('s', trie._separator)
sentence = "word1 as word2 a fword3 af two words"
sentence_words = sentence.split()
words_found = {}
extended_words = set()
for possible_word in sentence_words:
has_possible_word = trie.has_node(possible_word)
if has_possible_word & trie.HAS_VALUE:
words_found[possible_word] = True
deep_clone = set(extended_words)
for extended_word in deep_clone:
extended_words.remove(extended_word)
possible_extended_word = extended_word + trie._separator + possible_word
print(possible_extended_word)
has_possible_extended_word = trie.has_node(possible_extended_word)
if has_possible_extended_word & trie.HAS_VALUE:
words_found[possible_extended_word] = True
if has_possible_extended_word & trie.HAS_SUBTRIE:
extended_words.update(possible_extended_word)
if has_possible_word & trie.HAS_SUBTRIE:
extended_words.update([possible_word])
print(words_found)
print(len(words_found) == len(listOfWords))
This is useful if your list of words is huge and you do not wish to iterate over it every time or you have a large number of queries that over the same list of words.
The code is here
What about using from nltk.tokenize import sent_tokenize ?
sent_tokenize("Hello SF Python. This is NLTK.")
["Hello SF Python.", "This is NLTK."]
Then you can use that list of sentences in this way:
for sentence in my_list:
# test if this sentence contains the words you want
# using all() method
More info here
all(word in sentence for word in listOfWords)

Categories

Resources