Remove close matches / similar phrases from list

Remove close matches / similar phrases from list - python

I am working on removing similar phrases in a list, but I have hit a small roadblock.
I have sentences and phrases, phrases are related to the sentence. All phrases of a sentence are in a single list.
Let the phrase list be : p=[['This is great','is great','place for drinks','for drinks'],['Tonight is a good','good night','is a good','for movies']]
I want my output to be [['This is great','place for drinks'],['Tonight is a good','for movies']]
Basically, I want to get all the longest unique phrases of a list.
I took a look at fuzzywuzzy library, but I am unable to get around to a good solution.
here is my code :
def remove_dup(arr, threshold=80):
ret_arr =[]
for item in arr:
if item[1]<threshold:
ret_arr.append(item[0])
return ret_arr
def find_important(sents=sents, phrase=phrase):
import os, random
from fuzzywuzzy import process, fuzz
all_processed = [] #final array to be returned
for i in range(len(sents)):
new_arr = [] #reshaped phrases for a single sentence
for item in phrase[i]:
new_arr.append(item)
new_arr.sort(reverse=True, key=lambda x : len(x)) #sort with highest length
important = [] #array to store terms
important = process.extractBests(new_arr[0], new_arr) #to get levenshtein distance matches
to_proc = remove_dup(important) #remove_dup removes all relatively matching terms.
to_proc.append(important[0][0]) #the term with highest match is obviously the important term.
all_processed.append(to_proc) #add non duplicates to all_processed[]
return all_processed
Can someone point out what I am missing, or what is a better way to do this?
Thanks in advance!

I would use the difference between each phrase and all the other phrases.
If a phrase has at least one different word compared to all the other phrases then it's unique and should be kept.
I've also made it robust to exact matches and added spaces
sentences = [['This is great','is great','place for drinks','for drinks'],
['Tonight is a good','good night','is a good','for movies'],
['Axe far his favorite brand for deodorant body spray',' Axe far his favorite brand for deodorant spray','Axe is']]
new_sentences = []
s = " "
for phrases in sentences :
new_phrases = []
phrases = [phrase.split() for phrase in phrases]
for i in range(len(phrases)) :
phrase = phrases[i]
if all([len(set(phrase).difference(phrases[j])) > 0 or i == j for j in range(len(phrases))]) :
new_phrases.append(phrase)
new_phrases = [s.join(phrase) for phrase in new_phrases]
new_sentences.append(new_phrases)
print(new_sentences)
Output:
[['This is great', 'place for drinks'],
['Tonight is a good', 'good night', 'for movies'],
['Axe far his favorite brand for deodorant body spray', 'Axe is']]

Related

How to get Bigram/Trigram of word from prelisted unigram from a document corpus / dataframe column

I have a dataframe with text in one of its columns.
I have listed some predefined keywords which I need for analysis and words associated with it (and later make a wordcloud and counter of occurrences) to understand topics /context associated with such keywords.
Use case:
df.text_column()
keywordlist = [coca , food, soft, aerated, soda]
lets say one of the rows of the text column has text : ' coca cola is expanding its business in soft drinks and aerated water'.
another entry like : 'lime soda is the best selling item in fast food stores'
my objective is to get Bigram/trigram like:
'coca_cola','coca_cola_expanding', 'soft_drinks', 'aerated_water', 'business_soft_drinks', 'lime_soda', 'food_stores'
Kindly help me to do that [Python only]

First, you can optioanlly load the nltk's stop word list and remove any stop words from the text (such as "is", "its", "in", and "and"). Alternatively, you can define your own stop words list, as well as even extend the nltk's list with additional words. Following, you can use nltk.bigrams() and nltk.trigrams() methods to get bigrams and trigrams joined with an underscore _, as you asked. Also, have a look at Collocations.
Edit:
If you haven't already, you need to include the following once in your code, in order to download the stop words list.
nltk.download('stopwords')
Code:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
word_data = "coca cola is expanding its business in soft drinks and aerated water"
#word_data = "lime soda is the best selling item in fast food stores"
# load nltk's stop word list
stop_words = list(stopwords.words('english'))
# extend the stop words list
#stop_words.extend(["best", "selling", "item", "fast"])
# tokenise the string and remove stop words
word_tokens = word_tokenize(word_data)
clean_word_data = [w for w in word_tokens if not w.lower() in stop_words]
# get bigrams
bigrams_list = ["_".join(item) for item in nltk.bigrams(clean_word_data)]
print(bigrams_list)
# get trigrams
trigrams_list = ["_".join(item) for item in nltk.trigrams(clean_word_data)]
print(trigrams_list)
Update
Once you get the bigram and trigram lists, you can check for matches against your keyword list to keep only the relevant ones.
keywordlist = ['coca' , 'food', 'soft', 'aerated', 'soda']
def find_matches(n_grams_list):
matches = []
for k in keywordlist:
matching_list = [s for s in n_grams_list if k in s]
[matches.append(m) for m in matching_list if m not in matches]
return matches
all_matching_bigrams = find_matches(bigrams_list) # find all mathcing bigrams
all_matching_trigrams = find_matches(trigrams_list) # find all mathcing trigrams
# join the two lists
all_matches = all_matching_bigrams + all_matching_trigrams
print(all_matches)
Output:
['coca_cola', 'business_soft', 'soft_drinks', 'drinks_aerated', 'aerated_water', 'coca_cola_expanding', 'expanding_business_soft', 'business_soft_drinks', 'soft_drinks_aerated', 'drinks_aerated_water']

Trying to create a function to search for a word in a string

it's my first time posting but I have a question regarding trying to create a function in python that will search a list of strings and return any words that I am looking for. Here is what I have so far:
def search_words(data, search_words):
keep = []
for data in data:
if data in search_words:
keep.append(data)
return keep
Here is the data I am searching through and the words I am trying to find:
data = ['SHOP earnings for Q1 are up 5%',
'Subscriptions at SHOP have risen to all-time highs, boosting sales',
"Got a new Mazda, VROOM VROOM Y'ALL",
'I hate getting up at 8am FOR A STUPID ZOOM MEETING',
'TSLA execs hint at a decline in earnings following a capital expansion program']
words = ['earnings', 'sales']
Upon doing print(search_words(data=data, search_words=words)) my list (keep) returns empty brackets [ ] and I am unsure of how to fix the issue. I know that searching for a word in a string is different than looking for a number in a list but I cannot figure out to modify my code to account for that. Any help would be appreciated.

You can use the following. This will keep all the sentences in data that contain at least one of the words:
keep = [s for s in data if any(w in s for w in words)]

Since they are all strings, instead of looping over them all just combine them all and search that. Also make words a set:
[word for word in ' '.join(data).split() if word in words]
Using regex:
re.findall('|'.join(words), ''.join(data))
['earnings', 'sales', 'earnings']

You can use the following. This will keep all the sentences in data that contain at least one or both of the words:
This is a form of programs for beginners.
data, search_words must be a list.
def search_words(data, search_words):
keep = []
for dt in data:
for sw in search_words:
if sw in dt:
keep.append(dt)
return keep

Python extracting sentence containing 2 words with conditions of window size

I have the same problem that was discussed in this link
Python extracting sentence containing 2 words
but the difference is that i need to extract only sentences that containe the two words in a defined size windows search . for exemple:
sentences = [ 'There was peace and happiness','hello every one',' How to Find Inner Peace ,love and Happiness ','Inner peace is closely related to happiness']
search_words= ['peace','happiness']
windows_size = 3 #search only the three words after the 1est word 'peace'
#output must be :
output= ['There was peace and happiness',' How to Find Inner Peace love and Happiness ']

Here is a crude solution.
def search(sentences, keyword1, keyword2, window=3):
res = []
for sentence in sentences:
words = sentence.lower().split(" ")
if keyword1 in words and keyword2 in words:
keyword1_idx = words.index(keyword1)
keyword2_idx = words.index(keyword2)
if keyword2_idx - keyword1_idx <= window:
res.append(sentence)
return res
Given a sentences list and two keywords, keyword1 and keyword2, we iterate through the sentences list one by one. We split the sentence into words, assuming that the words are separated by a single space. Then, after performing a cursory check of whether or not both keywords are present in the words list, we find the index of each keyword in words to make sure that the indices are at most window apart, i.e. the words are close together within window words. We append only the sentences that satisfy this condition to the res list, and return that result.

Matching two-word variants with each other if they don't match alphabetically

I'm doing a NLP project with my university, collecting data on words in Icelandic that exist both spelled with an i and with a y (they sound the same in Icelandic fyi) where the variants are both actual words but do not mean the same thing. Examples of this would include leyti (an approximation in time) and leiti (a grassy hill), or kirkja (church) and kyrkja (choke). I have a dataset of 2 million words. I have already collected two wordlists, one of which includes words spelled with a y and one includes the same words spelled with a i (although they don't seem to match up completely, as the y-list is a bit longer, but that's a separate issue). My problem is that I want to end up with pairs of words like leyti - leiti, kyrkja - kirkja, etc. But, as y is much later in the alphabet than i, it's no good just sorting the lists and pairing them up that way. I also tried zipping the lists while checking the first few letters to see if I can find a match but that leaves out all words that have y or i as the first letter. Do you have a suggestion on how I might implement this?

Try something like this:
s = "trydfydfgfay"
l = list(s)
candidateWords = []
for idx, c in enumerate(l):
if c=='y':
newList = l.copy()
newList[idx] = "i"
candidateWord = "".join(newList)
candidateWords.append(candidateWord)
print(candidateWords)
#['tridfydfgfay', 'trydfidfgfay', 'trydfydfgfai']
#look up these words to see if they are real words

I do not think this is a programming challenge, but looks more like an NLP challenge itself. The spelling variations are often a hurdle that one would face during preprocessing.
I would suggest that you use an Edit-distance based approaches to identify word pairs that allow for some variations. Specifically for the kind of problem that you have described above, I would recommend "Jaro Winkler Distance". This method allows to give higher similarity score between word pairs that shows variations between specific character pairs, say y and i.
All these approaches are implemented in Jellyfish library.
You may also have a look at the fuzzywuzzy package as well. Hope this helps.

So this accomplishes my task, kind of an easy not-that-pretty solution I suppose but it works:
wordlist = open("data.txt", "r", encoding='utf-8')
y_words = open("y_wordlist.txt", "w+", encoding='utf-8')
all_words = []
y_words = []
for word in wordlist:
word = word.lower()
all_words.append(word)
for word in all_words:
if "y" in word:
y_words.append(word)
word_dict = {}
for word in y_words:
newwith1y = word.replace("y", "i",1)
newwith2y = word.replace("y", "i",2)
newyback = word[::-1].replace("y", "i",1)
newyback = newyback[::-1]
word_dict[word] = newwith1y
word_dict[word] = newwith2y
word_dict[word] = newyback
for key, value in word_dict.items():
if value in all_words:
y_wordlist.write(key)
y_wordlist.write(" - ")
y_wordlist.write(value)
y_wordlist.write("\n")

Python code flow does not work as expected?

I am trying to process various texts by regex and NLTK of python -which is at http://www.nltk.org/book-. I am trying to create a random text generator and I am having a slight problem. Firstly, here is my code flow:
Enter a sentence as input -this is called trigger string, is assigned to a variable-
Get longest word in trigger string
Search all Project Gutenberg database for sentences that contain this word -regardless of uppercase lowercase-
Return the longest sentence that has the word I spoke about in step 3
Append the sentence in Step 1 and Step4 together
Assign the sentence in Step 4 as the new 'trigger' sentence and repeat the process. Note that I have to get the longest word in second sentence and continue like that and so on-
So far, I have been able to do this only once. When I try to keep this to continue, the program only keeps printing the first sentence my search yields. It should actually look for the longest word in this new sentence and keep applying my code flow described above.
Below is my code along with a sample input/output :
Sample input
"Thane of code"
Sample output
"Thane of code Norway himselfe , with terrible numbers , Assisted by that most disloyall Traytor , The Thane of Cawdor , began a dismall Conflict , Till that Bellona ' s Bridegroome , lapt in proofe , Confronted him with selfe - comparisons , Point against Point , rebellious Arme ' gainst Arme , Curbing his lauish spirit : and to conclude , The Victorie fell on vs"
Now this should actually take the sentence that starts with 'Norway himselfe....' and look for the longest word in it and do the steps above and so on but it doesn't. Any suggestions? Thanks.
import nltk
from nltk.corpus import gutenberg
triggerSentence = raw_input("Please enter the trigger sentence: ")#get input str
split_str = triggerSentence.split()#split the sentence into words
longestLength = 0
longestString = ""
montyPython = 1
while montyPython:
#code to find the longest word in the trigger sentence input
for piece in split_str:
if len(piece) > longestLength:
longestString = piece
longestLength = len(piece)
listOfSents = gutenberg.sents() #all sentences of gutenberg are assigned -list of list format-
listOfWords = gutenberg.words()# all words in gutenberg books -list format-
# I tip my hat to Mr.Alex Martelli for this part, which helps me find the longest sentence
lt = longestString.lower() #this line tells you whether word list has the longest word in a case-insensitive way.
longestSentence = max((listOfWords for listOfWords in listOfSents if any(lt == word.lower() for word in listOfWords)), key = len)
#get longest sentence -list format with every word of sentence being an actual element-
longestSent=[longestSentence]
for word in longestSent:#convert the list longestSentence to an actual string
sstr = " ".join(word)
print triggerSentence + " "+ sstr
triggerSentence = sstr

How about this?
You find longest word in trigger
You find longest word in the longest sentence containing word found in 1.
The word of 1. is the longest word of the sentence of 2.
What happens? Hint: answer starts with "Infinite". To correct the problem you could find set of words in lower case to be useful.
BTW when you think MontyPython becomes False and the program finish?

Rather than searching the entire corpus each time, it may be faster to construct a single map from word to the longest sentence containing that word. Here's my (untested) attempt to do this.
import collections
from nltk.corpus import gutenberg
def words_in(sentence):
"""Generate all words in the sentence (lower-cased)"""
for word in sentence.split():
word = word.strip('.,"\'-:;')
if word:
yield word.lower()
def make_sentence_map(books):
"""Construct a map from words to the longest sentence containing the word."""
result = collections.defaultdict(str)
for book in books:
for sentence in book:
for word in words_in(sentence):
if len(sentence) > len(result[word]):
result[word] = sent
return result
def generate_random_text(sentence, sentence_map):
while True:
yield sentence
longest_word = max(words_in(sentence), key=len)
sentence = sentence_map[longest_word]
sentence_map = make_sentence_map(gutenberg.sents())
for sentence in generate_random_text('Thane of code.', sentence_map):
print sentence

Mr. Hankin's answer is more elegant, but the following is more in keeping with the approach you began with:
import sys
import string
import nltk
from nltk.corpus import gutenberg
def longest_element(p):
"""return the first element of p which has the greatest len()"""
max_len = 0
elem = None
for e in p:
if len(e) > max_len:
elem = e
max_len = len(e)
return elem
def downcase(p):
"""returns a list of words in p shifted to lower case"""
return map(string.lower, p)
def unique_words():
"""it turns out unique_words was never referenced so this is here
for pedagogy"""
# there are 2.6 million words in the gutenburg corpus but only ~42k unique
# ignoring case, let's pare that down a bit
for word in gutenberg.words():
words.add(word.lower())
print 'gutenberg.words() has', len(words), 'unique caseless words'
return words
print 'loading gutenburg corpus...'
sentences = []
for sentence in gutenberg.sents():
sentences.append(downcase(sentence))
trigger = sys.argv[1:]
target = longest_element(trigger).lower()
last_target = None
while target != last_target:
matched_sentences = []
for sentence in sentences:
if target in sentence:
matched_sentences.append(sentence)
print '===', target, 'matched', len(matched_sentences), 'sentences'
longestSentence = longest_element(matched_sentences)
print ' '.join(longestSentence)
trigger = longestSentence
last_target = target
target = longest_element(trigger).lower()
Given your sample sentence though, it reaches fixation in two cycles:
$ python nltkgut.py Thane of code
loading gutenburg corpus...
=== target thane matched 24 sentences
norway himselfe , with terrible
numbers , assisted by that most
disloyall traytor , the thane of
cawdor , began a dismall conflict ,
till that bellona ' s bridegroome ,
lapt in proofe , confronted him with
selfe - comparisons , point against
point , rebellious arme ' gainst arme
, curbing his lauish spirit : and to
conclude , the victorie fell on vs
=== target bridegroome matched 1 sentences
norway himselfe , with
terrible numbers , assisted by that
most disloyall traytor , the thane of
cawdor , began a dismall conflict ,
till that bellona ' s bridegroome ,
lapt in proofe , confronted him with
selfe - comparisons , point against
point , rebellious arme ' gainst arme
, curbing his lauish spirit : and to
conclude , the victorie fell on vs
Part of the trouble with the response to the last problem is that it did what you asked, but you asked a more specific question than you wanted an answer to. Thus the response got bogged down in some rather complicated list expressions that I'm not sure you understood. I suggest that you make more liberal use of print statements and don't import code if you don't know what it does. While unwrapping the list expressions I found (as noted) that you never used the corpus wordlist. Functions are a help also.

You are assigning "split_str" outside of the loop, so it gets the original value and then keeps it. You need to assign it at the beginning of the while loop, so it changes each time.
import nltk
from nltk.corpus import gutenberg
triggerSentence = raw_input("Please enter the trigger sentence: ")#get input str
longestLength = 0
longestString = ""
montyPython = 1
while montyPython:
#so this is run every time through the loop
split_str = triggerSentence.split()#split the sentence into words
#code to find the longest word in the trigger sentence input
for piece in split_str:
if len(piece) > longestLength:
longestString = piece
longestLength = len(piece)
listOfSents = gutenberg.sents() #all sentences of gutenberg are assigned -list of list format-
listOfWords = gutenberg.words()# all words in gutenberg books -list format-
# I tip my hat to Mr.Alex Martelli for this part, which helps me find the longest sentence
lt = longestString.lower() #this line tells you whether word list has the longest word in a case-insensitive way.
longestSentence = max((listOfWords for listOfWords in listOfSents if any(lt == word.lower() for word in listOfWords)), key = len)
#get longest sentence -list format with every word of sentence being an actual element-
longestSent=[longestSentence]
for word in longestSent:#convert the list longestSentence to an actual string
sstr = " ".join(word)
print triggerSentence + " "+ sstr
triggerSentence = sstr

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove close matches / similar phrases from list - python

Related

How to get Bigram/Trigram of word from prelisted unigram from a document corpus / dataframe column

Trying to create a function to search for a word in a string

Python extracting sentence containing 2 words with conditions of window size

Matching two-word variants with each other if they don't match alphabetically

Python code flow does not work as expected?

Categories

Resources