Check whether there is any synonyms between two word sets - python

I need to find the synonyms for a given word from a sentence. For an example
list1 = ['happy']
list2 = ['It', 'is', 'so', 'funny']
Here I need to find whether there is any synonyms for 'happy' word from list2 and print them. I'am new to the Python. I tried it from following code.
from nltk.corpus import wordnet
list1 = ['happy']
list2 = ['It', 'is', 'so', 'funny']
list = []
for word1 in list1:
for word2 in list2:
wordFromList1 = wordnet.synsets(word1)
wordFromList2 = wordnet.synsets(word2)
if wordFromList1 and wordFromList2:
s = wordFromList1[0].word1.lemmas(wordFromList2[0])
s = wordFromList1[0].word2.lemmas(wordFromList2[0])
list.append(s)
print((list))
But it does not work. Please help me.

When you use wordnet.synsets("happy") it returns synset entries (such as Synset('happy.a.01')) that contains part of speech info and an ID. You need to use lemma_names() on this to get the actual word forms. Try this:
from nltk.corpus import wordnet
def get_word_synonyms_from_sent(word, sent):
word_synonyms = []
for synset in wordnet.synsets(word):
for lemma in synset.lemma_names():
if lemma in sent and lemma != word:
word_synonyms.append(lemma)
return word_synonyms
word = "happy"
sent = ['I', 'am', 'glad', 'it', 'was', 'felicitous', '.']
word_synonyms = get_word_synonyms_from_sent(word, sent)
print ("WORD:", word)
print ("SENTENCE:", sent)
print ("SYNONYMS FOR '" + word.upper() + "' FOUND IN THE SENTENCE: " + ", ".join(word_synonyms))
# OUTPUT
# >>> WORD: happy
# >>> SENTENCE: ['I', 'am', 'glad', 'it', 'was', 'felicitous', '.']
# >>> SYNONYMS FOR 'HAPPY' FOUND IN THE SENTENCE: felicitous, glad

Related

How to iterate and apply text pre processing on sublists

I have a corpus of 20k rows of twitter data, which I have already lower cased and tokenised using tweet tokenizer.
For example:
X = [
["i","love,"to","play","games","","."],
["my","favourite,"colour","is","purple","!"],
["#ladygaga","we,"love","you","#stan","'someurl"]
]
tweet_tokens = []
for tweet in tweets:
tweet = tweet.lower()
tweet_tokens.append(tweet)
This is how I lowercased my tokens.
How can I iterate through the sublists to append each lists to remove stopwords, punctuation, blank spaces and URLs but keep the content of #'s.
This is what I thought/tried but its not giving me the right results (only showing stop words for an example)
filtered_sentence = []
filtered_word = []
for sent in X:
for word in sent:
if word not in stopwords:
filtered_word.append(word)
filtered_sentence.append(word)
What would be the correct way to iterate through each sublists, process without disrupting the lists.
Ideally the output should look like this
Cleaned_X = [
["love,"play","games"],
["favourite,"colour","purple",],
["ladygaga","love","#stan"]
]
import validators
punctuation_list = ['(',')',';',':','[',']',',', '!' ,'?','.', "", '']
dirty = [
["i","love","to","play","games","","."],
["my","favourite","colour","is","purple","!"],
["#ladygaga","we","love","you","#stan","https://test.de"]
]
def clean_list_list(tweets):
return [[elem for elem in tweet if elem_check(elem)]
for tweet in tweets]
def tweet_check(elem):
return elem not in punctuation_list and not validators.url(elem)
clean_list_list(dirty)
I have testet this, it should be very close to the solution you are looking for.
output
[['i', 'love', 'to', 'play', 'games'],
['my', 'favourite', 'colour', 'is', 'purple'],
['#ladygaga', 'we', 'love', 'you', '#stan']]
You can write your own validate function if you want to or use validators:
pip install validators
This works.
x = [
["i","love", "to","play","games","","."],
["my","favourite", "colour","is","purple","!"],
["#ladygaga","we", "love","you","#stan","'someurl"]
]
punctuations = ['(',')',';',':','[',']',',', '!' ,'?','.', "", '']
stop_words = stopwords.words('english')
clean_list = []
for sub_list in x:
for word in sub_list:
if word not in stop_words and word not in punctuations:
clean_list.append(word)
print(clean_list)
Output:
['love', 'play', 'games', 'favourite', 'colour', 'purple', '#ladygaga', 'love', '#stan', "'someurl"]

How can I count the number of times any of an array of strings appears in text with Python?

I have some text, raw_text and I have an array of words:
VERBS = ['be', 'am', 'is', 'are', 'was', 'were', 'being', 'been']
I want to count the number of times ANY of those words is used in raw_text. Case doesn't matter, but word boundaries would.
I'm sure this is doable with Regex or NLTK. Any ideas?
VERBS = ['be', 'am', 'is', 'are', 'was', 'were', 'being', 'been']
raw_text = "This IS example text which we will use to count these words: am, be, is, are"
raw_text2 = " " + raw_text.lower() + " "
cnt = 0
for verb in VERBS:
cnt += (len(raw_text2.split(f" {verb} "))-1)
cnt += (len(raw_text2.split(f" {verb},"))-1)
print(cnt)

Return a list of words that contain a letter

I wanna return a list of words containing a letter disregarding its case.
Say if i have sentence = "Anyone who has never made a mistake has never tried anything new", then f(sentence, a) would return
['Anyone', 'has', 'made', 'a', 'mistake', 'has', 'anything']
This is what i have
import re
def f(string, match):
string_list = string.split()
match_list = []
for word in string_list:
if match in word:
match_list.append(word)
return match_list
You don't need re. Use str.casefold:
[w for w in sentence.split() if "a" in w.casefold()]
Output:
['Anyone', 'has', 'made', 'a', 'mistake', 'has', 'anything']
You can use string splitting for it, if there is not punctuation.
match_list = [s for s in sentence.split(' ') if 'a' in s.lower()]
Here's another variation :
sentence = 'Anyone who has never made a mistake has never tried anything new'
def f (string, match) :
match_list = []
for word in string.split () :
if match in word.lower ():
match_list.append (word)
return match_list
print (f (sentence, 'a'))

How to only return actual tokens, rather than empty variables when tokenizing?

I have a function:
def remove_stopwords(text):
return [[word for word in simple_preprocess(str(doc), min_len = 2) if word not in stop_words] for doc in texts]
My input is a list with a tokenized sentence:
input = ['This', 'is', 'an', 'example', 'of', 'my', 'input']
Assume that stop_words contains the words: 'this', 'is', 'an', 'of' and 'my', then the output I would like to get is:
desired_output = ['example', 'input']
However, the actual output that I'm getting now is:
actual_output = [[], [], [], ['example'], [], [], ['input']]
How can I adjust my code, to get this output?
There are two solutions to your problem:
Solution 1:
Your remove_stopwords requires an array of documents to work properly, so you modify your input like this
input = [['This', 'is', 'an', 'example', 'of', 'my', 'input']]
Solution 2:
You change your remove_stopwords function to work on a single document
def remove_stopwords(text):
return [word for word in simple_preprocess(str(text), min_len = 2) if word not in stop_words]
You can use the below code for removing stopwords, if there is no specific reason to use your code.
wordsFiltered = []
def remove_stopwords(text):
for w in text:
if w not in stop_words:
wordsFiltered.append(w)
return wordsFiltered
input = ['This', 'is', 'an', 'example', 'of', 'my', 'input']
stop_words = ['This', 'is', 'an', 'of', 'my']
print remove_stopwords(input)
Output:
['example', 'input']

Dictionary and position list back to sentence

I've managed to get my program to store a sentence or two into a dictionary and at the same time create a word position list.
What I need to do now is recreate the original sentence just from the dictionary and the position list. I've done lots of searches but the results I'm getting are either not what I need or are to confusing and beyond me.
Any help would be much appreciated, thanks.
Here is my code so far:
sentence = ("This Sentence is a very, very good sentence. Did you like my very good sentence?")
print ('This is the sentence:', sentence)
punctuation = ['(', ')', '?', ':', ';', ',', '.', '!', '/', '"', "'"]
for punct in punctuation:
sentence = sentence.replace(punct," %s" % punct)
print ('This is the sentence with spaces before the punctuations:', sentence)
words_list = sentence.split()
print ('A list of the words in the sentence:', words_list)
dictionary = {}
word_pos_list = []
counter = 0
for word in words_list:
if word not in dictionary:
counter += 1
dictionary[word] = counter
word_pos_list.append(dictionary[word])
print ('The positions of the words in the sentence are:', word_pos_list)
John
While, as mentioned in comments, dictionaries are not sorted datastructures, if you are breaking up a sentence and indexing it into a dictionary and are trying to put it back together, you can try to use an OrderedDict from the collections library to do what you're doing.
That said, this is without any sort of further background or knowledge of how you are splitting your sentence (punctuation etc, I suggest looking into NLTP if you are doing any sort of natural language processing(NLP)).
from collections import OrderedDict
In [182]: def index_sentence(s):
.....: return {s.split(' ').index(i): i for i in s.split(' ')}
.....:
In [183]: def build_sentence_from_dict(d):
.....: return ' '.join(OrderedDict(d).values())
.....:
In [184]: s
Out[184]: 'See spot jump over the brown fox.'
In [185]: id = index_sentence(s)
In [186]: id
Out[186]: {0: 'See', 1: 'spot', 2: 'jump', 3: 'over', 4: 'the', 5: 'brown', 6: 'fox.'}
In [187]: build_sentence_from_dict(id)
Out[187]: 'See spot jump over the brown fox.'
In [188]:
To reconstruct from your list you have to reverse the location mapping:
# reconstruct
reversed_dictionary = {x:y for y, x in dictionary.items()}
print(' '.join(reversed_dictionary[x] for x in word_pos_list))
This can be done more nicely using a defaultdict (dictionary with predifined default value, in your case a list of locations for the word):
#!/usr/bin/env python3.4
from collections import defaultdict
# preprocessing
sentence = ("This Sentence is a very, very good sentence. Did you like my very good sentence?")
punctuation = ['()?:;,.!/"\'']
for punct in punctuation:
sentence = sentence.replace(punct," %s" % punct)
# using defaultdict this time
word_to_locations = defaultdict(list)
for part in enumerate(sentence.split()):
word_to_locations[part[1]].append(part[0])
# word -> list of locations
print(word_to_locations)
# location -> word
location_to_word = dict((y, x) for x in word_to_locations for y in word_to_locations[x])
print(location_to_word)
# reconstruct
print(' '.join(location_to_word[x] for x in range(len(location_to_word))))
It's not the randomness of dictionary keys that's the problem here, it's the failure to record every position at which a word was seen, duplicate or not. The following does that and then unwinds the dictionary to produce the original sentence, sans punctuation:
from collections import defaultdict
sentence = ("This Sentence is a very, very good sentence. Did you like my very good sentence?")
print ('This is the sentence:', sentence)
punctuation = set('()?:;\\,.!/"\'')
sentence = ''.join(character for character in sentence if character not in punctuation)
print ('This is the sentence with no punctuation:', sentence)
words = sentence.split()
print('A list of the words in the sentence:', words)
dictionary = defaultdict(list)
last_word_position = 0
for word in words:
last_word_position += 1
dictionary[word].append(last_word_position)
print('A list of unique words in the sentence and their positions:', dictionary.items())
# Now the tricky bit to unwind our random dictionary:
sentence = []
for position in range(1, last_word_position + 1):
sentence.extend([word for word, positions in dictionary.items() if position in positions])
print(*sentence)
The output of the various print() statements:
This is the sentence: This Sentence is a very, very good sentence. Did you like my very good sentence?
This is the sentence with no punctuation: This Sentence is a very very good sentence Did you like my very good sentence
A list of the words in the sentence: ['This', 'Sentence', 'is', 'a', 'very', 'very', 'good', 'sentence', 'Did', 'you', 'like', 'my', 'very', 'good', 'sentence']
A list of unique words in the sentence and their positions: dict_items([('Sentence', [2]), ('is', [3]), ('a', [4]), ('very', [5, 6, 13]), ('This', [1]), ('my', [12]), ('Did', [9]), ('good', [7, 14]), ('you', [10]), ('sentence', [8, 15]), ('like', [11])])
This Sentence is a very very good sentence Did you like my very good sentence

Categories

Resources