Synsets in Wordnet for NLTK in python

Synsets in Wordnet for NLTK in python - python

I am using this code to get all synonyms from the text in document named "answer_tokens.txt" it is only listing the words in the document without the synonyms. can someone check it out?
from nltk.corpus import wordnet
from nltk import word_tokenize
with open('answer_tokens.txt') as a: #opening the tokenised answer file
wn_tokens = (a.read())
#printing the answer tokens word by word as opened
print('==========================================')
synonyms = []
for b in word_tokenize(wn_tokens):
print (str (b))
for b in wordnet.synsets(b):
for l in b.lemmas():
synonyms.append(l.name())
print ('==========================================')
print (set (synonyms))
this is the out put is giving
[
,
'Compare
'
,
'dynamic
'
,
'case
'
,
'data
'
,
'changing
'
,
'
,
'
,
'example
'
,
'watching
'
,
'video
'
]
===================================================
set()
==================================================
This is the output we need
[
,
'Compare
'
,
'dynamic
'
,
'case
'
,
'data
'
,
'changing
'
,
'
,
'
,
'example
'
,
'watching
'
,
'video
'
]
===================================================
'Compare'{'equate', 'comparison', 'compare', 'comparability', 'equivalence', 'liken'}
'dynamic'{'dynamic', 'active', 'dynamical', 'moral_force'}
'case' {'display_case', 'grammatical_case', 'example', 'event', 'causa', 'shell', 'pillow_slip', 'encase', 'character', 'cause', 'font', 'instance', 'type', 'casing', 'guinea_pig', 'slip', 'suit', "typesetter's_case", 'sheath', 'vitrine', 'typeface', 'eccentric', 'lawsuit', 'showcase', 'caseful', 'fount', 'subject', 'pillowcase', "compositor's_case", 'face', 'incase', 'case'}
'data' {'data', 'information', 'datum', 'data_point'}
'changing'{'modify', 'interchange', 'convert', 'alter', 'switch', 'transfer', 'commute', 'change', 'vary', 'deepen', 'changing', 'ever-changing', 'shift', 'exchange'}
'example ' {'example', 'exemplar', 'object_lesson', 'representative', 'good_example', 'exercise', 'instance', 'deterrent_example', 'lesson', 'case', 'illustration', 'model'}
'watching' {'watch', 'observation', 'view', 'watching', 'watch_out', 'check', 'look_on', 'ascertain', 'learn', 'watch_over', 'observe', 'follow', 'observance', 'take_in', 'look_out', 'find_out', 'keep_an_eye_on', 'catch', 'determine', 'see'}
'video' {'video_recording', 'video', 'television', 'picture', 'TV', 'telecasting'}
==================================================

First of all, the initialisation synonyms = [] should be made for each token separately, since you want to build a different list of synonyms for each token. So I've moved that instruction into the for loop iterating over the tokens.
The second problem with your code is that you use the same variable name for both iterating over the tokens and over the synsets of the current token. This way you lose the information over the token itself, rendering you unable to print it afterwards.
Lastly, the printing of the set of synonyms should be made for each token (as you yourself stated in the question), so I have moved the print statement at the end of the for loop iterating over the tokens.
Here is the code:
from nltk.corpus import wordnet as wn
from nltk import word_tokenize as wTok
with open('answer_tokens.txt') as a: #opening the tokenised answer file
wn_tokens = (a.read())
print(wn_tokens)
print('==========================================')
for token in wTok(wn_tokens):
synonyms = []
for syn in wn.synsets(token):
for l in syn.lemmas():
synonyms.append(l.name())
print(token, set(synonyms))
print ('==========================================')
Hope this helps!
P.S. It could be useful to give aliases to the imports you make, in order to streamline the coding process. I have given the wn alias to the wordnet module and the wTok alias to the word_tokenize module.

Related

How do I get the top 'realistic' hypernym from wordnet synset hyper_paths?

Python3.8 nltk wordnet
How can I find the highest reasonable hypernym for a given word?
I'm using wordnet synset.hypernym_paths() which if I traverse all the way to the top gives me a hypernym that is way too abstract (ie: entity).
I've tried creating a list of 'too high' hypernyms but the list is difficult to determine and way too long.
import sys
import nltk
from nltk.corpus import wordnet as wn
arTooHigh = ['act', 'event', 'action', 'artifact', 'instrumentality', 'furnishing', 'organism', 'cognition', 'content', 'discipline', 'humanistic_discipline', 'diversion', 'communication', 'auditory_communication', 'speech', 'state', 'feeling', 'causal_agent', 'think', 'reason','information', 'evidence', 'measure', 'fundamental_quantity', 'condition']
arWords = ['work', 'frog', 'bad', 'corpus', 'chair', 'dancing', 'gossip', 'love', 'jerry', 'compute', 'satisfied', 'gift', 'candy', 'bookkeeper', 'construction', 'amplified', 'party', 'dinner', 'family', 'sky', 'office', 'project', 'budget', 'price']
for word in arWords:
hypernym = word
last = ''
synsets = wn.synsets(word)
syn = synsets[0]
for path in syn.hypernym_paths():
for i, ss in reversed(list(enumerate(path))):
test = ss.lemmas()[0].name()
if test in arTooHigh:
hypernym = last
break
else:
last = test
print(word + ' => ' + hypernym)

Limiting the output

I made a dictionary using .groupdict() function, however, I am having a problem regarding elimination of certain output dictionaries.
For example my code looks like this (tweet is a string that contains 5 elements separated by || :
def somefuntion(pattern,tweet):
pattern = "^(?P<username>.*?)(?:\|{2}[^|]+){2}\|{2}(?P<botprob>.*?)(?:\|{2}|$)"
for paper in tweet:
for item in re.finditer(pattern,paper):
item.groupdict()
This produces an output in the form:
{'username': 'yashrgupta ', 'botprob': ' 0.30794588629999997 '}
{'username': 'sterector ', 'botprob': ' 0.39391528649999996 '}
{'username': 'MalcolmXon ', 'botprob': ' 0.05630123819 '}
{'username': 'ryechuuuuu ', 'botprob': ' 0.08492567222000001 '}
{'username': 'dpsisi ', 'botprob': ' 0.8300337045 '}
But I would like it to only return dictionaries whose botprob is above 0.7. How do I do this?

Specifically, as #WiktorStribizew notes, just skip iterations you don't want:
pattern = "^(?P<username>.*?)(?:\|{2}[^|]+){2}\|{2}(?P<botprob>.*?)(?:\|{2}|$)"
for paper in tweet:
for item in re.finditer(pattern,paper):
item = item.groupdict()
if item["botprob"] < 0.7:
continue
print(item)
This could be wrapped in a generator expression to save the explicit continue, but there's enough going on as it is without making it harder to read (in this case).
UPDATE since you are apparently in a function:
pattern = "^(?P<username>.*?)(?:\|{2}[^|]+){2}\|{2}(?P<botprob>.*?)(?:\|{2}|$)"
items = []
for paper in tweet:
for item in re.finditer(pattern,paper):
item = item.groupdict()
if float(item["botprob"]) > 0.7:
items.append(item)
return items
Or using comprehensions:
groupdicts = (item.groupdict() for paper in tweet for item in re.finditer(pattern, paper))
return [item for item in groupdicts if float(item["botprob"]) > 0.7]

I would like it to only return dictionaries whose botprob is above 0.7.
entries = [{'username': 'yashrgupta ', 'botprob': ' 0.30794588629999997 '},
{'username': 'sterector ', 'botprob': ' 0.39391528649999996 '},
{'username': 'MalcolmXon ', 'botprob': ' 0.05630123819 '},
{'username': 'ryechuuuuu ', 'botprob': ' 0.08492567222000001 '},
{'username': 'dpsisi ', 'botprob': ' 0.8300337045 '}]
filtered_entries = [e for e in entries if float(e['botprob'].strip()) > 0.7]
print(filtered_entries)
output
[{'username': 'dpsisi ', 'botprob': ' 0.8300337045 '}]

Search for multiple words in a list using python

I'm currently working on my first python project. The goal is to be able to summarise a webpage's information by searching for and printing sentences that contain a specific word from a word list I generate. For example, the following (large) list contains 'business key terms' I generated by using cewl on business websites;
business_list = ['business', 'marketing', 'market', 'price', 'management', 'terms', 'product', 'research', 'organisation', 'external', 'operations', 'organisations', 'tools', 'people', 'sales', 'growth', 'quality', 'resources', 'revenue', 'account', 'value', 'process', 'level', 'stakeholders', 'structure', 'company', 'accounts', 'development', 'personal', 'corporate', 'functions', 'products', 'activity', 'demand', 'share', 'services', 'communication', 'period', 'example', 'total', 'decision', 'companies', 'service', 'working', 'businesses', 'amount', 'number', 'scale', 'means', 'needs', 'customers', 'competition', 'brand', 'image', 'strategies', 'consumer', 'based', 'policy', 'increase', 'could', 'industry', 'manufacture', 'assets', 'social', 'sector', 'strategy', 'markets', 'information', 'benefits', 'selling', 'decisions', 'performance', 'training', 'customer', 'purchase', 'person', 'rates', 'examples', 'strategic', 'determine', 'matrix', 'focus', 'goals', 'individual', 'potential', 'managers', 'important', 'achieve', 'influence', 'impact', 'definition', 'employees', 'knowledge', 'economies', 'skills', 'buying', 'competitive', 'specific', 'ability', 'provide', 'activities', 'improve', 'productivity', 'action', 'power', 'capital', 'related', 'target', 'critical', 'stage', 'opportunities', 'section', 'system', 'review', 'effective', 'stock', 'technology', 'relationship', 'plans', 'opportunity', 'leader', 'niche', 'success', 'stages', 'manager', 'venture', 'trends', 'media', 'state', 'negotiation', 'network', 'successful', 'teams', 'offer', 'generate', 'contract', 'systems', 'manage', 'relevant', 'published', 'criteria', 'sellers', 'offers', 'seller', 'campaigns', 'economy', 'buyers', 'everyone', 'medium', 'valuable', 'model', 'enterprise', 'partnerships', 'buyer', 'compensation', 'partners', 'leaders', 'build', 'commission', 'engage', 'clients', 'partner', 'quota', 'focused', 'modern', 'career', 'executive', 'qualified', 'tactics', 'supplier', 'investors', 'entrepreneurs', 'financing', 'commercial', 'finances', 'entrepreneurial', 'entrepreneur', 'reports', 'interview', 'ansoff']
And the following program allows me to copy all the text from a URL i specify and organises it into a list, in which the elements are separated by sentence;
from bs4 import BeautifulSoup
import urllib.request as ul
url = input("Enter URL: ")
html = ul.urlopen(url).read()
soup = BeautifulSoup(html, 'lxml')
for script in soup(["script", "style"]):
script.decompose()
strips = list(soup.stripped_strings)
# Joining list to form single text
text = " ".join(strips)
text = text.lower()
# Replacing substitutes of '.'
for i in range(len(text)):
if text[i] in "?!:;":
text = text.replace(text[i], ".")
# Splitting text by sentences
sentences = text.split(".")
My current objective is for the program to print all sentences that contain one (or more) of the key terms above, however i've only been succesful with single words at a time;
# Word to search for in the text
word_search = input("Enter word: ")
word_search = word_search.lower()
sentences_with_word = []
for x in sentences:
if x.count(word_search)>0:
sentences_with_word.append(x)
# Separating sentences into separate lines
sentence_text = "\n\n".join(sentences_with_word)
print(sentence_text)
Could somebody demonstrate how this could be achieved for an entire list at once? Thanks.
Edit
As suggested by MachineLearner, here is an example of the output for a single word. If I use wikipedia's page on marketing for the URL and choose the word 'marketing' as the input for 'word_search', this is a segment of the output generated (although the entire output is almost 600 lines long);
marketing mix the marketing mix is a foundational tool used to guide decision making in marketing
the marketing mix represents the basic tools which marketers can use to bring their products or services to market
they are the foundation of managerial marketing and the marketing plan typically devotes a section to the marketing mix
the 4ps [ edit ] the traditional marketing mix refers to four broad levels of marketing decision

Use a double loop to check multiple words contained in a list:
for sentence in sentences:
for word in words:
if sentence.count(word) > 0:
output.append(sentence)
# Do not forget to break the second loop, else
# you'll end up with multiple times the same sentence
# in the output array if the sentence contains
# multiple words
break

How to avoid double quoted string , site URL and email address from tokenization

How we I stop word_tokenize from splittings strings like "pass_word", "https://www.gmail.com" and "tempemail#mail.com"? The quotes should prevent it, but they don't.
I have tried with different regex options.
from nltk import word_tokenize
s = 'open "https://www.gmail.com" url. Enter "tempemail#mail.com" in email. Enter "pass_word" in password.'
for phrase in re.findall('"([^"]*)"', s):
s = s.replace('"{}"'.format(phrase), phrase.replace(' ', '*'))
tokens = word_tokenize(s)
print(tokens)
Actual response:
['open', 'https', ':', '//www.gmail.com', 'url', '.', 'Enter',
'tempemail', '#', 'mail.com', 'in', 'email', '.', 'Enter',
'pass_word', 'in', 'password', '.']
Expected response:
['open', 'https://www.gmail.com', 'url', '.', 'Enter',
'tempemail#mail.com', 'in', 'email', '.', 'Enter',
'pass_word', 'in', 'password', '.']

You can try this:
First, tokenize the text into sentences. If a sentence contains a special character, tokenize it with the str.split() function, otherwise use word_tokenize.
tokens=[]
for sent in sent_tokenize(s):
if re.match(r'^\w+$', s):
for token in word_tokenize(sent):
tokens.append(token)
else:
for token in sent.split():
tokens.append(token)
print(tokens)
Outputs:
['open', '"https://www.gmail.com"', 'url.', 'Enter', '"tempemail#mail.com"', 'in', 'email.', 'Enter', '"pass_word"', 'in', 'password.']
EDIT
You can tokenize periods by further splitting the string using period.

How to fix token pattern in scikit-learn?

I am using TfidfVectorizer from scikit-learn to extract features,
And the settings are:
def tokenize(text):
tokens = nltk.word_tokenize(text)
stems = []
for token in tokens:
token = re.sub("[^a-zA-Z]","", token)
stems.append(EnglishStemmer().stem(token))
return stems
vectorizer = TfidfVectorizer(tokenizer=tokenize, lowercase=True, stop_words='english')
After feeding the training set to the vectorizer, I call
vectorizer.get_feature_names()
the output contains some duplicate words with space: e.g.
u'', u' ', u' low', u' lower', u'lower', u'lower ', u'lower high', u'lower low'
And the acceptable output should be:
u'low', u'lower', u'lower high', u'lower low'
How can I solve that? Thank you.

You could do like the below,
>>> l = ['lower low', 'lower high','lower ', ' lower', u'lower', ' ', '', 'low']
>>> list(set(i.strip() for i in l if i!=' ' and i))
['lower', 'lower low', 'lower high', 'low']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Synsets in Wordnet for NLTK in python - python

Related

How do I get the top 'realistic' hypernym from wordnet synset hyper_paths?

Limiting the output

Search for multiple words in a list using python

How to avoid double quoted string , site URL and email address from tokenization

How to fix token pattern in scikit-learn?

Categories

Resources