How to fix token pattern in scikit-learn? - python

I am using TfidfVectorizer from scikit-learn to extract features,
And the settings are:
def tokenize(text):
tokens = nltk.word_tokenize(text)
stems = []
for token in tokens:
token = re.sub("[^a-zA-Z]","", token)
stems.append(EnglishStemmer().stem(token))
return stems
vectorizer = TfidfVectorizer(tokenizer=tokenize, lowercase=True, stop_words='english')
After feeding the training set to the vectorizer, I call
vectorizer.get_feature_names()
the output contains some duplicate words with space: e.g.
u'', u' ', u' low', u' lower', u'lower', u'lower ', u'lower high', u'lower low'
And the acceptable output should be:
u'low', u'lower', u'lower high', u'lower low'
How can I solve that? Thank you.

You could do like the below,
>>> l = ['lower low', 'lower high','lower ', ' lower', u'lower', ' ', '', 'low']
>>> list(set(i.strip() for i in l if i!=' ' and i))
['lower', 'lower low', 'lower high', 'low']

Related

How do I get the top 'realistic' hypernym from wordnet synset hyper_paths?

Python3.8 nltk wordnet
How can I find the highest reasonable hypernym for a given word?
I'm using wordnet synset.hypernym_paths() which if I traverse all the way to the top gives me a hypernym that is way too abstract (ie: entity).
I've tried creating a list of 'too high' hypernyms but the list is difficult to determine and way too long.
import sys
import nltk
from nltk.corpus import wordnet as wn
arTooHigh = ['act', 'event', 'action', 'artifact', 'instrumentality', 'furnishing', 'organism', 'cognition', 'content', 'discipline', 'humanistic_discipline', 'diversion', 'communication', 'auditory_communication', 'speech', 'state', 'feeling', 'causal_agent', 'think', 'reason','information', 'evidence', 'measure', 'fundamental_quantity', 'condition']
arWords = ['work', 'frog', 'bad', 'corpus', 'chair', 'dancing', 'gossip', 'love', 'jerry', 'compute', 'satisfied', 'gift', 'candy', 'bookkeeper', 'construction', 'amplified', 'party', 'dinner', 'family', 'sky', 'office', 'project', 'budget', 'price']
for word in arWords:
hypernym = word
last = ''
synsets = wn.synsets(word)
syn = synsets[0]
for path in syn.hypernym_paths():
for i, ss in reversed(list(enumerate(path))):
test = ss.lemmas()[0].name()
if test in arTooHigh:
hypernym = last
break
else:
last = test
print(word + ' => ' + hypernym)

Limiting the output

I made a dictionary using .groupdict() function, however, I am having a problem regarding elimination of certain output dictionaries.
For example my code looks like this (tweet is a string that contains 5 elements separated by || :
def somefuntion(pattern,tweet):
pattern = "^(?P<username>.*?)(?:\|{2}[^|]+){2}\|{2}(?P<botprob>.*?)(?:\|{2}|$)"
for paper in tweet:
for item in re.finditer(pattern,paper):
item.groupdict()
This produces an output in the form:
{'username': 'yashrgupta ', 'botprob': ' 0.30794588629999997 '}
{'username': 'sterector ', 'botprob': ' 0.39391528649999996 '}
{'username': 'MalcolmXon ', 'botprob': ' 0.05630123819 '}
{'username': 'ryechuuuuu ', 'botprob': ' 0.08492567222000001 '}
{'username': 'dpsisi ', 'botprob': ' 0.8300337045 '}
But I would like it to only return dictionaries whose botprob is above 0.7. How do I do this?
Specifically, as #WiktorStribizew notes, just skip iterations you don't want:
pattern = "^(?P<username>.*?)(?:\|{2}[^|]+){2}\|{2}(?P<botprob>.*?)(?:\|{2}|$)"
for paper in tweet:
for item in re.finditer(pattern,paper):
item = item.groupdict()
if item["botprob"] < 0.7:
continue
print(item)
This could be wrapped in a generator expression to save the explicit continue, but there's enough going on as it is without making it harder to read (in this case).
UPDATE since you are apparently in a function:
pattern = "^(?P<username>.*?)(?:\|{2}[^|]+){2}\|{2}(?P<botprob>.*?)(?:\|{2}|$)"
items = []
for paper in tweet:
for item in re.finditer(pattern,paper):
item = item.groupdict()
if float(item["botprob"]) > 0.7:
items.append(item)
return items
Or using comprehensions:
groupdicts = (item.groupdict() for paper in tweet for item in re.finditer(pattern, paper))
return [item for item in groupdicts if float(item["botprob"]) > 0.7]
I would like it to only return dictionaries whose botprob is above 0.7.
entries = [{'username': 'yashrgupta ', 'botprob': ' 0.30794588629999997 '},
{'username': 'sterector ', 'botprob': ' 0.39391528649999996 '},
{'username': 'MalcolmXon ', 'botprob': ' 0.05630123819 '},
{'username': 'ryechuuuuu ', 'botprob': ' 0.08492567222000001 '},
{'username': 'dpsisi ', 'botprob': ' 0.8300337045 '}]
filtered_entries = [e for e in entries if float(e['botprob'].strip()) > 0.7]
print(filtered_entries)
output
[{'username': 'dpsisi ', 'botprob': ' 0.8300337045 '}]

How to get dictionary keys to display in relation to values

I'm looking to categorize some sentences. To do this, I've created a couple dictionary categories for "Price" and "Product Quality". So far I have the code loop through the words within the category and it displays the word it found.
I'd also like to add the actual category name like "Price" or "Product Quality" depending on the values within those keys.
Is there a way to display the keys for each category. Currently it's just displaying both "Price" and "Product Quality" for everything.
Here is the code:
data = ["Great price on the dewalt saw", "cool deal and quality", "love it! and the price percent off", "definitely going to buy"]
words = {'price': ['price', 'compare', '$', 'percent', 'money', '% off'],
'product_quality': ['quality', 'condition', 'aspect']}
for d in data:
for word in words.values():
for s in word:
if s in d:
print(id(d), ", ", d, ", ", s, ", ", words.keys())
Here is the output as well:
4398300496 , Great price on the dewalt saw , price , dict_keys(['price', 'product_quality'])
4399544552 , cool deal and quality , quality , dict_keys(['price', 'product_quality'])
4398556680 , love it! and the price percent off , price , dict_keys(['price', 'product_quality'])
4398556680 , love it! and the price percent off , percent , dict_keys(['price', 'product_quality'])
You can use items(), which unpacks into (key, value):
data = ["Great price on the dewalt saw", "cool deal and quality", "love it! and the price percent off", "definitely going to buy"]
words = {'price': ['price', 'compare', '$', 'percent', 'money', '% off'],
'product_quality': ['quality', 'condition', 'aspect']}
for d in data:
for category, word in words.items():
for s in word:
if s in d:
print(id(d), ", ", d, ", ", s, ", ", category)
Out:
(4338487344, ', ', 'Great price on the dewalt saw', ', ', 'price', ', ', 'price')
(4338299376, ', ', 'cool deal and quality', ', ', 'quality', ', ', 'product_quality')
(4338487416, ', ', 'love it! and the price percent off', ', ', 'price', ', ', 'price')
(4338487416, ', ', 'love it! and the price percent off', ', ', 'percent', ', ', 'price')

How to iterate through a TextBlob WordList and find the most common nouns?

I'm scraping tweets from Twitter and I'd like to gather a list of all of the nouns from all of the tweets I'm scraping so I can figure out which nouns occur the most frequently.
def sentiment_script():
for tweet in tweepy.Cursor(api.search, q=hashtag_phrase + ' -filter:retweets', lang="en", tweet_mode='extended').items(7):
text = tweet.full_text
text = ' '.join(re.sub("(#[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", text).split())
blob = TextBlob(text)
nouns = (blob.noun_phrases)
print(nouns)
The output is this:
['covid', 'richmitch']
['uk', 'england', 'uk', 'johnson', 's approach']
['peoria']
['pa', 'surely', 'secretly trying', 'infect', 'covid', 'never wonkette']
['don t', 'full lockdown', 'cancer etc don t', 'full recovery', 'death rate', 'aren t', 'full lockdown']
['datascience team', 'weekly report', 'new data', 'covid', 'may', 'report sheds light', 'business impacts', 'covid', 'read', 'capraplus']
['osdbu', 'small businesses', 'linked', 'covid']
I'm not sure where to proceed next, as when I do this:
print(type(nouns))
the result is
<class 'textblob.blob.WordList'>
<class 'textblob.blob.WordList'>
<class 'textblob.blob.WordList'>
<class 'textblob.blob.WordList'>
<class 'textblob.blob.WordList'>
<class 'textblob.blob.WordList'>
<class 'textblob.blob.WordList'>
have you tried
print(type.text) ?

Synsets in Wordnet for NLTK in python

I am using this code to get all synonyms from the text in document named "answer_tokens.txt" it is only listing the words in the document without the synonyms. can someone check it out?
from nltk.corpus import wordnet
from nltk import word_tokenize
with open('answer_tokens.txt') as a: #opening the tokenised answer file
wn_tokens = (a.read())
#printing the answer tokens word by word as opened
print('==========================================')
synonyms = []
for b in word_tokenize(wn_tokens):
print (str (b))
for b in wordnet.synsets(b):
for l in b.lemmas():
synonyms.append(l.name())
print ('==========================================')
print (set (synonyms))
this is the out put is giving
[
,
'Compare
'
,
'dynamic
'
,
'case
'
,
'data
'
,
'changing
'
,
'
,
'
,
'example
'
,
'watching
'
,
'video
'
]
===================================================
set()
==================================================
This is the output we need
[
,
'Compare
'
,
'dynamic
'
,
'case
'
,
'data
'
,
'changing
'
,
'
,
'
,
'example
'
,
'watching
'
,
'video
'
]
===================================================
'Compare'{'equate', 'comparison', 'compare', 'comparability', 'equivalence', 'liken'}
'dynamic'{'dynamic', 'active', 'dynamical', 'moral_force'}
'case' {'display_case', 'grammatical_case', 'example', 'event', 'causa', 'shell', 'pillow_slip', 'encase', 'character', 'cause', 'font', 'instance', 'type', 'casing', 'guinea_pig', 'slip', 'suit', "typesetter's_case", 'sheath', 'vitrine', 'typeface', 'eccentric', 'lawsuit', 'showcase', 'caseful', 'fount', 'subject', 'pillowcase', "compositor's_case", 'face', 'incase', 'case'}
'data' {'data', 'information', 'datum', 'data_point'}
'changing'{'modify', 'interchange', 'convert', 'alter', 'switch', 'transfer', 'commute', 'change', 'vary', 'deepen', 'changing', 'ever-changing', 'shift', 'exchange'}
'example ' {'example', 'exemplar', 'object_lesson', 'representative', 'good_example', 'exercise', 'instance', 'deterrent_example', 'lesson', 'case', 'illustration', 'model'}
'watching' {'watch', 'observation', 'view', 'watching', 'watch_out', 'check', 'look_on', 'ascertain', 'learn', 'watch_over', 'observe', 'follow', 'observance', 'take_in', 'look_out', 'find_out', 'keep_an_eye_on', 'catch', 'determine', 'see'}
'video' {'video_recording', 'video', 'television', 'picture', 'TV', 'telecasting'}
==================================================
First of all, the initialisation synonyms = [] should be made for each token separately, since you want to build a different list of synonyms for each token. So I've moved that instruction into the for loop iterating over the tokens.
The second problem with your code is that you use the same variable name for both iterating over the tokens and over the synsets of the current token. This way you lose the information over the token itself, rendering you unable to print it afterwards.
Lastly, the printing of the set of synonyms should be made for each token (as you yourself stated in the question), so I have moved the print statement at the end of the for loop iterating over the tokens.
Here is the code:
from nltk.corpus import wordnet as wn
from nltk import word_tokenize as wTok
with open('answer_tokens.txt') as a: #opening the tokenised answer file
wn_tokens = (a.read())
print(wn_tokens)
print('==========================================')
for token in wTok(wn_tokens):
synonyms = []
for syn in wn.synsets(token):
for l in syn.lemmas():
synonyms.append(l.name())
print(token, set(synonyms))
print ('==========================================')
Hope this helps!
P.S. It could be useful to give aliases to the imports you make, in order to streamline the coding process. I have given the wn alias to the wordnet module and the wTok alias to the word_tokenize module.

Categories

Resources