Extract product name from english text - python

I want extract the names of products being sold from English text.
For example:
"I'm selling my xbox brand new"
"Selling rarely used 27 inch TV"
Should give me "xbox" and "27 inch TV"
The only thing I can think of at the moment is to hardcode in a giant list of important nouns and important adjectives: ['tv', 'fridge', 'xbox', 'laptop', etc]
Is there a better approach?

It looks like nltk will give you a list of words and their parts of speech. Since you are only interested in nouns? this will provide you with them
>>> from nltk import pos_tag, word_tokenize
>>> pos_tag(word_tokenize("John's big idea isn't all that bad."))
[('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is',
'VBZ'), ("n't", 'RB'), ('all', 'DT'), ('that', 'DT'), ('bad', 'JJ'),
('.', '.')]

Related

Incorrect POS Tagging with NLTK

I am trying to implement chunking using NLTK. But, I am having problems while assigning POS Tags to the words. It gives a different output when the data is passed as it is in a file and the tagging is wrong in many cases. When I convert the text to lowercase, it correctly assigns the tags. But the chunking is not correct.
My Code
import nltk
from nltk.corpus import state_union
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
sample_text = state_union.raw("2006-GWBush.txt")
words = nltk.word_tokenize(sample_text.lower()) # it makes correct pos tags in this case
words = nltk.word_tokenize(sample_text) # it does not make correct pos tags in this case
tagged = nltk.pos_tag(words)
print(tagged)
chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""
chunkParser = nltk.RegexpParser(chunkGram)
chunked = chunkParser.parse(tagged)
for subtree in chunked.subtrees(filter=lambda t: t.label() == 'Chunk'):
print(subtree)
Output when POS Tagging is correct:
[('president', 'NN'), ('george', 'NN'), ('w.', 'VBD'), ('bush', 'NN'), ("'s", 'POS'), ('address', 'NN'), ('before', 'IN'), ('a', 'DT'), ('joint', 'JJ'), ('session', 'NN'),
and so on....
there are no chunks printed
Output when POS Tagging is incorrect:
[('PRESIDENT', 'NNP'), ('GEORGE', 'NNP'), ('W.', 'NNP'), ('BUSH', 'NNP'), ("'S", 'POS'), ('ADDRESS', 'NNP'), ('BEFORE', 'IN'), ('A', 'NNP'), ('JOINT', 'NNP'),
and so on...
chunks are printed in this case but due to wrong POS tags, the chunks are not correct.
(Chunk PRESIDENT/NNP GEORGE/NNP W./NNP BUSH/NNP)
(Chunk ADDRESS/NNP)
(Chunk A/NNP JOINT/NNP SESSION/NNP)
(Chunk THE/NNP CONGRESS/NNP ON/NNP THE/NNP STATE/NNP)
why the POS tagging is producing different results when the words are in different cases? I am new to NLTK therefore unable to get the issue. Thanks
Data file
The data file used can be found here in the NLTK library. It is already there in NLTK when you download it in your system. You won't need to import it separately.

Extract action/task from text using nltk

Hi i am using nltk first time and i want to Extract action/task from text using nltk
Hi prakash, how are you ?. We need to complete the speech to action by 8 June then you will have to finish the UI by 15 july
Above here the speech to action and UI is the action.
I have started the token creation, don't know what to do next, Please guide.
from nltk import sent_tokenize
sample_text ="""Hi prakash, how are you ?. We need to complete the speech to action demo by 8 June then you will have to finish the Ui by 15 july"""
sentences = sent_tokenize(sample_text)
print(sentences) import nltk
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
sample_text = """Hi prakash, how are you ?. We need to complete the speech to action by today
then you will have to finish the UI by 15 july after that you may go finish the mobile view"""
sample_text = "need to complete the speech to action by today"
tokens = word_tokenize(sample_text.lower())
# the lower is very much required, as June and june have diffrent code NN, NNP
pos_tags = pos_tag(tokens)
result = []
for i in range(len(tokens)):
if (pos_tags[i][1] == 'VB') and (pos_tags[i][0] in ['complete','finish']):
# Here we are looking for text like (finish, complete, done)
owner = ''
for back_tag in pos_tags[:i][::-1]:
#traverse in back direction to know the owner who will (finish, complete, done)
if back_tag[1]=='PRP':
owner = back_tag[0]
break
message = ''
date = ''
for messae_index , token in enumerate(pos_tags[i:],i):
#traverse forward to know what has to be done
if token[1]=='IN':
for date_index, date_lookup in enumerate(pos_tags[messae_index:],messae_index):
if date_lookup[1]=='NN':
date = pos_tags[date_index-1][0] + ' ' + pos_tags[date_index][0]
if date_lookup[1]=='PRP':
# This is trick to stop further propegation
# Don't ask me why i am doing this, if you are still reading then read the nest line
# Save futher interation as the next sentance is i/we/you
break
break
else:
message = message + ' ' + token[0]
result += [dict(owner=owner, message=message, date=date)]
print(result)
Please guide how to extract the actions(action demo, UI) from the paragraph.
If you're using NLTK, you can get the POS tags of your tokens and come up with a regex or pattern using those tags. For example, an action will be a verb. (For better tagging, you may require Spacy. There is another library called Pattern for these purposes)
But I'm not sure if this is going to help you a lot for a scaled application.
N.B: There are well-trained Named Entity Recognizers available, you may try them.
Here is my thoughts:
If i try to identify parts of speech for your sentence using nltk.tag.pos_tag , i get below:
import nltk
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
s = 'Hi prakash, how are you ?. We need to complete the speech to action by 8 June then you will have to finish the UI by 15 july'
tokens = word_tokenize(s)
print(pos_tag(tokens))
Output:
[('Hi', 'NNP'), ('prakash', 'NN'), (',', ','), ('how', 'WRB'), ('are', 'VBP'), ('you', 'PRP'), ('?', '.'), ('.', '.'), ('We', 'PRP'), ('need', 'VBP'), ('to', 'TO'), ('complete', 'VB'), ('the', 'DT'), ('speech', 'NN'), ('to', 'TO'), ('action', 'NN'), ('by', 'IN'), ('8', 'CD'), ('June', 'NNP'), ('then', 'RB'), ('you', 'PRP'), ('will', 'MD'), ('have', 'VB'), ('to', 'TO'), ('finish', 'VB'), ('the', 'DT'), ('UI', 'NNP'), ('by', 'IN'), ('15', 'CD'), ('july', 'NN')]
If you observe, every action word i.e. "speech to action" or "UI" occur after a preceding verb tag i.e 'complete' and 'finish' respectively.
I would suggest to try this problem with below steps:
1) Find verb in a sentence.(something like below)
for i in range(len(tokens)):
if pos_tag(tokens)[][1] == 'VB':
2) If found, then fetch the next words based on their pos tags. (may be retrieve all next words until you find 'IN' tag)
This may work for your dataset.

Tokenizing and POS tagging in Python from CSV file

I am a newbie in Python and would like to do POS tagging after importing csv file from my local machine. I looked up some resources from online and found that the following code works.
text = 'Senator Elizabeth Warren from Massachusetts announced her support of
Social Security in Washington, D.C. on Tuesday. Warren joined other
Democrats in support.'
import nltk
from nltk import tokenize
sentences = tokenize.sent_tokenize(text)
sentences
from nltk.tokenize import TreebankWordTokenizer
texttokens = []
for sent in sentences:
texttokens.append(TreebankWordTokenizer().tokenize(sent))
texttokens
from nltk.tag import pos_tag
taggedsentences = []
for sentencetokens in texttokens:
taggedsentences.append(pos_tag(sentencetokens))
taggedsentences
print(taggedsentences)
Since I printed it, the result from the code above looks like this.
[[('Senator', 'NNP'), ('Elizabeth', 'NNP'), ('Warren', 'NNP'), ('from',
'IN'), ('Massachusetts', 'NNP'), ('announced', 'VBD'), ('her', 'PRP$'),
('support', 'NN'), ('of', 'IN'), ('Social', 'NNP'), ('Security', 'NNP'),
('in', 'IN'), ('Washington', 'NNP'), (',', ','), ('D.C.', 'NNP'), ('on',
'IN'), ('Tuesday', 'NNP'), ('.', '.')], [('Warren', 'NNP'), ('joined',
'VBD'), ('other', 'JJ'), ('Democrats', 'NNPS'), ('in', 'IN'), ('support',
'NN'), ('.', '.')]]
This is a desirable result that I would like to get, but I would like to get the result after importing csv file which contains several rows (in each row, there are several sentences.). For example, the csv file looks like this:
---------------------------------------------------------------
I like this product. This product is beautiful. I love it.
---------------------------------------------------------------
This product is awesome. It have many convenient features.
---------------------------------------------------------------
I went this restaurant three days ago. The food is too bad.
---------------------------------------------------------------
In the end, I would like to save the desirable pos tagging results that I displayed above after importing the csv file. I would like to save (write) the (pos tagged) each sentence in each row as a csv format.
Two formats might be possible. First one might be as follows (no header, each (pos tagged) sentence in one row).
----------------------------------------------------------------------------
[[('I', 'PRON'), ('like', 'VBD'), ('this', 'PRON'), ('product', 'NN')]]
----------------------------------------------------------------------------
[[('This', 'PRON'), ('product', 'NN'), ('is', 'VERB'), ('beautiful', 'ADJ')]]
---------------------------------------------------------------------------
[[('I', 'PRON'), ('love', 'VERB'), ('it', 'PRON')]]
----------------------------------------------------------------------------
...
The second format might look like this (no header, each set of token and pos tagger saved in one cell):
----------------------------------------------------------------------------
('I', 'PRON') | ('like', 'VBD') | ('this', 'PRON') | ('product', 'NN')
----------------------------------------------------------------------------
('This', 'PRON') | ('product', 'NN') | ('is', 'VERB') | ('beautiful', 'ADJ')
---------------------------------------------------------------------------
('I', 'PRON') | ('love', 'VERB') | ('it', 'PRON') |
----------------------------------------------------------------------------
...
I prefer the second format to the first one.
The python code that I wrote here perfectly works but I would like to do the same thing for csv file and in the end save it in my local machine.
Final purpose of doing this is that I would like to extract only noun types of words (e.g., NN, NNP) from the sentences.
Can somebody help me how to fix the python code?
Please refer to the question already answered here. You can just do some tagging to filter out just the Nouns as described in the post.SO Link

nltk tokenize measurement units

I'm trying to extract measurements from a messy dataset.
Some basic example entries would be:
1.5 gram of paracetamol
1.5g of paracetamol
1.5 grams. of paracetamol
I'm trying to extract the measurement and units for each entry so the result for all the above should be:
(1.5, g)
Some other questions proposed the use of NLTK for such a task but I'm running into trouble when doing the following:
import nltk
s1 = "1.5g of paracetamol"
s2 = "1.5 gram of paracetamol"
words_s1 = nltk.word_tokenize(s1)
words_s2 = nltk.word_tokenize(s2)
nltk.pos_tag(words_s1)
nltk.pos_tag(words_s2)
Which returns
[('1.5g', 'CD'), ('of', 'IN'), ('paracetamol', 'NN')]
[('1.5', 'CD'), ('gram', 'NN'), ('of', 'IN'), ('paracetamol', 'NN')]
The problem is that the unit 'g' is being kept as part of the CD in the first example. How could I get the following result?
[('1.5', 'CD'), ('g', 'NN'), ('of', 'IN'), ('paracetamol', 'NN')]
On the real data set the units are much more varied (mg, miligrams, kg, kgrams. ...)
Thanks!
You must tokenize the sentence yourself using nltk.regexp_tokenize, for example:
words_s1 = nltk.regexp_tokenize(s1, r'(?u)\d+(?:\.\d+)?|\w+')
Obviously, it needs to be improved to deal with more complicated cases.

How to add compound words to the tagger in NLTK?

So, I was wondering if anyone had any idea how to combine multiple terms to create a single term in the taggers in NLTK..
For example, when I do:
nltk.pos_tag(nltk.word_tokenize('Apple Incorporated is the largest company'))
It gives me:
[('Apple', 'NNP'), ('Incorporated', 'NNP'), ('is', 'VBZ'), ('the', 'DT'), ('largest', 'JJS'), ('company', 'NN')]
How do I make it put 'Apple' and 'Incorporated' Together to be ('Apple Incorporated','NNP')
You could try taking a look at nltk.RegexParser. It allows you to chunk part of speech tagged content based on regular expressions.
In your example, you could do something like
pattern = "NP:{<NN|NNP|NNS|NNPS>+}"
c = nltk.RegexpParser(p)
t = c.parse(nltk.pos_tag(nltk.word_tokenize("Apple Incorporated is the largest company")))
print t
This would give you:
Tree('S', [Tree('NP', [('Apple', 'NNP'), ('Incorporated', 'NNP')]), ('is', 'VBZ'), ('the', 'DT'), ('largest', 'JJS'), Tree('NP', [('company', 'NN')])])
The code is doing exactly what it is supposed to do. It is adding Part Of Speech tags to tokens. 'Apple Incorporated' is not a single token. It is two separate tokens, and as such can't have a single POS tag applied to it. This is the correct behaviour.
I wonder if you are trying to use the wrong tool for the job. What are you trying to do / Why are you trying to do it? Perhaps you are interested in identifying collocations rather than POS tagging? You might have a look here:
collocations module

Categories

Resources