Named Entity Recognition from personal Gazetter using Python - python

I try to do named entity recognition in python using NLTK.
I want to extract personal list of skills.
I have the list of skills and would like to search them in requisition and tag the skills.
I noticed that NLTK has NER tag for predefine tags like Person, Location etc.
Is there a external gazetter tagger in Python I can use?
any idea how to do it more sophisticated than search of terms ( sometimes multi words term )?
Thanks,
Assaf

I haven't used NLTK enough recently, but if you have words that you know are skills, you don't need to do NER- just a text search.
Maybe use Lucene or some other search library to find the text, and then annotate it? That's a lot of work but if you are working with a lot of data that might be ok. Alternatively, you could hack together a regex search which will be slower but probably work ok for smaller amounts of data and will be much easier to implement.

Have a look at RegexpTagger and eventually RegexpParser, I think that's exactly what you are looking for.
You can create your own POS tags, ie. map skills to a tag, and then easily define a grammar.
Some sample code for the tagger is in this pdf.

Related

Can I use Natural Language Processing while identifying given words in a paragraph Or do I need to use machine learning algorithms

I need to identify some given words using NLP.
As an example,
Mary Lives in France
If we consider in here the given words are Australia, Germany,France. But in this sentence it include only France.
So Among the above 3 given words I need to identify the sentence is include only France
I would comment but I don't have enough reputation. It's a bit unclear exactly what you are trying to achieve here and how representative your example is - please edit your question to make it clearer.
Anyhow, like Guy Coder says, if you know exactly the words you are looking for, you don't really need machine learning or NLP libraries at all. However, if this is not the case, and you don't know have every example of what you are looking for, the below might help:
It seems like what you are trying to do is perform Named Entity Recognition (NER) i.e. identify the named entities (e.g. countries) in your sentences. If so, the short answer is: you don't need to use any machine learning algorithms. You can just use a python library such as spaCy which comes out of the box with a pretrained language model that can already perform a bunch of tasks, for instance NER, to high degree of performance. The following snippet should get you started:
import spacy
nlp = spacy.load('en')
doc = nlp("Mary Lives in France")
for entity in doc.ents:
if (entity.label_ == "GPE"):
print(entity.text)
The output of the above snippet is "France". Named entities cover a wide range of possible things. In the snippet above I have filtered for Geopolitical entities (GPE).
Learn more about spaCy here: https://spacy.io/usage/spacy-101

Extracting and ranking keywords from short text

I am working on a project to extract a keyword from short texts (3-4 sentences). Using the spaCy library I extract noun phrases and NER and use them as keywords. However, I would like to sort them based on their importance wrt the original text.
I tried standard informational retrieval approaches, like tfidf, and even a couple of graph-based algorithms but having such short text the results weren't so great.
I was thinking that maybe using a NN with an attention mechanism could help me rank those keywords. Is there any way to use the pre-trained models that come with spaCy to do some kind of ranking?
How about something like maximal marginal relevance? http://www.cs.cmu.edu/~jgc/publication/The_Use_MMR_Diversity_Based_LTMIR_1998.pdf

NLP : What are some common verbs surrounding organization names in text

I am trying to come up with some rules to detect named entities, specifically company or organization names in text. I think it makes sense to focus on verbs. There are a lot of POS Taggers that can easily detect proper nouns. I personally like StanfordPOSTagger. Now, once i have the proper noun, i know that it is a named entity. However, to be certain that it is the name of a company, i need to come up with rules and possibly Gazetteers
I was thinking of focusing on verbs. Is there a set of common verbs that occur frequently around company names?
I could create an annotated corpus and explicitly train a Machine Learning classifier to predict such verbs, but that is a LOT of work. It would be great if someone has already done some research on this.
Additionally, can some other POS tags give clues? Not just verbs.
The verbs approach seems the most promising. I've been working on something myself to identify sentient beings from folktales. See more about my approach here: http://www.aaai.org/ocs/index.php/INT/INT7/paper/viewFile/9253/9204
You may still need to do some annotations and training OR use web text and the method below to find the training data.
If you are looking for real companies (i.e. non-fictional), then I'd suggest you just extract referring expressions (i.e. nouns and also multi-word expressions) and then check against an online database (some with easy to use API) like:
https://angel.co/api (startups)
https://data.crunchbase.com/
http://www.metabase.com/
http://www.opencalais.com/ (paid options)
http://wiki.dbpedia.org/ (wikipedia)
Does the Stanford NER system fit this use-case? It already detects organizations, alongside people and other named entity types.

Twitter search word analysis

I'm attempting some word analysis on a large collection of tweets.
I'm pulling tweets based on a search query, I then want to somehow find the keywords that appear often and that are related to the original query.
I'm not quite sure how to go about this in a reasonably effective manner though. I'm currently just removing stopwords then finding the words that occur most, but this is a bit more basic than I'd like.
Does anyone have any suggestions for this sort of thing (or even links to any reading on the topic)?
Any help greatly appreciated.
(my implementation is in Python, if that's relevant)
For semantic reasoning about the content of a tweet, you should definitely try the NLTK (Natural Language Toolkit Package). It's capable of quite sophisticated analysis of text.

Parsing Meaning from Text

I realize this is a broad topic, but I'm looking for a good primer on parsing meaning from text, ideally in Python. As an example of what I'm looking to do, if a user makes a blog post like:
"Manny Ramirez makes his return for the Dodgers today against the Houston Astros",
what's a light-weight/ easy way of getting the nouns out of a sentence? To start, I think I'd limit it to proper nouns, but I wouldn't want to be limited to just that (and I don't want to rely on a simple regex that assumes anything Title Capped is a proper noun).
To make this question even worse, what are the things I'm not asking that I should be? Do I need a corpus of existing words to get started? What lexical analysis stuff do I need to know to make this work? I did come across one other question on the topic and I'm digging through those resources now.
You need to look at the Natural Language Toolkit, which is for exactly this sort of thing.
This section of the manual looks very relevant: Categorizing and Tagging Words - here's an extract:
>>> text = nltk.word_tokenize("And now for something completely different")
>>> nltk.pos_tag(text)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'),
('completely', 'RB'), ('different', 'JJ')]
Here we see that and is CC, a coordinating conjunction; now and completely are RB, or adverbs; for is IN, a preposition; something is NN, a noun; and different is JJ, an adjective.
Use the NLTK, in particular chapter 7 on Information Extraction.
You say you want to extract meaning, and there are modules for semantic analysis, but I think IE is all you need--and honestly one of the only areas of NLP computers can handle right now.
See sections 7.5 and 7.6 on the subtopics of Named Entity Recognition (to chunk and categorize Manny Ramerez as a person, Dodgers as a sports organization, and Houston Astros as another sports organization, or whatever suits your domain) and Relationship Extraction. There is a NER chunker that you can plugin once you have the NLTK installed. From their examples, extracting a geo-political entity (GPE) and a person:
>>> sent = nltk.corpus.treebank.tagged_sents()[22]
>>> print nltk.ne_chunk(sent)
(S
The/DT
(GPE U.S./NNP)
is/VBZ
one/CD
...
according/VBG
to/TO
(PERSON Brooke/NNP T./NNP Mossman/NNP)
...)
Note you'll still need to know tokenization and tagging, as discussed in earlier chapters, to get your text in the right format for these IE tasks.
Natural Language Processing (NLP) is the name for parsing, well, natural language. Many algorithms and heuristics exist, and it's an active field of research. Whatever algorithm you will code, it will need to be trained on a corpus. Just like a human: we learn a language by reading text written by other people (and/or by listening to sentences uttered by other people).
In practical terms, have a look at the Natural Language Toolkit. For a theoretical underpinning of whatever you are going to code, you may want to check out Foundations of Statistical Natural Language Processing by Chris Manning and Hinrich Schütze.
(source: stanford.edu)
Here is the book I stumbled upon recently: Natural Language Processing with Python
What you want is called NP (noun phrase) chunking, or extraction.
Some links here
As pointed out, this is very problem domain specific stuff. The more you can narrow it down, the more effective it will be. And you're going to have to train your program on your specific domain.
This is a really really complicated topic. Generally, this sort of stuff falls under the rubric of Natural Language Processing, and tends to be tricky at best. The difficulty of this sort of stuff is precisely why there still is no completely automated system for handling customer service and the like.
Generally, the approach to this stuff REALLY depends on precisely what your problem domain is. If you're able to winnow down the problem domain, you can gain some very serious benefits; to use your example, if you're able to determine that your problem domain is baseball, then that gives you a really strong head start. Even then, it's a LOT of work to get anything particularly useful going.
For what it's worth, yes, an existing corpus of words is going to be useful. More importantly, determining the functional complexity expected of the system is going to be critical; do you need to parse simple sentences, or is there a need for parsing complex behavior? Can you constrain the inputs to a relatively simple set?
Regular expressions can help in some scenario. Here is a detailed example: What’s the Most Mentioned Scanner on CNET Forum, which used a regular expression to find all mentioned scanners in CNET forum posts.
In the post, a regular expression as such was used:
(?i)((?:\w+\s\w+\s(?:(?:(?:[0-9]+[a-z\-]|[a-z]+[0-9\-]|[0-9])[a-z0-9\-]*)|all-in-one|all in one)\s(\w+\s){0,1}(?:scanner|photo scanner|flatbed scanner|adf scanner|scanning|document scanner|printer scanner|portable scanner|handheld scanner|printer\/scanner))|(?:(?:scanner|photo scanner|flatbed scanner|adf scanner|scanning|document scanner|printer scanner|portable scanner|handheld scanner|printer\/scanner)\s(\w+\s){1,2}(?:(?:(?:[0-9]+[a-z\-]|[a-z]+[0-9\-]|[0-9])[a-z0-9\-]*)|all-in-one|all in one)))
in order to match either of the following:
two words, then model number (including all-in-one), then “scanner”
“scanner”, then one or two words, then model number (including
all-in-one)
As a result, the text extracted from the post was like,
discontinued HP C9900A photo scanner
scanning his old x-rays
new Epson V700 scanner
HP ScanJet 4850 scanner
Epson Perfection 3170 scanner
This regular expression solution worked in a way.

Categories

Resources