How to classify unseen text data?

How to classify unseen text data? - python

I am training an text classifier for addresses such that if given sentence is an address or not.
Sentence examples :-
(1) Mirdiff City Centre, DUBAI United Arab Emirates
(2) Ultron Inc. <numb> Toledo Beach Rd #1189 La Salle, MI <numb>
(3) Avenger - HEAD OFFICE P.O. Box <numb> India
As addresses can be of n types it's very difficult to make such classifier. Is there any pre-trained model or database for the same or any other non ML way.

As mentioned earlier, verifying that an address is valid - is probably better formalized as an information retrieval problem rather than a machine learning problem. (e.g. using a service).
However, from the examples you gave, it seems like you have several entity types that reoccur, such as organizations and locations.
I'd recommend enriching the data with a NER, such a spacy, and use the entity types for either a feature or a rule.
Note that named-entity recognizers rely more on context than the typical bag-of-words classifier, and are usually more robust to unseen data.

When I did this the last time the problem was very hard, esp. since I had international adresses and the variation across countries is enormous. Add to that the variation added by people and the problem becomes quite hard even for humans.
I finally build a heuristic (contains it some like PO BOX, a likely country name (grep wikipedia), maybe city names) and then threw every remaining maybe address into the google maps API. GM is quite good a recognizing adresses, but even that will have false positives, so manual checking will most likely be needed.
I did not use ML because my adress db was "large" but not large enough for training, esp. we lacked labeled training data.

As you are asking for recommendation for literature (btw this question is probably to broad for this place), I can recommend you two links:
https://www.reddit.com/r/datasets/comments/4jz7og/how_to_get_a_large_at_least_100k_postal_address/
https://www.red-gate.com/products/sql-development/sql-data-generator/
https://openaddresses.io/
You need to build a labeled data as #Christian Sauer has already mentioned, where you have examples with adresses. And probably you need to make false data with wrong adresses as well! So for example you have to make sentences with only telephone numbers or whatever. But in anyway this will be a quite disbalanced dataset, as you will have a lot of correct adresses and only a few which are not adresses. In total you would need around 1000 examples to have a starting point for it.
Other option is to identify the basic adresses manually and do a similarity analysis to identify the sentences which are clostet to it.

As mentioned by Uri Goren, the problem is of Named entity recognition, while there are a lot of trained models in the market. Still, the best one cant get is the Stanford NER.
https://nlp.stanford.edu/software/CRF-NER.shtml
It is a conditional random field NER. It is available in java.
If you are looking for a python implementation of the same. Have a look at:
How to install and invoke Stanford NERTagger?
Here you can gather info from a multiple sequence of tags like
, , or any other sequence like that. If it doesn't give you the correct stuff, it will still somehow get you closer to any address in the whole document. That's a head start.
Thanks.

Related

is there a way to use pretrained doc2vec model to evaluate some document dataset

Lately I am doing a research with purpose of unsupervised clustering of a huge texts database. Firstly I tried bag-of-words and then several clustering algorithms which gave me a good result, but now I am trying to step into doc2vec representation and it seems to not be working for me, I cannot load prepared model and work with it, instead training my own doesnt prove any result.
I tried to train my model on 10k texts
model = gensim.models.doc2vec.Doc2Vec(vector_size=500, min_count=2, epochs=100,workers=8)
(around 20-50 words each) but the similarity score which is proposed by gensim like
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
is working much worse than the same for Bag-of-words with my model.
By much worse i mean that identical or almost identical text have similarity score compatible to text which dont have any connection i can think about. So i decided to use model from Is there pre-trained doc2vec model? to use some pretrained model which might have more connections between words. Sorry for somewhat long preambula but the question is how do i plug it in? Can someone provide some ideas how do i, using the loaded gensim model from https://github.com/jhlau/doc2vec convert my own dataset of text into vectors of same length? My data is preprocesssed (stemmed, no punctuation, lowercase, no nlst.corpus stopwords)and i can deliver it from list or dataframe or file if needed, the code question is how to pass my own data to pretrained model? Any help would be appreciated.
UPD: outputs that make me feel bad
Train Document (6134): «use medium paper examination medium habit one
week must chart daily use medium radio television newspaper magazine
film video etc wake radio alarm listen traffic report commuting get
news watch sport soap opera watch tv use internet work home read book
see movie use data collect journal basis analysis examining
information using us gratification model discussed textbook us
gratification article provided perhaps carrying small notebook day
inputting material evening help stay organized smartphone use note app
track medium need turn diary trust tell tell immediately paper whether
actually kept one begin medium diary soon possible order give ample
time complete journal write paper completed diary need write page
paper use medium functional analysis theory say something best
understood understanding used us gratification model provides
framework individual use medium basis analysis especially category
discussed posted dominick article apply concept medium usage expected
le medium use cognitive social utility affiliation withdrawal must
draw conclusion use analyzing habit within framework idea discussed
text article concept must clearly included articulated paper common
mistake student make assignment tell medium habit fail analyze habit
within context us gratification model must include idea paper»
Similar Document (6130, 0.6926988363265991): «use medium paper examination medium habit one week must chart daily use medium radio
television newspaper magazine film video etc wake radio alarm listen
traffic report commuting get news watch sport soap opera watch tv use
internet work home read book see movie use data collect journal basis
analysis examining information using us gratification model discussed
textbook us gratification article provided perhaps carrying small
notebook day inputting material evening help stay organized smartphone
use note app track medium need turn diary trust tell tell immediately
paper whether actually kept one begin medium diary soon possible order
give ample time complete journal write paper completed diary need
write page paper use medium functional analysis theory say something
best understood understanding used us gratification model provides
framework individual use medium basis analysis especially category
discussed posted dominick article apply concept medium usage expected
le medium use cognitive social utility affiliation withdrawal must
draw conclusion use analyzing habit within framework idea discussed
text article concept must clearly included articulated paper common
mistake student make assignment tell medium habit fail analyze habit
within context us gratification model must include idea paper»
This looks perfectly ok, but looking on other outputs
Train Document (1185): «photography garry winogrand would like paper
life work garry winogrand famous street photographer also influenced
street photography aim towards thoughtful imaginative treatment detail
referencescite research material academic essay university level»
Similar Document (3449, 0.6901006698608398): «tang dynasty write page
essay tang dynasty essay discus buddhism tang dynasty name artifact
tang dynasty discus them history put heading paragraph information
tang dynasty discussed essay»
Shows us that the score of similarity between two exactly same texts which are the most similar in the system and two like super distinct is almost the same, which makes it problematic to do anything with the data.
To get most similar documents i use
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))

The models from https://github.com/jhlau/doc2vec are based on a custom fork of an older version of gensim, so you'd have to find/use that to make them usable.
Models from a generic dataset (like Wikipedia) may not understand the domain-specific words you need, and even where words are shared, the effective senses of those words may vary. Also, to use another model to infer vectors on your data, you should ensure you're preprocessing/tokenizing your text in the same way as the training data was processed.
Thus, it's best to use a model you've trained yourself – so you fully understand it – on domain-relevant data.
10k documents of 20-50 words each is a bit small compared to published Doc2Vec work, but might work. Trying to get 500-dimensional vectors from a smaller dataset could be a problem. (With less data, fewer vector dimensions and more training iterations may be necessary.)
If your result on your self-trained model are unsatisfactory, there could be other problems in your training and inference code (that's not shown yet in your question). It would also help to see more concrete examples/details of how your results are unsatisfactory, compared to a baseline (like the bag-of-words representations you mention). If you add these details to your question, it might be possible to offer other suggestions.

Identifying the subject of a sententce

I have been exploring NLP techniques with the goal of identifying the subject of survey comments (which I then use in conjunction with sentiment analysis). I want to make high level statements such as "10% of survey respondents made a positive comment (+ sentiment) about Account Managers".
My approach has used Named Entity Recognition (NER). Now that I am working with real data, I am getting visibility of some of the complexities & nuances associated with identifying the subject of a sentence. Here are 5 examples of sentences where the subject is the Account Manager. I have put the named entity in bold for demonstration purposes.
Our account manager is great, he always goes the extra mile!
Steve our account manager is great, he always goes the extra mile!
Steve our relationship manager is great, he always goes the extra
mile!
Steven is great, he always goes the extra mile!
Steve Smith is great, he always goes the extra mile!
Our business mgr. is great,he always goes the extra mile!
I see three challenges that add complexity to my task
Synonyms: Account manager vs relationship manager vs business mgr. This is somewhat domain specific and tends to vary with the survey target audience.
Abbreviations: Mgr. vs manager
Ambiguity - Whether “Steven” is “Steve Smith” & therefore an
“account manager”.
Of these the synonym problem is the most frequent issue, followed by the ambiguity issues. Based on what I have seen, the abbreviation issue isn’t that frequent in my data.
Are there any NLP techniques that can help deal with any of these issues to a relatively high degree of confidence?

As far as I understood, what you call the "subject" is, given a sentence, the entity that a statement is made about - in your example, Steve the account manager.
Based on this assumption, here are a few techniques and how they might help you:
(Dependency) Parsing
Since you don't mean subject in the strict grammatical sense, the approach suggested by user7344209 based on dependency parsing probably won't help you. In a sentence such as "I like Steve", the grammatical subject is "I", although you probably want to find "Steve" as the "subject".
Named Entity Recognition
You already use this, and it will be great to detect names of persons such as Steve. What I'm not so sure about is the example of the "account manager". Both the output provided by Daniel and my own test with Stanford CoreNLP did not identify it as a named entity - which is correct, it really is not a named entity:
Something broader such as the suggested mention identification might be better, but it basically marks every noun phrase which is probably too broad. If I understood it correctly, you want to find one subject per sentence.
Coreference Resolution
Coreference Resolution is the key technique to detect that "Steve" and the "account manager" are the same entity. Stanford CoreNLP has such module for example.
In order for this to work in your example, you have to let it process several sentence at once, since you want to find the links between them. Here is an example with (shorted versions) of some of your examples:
The visualization is a bit messy, but it basically found the following coreference chains:
Steve <-> Steve Smith
Steve our account manager <-> He <-> Our account manager
Our <-> Our
the extra mile <-> the extra mile
Given the first two chains, and a bit of post-processing, you could figure out that all four statements are about the same entity.
Semantic Similarity
In the case of account, business and relationship manager, I found that the CoreNLP coreference resolver actually already finds chains despite the different terms.
More generally, if you think that the coreference resolver cannot handle synonyms and paraphrases well enough, you could also try to include measures of semantic similarity. There is a lot of work in NLP on predicting whether two phrases are synonymous or not.
Some approaches are:
Looking up synonyms in a thesaurus such as Wordnet - e.g. with nltk (python) as shown here
Better, compute a similarity measure based on the relationships defined in WordNet - e.g. using SEMILAR (Java)
Using continous representations for words to compute similarities, for example based on LSA or LDA - also possible with SEMILAR
Using more recent neural-network-style word embeddings such as word2vec or GloVe - the latter are easily usable with spacy (python)
An idea to use these similarity measures would be to identify entities in two sentences, then make pairwise comparisons between entities in both sentences and if a pair has a similarity higher than a threshold consider it as beeing the same entity.

If you don't have much data to train, you probably can try a dependency analysis tool and extract dependency pairs which have SUBJECT identified (usually the nsubj if you use Stanford Parser).

I like your approach using NER. This is what I see in our system for your inputs:
Mention-Detection output might also be useful:
On your 2nd point, which involves abbreviations, it is a hard problem. But we have entity-similarity module here that might be useful. This takes into account things like honorifics etc.
About your 3rd point, co-reference problem, try the coref module:
Btw the above figures are from the demo here: http://deagol.cs.illinois.edu:8080

Name Entity Resolution Algorithm

I was trying to build an entity resolution system, where my entities are,
(i) General named entities, that is organization, person, location,date, time, money, and percent.
(ii) Some other entities like, product, title of person like president,ceo, etc.
(iii) Corefererred entities like, pronoun, determiner phrase,synonym, string match, demonstrative noun phrase, alias, apposition.
From various literature and other references, I have defined its scope as I would not consider the ambiguity of each of the entity beyond its entity category. That is, I am taking Oxford of Oxford University
as different from Oxford as place, as the previous one is the first word of an organization entity and second one is the entity of location.
My task is to construct one resolution algorithm, where I would extract
and resolve the entities.
So, I am working out an entity extractor in the first place.
In the second place, if I try to relate the coreferences as I found from
various literatures like this seminal work, they are trying to work out
a decision tree based algorithm, with some features like, distance,
i-pronoun, j-pronoun, string match, definite noun
phrase, demonstrative noun phrase, number agreement feature,
semantic class agreement, gender agreement, both proper names, alias, apposition
etc.
The algorithm seems a nice one where enities are extracted with Hidden Markov Model(HMM).
I could work out one entity recognition system with HMM.
Now I am trying to work out a coreference as well as an entity
resolution system. I was trying to feel instead of using so many
features if I use an annotated corpus and train it directly with
HMM based tagger, with a view to solve a relationship extraction like,
*"Obama/PERS is/NA delivering/NA a/NA lecture/NA in/NA Washington/LOC, he/PPERS knew/NA it/NA was/NA going/NA to/NA be/NA
small/NA as/NA it/NA may/NA not/NA be/NA his/PoPERS speech/NA as/NA Mr. President/APPERS"
where, PERS-> PERSON
PPERS->PERSONAL PRONOUN TO PERSON
PoPERS-> POSSESSIVE PRONOUN TO PERSON
APPERS-> APPOSITIVE TO PERSON
LOC-> LOCATION
NA-> NOT AVAILABLE*
would I be wrong? I made an experiment with around 10,000 words. Early results seem
encouraging. With a support from one of my colleague I am trying to insert some
semantic information like,
PERSUSPOL, LOCCITUS, PoPERSM, etc. for PERSON OF US IN POLITICS, LOCATION CITY US, POSSESSIVE PERSON MALE, in the tagset to incorporate entity disambiguation at one go. My feeling relationship extraction would be much better now.
Please see this new thought too.
I got some good results with Naive Bayes classifier also where sentences
having predominately one set of keywords are marked as one class.
If any one may suggest any different approach, please feel free to suggest so.
I use Python2.x on MS-Windows and try to use libraries like NLTK, Scikit-learn, Gensim,
pandas, Numpy, Scipy etc.
Thanks in Advance.

It seems that you are going in three different paths that are totally different and each can be done in a stand alone Phd. There are many literature about them. My first advice focus on the main task and outsource the remaining. If you are going to develop this for non-famous language, also, you can build on others.
Named Entity Recognition
Standford NLP have really go too far in that specially for English. They resolve named entities really good, they are widely used and have a nice community.
Other solution may exist in openNLP for python .
Some tried to extend it to unusual fine-grain types but you need much bigger training data to cover the cases and the decision becomes much harder.
Edit: Stanford NER exists in NLTK python
Named Entity Resolution/Linking/Disambiguation
This is concerned with linking the name to some knowledge base, and solves the problem of whether Oxford University of Oxford City.
AIDA: is one of the state-of-art in that. They uses different context information as well as coherence information. Also, they have tried supporting several languages. They have a good bench mark.
Babelfy: offers interesting API that does NER and NED for Entities and concepts. Also, they support many language but never worked very well.
others like tagme and wikifi ...etc
Conference Resolution
Also Stanford CoreNLP has some good work in that direction. I can also recommend this work where they combined Conference Resolution with NED.

Identifying an entity in article

I am working with Python on a Data Science related task. What I need to do is - I have extracted some news articles and now I want to selectively pick only those news articles belonging to a specific person and determine if the person mentioned in the article is the same person I am interested in.
Let's say a person can be identified by either his name or certain attributes that describes that person, for example, a person with name "X" is a political figure. When an article about that person is published, we 'know' that it is referring to that person only by reading the context of the article. By 'context' I mean if the article contains any (or a combination of following ):
That person's name
The name of his political party
Names of other people closely associated with him mentioned in the article
Other attributes that describe that person
Because names are common, I want to determine what is the probability (how much probability) that a given article speaks about that person "X" only and not any other person having the same name as "X".

Alright so this is my best shot.
Initial assumptions
First, we assume that you have articles that already contain mentions of people, and these mentions are either a) mentions of the particular person you are looking for or b) mentions of other people sharing the same name.
I think to disambiguate each mention (as you would do in Entity Linking) is overkill, as you also assume that the articles are either about the person or not. So we'll say that any articles that contains at least one mention of the person is an article about the person.
General solution: Text classification
You have to develop a classification algorithm that extracts features from an article and feeds those features to a model you obtained through supervised learning. The model will output one of two answers, for example True or False. This necessitates a training set. For evaluation purposes (knowing that your solution works), you will also need a testing set.
So the first step will be to tag these training and testing sets using one of two tags each time ("True" and "False" or whatever). You have to assign these tags manually, by examining the articles yourself.
What features to use
#eldams mentions using contextual clues. In my (attempt at a) solution, the article is the context, so basically you have to ask yourself what might give away that the article is about the person in particular. At this point you can either choose the features yourself or let a more complex model find specific features in a more general feature category.
Two examples, assuming we are looking for articles about Justin Trudeau, the newly elected Canadian Prime Minister, as opposed to anyone else who is also named Justin Trudeau.
A) Choosing features yourself
With a bit of research, you will learn that Justin Trudeau leads the Liberal Party of Canada, so some good features would be to check whether or not the article contains these strings:
Liberal Party of Canada, Parti Libéral du Canada, LPC, PLC, Liberals,
Libéraux, Jean Chrétien, Paul Martin, etc
Since Trudeau is a politician, looking for these might be a good idea:
politics, politician, law, reform, parliament, house of commons, etc
You might want to gather information about his personal life, close collaborators, name of wife and kids, and so on, and add these as well.
B) Letting the learning algorithm do the work
Your other option would be to train an n-gram model using every n-gram there are in the training set (eg use all unigrams and bigrams). This results in a more complex model than can be more robust, as well as heavier to train and use.
Software resources
Whatever you choose to do, if you need to train a classifier, you should use scikit-learn. Its SVM classifier would be the most popular choice. Naive Bayes is the more classic approach to document classification.

This task is usually known as Entity Linking. If you are working on popular entities, e.g. those that have an article in Wikipedia, then you may have a look at DBpedia Spotlight or BabelNet that address this issue.
If you'd like to implement your own linker, than you may have a look at related articles. In most cases, a named entity linker detects mentions (person names in your case), then a disambiguation step is required, which computes probabilities for available references (and NIL as a mention may not have a reference available), for any specific mention in text, and by using contextual clues (e.g. words of sentence, paragraph or whole article containing the mention).

Named entity recognition : For new/latest entities

Sorry for that weird "question title" , but I couldnt think of an appropriate title.
Im new to NLP concepts, so I used NER demo (http://cogcomp.cs.illinois.edu/demo/ner/results.php). Now the issue is that "how & in what ways" can I use these taggings done by NER. I mean these what answers or inferences can one draw from these named-entities which have been tagged in certain groups - location, person ,organization etc. If I have a data which has names of entirely new companies, places etc then how am I going to do these NER taggings for such a data ?
Pls dont downvote or block me, I just need guidance/expert suggestions thats it. Reading about a concept is another thing, while being able to know where & when to apply it is another thing, which is where Im asking for guidance. Thanks a ton !!!
A snippet from the demo:-
Dogs have been used in cargo areas for some time, but have just been introduced recently in
passenger areas at LOC Newark and LOC JFK airports. LOC JFK has one dog and LOC Newark has a
handful, PER Farbstein said.

Usually NER is a step in a pipeline. For example, once all entities have been tagged, if you have many sentences like [PER John Smith], CEO of [ORG IBM] said..., then you can set up a table of Companies and CEOs. This is a form of knowledge base population.
There are plenty of other uses, though, depending on the type of data you already have and what you are trying to accomplish.

I think there are two parts in your question:
What is the purpose of NER?
This is vast question, generally it is used for Information Retrieval (IR) tasks such as indexing, document classification, Knowledge Base Population (KBP) but also many, many others (speech recognition, translation)... quite hard to figure out an extensive list...
How can we NER be extended to also recognize new/unkown entities?
E.g. how can we recognize entities that have never been seen by the NER system. In a glance, two solutions are likely to work:
Let's say you have some linked database that is updated on a regular basis: than the system may rely on generic categories. For instance, let's say "Marina Silva" comes up in news and is now added to lexicon associated to category "POLITICIAN". As the system knows that every POLITICIAN should be tagged as a person, i.e. doesn't rely on lexical items but on categories, and will thus tag "Marina Silva" as a PERS named entity. You don't have to re-train the whole system, just to update its lexicon.
Using morphological and contextual clues, the system may guess for new named entities that have never been seen (and are not in the lexicon). For instance, a rule like "The presidential candidate XXX YYY" (or "Marina YYY") will guess that "XXX YYY" (or just "YYY") is a PERS (or part of a PERS). This involves, most of the times, probabilistic modeling.
Hope this helps :)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.