I am trying to build a web scraper that can predict the content of a given URL into multiple categories, but I am currently confused about which method is best suited for my use case. Here's the overall use case:
I want to predict a researcher's interest from their biography and categorize them into one or multiple categories based on SDG 17 goals. I have three data points to work with:
The biography of each researcher (can be scrapped and tokenized)
A list of keywords that are often associated with each of the SDG categories/goals (here's the example of said keywords)
Hundreds of categorizations that are done manually by students in the form of binary data (here's the example of said data)
So far, we have students that read each researcher's biography and decide which SDG category/goal each researcher belongs to. One research can belong to one or more SDG categories. We usually categorize it based on how often SDG keywords listed in our database are present in each researcher's bio.
I have looked up online machine learning models for NLP but couldn't decide on which method would work best with my use case. Any suggestions and references would be super appreciated because I'm a bit lost here.
The problem that you have here is a multi-label classification and you can solve it by applying supervised learning since you have a labelled dataset.
A labelled dataset should look something like this,
article 1 - sdg1, sdg2, sdg4
article 2 - sdg4
.
.
.
The implementation is explained in detail here - keras - multi-label-classification
This one has plenty of things abstracted and the implementation is kept simple - fasttext multi-label-classification
Profound insights of these libraries are here,
keras and fasttext
Related
I'm in need of suggestions how to extract keywords from a large document. The keywords should be inline what we have defined as the intended search results.
For example,
I need the owner's name, where the office is situated, what the operating industry is when a document about a company is given, and the defined set of words would be,
{owner, director, office, industry...}-(1)
the intended output has to be something like,
{Mr.Smith James, ,Main Street, Financial Banking}-(2)
I was looking for a method related to Semantic Similarity where sentences containing words similar to the given corpus (1), would be extracted, and using POS tagging to extract nouns from those sentences.
It would be a useful if further resources could be provided that support this approach.
What you want to do is referred to as Named Entity Recognition.
In Python there is a popular library called SpaCy that can be used for that. The standard models are able to detect 18 different entity types which is a fairly good amount.
Persons and company names should be extracted easily, while whole addresses and the industry might be more difficult. Maybe you would have to train your own model on these entity types. SpaCy also provides an API for training your own models.
Please note, that you need quite a lot of training data to have decent results. Start with 1000 examples per entity type and see if it's sufficient for your needs. POS can be used as a feature.
If your data is unstructured, this is probably one of most suited approaches. If you have more structured data, you could maybe take advantage of that.
I'm currently working on a project where I'm taking emails, stripping out the message bodies using the email package, then I want to categorize them using labels like sports, politics, technology, etc...
I've successfully stripped the message bodies out of my emails, now I'm looking to start classifying. I've done the classic example of sentiment-analysis classification using the move_reviews corpus separating documents into positive and negative reviews.
I'm just wondering how I could apply this approach to my project? Can I create multiple classes like sports, technology, politics, entertainment, etc.? I have hit a road block here and am looking for a push in the right direction.
If this isn't an appropriate question for SO I'll happily delete it.
Edit: Hello everyone, I see that this post has gained a bit of popularity, I did end up successfully completing this project, here is a link to the code in the projects GitHub Repo:
https://github.com/codyreandeau/Email-Categorizer/blob/master/Email_Categorizer.py
The task of text classification is a Supervised Machine Learning problem. This means that you need to have labelled data. When you approached the movie_review problem, you used the +1/-1 labels to train your sentiment analysis system.
Getting back to your problem:
If you have labels for your data, approach the problem in the same manner. I suggest you use the scikit-learn library. You can draw some inspiration from here: Scikit-Learn for Text Classification
If you don't have labels, you can try an unsupervised learning approach. If you have any clue about how many categories(call the number K) you have, you can try a KMeans approach. This means, grouping the emails in K categories based on how similar they are. Similar emails will end up in similar buckets. Then inspect the clusters by hand and come up with a label. Assign new emails to the most similar cluster. If you need help with KMeans check this quick recipe: Text Clustering Recipe
Suggestion: Getting labels for emails can be easier than you think. For example, Gmail lets you export your emails with folder information. If you have categorised your email, you can take advantage of this.
To create a classifier, you need a training data set with the classes you are looking for. In your case, you may need to either:
create your own data set
use a pre-existing dataset
The brown corpus is a seminal text with many of the categories you are speaking about. This could be a starting point to help classify your emails using some package like gensim to find semantically similar texts.
Once you classify your emails, you can then train a system to predict a label for each unseen email.
I am working with Python on a Data Science related task. What I need to do is - I have extracted some news articles and now I want to selectively pick only those news articles belonging to a specific person and determine if the person mentioned in the article is the same person I am interested in.
Let's say a person can be identified by either his name or certain attributes that describes that person, for example, a person with name "X" is a political figure. When an article about that person is published, we 'know' that it is referring to that person only by reading the context of the article. By 'context' I mean if the article contains any (or a combination of following ):
That person's name
The name of his political party
Names of other people closely associated with him mentioned in the article
Other attributes that describe that person
Because names are common, I want to determine what is the probability (how much probability) that a given article speaks about that person "X" only and not any other person having the same name as "X".
Alright so this is my best shot.
Initial assumptions
First, we assume that you have articles that already contain mentions of people, and these mentions are either a) mentions of the particular person you are looking for or b) mentions of other people sharing the same name.
I think to disambiguate each mention (as you would do in Entity Linking) is overkill, as you also assume that the articles are either about the person or not. So we'll say that any articles that contains at least one mention of the person is an article about the person.
General solution: Text classification
You have to develop a classification algorithm that extracts features from an article and feeds those features to a model you obtained through supervised learning. The model will output one of two answers, for example True or False. This necessitates a training set. For evaluation purposes (knowing that your solution works), you will also need a testing set.
So the first step will be to tag these training and testing sets using one of two tags each time ("True" and "False" or whatever). You have to assign these tags manually, by examining the articles yourself.
What features to use
#eldams mentions using contextual clues. In my (attempt at a) solution, the article is the context, so basically you have to ask yourself what might give away that the article is about the person in particular. At this point you can either choose the features yourself or let a more complex model find specific features in a more general feature category.
Two examples, assuming we are looking for articles about Justin Trudeau, the newly elected Canadian Prime Minister, as opposed to anyone else who is also named Justin Trudeau.
A) Choosing features yourself
With a bit of research, you will learn that Justin Trudeau leads the Liberal Party of Canada, so some good features would be to check whether or not the article contains these strings:
Liberal Party of Canada, Parti Libéral du Canada, LPC, PLC, Liberals,
Libéraux, Jean Chrétien, Paul Martin, etc
Since Trudeau is a politician, looking for these might be a good idea:
politics, politician, law, reform, parliament, house of commons, etc
You might want to gather information about his personal life, close collaborators, name of wife and kids, and so on, and add these as well.
B) Letting the learning algorithm do the work
Your other option would be to train an n-gram model using every n-gram there are in the training set (eg use all unigrams and bigrams). This results in a more complex model than can be more robust, as well as heavier to train and use.
Software resources
Whatever you choose to do, if you need to train a classifier, you should use scikit-learn. Its SVM classifier would be the most popular choice. Naive Bayes is the more classic approach to document classification.
This task is usually known as Entity Linking. If you are working on popular entities, e.g. those that have an article in Wikipedia, then you may have a look at DBpedia Spotlight or BabelNet that address this issue.
If you'd like to implement your own linker, than you may have a look at related articles. In most cases, a named entity linker detects mentions (person names in your case), then a disambiguation step is required, which computes probabilities for available references (and NIL as a mention may not have a reference available), for any specific mention in text, and by using contextual clues (e.g. words of sentence, paragraph or whole article containing the mention).
I am planning to do my final year project on Natural Language Processing (using NLTK) and my area of interest is Comment Summarization from Social media websites such as Facebook. For example, I am trying to do something like this:
Random Facebook comments in a picture :
Wow! Beautiful.
Looking really beautiful.
Very pretty, Nice pic.
Now, all these comments will get mapped (using a template based comment summarization technique) into something like this:
3 people find this picture to be "beautiful".
The ouput will consist of the word "beautiful" since it is more commonly used in the comments than the word "pretty" (and also the fact that Beautiful and pretty are synonyms).In order to accomplish this task, I am going to use approaches like tracking Keyword frequency and Keyword Scores (In this scenario,"Beautiful" and "Pretty" have a very close score).
Is this the best way to do it?
So far with my research, I have been able to come up with the following papers but none of the papers address this kind of comment summarization :
Automatic Summarization of Events from Social Media
Social Context Summarization
-
What are the other papers in this field which address a similar issue?
Apart from this, I also want my summarizer to improve with every summarization task.How do I apply machine learning in this regard?
Topic model clustering is what you are looking for.
A search on Google Scholars for "topic model clustering will give you lots of references on topic model clustering.
To understand them, you need to be familiar with approaches for the following tasks, apart from basics of Machine Learning in general.
Clustering: Cosine distance clustering, k-means clustering
Ranking: PageRank, TF-IDF, Mutual Information Gain, Maximal Marginal Relevance
First: Any recs on how to modify the title?
I am using my own named entity recognition algorithm to parse data from plain text. Specifically, I am trying to extract lawyer practice areas. A common sentence structure that I see is:
1) Neil focuses his practice on employment, tax, and copyright litigation.
or
2) Neil focuses his practice on general corporate matters including securities, business organizations, contract preparation, and intellectual property protection.
My entity extraction is doing a good job of finding the key words, for example, my output from sentence one might look like this:
Neil focuses his practice on (employment), (tax), and (copyright litigation).
However, that doesn't really help me. What would be more helpful is if i got an output that looked more like this:
Neil focuses his practice on (employment - litigation), (tax - litigation), and (copyright litigation).
Is there a way to accomplish this goal using an existing python framework such as nltk (after my algo extracts the practice areas) can I use ntlk to extract the other words that my "practice areas" modify in order to get a more complete picture?
Named entity recognition (NER) systems typically use grammer-based rules or statistical language models. What you have described here seems to be based only on keywords, though.
Typically, and much like most complex NLP tasks, NER systems should be trained on domain-specific data so that they perform well on previously unseen (test) data. You will require adequate knowledge of machine learning to go down that path.
In "normal" language, if you want to extract words or phrases and categorize them into classes defined by you (e.g. litigation), if often makes sense to use category labels in external ontologies. An example could be:
You want to extract words and phrases related to sports.
Such a categorization (i.e. detecting whether or not a word is indeed related to sports) is not a "general"-enough problem. Which means you will not find ready-made systems that will solve the problem (e.g. algorithms in the NLTK library). You can, however, use an ontology like Wikipedia and exploit the category labels available there.
E.g., you can check that if you search Wikipedia for "football", which has a category label "ball games", which in turn is under "sports".
Note that the wikipedia category labels form a directed graph. If you build a system which exploits the category structure of such an ontology, you should be able to categorize terms in your texts as you see fit. Moreover, you can even control the granularity of the categorization (e.g. do you want just "sports", or "individual sports" and "team sports").
I have built such a system for categorizing terms related to computer science, and it worked remarkably well. The closest freely available system that works in a similar way is the Wikifier built by the cognitive computing group at the University of Illinois at Urbana-Champaign.
Caveat: You may need to tweak a simple category-based code to suit your needs. E.g. there is no wikipedia page for "litigation". Instead, it redirects you to a page titled "lawsuit". Such cases need to be handled separately.
Final Note: This solution is really not in the area of NLP, but my past experience suggests that for some domains, this kind of ontology-based approach works really well. Also, I have used the "sports" example in my answer because I know nothing about legal terminology. But I hope my example helps you understand the underlying process.
I do not think your "algo" is even doing entity recognition... however, stretching the problem you presented quite a bit, what you want to do looks like coreference resolution in coordinated structures containing ellipsis. Not easy at all: start by googling for some relevant literature in linguistics and computational linguistics. I use the standard terminology from the field below.
In practical terms, you could start by assigning the nearest antecedent (the most frequently used approach in English). Using your examples:
first extract all the "entities" in a sentence
from the entity list, identify antecedent candidates ("litigation", etc.). This is a very difficult task, involving many different problems... you might avoid it if you know in advance the "entities" that will be interesting for you.
finally, you assign (resolve) each anaphora/cataphora to the nearest antecedent.
Have a look at CogComp NER tagger:
https://github.com/CogComp/cogcomp-nlp/tree/master/ner