Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
Imagine there is a column in dataset representing university. We need to classify the values, i.e. number of groups after classification should be as equal as possible to real number of universities. The problem is that there might be different naming for the same university. An example: University of Stanford = Stanford University = Uni of Stanford. Is there any certain NLP method/function/solution in Python 3?
Let's consider both cases: data might be tagged as well as untagged.
Thanks in advance.
A very simple unsupervised approach would be to use a k-means based approach. The advantage here is that you know exactly how many clusters (k) you expect, since you know the number of universities in advance.
Then you could use a package such as scikit-learn to create your feature vectors (most likely n-grams of characters using a Countvectorizer with the option analyzer=char) and you can use the clustering to group together similarly written universities.
There is no guarantee that the groups will match perfectly, but I think that it should work quite well, as long as the different spellings are somewhat similar.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
If you see another question with the same wording as this, please ignore, it has unnecessary code
I made a very basic chatbot/program in Python that simulates ordering from a restaurant. I was wondering if there is any way to use Natural Language Processing (NLP) to find out if two words are the same? For example, how can NLP find out that "I'm feeling bad" means the same thing as "I'm feeling horrible" ?
Actually, your question is quite complex. By calculating the distance between word you can measure the distance between two word in an Embedding space, as known as Word2Vec, using something like euclidean distance, cosine similarity, and so on. (you can download pre-trained word2vec such as googlenews)
However, you mention about the similarity between sentence "I'm feeling bad" and sentence "I'm feeling horrible" which in this case it is easy, you just compare these two sentence and find out that they only have 1 part that is different which is the word "horrible" and "bad". Simplest way, you can use a vocab that contain set of synonyms to solve this. This is what we called Rule-base system (I suggest this rule-base method, something like nested of if-else).
Things will get more complicate when the sentence structure is different, now you need to have some kind of algorithm to detect not only word but also structure. You may need something like WMD to measure similarity between sentence. Moreover, if you want to create a model that is not rule-base, you need tons of data, we called it parallel corpus which is just a pair of sentence that have similar meaning to train. For statistical method, you also need lots of data to calculate the probability but less than deep learning model.
Lastly, do not look down toward rule-base method, this is what APPLE SIRI is using right now, super huge rule base. you can also add some complexity to your rule base while you are improving your model. Hope this answer your question
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I would like to restore commas and full stops in text without punctuation. For example, let's take this sentence:
I am XYZ I want to execute I have a doubt
And I would like to detect that there should be 1 commas and 1 full stop in the above example:
I am XYZ, I want to execute. I have a doubt.
Can anyone advise me on how to achieve this using Python and NLP concepts?
If I understand well, you want to improve the quality of a sentence by adding the appropriate punctuation. This is sometimes called punctuation restoration.
A good first step is to apply the usual NLP pipeline, namely tokenization, POS tagging, and parsing, using libraries such as NLTK or Spacy.
Once this preprocessing is done, you'll have to apply a rule-based or a machine learning approach to define where the punctuation should be, based on the features extracted from the NLP pipeline (e.g. sentence boundaries, parsing tree, POS, etc.).
However this is not a trivial task. It can require strong NLP/AI skills if you want to customise your algorithm.
Some examples that can be reused:
Here is a simple approach using Spacy, mainly based on sentence boundaries.
Here is a more complex solution, using the Theano deep learning library.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
What I am trying to do is to create a Multiple Choice Question (MCQ) generation to our fill in the gap style question generator. I need to generate distracters (Wrong answers) from the Key (correct answer). The MCQ is generated from educational texts that users input. We're trying to tackle this through combining Contextual similarity, similarity of the sentences in which the keys and the distractors occur in and Difference in term frequencies Any help? I was thinking of using big data datasets to generate related distractors such as the ones provided by google vision, I have no clue how to achieve this in python.
This question is way too broad to be answered, though I would do my best to give you some pointers.
If you have a closed set of potential distractors, I would use word/phrase embedding to find the closest distractor to the right answer.
Gensim's word2vec is a good starting point in python
If you want your distractors to follow a template, for example replace a certain word from the right answer with its opposite, I would use nltk's wordnet implementation to find antonyns / synonyms.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have a task to classify an unseen movie review into either a positive review or a negative one. I have a two folders, neg and pos, each containing 1,000 files which are movie reviews that have already been classified.
So far, what I have done is loaded the positive reviews, and I have each word stored in a dictionary along with the frequency each word occurs. I then divided each words frequency by the total amount of words in the positive folders files. I have done the same thing with the negative folder.
I am currently stuck as to where to go next. In the end I am going to have to load in an unseen review and determine if the review is positive or negative. I am not looking for any code, just guidence as to what I need to do next to achieve this. Any help is greatly appriciated, thanks!
The problem you are describing is a typical Sentiment Analysis problem, and what you've done with the reviews is called language model in (word, probability) format. I suggest you watch Professor Dan Jurafsky's video series on Sentiment Analysis as part of a Stanford course on NLP here. Another great practical tutorial by Harrison Kinsley on NLTK [ a python module for NLP related tasks] will show you how to use NLTK along with Scikit-learn [a popular python module for ML tasks] to do the classification using NB classifier and many others.
The best guidance here might be the Udacity ML course... They use the excellent scikit-learn library to classify emails using Naive Bayes, specifically the Gaussian flavour of NB; this sounds exactly like the problem you have:
https://www.udacity.com/course/intro-to-machine-learning--ud120
If you are already comfortable with the concepts and you are happy to use SK-learn then jump straight to the docs here:
http://scikit-learn.org/stable/modules/naive_bayes.html#multinomial-naive-bayes
Fitting the model and then making predictions is actually trivial with SK-learn, once you have the data in the right form.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I am trying to use phonetic algorithms like Soundex and/or Metaphone to generate words that sound similar to a given dictionary word. Do I have to have a corpus of all dictionary words for doing that? Is there another way to generate words that sound similar to a given word without using a corpus? I am trying to do it in Python.
If you don't use a corpus, then you will probably have to manually define a set of rules to split a word in phonetic parts and then find the list of close phonemes. This can generate similar sounding words but most won't exist. If you want to generate close sounding words that exist, then you necessarily need a corpus.
You didn't precise the goal of your task, but you may be interested in the works of Will Leben "Sounder I" (and II and III) and Jabberwocky sentences.