Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I would like to restore commas and full stops in text without punctuation. For example, let's take this sentence:
I am XYZ I want to execute I have a doubt
And I would like to detect that there should be 1 commas and 1 full stop in the above example:
I am XYZ, I want to execute. I have a doubt.
Can anyone advise me on how to achieve this using Python and NLP concepts?
If I understand well, you want to improve the quality of a sentence by adding the appropriate punctuation. This is sometimes called punctuation restoration.
A good first step is to apply the usual NLP pipeline, namely tokenization, POS tagging, and parsing, using libraries such as NLTK or Spacy.
Once this preprocessing is done, you'll have to apply a rule-based or a machine learning approach to define where the punctuation should be, based on the features extracted from the NLP pipeline (e.g. sentence boundaries, parsing tree, POS, etc.).
However this is not a trivial task. It can require strong NLP/AI skills if you want to customise your algorithm.
Some examples that can be reused:
Here is a simple approach using Spacy, mainly based on sentence boundaries.
Here is a more complex solution, using the Theano deep learning library.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
If you see another question with the same wording as this, please ignore, it has unnecessary code
I made a very basic chatbot/program in Python that simulates ordering from a restaurant. I was wondering if there is any way to use Natural Language Processing (NLP) to find out if two words are the same? For example, how can NLP find out that "I'm feeling bad" means the same thing as "I'm feeling horrible" ?
Actually, your question is quite complex. By calculating the distance between word you can measure the distance between two word in an Embedding space, as known as Word2Vec, using something like euclidean distance, cosine similarity, and so on. (you can download pre-trained word2vec such as googlenews)
However, you mention about the similarity between sentence "I'm feeling bad" and sentence "I'm feeling horrible" which in this case it is easy, you just compare these two sentence and find out that they only have 1 part that is different which is the word "horrible" and "bad". Simplest way, you can use a vocab that contain set of synonyms to solve this. This is what we called Rule-base system (I suggest this rule-base method, something like nested of if-else).
Things will get more complicate when the sentence structure is different, now you need to have some kind of algorithm to detect not only word but also structure. You may need something like WMD to measure similarity between sentence. Moreover, if you want to create a model that is not rule-base, you need tons of data, we called it parallel corpus which is just a pair of sentence that have similar meaning to train. For statistical method, you also need lots of data to calculate the probability but less than deep learning model.
Lastly, do not look down toward rule-base method, this is what APPLE SIRI is using right now, super huge rule base. you can also add some complexity to your rule base while you are improving your model. Hope this answer your question
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
Imagine there is a column in dataset representing university. We need to classify the values, i.e. number of groups after classification should be as equal as possible to real number of universities. The problem is that there might be different naming for the same university. An example: University of Stanford = Stanford University = Uni of Stanford. Is there any certain NLP method/function/solution in Python 3?
Let's consider both cases: data might be tagged as well as untagged.
Thanks in advance.
A very simple unsupervised approach would be to use a k-means based approach. The advantage here is that you know exactly how many clusters (k) you expect, since you know the number of universities in advance.
Then you could use a package such as scikit-learn to create your feature vectors (most likely n-grams of characters using a Countvectorizer with the option analyzer=char) and you can use the clustering to group together similarly written universities.
There is no guarantee that the groups will match perfectly, but I think that it should work quite well, as long as the different spellings are somewhat similar.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I am using NLTK in python. I understood that it uses regular expressions in its word tokenization functions, such as TreebankWordTokenizer.tokenize(), but it uses trained models (pickle files) for sentence tokenization. I don't understand why they don't use training for word tokenization? Does it imply that sentence tokenization is a harder task?
I'm not sure if you can say that sentence splitting is harder than (word) tokenisation. But tokenisation depends on sentence splitting, so errors in sentence splitting will propagate to tokenisation. Therefore you'd want to have reliable sentence splitting, so that you don't have to make up for it in tokenisation. And it turns out that once you have good sentence splitting, tokenisation works pretty well with regexes.
Why is that? – One of the major ambiguities in tokenisation (in Latin script languages, at least) is the period ("."): It can be a full stop (thus a token of its own), an abbreviation mark (belonging to that abbreviation token), or something special (like part of a URL, a decimal fraction, ...). Once the sentence splitter has figured out the first case (full stops), the tokeniser can concentrate on the rest. And identifying stuff like URLs is exactly what you would use a regex for, isn't it?
The sentence splitter's main job, on the other hand, is to find abbreviations with a period. You can create a list for that by hand – or you can train it on a big text collection. The good thing is, it's unsupervised training – you just feed in the plain text, and the splitter collects abbreviations. The intuition is: If a token almost always appears with a period, then it's probably an abbreviation.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
What I am trying to do is to create a Multiple Choice Question (MCQ) generation to our fill in the gap style question generator. I need to generate distracters (Wrong answers) from the Key (correct answer). The MCQ is generated from educational texts that users input. We're trying to tackle this through combining Contextual similarity, similarity of the sentences in which the keys and the distractors occur in and Difference in term frequencies Any help? I was thinking of using big data datasets to generate related distractors such as the ones provided by google vision, I have no clue how to achieve this in python.
This question is way too broad to be answered, though I would do my best to give you some pointers.
If you have a closed set of potential distractors, I would use word/phrase embedding to find the closest distractor to the right answer.
Gensim's word2vec is a good starting point in python
If you want your distractors to follow a template, for example replace a certain word from the right answer with its opposite, I would use nltk's wordnet implementation to find antonyns / synonyms.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I am trying to use phonetic algorithms like Soundex and/or Metaphone to generate words that sound similar to a given dictionary word. Do I have to have a corpus of all dictionary words for doing that? Is there another way to generate words that sound similar to a given word without using a corpus? I am trying to do it in Python.
If you don't use a corpus, then you will probably have to manually define a set of rules to split a word in phonetic parts and then find the list of close phonemes. This can generate similar sounding words but most won't exist. If you want to generate close sounding words that exist, then you necessarily need a corpus.
You didn't precise the goal of your task, but you may be interested in the works of Will Leben "Sounder I" (and II and III) and Jabberwocky sentences.