Python: Naive Bayes movie review [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have a task to classify an unseen movie review into either a positive review or a negative one. I have a two folders, neg and pos, each containing 1,000 files which are movie reviews that have already been classified.
So far, what I have done is loaded the positive reviews, and I have each word stored in a dictionary along with the frequency each word occurs. I then divided each words frequency by the total amount of words in the positive folders files. I have done the same thing with the negative folder.
I am currently stuck as to where to go next. In the end I am going to have to load in an unseen review and determine if the review is positive or negative. I am not looking for any code, just guidence as to what I need to do next to achieve this. Any help is greatly appriciated, thanks!

The problem you are describing is a typical Sentiment Analysis problem, and what you've done with the reviews is called language model in (word, probability) format. I suggest you watch Professor Dan Jurafsky's video series on Sentiment Analysis as part of a Stanford course on NLP here. Another great practical tutorial by Harrison Kinsley on NLTK [ a python module for NLP related tasks] will show you how to use NLTK along with Scikit-learn [a popular python module for ML tasks] to do the classification using NB classifier and many others.

The best guidance here might be the Udacity ML course... They use the excellent scikit-learn library to classify emails using Naive Bayes, specifically the Gaussian flavour of NB; this sounds exactly like the problem you have:
https://www.udacity.com/course/intro-to-machine-learning--ud120
If you are already comfortable with the concepts and you are happy to use SK-learn then jump straight to the docs here:
http://scikit-learn.org/stable/modules/naive_bayes.html#multinomial-naive-bayes
Fitting the model and then making predictions is actually trivial with SK-learn, once you have the data in the right form.

Related

Using machine learning to detect fish spawning in audio files [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 months ago.
Improve this question
My friend is doing his thesis related to fish spawning in rivers, for this, he collects hours of data that he then analysis manually in Audacity and looks for specific patterns in the spectrograms that might indicate the sound of fish spawning.
Since he has days' worth of data I proposed a challenge to myself: to create an algorithm that might help him in detecting these patterns.
I am fairly new to Machine Learning, but a junior in programming and this sounds like a fun learning experience.
I identify the main problems as:
samples are 1 hour in length.
noise in the background (such as cars and the rivers)
Is this achievable with machine learning or should I look into other options? If yes which ones?
Thank you for taking the time to read!
the first step would be to convert the sound signals into features that machines can understand. Maybe look into MFCCs for that.
Given that you have an appropriate feature representation of your problem domain, the main thing to consider would be what kind of machine learning algorithm would you apply? Unless you would like to sit and annotate hours of data, naive supervised learning is out of the window.
I think your best bet would be to modify VAD (voice activity detection) algorithms or better yet, Speaker recognition/Identification modals.
You could also approach it by first having a complex enough representation that allows you to "see" the sound and comparing it with every frame in the test data of the specific length. Might be useful to check out DTW (Dynamic Time warping)
If you have not designed such modals before, it will be a bit difficult and might take quite a long time.

How to use NLP to find out if two words have the same definition? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
If you see another question with the same wording as this, please ignore, it has unnecessary code
I made a very basic chatbot/program in Python that simulates ordering from a restaurant. I was wondering if there is any way to use Natural Language Processing (NLP) to find out if two words are the same? For example, how can NLP find out that "I'm feeling bad" means the same thing as "I'm feeling horrible" ?
Actually, your question is quite complex. By calculating the distance between word you can measure the distance between two word in an Embedding space, as known as Word2Vec, using something like euclidean distance, cosine similarity, and so on. (you can download pre-trained word2vec such as googlenews)
However, you mention about the similarity between sentence "I'm feeling bad" and sentence "I'm feeling horrible" which in this case it is easy, you just compare these two sentence and find out that they only have 1 part that is different which is the word "horrible" and "bad". Simplest way, you can use a vocab that contain set of synonyms to solve this. This is what we called Rule-base system (I suggest this rule-base method, something like nested of if-else).
Things will get more complicate when the sentence structure is different, now you need to have some kind of algorithm to detect not only word but also structure. You may need something like WMD to measure similarity between sentence. Moreover, if you want to create a model that is not rule-base, you need tons of data, we called it parallel corpus which is just a pair of sentence that have similar meaning to train. For statistical method, you also need lots of data to calculate the probability but less than deep learning model.
Lastly, do not look down toward rule-base method, this is what APPLE SIRI is using right now, super huge rule base. you can also add some complexity to your rule base while you are improving your model. Hope this answer your question

How to restore punctuation using Python? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I would like to restore commas and full stops in text without punctuation. For example, let's take this sentence:
I am XYZ I want to execute I have a doubt
And I would like to detect that there should be 1 commas and 1 full stop in the above example:
I am XYZ, I want to execute. I have a doubt.
Can anyone advise me on how to achieve this using Python and NLP concepts?
If I understand well, you want to improve the quality of a sentence by adding the appropriate punctuation. This is sometimes called punctuation restoration.
A good first step is to apply the usual NLP pipeline, namely tokenization, POS tagging, and parsing, using libraries such as NLTK or Spacy.
Once this preprocessing is done, you'll have to apply a rule-based or a machine learning approach to define where the punctuation should be, based on the features extracted from the NLP pipeline (e.g. sentence boundaries, parsing tree, POS, etc.).
However this is not a trivial task. It can require strong NLP/AI skills if you want to customise your algorithm.
Some examples that can be reused:
Here is a simple approach using Spacy, mainly based on sentence boundaries.
Here is a more complex solution, using the Theano deep learning library.

Text clustering/NLP [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
Imagine there is a column in dataset representing university. We need to classify the values, i.e. number of groups after classification should be as equal as possible to real number of universities. The problem is that there might be different naming for the same university. An example: University of Stanford = Stanford University = Uni of Stanford. Is there any certain NLP method/function/solution in Python 3?
Let's consider both cases: data might be tagged as well as untagged.
Thanks in advance.
A very simple unsupervised approach would be to use a k-means based approach. The advantage here is that you know exactly how many clusters (k) you expect, since you know the number of universities in advance.
Then you could use a package such as scikit-learn to create your feature vectors (most likely n-grams of characters using a Countvectorizer with the option analyzer=char) and you can use the clustering to group together similarly written universities.
There is no guarantee that the groups will match perfectly, but I think that it should work quite well, as long as the different spellings are somewhat similar.

What are the steps to do handwritten character recognition in python? Using opencv and sci kit learn? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
my project is Recognition of handwritten tamil character using python , opencv and scikit-learn.
input file:handwritten tamil charcter images.
output file:recognised character in text file.
what are the basic steps to do the project?
i know three steps,
preprocessing , feature point extraction and classification
but,i dont know how to exactly proceed this project?
how to do the preprocessing?
where to store the training data sets images?
how to extract feature point in opencv?
how to implement this?
please help....
I am working on the same project of Handwritten Arabic Character Recognition and Generation but I didn't use opencv so far. Because in opencv you have to put filters on the image and process that image and you get the processed image as a result of the same size everytime. But in Arabic there is so much variation on every character and opencv is of no use for that purpose.
For your problem, I have some suggestions and helping material too. Before starting, you have to do a lot of research about character recognition and everything you want. Read research papers of Alex Graves, he has done a lot of research on character recognition and generation. It will help you a lot.
I am using Neural Network for this purpose. Initially, it is bit difficult to understand but when you understand this, you will get everything you want. And Python is very good language for that too. I have a lot of material to learn Neural Network and how to train your dataset on that. I have some useful links too which I have shared with you below:
Alex Graves's Profile: http://www.cs.toronto.edu/~graves/
Neural Network Understanding: http://nikhilbuduma.com/2014/12/29/deep-learning-in-a-nutshell/
Video: https://www.youtube.com/watch?v=q0pm3BrIUFo
Neural Network Code In Python: http://iamtrask.github.io/2015/07/12/basic-python-network/
Hope it helps you.
Thanks

Categories

Resources