Sentiment analysis for Twitter in Python [closed] - python

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
I'm looking for an open source implementation, preferably in python, of Textual Sentiment Analysis (http://en.wikipedia.org/wiki/Sentiment_analysis). Is anyone familiar with such open source implementation I can use?
I'm writing an application that searches twitter for some search term, say "youtube", and counts "happy" tweets vs. "sad" tweets.
I'm using Google's appengine, so it's in python. I'd like to be able to classify the returned search results from twitter and I'd like to do that in python.
I haven't been able to find such sentiment analyzer so far, specifically not in python.
Are you familiar with such open source implementation I can use? Preferably this is already in python, but if not, hopefully I can translate it to python.
Note, the texts I'm analyzing are VERY short, they are tweets. So ideally, this classifier is optimized for such short texts.
BTW, twitter does support the ":)" and ":(" operators in search, which aim to do just this, but unfortunately, the classification provided by them isn't that great, so I figured I might give this a try myself.
Thanks!
BTW, an early demo is here and the code I have so far is here and I'd love to opensource it with any interested developer.

Good luck with that.
Sentiment is enormously contextual, and tweeting culture makes the problem worse because you aren't given the context for most tweets. The whole point of twitter is that you can leverage the huge amount of shared "real world" context to pack meaningful communication in a very short message.
If they say the video is bad, does that mean bad, or bad?
A linguistics professor was lecturing
to her class one day. "In English,"
she said, "A double negative forms a
positive. In some languages, though,
such as Russian, a double negative is
still a negative. However, there is no
language wherein a double positive can
form a negative."
A voice from the back of the room
piped up, "Yeah . . .right."

With most of these kinds of applications, you'll have to roll much of your own code for a statistical classification task. As Lucka suggested, NLTK is the perfect tool for natural language manipulation in Python, so long as your goal doesn't interfere with the non commercial nature of its license. However, I would suggest other software packages for modeling. I haven't found many strong advanced machine learning models available for Python, so I'm going to suggest some standalone binaries that easily cooperate with it.
You may be interested in The Toolkit for Advanced Discriminative Modeling, which can be easily interfaced with Python. This has been used for classification tasks in various areas of natural language processing. You also have a pick of a number of different models. I'd suggest starting with Maximum Entropy classification so long as you're already familiar with implementing a Naive Bayes classifier. If not, you may want to look into it and code one up to really get a decent understanding of statistical classification as a machine learning task.
The University of Texas at Austin computational linguistics groups have held classes where most of the projects coming out of them have used this great tool. You can look at the course page for Computational Linguistics II to get an idea of how to make it work and what previous applications it has served.
Another great tool which works in the same vein is Mallet. The difference between Mallet is that there's a bit more documentation and some more models available, such as decision trees, and it's in Java, which, in my opinion, makes it a little slower. Weka is a whole suite of different machine learning models in one big package that includes some graphical stuff, but it's really mostly meant for pedagogical purposes, and isn't really something I'd put into production.
Good luck with your task. The real difficult part will probably be the amount of knowledge engineering required up front for you to classify the 'seed set' off of which your model will learn. It needs to be pretty sizeable, depending on whether you're doing binary classification (happy vs sad) or a whole range of emotions (which will require even more). Make sure to hold out some of this engineered data for testing, or run some tenfold or remove-one tests to make sure you're actually doing a good job predicting before you put it out there. And most of all, have fun! This is the best part of NLP and AI, in my opinion.

Thanks everyone for your suggestions, they were indeed very useful!
I ended up using a Naive Bayesian classifier, which I borrowed from here.
I started by feeding it with a list of good/bad keywords and then added a "learn" feature by employing user feedback. It turned out to work pretty nice.
The full details of my work as in a blog post.
Again, your help was very useful, so thank you!

I have constructed a word list labeled with sentiment. You can access it from here:
http://www2.compute.dtu.dk/pubdb/views/edoc_download.php/6010/zip/imm6010.zip
You will find a short Python program on my blog:
http://finnaarupnielsen.wordpress.com/2011/06/20/simplest-sentiment-analysis-in-python-with-af/
This post displays how to use the word list with single sentences as well as with Twitter.
Word lists approaches have their limitations. You will find a investigation of the limitations of my word list in the article "A new ANEW: Evaluation of a word list for sentiment analysis in microblogs". That article is available from my homepage.
Please note a unicode(s, 'utf-8') is missing from the code (for paedagogic reasons).

A lot of research papers indicate that a good starting point for sentiment analysis is looking at adjectives, e.g., are they positive adjectives or negative adjectives. For a short block of text this is pretty much your only option... There are papers that look at entire documents, or sentence level analysis, but as you say tweets are quite short... There is no real magic approach to understanding the sentiment of a sentence, so I think your best bet would be hunting down one of these research papers and trying to get their data-set of positively/negatively oriented adjectives.
Now, this having been said, sentiment is domain specific, and you might find it difficult to get a high-level of accuracy with a general purpose data-set.
Good luck.

I think you may find it difficult to find what you're after. The closest thing that I know of is LingPipe, which has some sentiment analysis functionality and is available under a limited kind of open-source licence, but is written in Java.
Also, sentiment analysis systems are usually developed by training a system on product/movie review data which is significantly different from the average tweet. They are going to be optimised for text with several sentences, all about the same topic. I suspect you would do better coming up with a rule-based system yourself, perhaps based on a lexicon of sentiment terms like the one the University of Pittsburgh provide.
Check out We Feel Fine for an implementation of similar idea with a really beautiful interface (and twitrratr).

Take a look at Twitter sentiment analysis tool. It's written in python, and it uses Naive Bayes classifier with semi-supervised machine learning. The source can be found here.

Maybe TextBlob (based on NLTK and pattern) is the right sentiment analysis tool for you.

I came across Natural Language Toolkit a while ago. You could probably use it as a starting point. It also has a lot of modules and addons, so maybe they already have something similar.

Somewhat wacky thought: you could try using the Twitter API to download a large set of tweets, and then classifying a subset of that set using emoticons: one positive group for ":)", ":]", ":D", etc, and another negative group with ":(", etc.
Once you have that crude classification, you could search for more clues with frequency or ngram analysis or something along those lines.
It may seem silly, but serious research has been done on this (search for "sentiment analysis" and emoticon). Worth a look.

There's a Twitter Sentiment API by TweetFeel that does advanced linguistic analysis of tweets, and can retrieve positive/negative tweets. See http://www.webservius.com/corp/docs/tweetfeel_sentiment.htm

For those interested in coding Twitter Sentiment Analyis from scratch, there is a Coursera course "Data Science" with python code on GitHub (as part of assignment 1 - link). The sentiments are part of the AFINN-111.
You can find working solutions, for example here. In addition to the AFINN-111 sentiment list, there is a simple implementation of builing a dynamic term list based on frequency of terms in tweets that have a pos/neg score (see here).

Related

Machine Learning approach for automating document filing into the correct folder [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
Improve this question
I am wondering if anyone has any ideas to the correct approach and suitable algorithms for the below scenario:
There a thousands of distinct documents each with their own categorical encoding. These documents arrive into the system and need to be manually filed by the user into the correct folder. E.g.
Document Code
Folder
ABC123
Folder 1
DEF456
Folder 2
GHI789
Folder 1
While we could create a mapping of document codes to the folder, this may be very cumbersome for so many codes that also may expand too. Furthermore, each customer may want to file the same type of document to different folder.
Is there a good approach to build a supervised model that would essentially learn which folder a specific document tends to get filed under using weighting from historical manual filing, then decide to file this automatically for the user in future?
I understand this weighting may difficult for a new document type that would need to be manually filed the first time and therefore be highly biased on the first occasion. But may be easier than building a classifier for the contents of the document that would ignore the code itself.
If anyone can point out some algorithms would be much appreciated!
I contributed to a model that has been used on over 1 million documents, using the document name. The short answer is yes BUT
I know this boring, but: Don't use machine learning unless you really have to. Maintaining a production model ends up being a lot more work than you might expect if you have not had the pleasure. Furthermore, I would very be tempted to create the mapping as long as the # of codes is small, say less than 1000. Even if you want to create a model, in the long run, having a rules-based solution against which to benchmark it can be invaluable for getting the confidence of your stakeholders.
If you do go the modeling approach learning this type of mapping should be in reach of some elementary algorithms, such as decision trees, or their more sophisticated cousins, random forest classifiers, and gradient boosting machines. With any algorithm, data science fundamentals, Understanding the customers' real needs, Thorough EDA, and sound experimental design will really be the key to whether what you build ends up helping anyone.
No matter the approach you take, I'd advise keeping an iterative mindset start simple, evaluating, and add complexity (such as customizing the model to each user) bit by bit. Just like you would with a traditional software product/project.
Take a look at an XGBOOST classifier, as a fine place to start playing around. https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBClassifier
To learn more about designing products that rely on machine learning, I HIGHLY recommend "Building Machine Learning Powered Applications: Going from Idea to Product" by Emmanuel Ameisen.

Merging many statistical methods for Text classification, starting with SVM multiclass classifier

Premise: I am not an expert of Machine Learning/Maths/Statistics. I am a linguist and I am entering the world of ML. Please when answering, try to be the more explicit you can.
My problem: I have 3000 expressions containing some aspects (or characteristics, or features) that users usually review in online reviews. These expressions are recognized and approved by human beings and experts.
Example: “they play a difficult role”
The labels are: Acting (referring to the act of acting and also to actors), Direction, Script, Sound, Image.
The goal: I am trying to classify these expressions according to their aspects.
My system: I am using SkLearn and Python under a Jupyter environment.
Technique used until now:
I built a bag-of-words matrix (so I kept track of the
presence/absence of – stemmed - words for each expression) and
I applied a SVM multiclass classifier with kernel RBF and C = 1 (or I
tuned according to the final accuracy.). The code used is this one from
https://www.geeksforgeeks.org/multiclass-classification-using-scikit-learn/
First attempt showed 0.63 of accuracy. When I tried to create more labels from the class Script accuracy went down to 0.50. I was interested in doing that because I have some expressions that for sure describe the plot or the characters.
I think that the problem is due to the presence of some words that are shared among these aspects.
I searched for a solution to improve the model. I found something called “learning curve”. I use the official code provided by sklearn documentation http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html .
The result is like the second picture (the right one). I can't understand if it is good or not.
In addition to this, I would like to:
import the expressions from a text file. For the moment I have
just created an array and put inside the expressions and I don't
feel so comfortable.
find a way, if it possible, to communicate to the system that there are some words that are very specific / important to an Aspect and help it to improve the classification.
How can I do this? I read that in some works researchers have used more systems... How should I handle this? From where can I retrieve the resulting numbers from the first system to use them in the second one?
I would like to underline that there are some expressions, verbs, nouns, etc. that are used a lot in some contexts and not in others. There are some names that for sure are names of actors and not directors, for example. In the future I would like to add more linguistic pieces of information to the system and trying to improve it.
I hope to have expressed myself in an enough clear way and to have used an appropriate and understandable language.

How can I use text analysis in order to investigate questionnaire responses?

I'm the "programmer" of a team of pupils that aims to investigate satisfaction and general problems in my grammar school. We have a questionary that is built upon a scale from 1-6 and we interpret these answers by a diagram software that I wrote in python.
Now there's a <textarea> at the end of our questionary that one can use as he likes.
I'm currently thinking of ways to make this data usable (we don't want to read more than 800+ answers).
How can I use text analysis in Python to investigate what pupils write?
I was thinking of a way to "tag" any sentence that is written down, like:
I don't like being in school. [wellbeing][negative]
I have way too much homework. [homework][much]
I think there should be more interesting projects. [projects][more]
Are there any usable approaches to obtain that? Does it make sense to use an existing tokenizer?
Thanks for your help!
well, I am just throwing in ideas here..but one approach I can think of is,
to use a clustering algorithm to cluster the responses first. something like K-means
or you can do topic modelling using something like LDA.
Then you can use your tagging approach by doing text analysis to generate frequent/related keywords in each of the cluster/topic you get from step 1.
Why Step 1 would be a good idea? Well, in my opinion- while doing text analysis, if you arbitrarly go around tagging sentences, you could generate a lot of tags- a lot of them would be similar in context. Hence, your usability might go down that you still would have to analyze loads of tags for each sentence.
Using a clustering/topic modelling can help reduce the context problem to some level as well. Hence, more usable in my opinion.
"NLTK Sentiment Analysis" is a good place to start searching. The Natural Language Toolkit is the package for doing text analysis in Python but it is not exactly simple because the task is quite complex. The first few results had some compelling demos but I didn't look at them in detail.
I won't quite answer to your question. But if I understand you have a classic survey (with check boxes, ...) with a small text area question at the end...
So you will have about 800+ answers. But I guess the answers will not be too long. Usually it will a few lines or even a few words... I think that a manual QDA software will be better than an algorithms that won't be perfect. For instance you can use the open source RQDA (R project package) or commercials software such as Nvivio...
Thanks
This sounds a lot like AI programming just because of the way that they 'tag' questions and responses. Maybe take a look at http://pyaiml.sourceforge.net/ and the artificial intelligence markup language. I don't have much experience with it, but you might be able to tweak it to your needs instead of doing it from scratch.

Acquiring basic skills working with visualizing/analyzing large data sets [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I'm looking for a way to learn to be comfortable with large data sets. I'm a university student, so everything I do is of "nice" size and complexity. Working on a research project with a professor this semester, and I've had to visualize relationships between a somewhat large (in my experience) data set. It was a 15 MB CSV file.
I wrote most of my data wrangling in Python, visualized using GNUPlot.
Are there any accessible books or websites on the subject out there? Bonus points for using Python, more bonus points for a more "basic" visualization system than relying on gnuplot. Cairo or something, I suppose.
Looking for something that takes me from data mining, to processing, to visualization.
EDIT: I'm more looking for something that will teach me the "big ideas". I can write the code myself, but looking for techniques people use to deal with large data sets. I mean, my 15 MB is small enough where I can put everything I would ever need into memory and just start crunching. What do people do to visualize 5 GB data sets?
I'd say the most basic skill is a good grounding in math and statistics. This can help
you assess and pick from the variety of techniques for filtering data, and
reducing its volume and dimensionality while keeping its integrity. The last
thing you'd want to do is make something pretty that shows patterns or
relationships which aren't really there.
Specialized math
To tackle some types of problems you'll need to learn some math to understand how particular algorithms work and what effect they'll have on your data. There are various algorithms for clustering data, dimensionality reduction, natural
language processing, etc. You may never use many of these, depending on the type of data you wish to analyze, but there are abundant resources on the Internet
(and Stack Exchange sites) should you need help.
For an introductory overview of data mining techniques, Witten's Data Mining is good. I have the 1st edition, and it explains concepts in plain language with a bit of math thrown in. I recommend it because it provides a good overview and it's not too expensive -- as you read more into the field you'll notice many of the books are quite expensive. The only drawback is a number of pages dedicated to using WEKA, an Java data mining package, which might not be too helpful as you're using Python (but is open source, so you may be able to glean some ideas from the source code. I also found Introduction to Machine Learning to provide a good overview, also reasonably priced, with a bit more math.
Tools
For creating visualizations of your own invention, on a single machine, I think the basics should get you started: Python, Numpy, Scipy, Matplotlib, and a
good graphics library you have experience with, like PIL or
Pycairo. With these you can crunch numbers, plot them on graphs, and pretty things up via custom drawing routines.
When you want to create moving, interactive visualizations, tools like the
Java-based Processing library make this easy. There
are even ways of writing Processing sketches in
Python via Jython, in case you don't want to write Java.
There are many more tools out there, should you need them, like OpenCV (computer vision,
machine learning), Orange (data mining,
analysis, viz), and NLTK (natural language, text
analysis).
Presentation principles and techniques
Books by folks in the field like Edward
Tufte and references like
Information
Graphics
can help you get a good overview of the ways of creating visualizations and
presenting them effectively.
Resources to find Viz examples
Websites like Flowing Data, Infosthetics, Visual Complexity and Information is
Beautiful show recent, interesting
visualizations from across the web. You can also look through the many compiled lists of of visualization sites out there on the Internet. Start with these as a seed and start navigating around, I'm sure you'll find a lot of useful sites and inspiring examples.
(This was originally going to be a comment, but grew too long)
Check out Information is beautiful. It is not a technical book but it might give you a couple of ideas for visualising data.
And maybe have a look at the first 3 chapters of Principles of Data Mining, it goes through some concepts of visualizing data in data mining context, I found some parts of it useful during university.
Hope this helps
If you are looking for visualization rather than data mining and analysis, The Visual Display of Quantitative Information by Edward Tufte is considered one of the best books in the field.
I like the book Data Analysis with Open Source Tools by Janert. It is a pretty broad survey of data analysis methods, focusing on how to understand the system that produced the data, rather than on sophisticated statistical methods. One caveat: while the mathematics used isn't especially advanced, I do think you will need to be comfortable with mathematical arguments to gain much from the book.

Programming a Self Learning Music Maker [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I want to learn how to program a music application that will analyze songs.
How would I get started in this and is there a library for analyzing soundwaves?
I know C, C++, Java, Python, some assembly, and some Perl.
Related question: Algorithm for music imitation
Composition and analysis of music by computer is a huge field. There are two basic areas in this type of work, which overlap somewhat.
Algorithmic composition is concerned with the generation of music. This can be based on statistical approaches such as Markov chaining, mathematical models employing fractal or chaotic processes, or leveraging techniques from AI such as expert systems, neural networks and genetic algorithms.
Music information retrieval is concerned with identifying common grammars, commonalities and similarity metrics between pieces of music, and identifying uniqueness (sometimes called acoustic fingerprinting).
Many, many libraries, tools and specialised programming languages exist which can help with different parts of these problems. Here's a list of music-related programs and libraries for Python. There is a lot of technology available; you should be able to find something that will do the brunt of the work for you. Reimplementing a 'musical parser' through very low-level frequency analysis tools such as Fourier Transforms, as other answers have suggested, while possible, will be quite difficult and is almost certainly unnecessary.
For further advice and specific questions, the International Society for Music Information Retrieval has a mailing list which you would probably find very helpful.
Once you get past the FFT stuff that Lennart mentioned, you might want to have a look at Markov chains for analyzing intervals between notes, and aggregated patterns.
This is kind of treaded ground, but Markov chains have been used in the past to build a kind of statistical model of melodies from various songs which can be used to generate new melodies. Markov chains can do the same with written english sentences. For an example of how that looks, have a play with the megahal chatterbot to see how markov chains can produce mangled output that statistically looks like its input (in megahal's case, it looks like english sentences)
You could concievably mash up the top 100, and have a markov chain generator blast out the next big hit.
On the other hand, you may want to consider the possibility that it is not any quality of the music itself that makes a song popular. Or perhaps it is a quality of music issue combined with marketing.
To analyze soundwaves you need some sort of fourier transformation (fft), so you can split the song up into it's frequencies and how they change over time. There exists fft support in numpy, I haven't used it, so I don't know if it's any good. But it would be a great place to start.
After that you then need to make some sort of statistical analysis on frequencies and patterns, and then I no longer have any clue what I'm talking about.
Cool stuff though, go for it!
You may like to start by looking at the MIDI format, it's reasonable simple compared to the compressed formats, and you can generate some nice things in it.
Depends what you want to do really.
There's the Echo Nest remix API that lets you analyze and manipulate music in Python. Some examples here: Where's the pow and here: You make me quantized miss lizzie. There's a nifty tutorial here: An overview of the Echo Nest API

Categories

Resources