Automatically detect the programming language of given script in python

Automatically detect the programming language of given script in python - python

I have a python script that reads in a file, should then detect the language of the code in the file, get the language ID from https://ghostbin.com/languages.json and upload it to https://ghostbin.com with the language ID as a parameter.
The problem is detecting the programming language used. I haven't found any lib to help me out.

Here is a module that uses a Naive Bayes classifier to do what you want, with a corresponding discussion. The caveat is that the module needs to be trained on code samples. It should be easy enough to modify it to retain its training.

Related

Does Keras OCR support other languages?

I am trying to extract text from image using keras-ocr. Does it support other written languages?
I am not getting proper documentation for supporting other languages.

Unfortunately the answer is no. Keras is written in Python and can only be used in this language. What you are looking for is referred to as an API wrapper and there is none available at this time. As an alternative, you can look into converting your Keras model into another language such as by using keras2cpp explained in a similar question.

Merging many statistical methods for Text classification, starting with SVM multiclass classifier

Premise: I am not an expert of Machine Learning/Maths/Statistics. I am a linguist and I am entering the world of ML. Please when answering, try to be the more explicit you can.
My problem: I have 3000 expressions containing some aspects (or characteristics, or features) that users usually review in online reviews. These expressions are recognized and approved by human beings and experts.
Example: “they play a difficult role”
The labels are: Acting (referring to the act of acting and also to actors), Direction, Script, Sound, Image.
The goal: I am trying to classify these expressions according to their aspects.
My system: I am using SkLearn and Python under a Jupyter environment.
Technique used until now:
I built a bag-of-words matrix (so I kept track of the
presence/absence of – stemmed - words for each expression) and
I applied a SVM multiclass classifier with kernel RBF and C = 1 (or I
tuned according to the final accuracy.). The code used is this one from
https://www.geeksforgeeks.org/multiclass-classification-using-scikit-learn/
First attempt showed 0.63 of accuracy. When I tried to create more labels from the class Script accuracy went down to 0.50. I was interested in doing that because I have some expressions that for sure describe the plot or the characters.
I think that the problem is due to the presence of some words that are shared among these aspects.
I searched for a solution to improve the model. I found something called “learning curve”. I use the official code provided by sklearn documentation http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html .
The result is like the second picture (the right one). I can't understand if it is good or not.
In addition to this, I would like to:
import the expressions from a text file. For the moment I have
just created an array and put inside the expressions and I don't
feel so comfortable.
find a way, if it possible, to communicate to the system that there are some words that are very specific / important to an Aspect and help it to improve the classification.
How can I do this? I read that in some works researchers have used more systems... How should I handle this? From where can I retrieve the resulting numbers from the first system to use them in the second one?
I would like to underline that there are some expressions, verbs, nouns, etc. that are used a lot in some contexts and not in others. There are some names that for sure are names of actors and not directors, for example. In the future I would like to add more linguistic pieces of information to the system and trying to improve it.
I hope to have expressed myself in an enough clear way and to have used an appropriate and understandable language.

Is there a supported implementation of Prolog which interfaces cleanly to Python?

I have been very happy with some language translations I've done with Prolog, but long ago. I'm now using Python for general purpose programming. The area is DNA sequencing data processing, but that's besides the point.
I am interested in using a DCG (definite clause grammar) for translation into a target language. (A DCG is very close to being a set of Prolog predicates, and a DCG to Prolog interpretation layer is almost trivial, as I recall.) The method I used was to parse an input language, and at the same time as parsing the input expressions, build a network structure to represent a deeper model of the expression. Another grammar then served to elaborate that model into a valid expression in the target language.
This time, though, I'm looking to do just the second half, to take an internal model (in a network of Python objects) and translate them into a target language. (This target language is a workflow configuration language, incidentally, and the network of objects are those used by a pre-existing less general workflow engine that I hope to abandon.)
So, are there any modern, supported Prolog implementations that cleanly interface to Python?

YAP provides a Python interface package:
http://www.dcc.fc.up.pt/~vsc/yap/
If you want to try it, I suggest you start with use the current git version found at:
https://github.com/vscosta/yap-6.3
Some examples are provided with the distribution:
https://github.com/vscosta/yap-6.3/tree/master/packages/python/examples

Python based SVM library

Is there a Python based library providing an SVM implementation with a GPL or any other opensource license? I have come across a few that provide an SVM wrapper for the SVM logic encoded in C, but none that are coded entirely in Python.
Regards,
Mandar

libsvm has Python bindings.
Edit
Googling found PyML, but I haven't used it.

You might want to check out this link, it has a big collection of machine learning software, it lists 50+ libraries that have been written in Python:
http://mloss.org/software/language/python/

Sentiment analysis for Twitter in Python [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
I'm looking for an open source implementation, preferably in python, of Textual Sentiment Analysis (http://en.wikipedia.org/wiki/Sentiment_analysis). Is anyone familiar with such open source implementation I can use?
I'm writing an application that searches twitter for some search term, say "youtube", and counts "happy" tweets vs. "sad" tweets.
I'm using Google's appengine, so it's in python. I'd like to be able to classify the returned search results from twitter and I'd like to do that in python.
I haven't been able to find such sentiment analyzer so far, specifically not in python.
Are you familiar with such open source implementation I can use? Preferably this is already in python, but if not, hopefully I can translate it to python.
Note, the texts I'm analyzing are VERY short, they are tweets. So ideally, this classifier is optimized for such short texts.
BTW, twitter does support the ":)" and ":(" operators in search, which aim to do just this, but unfortunately, the classification provided by them isn't that great, so I figured I might give this a try myself.
Thanks!
BTW, an early demo is here and the code I have so far is here and I'd love to opensource it with any interested developer.

Good luck with that.
Sentiment is enormously contextual, and tweeting culture makes the problem worse because you aren't given the context for most tweets. The whole point of twitter is that you can leverage the huge amount of shared "real world" context to pack meaningful communication in a very short message.
If they say the video is bad, does that mean bad, or bad?
A linguistics professor was lecturing
to her class one day. "In English,"
she said, "A double negative forms a
positive. In some languages, though,
such as Russian, a double negative is
still a negative. However, there is no
language wherein a double positive can
form a negative."
A voice from the back of the room
piped up, "Yeah . . .right."

With most of these kinds of applications, you'll have to roll much of your own code for a statistical classification task. As Lucka suggested, NLTK is the perfect tool for natural language manipulation in Python, so long as your goal doesn't interfere with the non commercial nature of its license. However, I would suggest other software packages for modeling. I haven't found many strong advanced machine learning models available for Python, so I'm going to suggest some standalone binaries that easily cooperate with it.
You may be interested in The Toolkit for Advanced Discriminative Modeling, which can be easily interfaced with Python. This has been used for classification tasks in various areas of natural language processing. You also have a pick of a number of different models. I'd suggest starting with Maximum Entropy classification so long as you're already familiar with implementing a Naive Bayes classifier. If not, you may want to look into it and code one up to really get a decent understanding of statistical classification as a machine learning task.
The University of Texas at Austin computational linguistics groups have held classes where most of the projects coming out of them have used this great tool. You can look at the course page for Computational Linguistics II to get an idea of how to make it work and what previous applications it has served.
Another great tool which works in the same vein is Mallet. The difference between Mallet is that there's a bit more documentation and some more models available, such as decision trees, and it's in Java, which, in my opinion, makes it a little slower. Weka is a whole suite of different machine learning models in one big package that includes some graphical stuff, but it's really mostly meant for pedagogical purposes, and isn't really something I'd put into production.
Good luck with your task. The real difficult part will probably be the amount of knowledge engineering required up front for you to classify the 'seed set' off of which your model will learn. It needs to be pretty sizeable, depending on whether you're doing binary classification (happy vs sad) or a whole range of emotions (which will require even more). Make sure to hold out some of this engineered data for testing, or run some tenfold or remove-one tests to make sure you're actually doing a good job predicting before you put it out there. And most of all, have fun! This is the best part of NLP and AI, in my opinion.

Thanks everyone for your suggestions, they were indeed very useful!
I ended up using a Naive Bayesian classifier, which I borrowed from here.
I started by feeding it with a list of good/bad keywords and then added a "learn" feature by employing user feedback. It turned out to work pretty nice.
The full details of my work as in a blog post.
Again, your help was very useful, so thank you!

I have constructed a word list labeled with sentiment. You can access it from here:
http://www2.compute.dtu.dk/pubdb/views/edoc_download.php/6010/zip/imm6010.zip
You will find a short Python program on my blog:
http://finnaarupnielsen.wordpress.com/2011/06/20/simplest-sentiment-analysis-in-python-with-af/
This post displays how to use the word list with single sentences as well as with Twitter.
Word lists approaches have their limitations. You will find a investigation of the limitations of my word list in the article "A new ANEW: Evaluation of a word list for sentiment analysis in microblogs". That article is available from my homepage.
Please note a unicode(s, 'utf-8') is missing from the code (for paedagogic reasons).

A lot of research papers indicate that a good starting point for sentiment analysis is looking at adjectives, e.g., are they positive adjectives or negative adjectives. For a short block of text this is pretty much your only option... There are papers that look at entire documents, or sentence level analysis, but as you say tweets are quite short... There is no real magic approach to understanding the sentiment of a sentence, so I think your best bet would be hunting down one of these research papers and trying to get their data-set of positively/negatively oriented adjectives.
Now, this having been said, sentiment is domain specific, and you might find it difficult to get a high-level of accuracy with a general purpose data-set.
Good luck.

I think you may find it difficult to find what you're after. The closest thing that I know of is LingPipe, which has some sentiment analysis functionality and is available under a limited kind of open-source licence, but is written in Java.
Also, sentiment analysis systems are usually developed by training a system on product/movie review data which is significantly different from the average tweet. They are going to be optimised for text with several sentences, all about the same topic. I suspect you would do better coming up with a rule-based system yourself, perhaps based on a lexicon of sentiment terms like the one the University of Pittsburgh provide.
Check out We Feel Fine for an implementation of similar idea with a really beautiful interface (and twitrratr).

Take a look at Twitter sentiment analysis tool. It's written in python, and it uses Naive Bayes classifier with semi-supervised machine learning. The source can be found here.

Maybe TextBlob (based on NLTK and pattern) is the right sentiment analysis tool for you.

I came across Natural Language Toolkit a while ago. You could probably use it as a starting point. It also has a lot of modules and addons, so maybe they already have something similar.

Somewhat wacky thought: you could try using the Twitter API to download a large set of tweets, and then classifying a subset of that set using emoticons: one positive group for ":)", ":]", ":D", etc, and another negative group with ":(", etc.
Once you have that crude classification, you could search for more clues with frequency or ngram analysis or something along those lines.
It may seem silly, but serious research has been done on this (search for "sentiment analysis" and emoticon). Worth a look.

There's a Twitter Sentiment API by TweetFeel that does advanced linguistic analysis of tweets, and can retrieve positive/negative tweets. See http://www.webservius.com/corp/docs/tweetfeel_sentiment.htm

For those interested in coding Twitter Sentiment Analyis from scratch, there is a Coursera course "Data Science" with python code on GitHub (as part of assignment 1 - link). The sentiments are part of the AFINN-111.
You can find working solutions, for example here. In addition to the AFINN-111 sentiment list, there is a simple implementation of builing a dynamic term list based on frequency of terms in tweets that have a pos/neg score (see here).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.