I'm building an Android application with OCR and Tensorflow. It scans price tags in supermarkets and has to put the scanned data into different fields. I've done the OCR part, so the image -> text recognition works fine and Tensorflow is only required to work with text input.
I'm new to Tensorflow and machine learning in overall. Is it possible to do the following work using Tensorflow and if yes, could you share some ideas on how to do so?
The average input looks like this:
CARLSBERG
EESTI
HELE OLU 5%
1.59 +0.10
500 ml pudel
3.18 /I
4740019113419
The goal is to sort this data as follows:
Brand: CARLSBERG
Product name: HELE OLU 5%
Size: 500
Units: ml
The parameters that determine how a particular string will be classified are:
Case
Line number
Supermarket (it's known by default)
Total number of lines
Letters/numbers ratio
I think the first step would be to get your hands on or to generate some labelled training data. You should look into at feature extraction; for example, if you notice that for a certain item, the second line is usually the price, you could represent that as a parameter. Or say if a number is followed by a unit like ml/l/oz, it's likely to be the volume. What you want to know is how confident you are that a specific line/string is say the price.
However, I think TensorFlow would be more suited for the OCR portion of the problem, which you have already solved. What you are asking is more towards text parsing, which could be better solved with an NLP approach.
As mentioned in 4d11's answer, one of the biggest challenges in machine learning is often getting a high quality, significantly sized set of training data.
In terms of feeding data into a Tensorflow network/model, I'd recommend you check out their 'get started' tutorial on feature columns:
https://www.tensorflow.org/get_started/feature_columns
Feature columns are used to represent data of different types numerically for a representation that can be fed into the model. The tutorial goes into some detail on the ways in which this works and why you may choose to represent different data in different ways. I found it pretty helpful as an intro.
A Tensorflow model for text recognition (CNN + seq2seq with visual attention) available as a Python package and compatible with Google Cloud ML Engine.
https://github.com/emedvedev/attention-ocr
Related
I do side work writing/improving a research project web application for some political scientists. This application collects articles pertaining to the U.S. Supreme Court and runs analysis on them, and after nearly a year and half, we have a database of around 10,000 articles (and growing) to work with.
One of the primary challenges of the project is being able to determine the "relevancy" of an article - that is, the primary focus is the federal U.S. Supreme Court (and/or its justices), and not a local or foreign supreme court. Since its inception, the way we've addressed it is to primarily parse the title for various explicit references to the federal court, as well as to verify that "supreme" and "court" are keywords collected from the article text. Basic and sloppy, but it actually works fairly well. That being said, irrelevant articles can find their way into the database - usually ones with headlines that don't explicitly mention a state or foreign country (the Indian Supreme Court is the usual offender).
I've reached a point in development where I can focus on this aspect of the project more, but I'm not quite sure where to start. All I know is that I'm looking for a method of analyzing article text to determine its relevance to the federal court, and nothing else. I imagine this will entail some machine learning, but I've basically got no experience in the field. I've done a little reading into things like tf-idf weighting, vector space modeling, and word2vec (+ CBOW and Skip-Gram models), but I'm not quite seeing a "big picture" yet that shows me how just how applicable these concepts can be to my problem. Can anyone point me in the right direction?
Framing the Problem
When starting a novel machine learning project like this there are a few fundamental questions to think through that can help you refine the problem and lit review + experiment more effectively.
Do you have the right data to build a model? You have ~10,000 articles that will be your model input, however, to use a supervised learning approach you will need trustworthy labels for all articles that will be used in model training. It sounds like you already have done this.
What metric(s) to use to quantify success. How can you measure if your model is doing what you want? In your specific case this sounds like a binary classification problem - you want to be able to label articles as relevant or not. You could measure your success using a standard binary classification metric like area under the ROC. Or since you have a specific issue with False Positives you could choose a metric like Precision.
How well can you do with a random or naive approach. Once a dataset and metric have been established you can quantify how well you can do at your task with a basic approach. This could be a simple as calculating your metric for a model that chooses at random, but in your case you have your keyword parser model which is a perfect way to set a bench mark. Quantify how well your keyword parsing approach does for your dataset so you can determine when a machine learning model is doing well.
Sorry if this was obvious and basic to you but I wanted to make sure it was in the answer. In an innovative open ended project like this diving straight into machine learning experiments without thinking through these fundamentals can be inefficient.
Machine Learning Approaches
As suggested by Evan Mata and Stefan G, the best approach is to first reduce your articles into features. This could be done without machine learning (eg vector space model) or with machine learning (word2vec and other examples you cited). For your problem I think something like BOW makes sense to try as a starting point.
Once you have a feature representation of your articles you are almost done and there are a number of binary classification models that will do well. Experiment from here to find the best solution.
Wikipedia has a nice example of a simple way to use this two step approach in spam filtering, an analogous problem (See the Example Usage section of the article).
Good luck, sounds like a fun project!
If you have sufficient labeled data - not only for "yes this article is relevant" but also for "no this article is not relevant" (you're basically making a binary model between y/n relevant - so I would research spam filters) then you can train a fair model. I don't know if you actually have a decent quantity of no-data. If you do, you could train a relatively simple supervised model by doing (pesudocode) the following:
Corpus = preprocess(Corpus) #(remove stop words, etc.)
Vectors = BOW(Corpus) #Or TFIDF or Whatever model you want to use
SomeModel.train(Vectors[~3/4 of them], Labels[corresponding 3/4]) #Labels = 1 if relevant, 0 if not
SomeModel.evaluate(Vectors[remainder], Labels[remainder]) #Make sure the model doesn't overfit
SomeModel.Predict(new_document)
The exact model will depend on your data. A simple Naive-Bayes could (probably will) work fine if you can get a decent number of no-documents. One note - you imply that you have two kinds of no-documents - those that are reasonably close (Indian Supreme Court) or those that are completely irrelevant (say Taxes). You should test training with "close" erroneous cases with "far" erroneous cases filtered out as you do now vs both "close" erroneous cases and "far" erroneous cases and see which one comes out better.
There are many many ways to do this, and the best method changes depending on the project. Perhaps the easiest way to do this is to keyword search in your articles and then empirically choose a cut off score. Although simple, this actually works pretty well, especially in a topic like this one where you can think of a small list of words that are highly likely to appear somewhere in a relevant article.
When a topic is more broad with something like 'business' or 'sports', keyword search can be prohibitive and lacking. This is when a machine learning approach might start to become the better idea. If machine learning is the way you want to go, then there are two steps:
Embed your articles into feature vectors
Train your model
Step 1 can be something simple like a TFIDF vector. However, embedding your documents can also be deep learning on its own. This is where CBOW and Skip-Gram come into play. A popular way to do this is Doc2Vec (PV-DM). A fine implementation is in the Python Gensim library. Modern and more complicated character, word, and document embeddings are much more of a challenge to start with, but are very rewarding. Examples of these are ELMo embeddings or BERT.
Step 2 can be a typical model, as it is now just binary classification. You can try a multilayer neural network, either fully-connected or convolutional, or you can try simpler things like logistic regression or Naive Bayes.
My personal suggestion would be to stick with TFIDF vectors and Naive Bayes. From experience, I can say that this works very well, is by far the easiest to implement, and can even outperform approaches like CBOW or Doc2Vec depending on your data.
I have the dataset which you can find the (updated) file here , containing many different characteristics of different office buildings, including their surface area and number of people working in there. In total there are about 200 records. I want to use an algorithm, that can be trained using the dataset above, in order to be able to predict the electricity consumption(given in the column 'kwh') of a the building that is not in the set.
I have tried most of the possible machine learning algorithms using the scikit library in python (linear regression, Ridge, Lasso, SVC etc) in order to predict a continuous variable. Surface_area and number of workers had a coorelation value with the target variable between 0.3-0.4 so I assumed them to be good features for the model and included them in the training of the model. However I had about 13350 mean absolute error and R-squared value of about 0.22-0.35, which is not good at all.
I would be very grateful, if someone could give me some advice, or if you could examine a little the dataset and run some algorithms on it. What type of preprocessing should I use, and what type of algorithm? Is the number of datasets too low to train a regression model for predicting continuous variables?
Any feedback would be helpful as I am new to machine learning :)
The first thing that should be done in these kinds of Machine Learning Problems is to understand the data. Yes, the number of features in your dataset is small, yes, the number of data samples are very less, but it is important to do the best we can with what we have.
The data set header is in a language other than English, it is important to convert it to a language most of the people in the community would understand (in this case English). After doing a bit of tinkering, I found out that the language being used is Dutch.
There are some key features missing in the dataset. From something as obvious as the number of floors in the building to something not obvious like the number of working hours. Surface Area and the number of workers seems to me are the most important features, but you are missing out on a feature called building_function which (after using Google Translate) tells what the purpose of the building is. Intuitively, this is supposed to have a large correlation with the power consumption. Industries tend to use more power than normal Households. After translation, I found out that the main types were Residential, Office, Accommodation and Meeting. This feature thus has to be encoded as a nominal variable to train the model.
Another feature hoofsbi also seems to have some variance. But I do not know what that feature means.
If you could translate the headers in the data and share it, I will be able to provide you some code to perform this regression task. It is very important in such tasks to understand what the data is and thus perform feature engineering.
I have some comments dataset which I want to classify into five categories :-
jewelries, clothes, shoes, electronics, food & beverages
So if someones talking about pork, steak, wine, soda, eat : its classified into f&b
Whereas if somebodys talking about say - gold, pendent, locket etc : its classified into jewelries
I want to know , what tags/tokens should I be looking for in a comment/tweet so as to classify it into any of these categories. Finally which classifier to use. I just need some guidance and suggestions , Ill take it from there.
Please help. Thanks
This answer can be a bit long and perhaps I abstract a few things away, but it's just to give you an idea and some advice.
Supervised Vs Unsupervised
As others already mentioned, in the land of machine learning there are 2 main roads: Supervised and Unsupervised learning. As you probably already know by now, if your corpus(documents) are labeled, you are talking about supervised learning. The labels are the categories and are in this case boolean values.
For instance if a text is related to clothes and shoes the labels for those categories should be true.
Since a text can be related to multiple categories (multiple labels), we are looking at multiclassifiers.
What to use?
I presume that the dataset is not yet labeled, since twitter does not do this categorisation for you. So here comes a big decision on your part.
You label the data manually, which means you try to look at as much tweets/fb messages in your dataset and for each of them you consider the 5 categories and answer them by True/False.
You decide to use a unsupervised learning algorithm and hope that you discover these 5 categories. Since approaches like clustering will just try to find categories on their own and these don't have to match your 5 predefined categories by default.
I've used quite some supervised learning in the past and have had good experience with this type of learning, therefore I will continue explaining this path.
Feature Engineering
You have to come up with the features that you want to use. For text classification, a good approach is to use each possible word in the document as a feature. A value of True represents if the word is present in the document, false represents absence.
Before doing this, you need to do some preprocessing. This can be done by using various features provided by the NLTK library.
Tokenization this will break your text up into a list of words. You can use this module.
Stopword removal this will remove common words out of the tokens. Words likes 'a',the',... You can take a look at this.
Stemming stemming will transform words to their stem-form. For example: the words 'working','worked','works' will be transformed to 'work'. Take a look at this.
Now if you have preprocessed the data, then generate a featureset for each word that exists in the documents. There exist automatic methods and filters for this, but I'm not sure how to do this in Python.
Classification
There are multiple classifiers that you can use for this purpose. I suggest to take a deeper look at the ones that exist and their benefits.You can user the nltk classifier which supports multiclassification, but to be honest I never tried that one before. In the past I've used Logistic Regression and SVM.
Training & testing
You will use a part of your data for training and a part for validating if the trained model performs well. I suggest you to use cross-validation, because you will have a small dataset (you have to manually label the data, which is cumbersome). The benefit of cross-validation is that you don't have to split your dataset in a training set and testing set. Instead it will run in multiple rounds and iterate through the data for a part training data and a part testing data. Resulting in all the data being used at least once in your training data.
Predicting
Once your model is built and the outcome of the predictions on 'test-data' is plausible. You can use your model in the wild to predict the categories of the new Facebook messages/tweets.
Tools
The NLTK library is great for preprocessing and natural language processing, but I never used it before for classification. I've heard a lot of great things about the scikit python library. But to be fair honest, I prefer to use Weka, which is a data mining tool written in java, offering a great UI and which speeds up your task a lot!
From a different angle: Topic modelling
In your question you state that you want to classify the dataset into five categories. I would like to show you the idea of topic modelling. It might not be useful in your scenario if you are really only targeting those categories (that's why I leave this part at the end of my answer). However if your goal is to categorise the tweets/fb messages into non-predefined categories, topic modelling is the way to go.
Topic modeling is an unsupervised learning method, where you decide in advance the amount of topics(categories) you want to 'discover'. This number can be high (e.g. 40) Now the cool thing is that the algorithm will find 40 topics that contain words that have something related. It will also output for each document a distribution that indicates to which topics the document is related. This way you can discover a lot more categories than your 5 predefined ones.
Now I'm not gonna go much deeper into this, but just google it if you want more information. In addition you could consider to use MALLET which is an excellent tool for topic modelling.
Well this is kind of a big subject.
You mentioned Python, so you should have a look at the NLTK library which allows you to process natural language, such as your comments.
After this step, you should have a classifier which will map the words you retrieved to a certain class. NTLK also have tools for classification which is linked to knowledge databases. If you are lucky, the categories you are looking for are already available; otherwise you may have to build them yourself. You can have a look at this example which uses NTLK and the WordNet database. You can have access to the Synset, which seems to be pretty broad; and you can also have a look at the hypersets (see for example list(dog.closure(hyper)) ).
Basically you should consider using a multiclassifier on the whole tokenized text (comments on Facebook and tweets are usually short. You might also decide to only consider FB comments below 200 characters, your choice). The choice of a multiclassifier is motivated by non-orthogonality of your classification set (clothes, shoes and jewelries can be the same object; you could have electronic jewelry [ie smartwatches], etc.). This is a fairly simple setup but it's an interesting first step, whose strengths and weaknesses will allow you to iterate easily (if needed).
Good luck!
What you're looking for is in the subject of
Natural Language Processing (NLP) : processing text data and
Machine learning (where the classification models are built)
First I would suggesting going through NLP tutorials and then text classification tutorials, the most appropriate being https://class.coursera.org/nlp/lecture
If you're looking for libraries available in python or java, take a look at Java or Python for Natural Language Processing
If you're new to text processing, please take a look at the NLTK library that provides a nice introduction to doing NLP, see http://www.nltk.org/book/ch01.html
Now to the hard core details:
First, ask yourself whether you have twitter/facebook comments (let's call them documents from now on) that are manually labelled with the categories you want.
1a. If YES, look at supervised machine learning, see http://scikit-learn.org/stable/tutorial/basic/tutorial.html
1b. If NO, look at UNsupervised machine learning, i suggest clustering and topic modelling, http://radimrehurek.com/gensim/
After knowing which kind of machine learning you need, split the documents up into at least training (70-90%) and testing (10-30%) set, see
Note. I suggest at least because there are other ways to split up your documents, e.g. for development or cross-validation. (if you don't understand this, it's all right, just follow step 2)
Finally, Train and Test your model
3a. If supervised, use the training set to train your supervised model. Apply your model onto the test set and then see how well you performed.
3b. If unsupervised, use the training set to generate documents clusters (that means to group similar documents) but they still have no labels. So you need to think of some smart way to label the groups of documents correctly. (To this date, there is no real good solution to this, even super effective neural networks cannot know what the neurons are firing, they just know each neuron is firing something specific)
I'm trying to make an ANN to classify a PDF file as either malicious or clean, by utilising the 26,000 PDF samples (both clean and malicious) found on contagiodump. For each PDF file, I used PDFid.py to parse the file and return a vector of 42 numbers. The 26000 vectors are then passed into pybrain; 50% for training and 50% for testing. This is my source code:
https://gist.github.com/sirpoot/6805938
After much tweaking with the dimensions and other parameters I managed to get a false positive rate of about 0.90%. This is my output:
https://gist.github.com/sirpoot/6805948
My question is, is there any explicit way for me to decrease the false positive rate further? What do I have to do to reduce the rate to perhaps 0.05%?
There are several things you can try to increase the accuracy of your neural network.
Use more of your data for training. This will permit the network to learn from a larger set of training samples. The drawback of this is that having a smaller test set will make your error measurements more noisy. As a rule of thumb, however, I find that 80%-90% of your data can be used in the training set, with the rest for test.
Augment your feature representation. I'm not familiar with PDFid.py, but it only returns ~40 values for a given PDF file. It's possible that there are many more than 40 features that might be relevant in determining whether a PDF is malicious, so you could conceivably use a different feature representation that includes more values to increase the accuracy of your model.
Note that this can potentially involve a lot of work -- feature engineering is difficult! One suggestion I have if you decide to go this route is to look at the PDF files that your model misclassifies, and try to get an intuitive idea of what went wrong with those files. If you can identify a common feature that they all share, you could try adding that feature to your input representation (giving you a vector of 43 values) and re-train your model.
Optimize the model hyperparameters. You could try training several different models using training parameters (momentum, learning rate, etc.) and architecture parameters (weight decay, number of hidden units, etc.) chosen randomly from some reasonable intervals. This is one way to do what is called "hyperparameter optimization" and, like feature engineering, it can involve a lot of work. However, unlike feature engineering, hyperparameter optimization can largely be done automatically and in parallel, provided you have access to a lot of processing cores.
Try a deeper model. Deep models have become quite "hot" in the machine learning literature recently, especially for speech processing and some types of image classification. By using stacked RBMs, a second-order learning method (PDF), or a different nonlinearity like a rectified linear activation function, then you can add multiple layers of hidden units to your model, and sometimes this will help improve your error rate.
These are the ones that come to mind right off the bat. Good luck !
Let me first say I am in no ways an expert in Neural Networks. But I played with pyBrain once and I used the .train() method in a while error < 0.001 loop to get the error rate I wanted. So you can try using all of them for training with that loop and test it with other files.
The title says it all; I have an SQL database bursting at the seams with online conversation text. I've already done most of this project in Python, so I would like to do this using Python's NLTK library (unless there's a strong reason not to).
The data is organized by Thread, Username, and Post. Each thread more or less focuses on discussing one "product" of the Category that I am interested in analyzing. Ultimately, when this is finished, I would like to have an estimated opinion (like/dislike sort of deal) from each user for any of the products they had discussed at some point.
So, what I would like to know:
1) How can I go about determining what product each thread is about? I was reading about keyword extraction... is that the correct method?
2) How do I determine a specific users sentiment based on their posts? From my limited understanding, I must first "train" NLTK to recognize certain indicators of opinion, and then do I simply determine the context of those words when they appear in the text?
As you may have guessed by now, I have no prior experience with NLP. From my reading so far, I think I can handle learning it though. Even just a basic and crude working model for now would be great if someone can point me in the right direction. Google was not very helpful to me.
P.S. I have permission to analyze this data (in case it matters)
Training any classifier requires a training set of labeled data and a feature extractor to obtain feature sets for each text. After you have a trained classifier, you can apply it to previously unseen text (unlabeled) and obtain a classification based on the machine learning algorithm used. NLTK gives a good explanation and some samples to play around with.
If you are interested in building a classifier for positive/negative sentiment, using your own training dataset, I would avoid simple keyword counts, as they aren't accurate for a number of reasons (eg. negation of positive words: "not happy"). An alternative, where you can still use a large training set without having to manually label anything, is distant supervision. Basically, this approach uses emoticons or other specific text elements as noisy labels. You still have to choose which features are relevant but many studies have had good results with simply using unigrams or bigrams (individual words or pairs of words, respectively).
All of this can be done relatively easily with Python and NLTK. You can also choose to use a tool like NLTK-trainer, which is a wrapper for NLTK and requires less code.
I think this study by Go et al. is one of the easiest to understand. You can also read other studies for distant supervision, distant supervision sentiment analysis, and sentiment analysis.
There are a few built-in classifiers in NLTK with both training and classification methods (Naive Bayes, MaxEnt, etc.) but if you are interested in using Support Vector Machines (SVM) then you should look elsewhere. Technically NLTK provides you with an SVM class but its really just a wrapper for PySVMLight, which itself is a wrapper for SVMLight, written in C. I had numerous problems with this approach though, and would instead recommend LIBSVM.
For determining the topic, many have used simple keywords but there are some more complex methods available.
You could train any classifier with similar datasets and see what the results are when you apply it to your data. For example, the NLTK contains the Movie Reviews Corpus that contains 1000 positive and 1000 negative reviews. Here is an example on how to train a Naive Bayes Classifier with it. Some other review datasets like Amazon Product Review data are available here.
Another possibility is to take a list of positive and negative words like this one and count their frequencies in your dataset. If you want a complete list, use SentiWordNet.