I need to write a program that, given an object with certain attributes, it knows how to classify it. It should know how to classify new objects by being trained with a list of known objects with known attributes.
For example, I have object A with the following attributes: a=10 and b=1. I also trained the program so that it knows that values between 5..15 for a and 0..2 for b classify the given object as label1.
As the program evolves, I need to further train it with known data so that the attribute intervals will get more accurate (hence the classification).
Now, I haven't got any experience with machine learning or any of this kind and I would like to know how should I start with this. I've seen plenty of tutorials, but only for text classification. And only for 2-ways classification (that is, positive or negative, yes or no...only two values to choose from). I would have 5-6 labels to start with and their number will soon increase. Also, the object attributes are integers.
Any tip is highly appreciated!
Machine learning is a very broad field, so the first step would be knowing exactly what you're looking for and familiarizing yourself with the subproblem you're trying to solve.
By your description, you're trying to solve a classification problem in a supervised learning approach.
I'll paraphrase a bit from here:
The classification problem consists in identifying to which class a observation belongs to.
Supervised learning is a way of "teaching" a machine. Basically, an algorithm is trained through examples (i.e.: this particular object belongs to class X). After training, the machine should be able to apply its aquired knowledge to new data.
The k-NN algorithm is one of the simplest algorithms for solving this kind of problem. I suggest you familiarize yourself with it.
You have an implementation of k-NN in scipy. Here's a link to a tutorial on using it.
Now, answering your specific questions:
only for 2-ways classification (that is, positive or negative, yes or
no...only two values to choose from)
k-NN can handle any (finite) number of classes, so you're clear
Also, the object attributes are integers
K-NN usually uses a continuous space - so you'll have to convert those to floats.
Mapping the attributes values into points in the algorithm space is not a trivial problem (see Data pre-processing, especially the articles on normalization, feature extraction and selection)
Related
I am working on a waste water data. The data is collected every 5 min. This is the sample data.
The threshold of the individual parameters is provided. My question is what kind of models should I go for to classify it as usable or not useable and also output the anomaly because of which it is unusable (if possible since it is a combination of the variables). The column for yes/no is yet to be and will be provided to me.
The other question I have is how do I keep it running since the data is collected every 5 minutes?
Your data and use case seem fit for a decision tree classifier. Decision trees are easy to train and interpret (which is one of your requirements, since you want to know why a given sample was classified as usable or not usable), do not require large amounts of labeled data, can be trained and used for prediction on most haedware, and are well suited for structured data with no missing values and low dimensionality. They also work well without normalizing your variables.
Scikit learn is super mature and easy to use, so you should be able to get something working without too much trouble.
As regards time, I'm not sure how you or your employee will be taking samples, so I don't know. If you will be getting and reading samples at that rate, using your model to label data should not be a problem, but I'm not sure if I understood your situation.
Note stackoverflow is aimed towards questions of the form "here's my code, how do I fix this?", and not so much towards general questions such as this. There are other stackexhange sites specially dedicated to statistics and data science. If you don't find here what you need, maybe you can try those other sites!
I am new to machine learning and had a problem I wanted to solve and see if anyone has any ideas on what type of algorithm would be best to use. I am not looking for code, but rather a process.
Problem: I am classifying people into 2 categories: high risk and low risk. (this is a very basic starting point and I will expand as I learn how to classify more detailed)
Each person has 11 variables I am looking at and each variable has a binary value (0 for no, 1 for yes). The variables are like has married, gun_owner, home_owner, etc. So I gather each person can have 2^11 or 2048 different combinations of these variables.
I have a data set that has this information and then the result (whether or not they committed a crime). I figured this data would be used for training and then the algorithm can make predictions on high risk individuals.
Does anyone have any ideas for what would be the best algorithm? Since there are so many variables, I am having more trouble trying to figure out what may work bets.
This is a binary classification problem, with each input a binary string of length 11. There are many algorithms for this problem. The simplest one is the naive Bayes model (https://en.wikipedia.org/wiki/Naive_Bayes_classifier). You could also try some linear classifiers such as logistic regression or SVM. They both work well for linear separable data and binary classification.
It seems like you want to classify people based on a few features. It looks like a simple binary classification problem. However, it is not very clear that if the data you have is labeled or not.
So the first question is, in you dataset, do you know which person is 'high risk' and which person is 'low risk'? If you have that information, you can use a whole lot of machine learning model for this classification task.
However, if the labels are not present ('high risk' or 'low risk') you cannot do that. Then you have to think about some unsupervised learning methods (clustering). Hope this answers your question.
I think this is kind of "blasphemy" for someone who comes from the AI world, but since I come from the world where we program and get a result, and there is the concept of storing something un memory, here is my question :
Machine learning works by iterations, the more there are iterations, the best our algorithm becomes, but after those iterations, there is a result stored somewhere ? because if I think as a programmer, if I re-run the program, I must store previous results somewhere, or they will be overwritten ? or I need to use an array for example to store my results.
For example, if I train my image recognition algorithme with a bunch of cats pictures data sets, what are the variables I need to add to my algorithme, so if I use it with an image library, it will always success everytime I find a cat, but I will use what? since there is nothing saved for my next step ?
All videos and tutorials I have seen, they only draw a graph as decision making visualy, and not applying something to use it in future program ?
For example, this example, kNN is used to teach how to detect a written digit, but where is the explicit value to use ?
https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/2_BasicModels/nearest_neighbor.py
NB: people clicking on close request or downvoting at least give a reason.
the more there are iterations, the best our algorithm becomes, but after those iterations, there is a result stored somewhere
What you're alluding to here is the optimization part.
However to optimize a model, we first have to represent it.
For example, if I'm creating a very simple linear model to predict house prices using its surface in square meters I might go for this model:
price = a * surface + b
That's the representation.
Now that you have represented the model, you want to optimize it, i.e. find the params a and b that minimize the prediction error.
there is a result stored somewhere ?
In the above, we say that we have learned the params or weights a and b.
That's what you keep, the weights which come from optimization (also called training) and of course the model itself.
I think there is some confusion. Let's clear it up.
Machine Learning models usually have parameters, and these parameters are trainable. This means a training algorithm find the "right" values of these parameters in order to properly work for a given task.
This is the learning part. The actual parameter values are "inferred" from training data.
What you would call the result of the training process is a model. The model is represented by formulas with parameters, and these parameters must be stored. Typically when you use a ML/DL framework (like scikit-learn or Keras), the parameters are stored alongside some information about the type of model, so it can be reconstructed at runtime.
I had some trouble finding a good transductive svm (semi-supervised support vector machine or s3vm) implementation for python. Finally I found the implementation of Fabian Gieseke of Oldenburg University, Germany (code is here: https://www.ci.uni-oldenburg.de/60506.html, paper title: Fast and Simple Gradient-Based Optimization for Semi-Supervised Support Vector Machines).
I now try to integrate the learned model into my scikit-learn code.
1) This works already:
I've got a binary classification problem. I defined a new method inside the S3VM-code returning the self.__c-coeficients (these are needed for the decision function of the classifier).
I then assign these (in my own scikit-code where clf stands for a svm.SVC-classifier) to clf.dual_coefs_ and properly change clf.support_ too (which holds the indices of the support vectors). It takes a while because sometimes you need numpy-arrays and sometimes lists etc. But it works quite well.
2) This doesnt work at all:
I want to adapt this now to a multi-class-classification problem (again with an svm.SVC-classifier).
I see the scheme for multi-class dual_coef_ in the docs at
http://scikit-learn.org/stable/modules/svm.html
I tried some things already but seem to mess it up all the time. My strategy is as follows:
for all pairs in classes:
calculate the coefficients with qns3vm for the properly binarized labeled training set (filling 0s into spaces in the coef-vector where instances have been in the labeled training set that are not in the current class-pair) --> get a 1x(l+u)-np.array of coefficients
horizontally stack these to get a (n_class*(n_class-1)/2)x(l+u) matrix | I do not have a clue why the specs say that this should be of shape [n_class-1, n_SV(=l+u)]?
replace clf.dual_coef_ with this matrix
Does anybody know the right way to replace dual_coef_ in the multi-class-setting? Or is there a neat piece of example code someone can recommend? Or at least a better explanation for the shape of dual_coef_ in the one-vs-one-multiclass-setting?
Thanks!
Damian
I have a human tagged corpus of over 5000 subject indexed documents in XML. They vary in size from a few hundred kilobytes to a few hundred megabytes. Being short articles to manuscripts. They have all been subjected indexed as deep as the paragraph level. I am lucky to have such a corpus available, and I am trying to teach myself some NLP concepts. Admittedly, I've only begun. Thus far reading only the freely available NLTK book, streamhacker, and skimming jacobs(?) NLTK cookbook. I like to experiment with some ideas.
It was suggested to me, that perhaps, I could take bi-grams and use naive Bayes classification to tag new documents. I feel as if this is the wrong approach. a Naive Bayes is proficient at a true/false sort of relationship, but to use it on my hierarchical tag set I would need to build a new classifier for each tag. Nearly a 1000 of them. I have the memory and processor power to undertake such a task, but am skeptical of the results. However, I will be trying this approach first, to appease someones request. I should likely have this accomplished in the next day or two, but I predict the accuracy to be low.
So my question is a bit open ended. Laregly becuase of the nature of the discipline and the general unfamilirity with my data it will likely be hard to give an exact answer.
What sort of classifier would be appropriate for this task. Was I wrong can a Bayes be used for more than a true/false sort of operation.
what feature extraction should I pursue for such a task. I am not expecting much with the bigrams.
Each document also contains some citational information including, author/s, an authors gender of m,f,mix(m&f),and other (Gov't inst et al.), document type, published date(16th cent. to current), human analyst, and a few other general elements. I'd also appreciate some useful descriptive tasks to help investigate this data better for gender bias, analyst bias, etc. But realize that is a bit beyond the scope of this question.
What sort of classifier would be appropriate for this task. Was I wrong can a Bayes be used for more than a true/false sort of operation.
You can easily build a multilabel classifier by building a separate binary classifier for each class, that can distinguish between that class and all others. The classes for which the corresponding classifier yields a positive value are the combined classifier's output. You can use Naïve Bayes for this or any other algorithm. (You could also play tricks with NB's probability output and a threshold value, but NB's probability estimates are notoriously bad; only its ranking among them is what makes it valuable.)
what feature extraction should I pursue for such a task
For text classification, tf-idf vectors are known to work well, but you haven't specified what the exact task is. Any metadata on the documents might work as well; try doing some simple statistical analysis. If any feature of the data is more frequently present in some classes than in others, it may be a useful feature.
I understand that you have two tasks to solve here. The 1st one is that you want to tag an article based on its topic(?) and thus the article can be classified in more than one categories/classes and thus you have a multi-label classification problem. There are several algorithms proposed for solving a multi-label classification problem - please check the literature. I found this paper quite helpful when I was dealing with a similar problem: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.104.9401
The 2nd problem you want to solve is to tag the paper with authors, gender, type of document. This is a multi-class problem - each class has more than two potential values but all documents have some values for these classes.
I think as a first step it is important to understand the differences between multi-class and multi-label classification.