Machine Learning - Classification or Clustering - python

I am new to machine learning and had a problem I wanted to solve and see if anyone has any ideas on what type of algorithm would be best to use. I am not looking for code, but rather a process.
Problem: I am classifying people into 2 categories: high risk and low risk. (this is a very basic starting point and I will expand as I learn how to classify more detailed)
Each person has 11 variables I am looking at and each variable has a binary value (0 for no, 1 for yes). The variables are like has married, gun_owner, home_owner, etc. So I gather each person can have 2^11 or 2048 different combinations of these variables.
I have a data set that has this information and then the result (whether or not they committed a crime). I figured this data would be used for training and then the algorithm can make predictions on high risk individuals.
Does anyone have any ideas for what would be the best algorithm? Since there are so many variables, I am having more trouble trying to figure out what may work bets.

This is a binary classification problem, with each input a binary string of length 11. There are many algorithms for this problem. The simplest one is the naive Bayes model (https://en.wikipedia.org/wiki/Naive_Bayes_classifier). You could also try some linear classifiers such as logistic regression or SVM. They both work well for linear separable data and binary classification.

It seems like you want to classify people based on a few features. It looks like a simple binary classification problem. However, it is not very clear that if the data you have is labeled or not.
So the first question is, in you dataset, do you know which person is 'high risk' and which person is 'low risk'? If you have that information, you can use a whole lot of machine learning model for this classification task.
However, if the labels are not present ('high risk' or 'low risk') you cannot do that. Then you have to think about some unsupervised learning methods (clustering). Hope this answers your question.

Related

How do I make a good feature using machine learning on a timeseries forecast that has only traffic volume as an input?

So I have a time series that only has traffic volume. I've done FB prophet and neural prophet. They work okay, but I would like to do something using machine learning. So far I have the problem of trying to make my features. Using the classical dayofyear, month, etc does not give me good results. I have tried using shift where I get the average, minimum, and max of the two previous days. However that would work, but my problem is when I try to predict days in advance the feature doesn't really work for that since I cant get the average of that day. My main concern is trying to find a good feature that my predicting future dataframe also has. A picture of my data is included. Does anyone know how I would do this?
First of all, you have to clarify some definitions. FBProphet works on a mechanism that is the same as any machine learning algorithm that is, fitting the model and then predicting the output. Being an additive regression model with a piecewise linear or logistic growth curve trend, it can be considered as a Machine Learning method that allows us to predict a continuous outcome variable.
Secondly, I think you missed the most important word that your question was about - namely: Feature engineering.
Feature Engineering includes :
Process of using domain knowledge to Extract features (characteristics, properties, attributes) from raw data.
Process of transforming raw data into features that better represent the underlying problem to the predictive models, etc..
But it's very unlikely to use Machine Learning to do Feature engineering. You do Feature engineering in order to improve your Machine Learning model. Many techniques such as imputation, handling outliers, binning, log transform, one-hot encoding, grouping operations, feature split, scaling are hybrid methods using a statistical approach and/or domain knowledge.
Regarding your data, bearing in my mind that the seasonality is already handled by FBProphet, I am not confident if feature engineering transformations such as adding the day of the week, adding holidays periods, etc... could really help improve performance...
To conclude, it is not possible to create ex-nihilo new features that would outperform your model. Whether you process/transform your data or add external domain-knowledge dataset

What to do when only a portion of training/testing data generates confident predictions?

I have a general question on machine learning that can be applied to any algorithm. Suppose I have a particular problem, let us say soccer team winning/losing prediction. The features I choose are the amount of sleep each player gets before the game, sentiment analysis on news coverage, etc etc.
In this scenario, there is a pattern or correlation (something only a machine learning algorithm can pick up on) that only occurs around 5% of the time. But when it occurs, it is very predictive of the upcoming match.
How do you setup a machine learning algorithm to handle such a case in which it has the ability to discard most samples as noise. For example, consider a binary SVM. If there was a way to discard most of the “noisy” samples, a lot less overfitting would occur because the hyperplane would not have to eliminate error from these samples.
Regularization would help in this case, but due to the very low percentage of predictive information, is there a way we can code the algorithm to discard these samples in training and refuse to predict certain test data samples?
I have also read into confidence intervals but they seem more of an analytic tool to me than something to use in the algorithm.
I was thinking that using another ml algorithm which uses the same features to decide which testing samples are keepers might be a good idea.
Any answers using any machine learning algorithm (e.g. svm, neural net, random forest) as an example would be much appreciated. Any suggestions on where to look would be great as well (google is usually my friend, but not this time). Please let me know if I can rephrase the question better. Thanks.

I want to implement a machine learning or deep learning model for text classification (100 classes)

I have a dataset that is similar to the one where we have movie plots and their genres. The number of classes is around 100. What algorithm should I choose for this 100 class classification? The classification is multi-label because 1 movie can have multiple genres
Please recommend anyone from the following. You are free to suggest any other model if you want to.
1.Naive Bayesian
2.Neural networks
3.SVM
4.Random forest
5.k nearest neighbours
It would be useful if you also give the necessary library in python
An important step in machine learning engineering consists of properly inspecting the data. Herby you get some insight that determines what algorithm to choose. Sometimes, you might try out more than one algorithm and compare the models, in order to be sure, that you tried your best on the data.
Since you did not disclose your data, I can only give you the following advice: If your data is "easy", meaning that you need only little features and a slight combination of them to solve the task, use Naive Bayes or k-nearest neighbors. If your data is "medium" hard, then use Random Forest or SVM. If solving the task requires a very complicated decision boundary combining many dimensions of the features in a non-linear fashion, choose a Neural Network architecture.
I suggest you use python and the scikit-learn package for SVM or Random forest or k-NN.
For Neural Networks, use keras.
I am sorry that I can not give you THE recipe you might expect for solving your problem. Your question is posed really broad.

Predicting Energy Consumption of different buildings

I have the dataset which you can find the (updated) file here , containing many different characteristics of different office buildings, including their surface area and number of people working in there. In total there are about 200 records. I want to use an algorithm, that can be trained using the dataset above, in order to be able to predict the electricity consumption(given in the column 'kwh') of a the building that is not in the set.
I have tried most of the possible machine learning algorithms using the scikit library in python (linear regression, Ridge, Lasso, SVC etc) in order to predict a continuous variable. Surface_area and number of workers had a coorelation value with the target variable between 0.3-0.4 so I assumed them to be good features for the model and included them in the training of the model. However I had about 13350 mean absolute error and R-squared value of about 0.22-0.35, which is not good at all.
I would be very grateful, if someone could give me some advice, or if you could examine a little the dataset and run some algorithms on it. What type of preprocessing should I use, and what type of algorithm? Is the number of datasets too low to train a regression model for predicting continuous variables?
Any feedback would be helpful as I am new to machine learning :)
The first thing that should be done in these kinds of Machine Learning Problems is to understand the data. Yes, the number of features in your dataset is small, yes, the number of data samples are very less, but it is important to do the best we can with what we have.
The data set header is in a language other than English, it is important to convert it to a language most of the people in the community would understand (in this case English). After doing a bit of tinkering, I found out that the language being used is Dutch.
There are some key features missing in the dataset. From something as obvious as the number of floors in the building to something not obvious like the number of working hours. Surface Area and the number of workers seems to me are the most important features, but you are missing out on a feature called building_function which (after using Google Translate) tells what the purpose of the building is. Intuitively, this is supposed to have a large correlation with the power consumption. Industries tend to use more power than normal Households. After translation, I found out that the main types were Residential, Office, Accommodation and Meeting. This feature thus has to be encoded as a nominal variable to train the model.
Another feature hoofsbi also seems to have some variance. But I do not know what that feature means.
If you could translate the headers in the data and share it, I will be able to provide you some code to perform this regression task. It is very important in such tasks to understand what the data is and thus perform feature engineering.

how to predict binary outcome with categorical and continuous features using scikit-learn?

I need advice choosing a model and machine learning algorithm for a classification problem.
I'm trying to predict a binary outcome for a subject. I have 500,000 records in my data set and 20 continuous and categorical features. Each subject has 10--20 records. The data is labeled with its outcome.
So far I'm thinking logistic regression model and kernel approximation, based on the cheat-sheet here.
I am unsure where to start when implementing this in either R or Python.
Thanks!
Choosing an algorithm and optimizing the parameter is a difficult task in any data mining project. Because it must customized for your data and problem. Try different algorithm like SVM,Random Forest, Logistic Regression, KNN and... and test Cross Validation for each of them and then compare them.
You can use GridSearch in sickit learn to try different parameters and optimize the parameters for each algorithm. also try this project
witch test a range of parameters with genetic algorithm
Features
If your categorical features don't have too many possible different values, you might want to have a look at sklearn.preprocessing.OneHotEncoder.
Model choice
The choice of "the best" model depends mainly on the amount of available training data and the simplicity of the decision boundary you expect to get.
You can try dimensionality reduction to 2 or 3 dimensions. Then you can visualize your data and see if there is a nice decision boundary.
With 500,000 training examples you can think about using a neural network. I can recommend Keras for beginners and TensorFlow for people who know how neural networks work.
You should also know that there are Ensemble methods.
A nice cheat sheet what to use is on in the sklearn tutorial you already found:
(source: scikit-learn.org)
Just try it, compare different results. Without more information it is not possible to give you better advice.

Categories

Resources