Target Dependent Variables is continuous but Independent Variables are Categorical - python

I am working on a dataset where my dependent variable is continuous but all my independent variables are categorical(non-binary). I have tried one hot encoding or created dummy variables. I am getting low R2 about 0.4 but high adjusted R2 around 0.9. However I am getting vertical lines in my regression plot and residual plot, even though my QQ line seems to fit into a straight line with some heavy tails at the end. So may I know if regression model is the right method to be used in this kind of scenario? If its a yes how should the plots be analyzed and if its a no, what are the other methods and libraries that can be employed to yield a better result?

I try to address some of your questions below:
However I am getting vertical lines in my regression plot and residual
plot
This is expected if all your independent variables (IV) are categorical. Each category is encoded as binary and the prediction for each observation would be combinations of each category. For simple illustration, imagine a prediction by 2 binary variables, there can only be 4 outcomes (0/0, 0/1, 1/0, 1/1).. and if you extend this to many binary variables, you see that kind of discrete prediction.
In other words, there is no slope to speak of so you should not see a continuous prediction. You can read more about regression with categories here
even though my QQ line seems to fit into a straight line with some
heavy tails at the end. So may I know if regression model is the right
method to be used in this kind of scenario?
Yes you can still use a linear model.
If its a yes how should the plots be analyzed and if its a no, what
are the other methods and libraries that can be employed to yield a
better result?
What you have is basically similar to an anova analysis except you are not doing inference. You can check for the homogeneity of variance using a levene test, or other similar test. These test can be extremely sensitive when you have a large number of observations. Looking at your qq plot , which looks at quantiles, I think its fine.

Related

Input data necessary for forecasting/ estimating trends for a given variable

This could be more of a theoretical question than a code-related one. In my current job I find myself estimating/ predicting (this last is more opportunistic) the water level for a given river in Africa.
The point is that I am developing a simplistic multiple regression model that takes more than 15 years of historical water levels and precipitation (from different locations) to generate water level estimates.
I am not that used to work with Machine Learning or whatever the correct name is. I am more used to model data and generate fittings (the current data can be perfectly defined with asymetric gaussians and sigmoids functions combined with low order polynomials.
So the point is; once I have a multiple regression model, my colleagues advised me not to use fitted data for the estimation but all the raw data instead. Since they couldn't explain to me the reason of that, I attempted to use the fitted data as raw inputs (in my defense, a median of all the fitting models has a very low deviation error == nice fittings). But what I don't understand is why should I use just the raw data, which cold be noisy, innacurate, taking into account factors that are not directly related (biasing the regression?). What is the advantage of that?
My lack of theoretical knowledge in the field is what makes me wonder about that. Should I always use all the raw data to determine the variables of my multiple regression or can I use the fitted values (i.e. get a median of the different fitting models of each historical year)?
Thanks a lot!
here is my 2 cents
I think your colleagues are saying that because it would be better for the model to learn the correlations between the raw data and the actual rain fall.
In the field you will start with the raw data so being able to predict directly from it is very useful. The more work you do after the raw data is work you will have to do every time you want to make a prediction.
However, if a simpler model work perfectly defined with asymetric gaussians and sigmoids functions combined with low order polynomials then I would recommend doing that. As long as your (y_pred - t_true) ** 2 is very small

What is the difference in interpretation of the "probability" returned by a kNN or a DNN algorithm

I have two datasets, each defined by the same two parameters. If you plot them on a scatter plot, there is some overlap. I'd like to classify them, but also get a probability that a given point is in one dataset or another. So in the overlap region, I would never expect the probability to be 100%.
I've implemented this using python's scikit-learn package and the kNN algorithm, KNeighborsClassifier. It looks pretty good! When I use predict_proba to return the probability, it looks like what I would expect!
So then I tried doing the same thing with TensorFlow and the DNNClassifier classifier, mostly as a learning exercise for myself. When I evaluate the test samples I used predict_proba to return the probabilities, but the distribution of probabilities look much different than the kNN approach. It looks like the DNNClassifier is really trying to drive the probabilities to 1 or 0, rather than somewhere in between for the overlapping region.
I've not posted code here because my questions is more basic: can I interpret the probabilities returned by these two approaches in the same way? Or is there a fundamental difference between them?
Thanks!
Yes. Provided you used sigmoid or softmax for prediction you should be getting values that are reasonable to interpret as probabilities (DNNClassifier will use softmax as far as I know).
Now you didn't give us any details on the models. Depending on the complexity of the models and the training parameters you might be getting more over fitting.
If you are seeing extreme (0 or 1) values for the overlapping area it's probably over fitting. Use test/validation set to keep a check on it.
From what you are describing a very simple model should do, try to have less depth, less parameters.

Python model targeting n variable prediction equation

I am looking to build a predictive model and am working with our current JMP model. Our current approach is to guess an nth degree polynomial and then look at which terms are not significant model effects. Polynomials are not always the best and this leads to a lot of confusion and bad models. Our data can have between 2 and 7 effects and always has one response.
I want to use python for this, but package documentation or online guides for something like this are hard to find. I know how to fit a specific nth degree polynomial or do a linear regression in python, but not how to 'guess' the best function type for the data set.
Am I missing something obvious or should I be writing something that probes through a variety of function types? Precision is the most important. I am working with a small (~2000x100) data set.
Potentially I can do regression on smaller training sets, test them against the validation set, then rank the models and choose the best. Is there something better?
Try using other regression models instead of the vanilla Linear Model.
You can use something like this for polynomial regression:
poly = PolynomialFeatures(degree=2)
X_ = poly.fit_transform(input_data)
And you can constraint the weights through the Lasso Regression
clf = linear_model.Lasso(alpha = 0.5, positive = True)
clf.fit(X_,Y_)
where Y_ is the output you want to train against.
Setting alpha to 0 turns it into a simple linear regression. alpha is basically the penalty imposed for smaller weights. You can also make the weights strictly positive. Check this out here.
Run it with a small degree and perform a cross-validation to check how good it fits.
Increasing the degree of the polynomial generally leads to over-fitting. So if you are forced to use degree 4 or 5, that means you should look for other models.
You should also take a look at this question. This explains how you can curve fit.
ANOVA (analysis of variance) uses covariance to determine which effects are statistically significant... you shouldn't have to choose terms at random.
However, if you are saying that your data is inhomogenous (i.e., you shouldn't fit a single model to all the data), then you might consider using the scikit-learn toolkit to build a classifier that could choose a subset of the data to fit.

Semi-supervised Gaussian mixture model clustering in Python

I have images that I am segmenting using a gaussian mixture model from scikit-learn. Some images are labeled, so I have a good bit of prior information that I would like to use. I would like to run a semi-supervised training of a mixture model, by providing some of the cluster assignments ahead of time.
From the Matlab documentation, I can see that Matlab allows initial values to be set. Are there any python libraries, especially scikit-learn approaches that would allow this?
The standard GMM does not work in a semi-supervised fashion. The initial values you mentioned is likely the initial values for the mean vectors and covariance matrices for the gaussians which will be updated by the EM algorithm.
A simple hack will be to group your labeled data based on their labels and individually estimate mean vectors and covariance matrices for them and pass these as the initial values to your MATLAB function (scikit-learn does not allow this as far as I'm aware). Hopefully this will position your Gaussians at the "correct locations". The EM algorithm will then take it from there to adjust these parameters.
The downside of this hack is that it does not guarantee that it will respect your true label assignment, hence even if a data point is assigned a particular cluster label, there is a chance that it might be re-assigned to another cluster. Also, noise in your feature vectors or labels could also cause your initial Gaussians to cover a much larger region than it is suppose to, hence wrecking havoc on the EM algorithm. Also, if you do not have sufficient data points for a particular cluster, your estimated covariance matrices might be singular, hence breaking this trick altogether.
Unless it is a must for you to use GMM to cluster your data (for e.g., you know for sure that gaussians model your data well), then perhaps you can just try the semi-supervised methods in scikit-learn . These will propagate the labels based on feature similarities to your other data point. However, I doubt this can handle large dataset as it requires the graph laplacian matrix to be built from pairs of samples, unless there is some special implementation trick to handle this in scikit-learn.

scikits.learn clusterization methods for curve fitting parameters

I would like some suggestion on the best clusterization technique to be used, using python and scikits.learn. Our data comes from a Phenotype Microarray, which measures the metabolism activity of a cell on various substrates over time. The output are a series of sigmoid curves for which we extract a series of curve parameters through a fitting to a sigmoid function.
We would like to "rank" this activity curves through clusterization, using a fixed number of clusters. For now we are using the k-means algorithm provided by the package, with (init='random', k=10, n_init=100, max_iter=1000). The input is a matrix with n_samples and 5 parameters for each sample. The number of samples can vary, but it is usually around several thousands (i.e. 5'000). The clustering seems efficient and effective, but I would appreciate any suggestion on different methods or on the best way to perform an assessment of the clustering quality.
Here a couple of diagrams that may help:
the scatterplot of the input parameters (some of them are quite correlated), the color of the single samples is relative to the assigned cluster.
the sigmoid curves from which the input parameters have been extracted, whose color is relative to their assigned cluster
EDIT
Below some elbow plots and the silhouette score for each number of cluster.
Have you noticed the striped pattern in your plots?
This indicates that you didn't normalize your data good enough.
"Area" and "Height" are highly correlated and probably on the largest scale. All the clustering happened on this axis.
You absolutely must:
perform careful preprocessing
check that your distance functions produce a meaningful (to you, not just the computer) notion of similarity
reality-check your results, and check that they aren't too simple, determined e.g. by a single attribute
Don't blindly follow the numbers. K-means will happily produce k clusters no matter what data you give. It just optimizes some number. It's up to you to check that the results are useful, and analyze what their semantic meaning is - and it might well be that it just is mathematically a local optimum, but meaningless for your task.
For 5000 samples, all methods should work without problem.
The is a pretty good overview here.
One thing to consider is whether you want to fix the number of clusters or not.
See the table for possible choices of the clustering algorithm depending on that.
I think spectral clustering is a pretty good method. You can use it for example together with the RBF kernel. You have to adjust gamma, though, and possibly restrict connectivity.
Choices that don't need n_clusters are WARD and DBSCAN, also solid choices.
You can also consult this chart of my personal opinion which I can't find the link to in the scikit-learn docs...
For judging the result: If you have no ground truth of any kind (which I imagine you don't have if this is exploratory) there is no good measure [yet] (in scikit-learn).
There is one unsupervised measure, silhouette score, but afaik that favours very compact clusters as found by k-means.
There are stability measures for clusters which might help, though they are not implemented in sklearn yet.
My best bet would be to find a good way to inspect the data and visualize the clustering.
Have you tried PCA and thought about manifold learning techniques?

Categories

Resources