I'm a beginner to using statsmodels & I'm also open to using other Python based methods of solving my problem:
I have a data set with ~ 85 features some of which are highly correlated.
When I run the OLS method I get a helpful 'strong multicollinearity problems' warning as I might expect.
I've previously run this data through Weka, which as part of the regression classifier has an eliminateColinearAttributes option.
How can I do the same thing - get the model to chose which attributes to use instead of having them all in the model?
Thanks!
To run multivariate regression use scipy.stats.linregress. Check out this nice example which has a good explanation.
The eliminateColinearAttributes option in the software you've mentioned is just some algorithm implemented in this software to fight the problem. Here, you need to implement some iterative algorithm yourself based on elimination of one of highly correlated variables with the highest p-value (then run regression again and repeat until multicollinearity is not there).
There's no one and only way here, there are different techniques. It is also a good practice to choose manually from the set of highly correlated with each other set of variables which to omit that it also makes sense.
Related
Problem statement: I'm working with a linear system of equations that correspond to an inverse problem that is ill-posed. I can apply Tikhonov regularization or ridge regression by hand in Python, and get solutions on test data that are sufficiently accurate for my problem. I'd like to try solving this problem using sklearn.linear_model.Ridge, because I'd like to try other machine-learning methods in the linear models part of that package (https://scikit-learn.org/stable/modules/linear_model.html). I'd like to know if using sklearn in this context is using the wrong tool.
What I've done: I read the documentation for sklearn.linear_model.Ridge. Since I know the linear transformation corresponding to the forward problem, I have run it over impulse responses to create training data, and then supplied it to sklearn.linear_model.Ridge to generate a model. Unlike when I apply the equation for ridge regression myself in Python, the model from sklearn.linear_model.Ridge only works on impulse responses. On the other hand, applying ridge regression using the equations myself, generates a model that can be applied to any linear combination of the impulse responses.
Is there a way to apply the linear methods of sklearn, without needing to generate a large test data set that represents the entire parameter space of the problem, or is this requisite for using (even linear) machine learning algorithms?
Should sklearn.model.Ridge return the same results as solving the equation for ridge regression, when the sklearn method is applied to test cases that span the forward problem?
Many thanks to anyone who can help my understanding.
Found the answer through trial and error. Answering my own question in case anyone was thinking like I did and needs clarity.
Yes, if you use training data that spans the problem space, it is the same as running ridge regression in python using the equations. sklearn does what it says in the documentation.
You need to use fit_intercept=True to get sklearn.linear_model.Ridge to fit the Y intercept of your problem, otherwise it is assumed to be zero.
If you use the default, fit_intercept=False, and your problem does NOT have a Y-intercept of zero, you will of course, get a bad solution.
This might lead a novice like me to the impression that you haven't supplied enough training data, which is incorrect.
For a project I am working on, I need to find a model for the data graphed below that includes a sine or cosine component (hard to tell from the image but the data does follow a trig-like function for each period, although the amplitude/max/mins are changing).
data
I originally planned on finding a simple regression model for my data using Desmos before I saw how complex the data was, but alas, I do not think I am capable of determining what equation to use without the help of Python. I don't have much experience with regression in Python, I've only done basic linear modeling where I knew the type of equation and was just determining the coefficients/constants. Could anyone offer a guiding example, git code, or resources that would be useful for this?
Your question is pretty generic and looking at the graph, we cannot tell much about the data to give you a more detailed answer, but i'd say have a look at OLS
https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.html
You could also look at scikit learn for the various regression models it provides.
http://scikit-learn.org/stable/modules/linear_model.html
Essentially,these packages will help you figure our the equation you are looking to have for your data.
Also, looks like your graph has an outlier ? Please note regression is very sensitive to outliers, so you may want to handle those data points before fitting the model.
My groupmates and I were doing this assignment that involves running a regression on Fama-French 3 factor model. I used python Statsmodels module and they used Stata and we share the same set of data. For Ordinary Least Squares regression, we got the same answers. But robust regression results for some reason don't agree.
Here is the result from Stata:
Here is the result from Statsmodels:
Just wondering what could be the cause of this issue? Any way to resolve it? I also tried different methods (HuberT, RamsayE etc) in Statsmodels and none of them had the same answers as the result from Stata. Any help is appreciated.
The equivalent of Stata's
regress ..., robust
in statsmodels is
OLS(...).fit(cov_type='HC1')
The options for the robust sandwich covariance matrices are here http://www.statsmodels.org/devel/generated/statsmodels.regression.linear_model.RegressionResults.get_robustcov_results.html, but the use is through the fit keywords.
There is an incomplete FAQ answer for differences in robust standard errors between Stata and statsmodels. https://github.com/statsmodels/statsmodels/issues/1923
statsmodel.robust and RLM refer to outlier robust estimation. This is an M-estimator and the covariance has the original Huber sandwich form.
Here is the main page for statsmodels.robust
http://www.statsmodels.org/devel/rlm.html
and the documentation for RLM
http://www.statsmodels.org/devel/generated/statsmodels.robust.robust_linear_model.RLM.html
Is there a way to have an x,y pair dataset given to a function that will return a list of curve fit models and the coeff. The program DataFit does this with about 200 different models, but we are looking for a pythonic way. From exponential to inverse polynomial etc.
I have seen many posts of manually using scipy to type each model, but this is not feasible for the number of models we want to test.
The closest I found was pyeq2, but this is not returning the list of functions, and seems to be a rabbit hole to code for.
If R has this available, we could use that but python is really the goal
Below is an example of the data, we want to find the best way to describe this curve
You can try library splines in R. I have used this for higher order curve fitting to some univariate data. You can try to change and achieve similar thing with corresponding R^2 errors.
You can either decide to do the following:
Choose a model to fit a parameters. This model should be based on a single independent variable. This can be done by python's scipy.optimize curve_fit function. You can choose something like a hyberbola.
Choose a model that is complex and likely represents an underlying mechanism of something at work. Like the system of ODE's from a disease SIR model. Fitting the parameters will be no easy task. This will be done by Markov Chain Monte Carlo (MCMC) methods. This is VERY difficult.
Realise that you have data and can use machine learning via scikit learn to predict from your data. This is a method that doesn't require parameters.
Machine learning and neural networks don't fit something and can't really tell you about the underlying mechanism but can make predicitions just as a best fit model would...dare I say even better.
In the end, we found that Eureqa software was able to achieve this. https://www.nutonian.com/products/eureqa/
I'm wondering what the set_weights method of the Maxent class in NLTK is used for (or more specifically how to use it). As I understand, it allows you to manually assign weights to certain features? Could somebody provide a basic example of the type of parameter that would be passed into it?
Thanks
Alex
It apparently allows you to set the coefficient matrix of the classifier. This may be useful if you have an external MaxEnt/logistic regression learning package from which you can export the coefficients. The train_maxent_classifier_with_gis and train_maxent_classifier_with_iis learning algorithms call this function.
If you don't know what a coefficient matrix is; it's the β mentioned in Wikipedia's treatment of MaxEnt.
(To be honest, it looks like NLTK is either leaking implementation details here, or has a very poorly documented API.)