Do we have to scale the polynomial features when creating a polynomial regression?
This question is already answered here and the answer is no. But when creating a model with scikit learn, I do observe a huge difference.
And I also found this article about the Importance of Feature Scaling in Data Modeling. And the example of polynomial features prove that the scaling does have an impact.
What did I miss ?
Maybe because of numerical issues; polynomials are prone to ill-conditioning: I have experienced (outside of machine learning) such problems with not-so-complex polynomial models, and the first solution was scaling the values of the features.
Symbolically the result is the same, but numerically can be much different, in some instances
Related
I have implemented a K-Means clustering on a dataset in which I have reduced the dimensionality to 2 features with PCA.
Now I am wondering how to interprete this analysis since there is any reference on which are the variables on the axis. Given that doubt, I am also wondering if it is a good practice implementg a K-Means on a resized dataset with PCA.
How can I interprete this kind of clustering?
Thank you!
It is hard to give an answer addressing your question since it is not specific enough and I have no idea about the data and the objective question of your research. So, let me answer your question in general perspective if it helps.
First of all, PCA strictly decreases interpretability of the analysis beacuse it reduces the dimensions depending on linear relations of variables and you can not name reduced components anymore. In addition, check the correlation scores among the variables before PCA to get intiution how much PCA will be successful and check variance explained by PCA. The lower explained variance ratio, the greater the information loss. So it may mislead your intreptations.
If your objective is to analyse data and make inferences, I would suggest you not to reduce dimension. You have 3 dimensions only. You can apply K-Means without PCA and plot them in 3D. Matplotlib and plotly provide interactive feature for this.
However, If your objective is to build a macine learning model, then you should reduce the dimension if they are highly correlated. This would be a big favor for your model.
Finally, applying K-Means after PCA is not something not to do but creates difficulty for interpretations.
Problem statement: I'm working with a linear system of equations that correspond to an inverse problem that is ill-posed. I can apply Tikhonov regularization or ridge regression by hand in Python, and get solutions on test data that are sufficiently accurate for my problem. I'd like to try solving this problem using sklearn.linear_model.Ridge, because I'd like to try other machine-learning methods in the linear models part of that package (https://scikit-learn.org/stable/modules/linear_model.html). I'd like to know if using sklearn in this context is using the wrong tool.
What I've done: I read the documentation for sklearn.linear_model.Ridge. Since I know the linear transformation corresponding to the forward problem, I have run it over impulse responses to create training data, and then supplied it to sklearn.linear_model.Ridge to generate a model. Unlike when I apply the equation for ridge regression myself in Python, the model from sklearn.linear_model.Ridge only works on impulse responses. On the other hand, applying ridge regression using the equations myself, generates a model that can be applied to any linear combination of the impulse responses.
Is there a way to apply the linear methods of sklearn, without needing to generate a large test data set that represents the entire parameter space of the problem, or is this requisite for using (even linear) machine learning algorithms?
Should sklearn.model.Ridge return the same results as solving the equation for ridge regression, when the sklearn method is applied to test cases that span the forward problem?
Many thanks to anyone who can help my understanding.
Found the answer through trial and error. Answering my own question in case anyone was thinking like I did and needs clarity.
Yes, if you use training data that spans the problem space, it is the same as running ridge regression in python using the equations. sklearn does what it says in the documentation.
You need to use fit_intercept=True to get sklearn.linear_model.Ridge to fit the Y intercept of your problem, otherwise it is assumed to be zero.
If you use the default, fit_intercept=False, and your problem does NOT have a Y-intercept of zero, you will of course, get a bad solution.
This might lead a novice like me to the impression that you haven't supplied enough training data, which is incorrect.
I have my university project and i'm given a dataset which almost all features have a very weak (only 1 feature has moderate correlation with the target) correlation with the target. It's distribution is not normal too. I already tried to apply simple model linear regression it caused underfitting, then i applied simple random forest regressor but it caused overfitting but when i applied random forest regressor with optimization with randomsearchcv it took time so long. Is there any way to get decent model with not-so-good dataset without underfitting or overfitting? or it's just not possible at all?
Well, to be blunt, if you could fit a model without underfitting or overfitting you would have solved AI completely.
Some suggestions, though:
Overfitting on random forests
Personally, I'd try to hack this route since you mention that your data is not strongly correlated. It's typically easier to fix overfitting than underfitting so that helps, too.
Try looking at your tree outputs. If you are using python, sci-kit learn's export_graphviz can be helpful.
Try reducing the maximum depth of the trees.
Try increasing the maximum number of a samples a tree must have in order to split (or similarly, the minimum number of samples a leaf should have).
Try increasing the number of trees in the RF.
Underfitting on linear regression
Add more parameters. If you have variables a, b, ... etc. adding their polynomial features, i.e. a^2, a^3 ... b^2, b^3 ... etc. may help. If you add enough polynomial features you should be able to overfit -- although that doesn't necessarily mean it will have a good fit on the train set (RMSE value).
Try plotting some of the variables against the value to predict (y). Perhaps you may be able to see a non-linear pattern (i.e. a logarithmic relationship).
Do you know anything about the data? Perhaps a variable that is the multiple, or the division between two variables may be a good indicator.
If you are regularizing (or if the software is automatically applying) your regression, try reducing the regularization parameter.
I'm studing machine learning from here and the course uses 'scikit learn' from regression - https://www.udemy.com/machinelearning/
I can see that for some training regression algorithms, the author uses feature scaling and for some he doesn't because some 'scikit learn' regression algorithms take care of feature scaling by themselves.
How to know in which training algorithm we need to do feature scaling and where we don't need to ?
No machine learning technique needs feature scaling, for some algoirthms scaled inputs make the optimizing easier on the computer which results in faster training time.
Typically, algorithms that leverage distance or assume normality benefit from feature scaling. https://medium.com/greyatom/why-how-and-when-to-scale-your-features-4b30ab09db5e
It depends on the algorithm you are using and your dataset.
Support Vector Machines (SVM), these models converge faster if you scale your features . The main advantage of scaling is to avoid attributes in greater numeric ranges dominating those in smaller numeric ranges
In K-means clustering, you find out the Euclidean distance for clustering different data points together. Thus it seems to be a good reason to scale your features so that the centroid doesn't get much affected by the large or abnormal values.
In case of regression, scaling your features will not be of much help since the relation of coefficients between original dataset and the relation of coefficients between scaled dataset will be the same.
In case of Decision Trees, they don't usually require feature scaling.
In case of models which have learning rates involved and are using gradient descent, the input scale does effect the gradients. So feature scaling would be considered in this case.
A very simple answer. Some algorithm does the feature scaling even if you don't and some do not. So, if the algorithm does not, you need to manually scale the features.
You can google which algorithm does the feature scaling, but its good to be safe by manually scaling the feature. Always make sure, the features are scaled, otherwise, the algorithm would give output offset to ideal.
I have a dataset of images that I would like to run nonlinear dimensionality reduction on. To decide what number of output dimensions to use, I need to be able to find the retained variance (or explained variance, I believe they are similar). Scikit-learn seems to have by far the best selection of manifold learning algorithms, but I can't see any way of getting a retained variance statistic. Is there a part of the scikit-learn API that I'm missing, or simple way to calculate the retained variance?
I don't think there is a clean way to derive the "explained variance" of most non-linear dimensionality techniques, in the same way as it is done for PCA.
For PCA, it is trivial: you are simply taking the weight of a principal component in the eigendecomposition (i.e. its eigenvalue) and summing the weights of the ones you use for linear dimensionality reduction.
Of course, if you keep all the eigenvectors, then you will have "explained" 100% of the variance (i.e. perfectly reconstructed the covariance matrix).
Now, one could try to define a notion of explained variance in a similar fashion for other techniques, but it might not have the same meaning.
For instance, some dimensionality reduction methods might actively try to push apart more dissimilar points and end up with more variance than what we started with. Or much less if it chooses to cluster some points tightly together.
However, in many non-linear dimensionality reduction techniques, there are other measures that give notions of "goodness-of-fit".
For instance, in scikit-learn, isomap has a reconstruction error, tsne can return its KL-divergence, and MDS can return the reconstruction stress.