Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
My dataset has over 200 variables and I am running a classification model on it, which is leading to a model OverFit. Which suggested for reducing the number of features? I started with Feature Importance, however due to such a large number of variables, I am unable to visualise it. Is there a way I can plot or showcase these values with respect to the given variable?
Below is the code that am trying:
F_Select = ExtraTreesClassifier(n_estimators=50)
F_Select.fit(X_train,y_train)
print(F_Select.feature_importances_)
You could try plotting the feature importances from largest to smallest and seeing which features capture a certain amount (say 95%) of the variance, like a scree plot used in PCA. Ideally, this should be a small number of features:
import matplotlib.pyplot as plt
from sklearn import *
model = ensemble.RandomForestClassifier()
model.fit(features, labels)
model.feature_importances_
importances = np.sort(model.feature_importances_)[::-1]
cumsum = np.cumsum(importances)
plt.bar(range(len(importances)), importances)
plt.plot(cumsum)
Related
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
This question is specific for a XGBClassifier API using a "gblinear" booster.
As mentioned here, the .coef_ property returns, as the xgboost doc says here an array of type [n_classes, n_features].
Using this array how can I order the features by importance?
The short answer is no, although the base learner is a linear model, the magnitude of the coefficients will not indicate how important they are. Even more so when the coefficients are not scaled. You can look at it as the magnitude of the coefficients are dependent of the scale / variation of your predictors, but does not tell you how useful it will be in predicting the correct value. You can check this post on more details of how the base learner works.
If you are already using scikit-learn and xgboost underneath it, there is a help page on plotting the importance of the variables, and you can work with that.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I am working on dataset in which almost every feature has missiong values. I want to impute missing values with KNN method. But as KNN works on distance metrics so it is advised to perform normalization of dataset before its use. Iam using scikit-learn library for this.
But how can I perform normalization with missing values.
For classification algorithms like KNN, we measure the distances between pairs of samples and these distances are influenced by the measurement units also.
For example: Let’s say, we are applying KNN on a data set having 3 features.
1st feature : Range from 1 to 100
2nd feature : Range from 1 to 200
3rd feature : Range from 1 to 10000
This will led to generated clusters based on 3rd feature. Since, the difference between 1st and 2nd are smaller as compared to 3rd one. To avoid this wrong clustering, we need to have normalization in place.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I used Scikit Learn to implement a Support Vector Machine. Since I am dealing with class imbalance (96% to 4%), I would like the SVM to draw an equal number of samples from each class during training. How can I achieve this with Scikit Learn?
You might be interested in imbalanced-learn package which has a number of implementations such as oversampling and undersampling to tackle the class imbalance problem.
An alternative approach is to adjust the class weights with the class_weight='balanced' argument; from the SVC docs (similar argument exists for other SVM models, too):
class_weight : {dict, ‘balanced’}, optional
Set the parameter C of class i to class_weight[i]*C for SVC. If not given, all classes are supposed to have weight one. The “balanced”
mode uses the values of y to automatically adjust weights inversely
proportional to class frequencies in the input data as n_samples /
(n_classes * np.bincount(y))
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I search a lot but I cannot find anything to show that how can I generate data for continuous dataset such as breastCancer? All documents are about images or text classifications.
Can you please help me construct neural network?
CNNs are useful for datasets where the features have strong temporal or spatial correlation. For instance, in the case of images, the value of a pixel is highly correlated to the neighboring pixels. If you randomly permute the pixels, then this correlation goes away, and convolution no longer makes sense.
For the breast cancer dataset, you have only 10 attributes which are not spatially correlated in this way. Unlike the previous image example, you can randomly permute these 10 features and no information is lost. Therefore, CNNs are not directly useful for this problem domain.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have 2 arrays, one with sizes and one with prices. How can I train or predict or use a cost function (i'm a begginner yeah) so i can predict prices according to a random size?
Maybe i'm confused with the terms but I hope someone can understand. thanks.
You must use a regressor and fit it to your data. Once fitted, you can use this regressor to predict unseen samples.
Here is a link that shows all the regressors available on sklearn.
Amongst the regressors you could use I can cite : OLS, Ridge, K-NN, Decision trees, Random Forest ...
The documentation is very clear so you won't find (a priori) any difficulty.
NB :
A training dataset with 14 elements is clearly not sufficient.
Try to find out other samples to add to your dataset.