Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I am working on dataset in which almost every feature has missiong values. I want to impute missing values with KNN method. But as KNN works on distance metrics so it is advised to perform normalization of dataset before its use. Iam using scikit-learn library for this.
But how can I perform normalization with missing values.
For classification algorithms like KNN, we measure the distances between pairs of samples and these distances are influenced by the measurement units also.
For example: Let’s say, we are applying KNN on a data set having 3 features.
1st feature : Range from 1 to 100
2nd feature : Range from 1 to 200
3rd feature : Range from 1 to 10000
This will led to generated clusters based on 3rd feature. Since, the difference between 1st and 2nd are smaller as compared to 3rd one. To avoid this wrong clustering, we need to have normalization in place.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I'm experimenting with machine learning regressors and I was using the dataset train.csv from the following webpage: https://www.kaggle.com/c/rossmann-store-sales/data?select=train.csv
I was trying to train an SVR but it was taking a lot of time to fit, so I realized the problem is probably because I haven't normalized data.
I know a normal practice to do is to normalize the columns, but I'm not really sure which ones should I apply it to. There are some binary variables and some continuous, and I feel like it would be weird to normalize the binary variables. Is this correct?
The table columns are the following:
Open, promo and SchoolHoliday are binary. StateHoliday can take values from 0 to 4.
The other ones are ints (except date obviously).
Store, DayOfWeek, Open, Promo, StateHoliday, SchoolHoliday are categorical features. They can be encoded as one-hot-encoded vector using OneHotEncoder.
Sales, Customers are numerical features and can be encoded for example with StandardScaler, RobustScaler etc.
see scikit-learn preprocessing documentation here for additional transformations.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
This question is specific for a XGBClassifier API using a "gblinear" booster.
As mentioned here, the .coef_ property returns, as the xgboost doc says here an array of type [n_classes, n_features].
Using this array how can I order the features by importance?
The short answer is no, although the base learner is a linear model, the magnitude of the coefficients will not indicate how important they are. Even more so when the coefficients are not scaled. You can look at it as the magnitude of the coefficients are dependent of the scale / variation of your predictors, but does not tell you how useful it will be in predicting the correct value. You can check this post on more details of how the base learner works.
If you are already using scikit-learn and xgboost underneath it, there is a help page on plotting the importance of the variables, and you can work with that.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
My dataset has over 200 variables and I am running a classification model on it, which is leading to a model OverFit. Which suggested for reducing the number of features? I started with Feature Importance, however due to such a large number of variables, I am unable to visualise it. Is there a way I can plot or showcase these values with respect to the given variable?
Below is the code that am trying:
F_Select = ExtraTreesClassifier(n_estimators=50)
F_Select.fit(X_train,y_train)
print(F_Select.feature_importances_)
You could try plotting the feature importances from largest to smallest and seeing which features capture a certain amount (say 95%) of the variance, like a scree plot used in PCA. Ideally, this should be a small number of features:
import matplotlib.pyplot as plt
from sklearn import *
model = ensemble.RandomForestClassifier()
model.fit(features, labels)
model.feature_importances_
importances = np.sort(model.feature_importances_)[::-1]
cumsum = np.cumsum(importances)
plt.bar(range(len(importances)), importances)
plt.plot(cumsum)
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I search a lot but I cannot find anything to show that how can I generate data for continuous dataset such as breastCancer? All documents are about images or text classifications.
Can you please help me construct neural network?
CNNs are useful for datasets where the features have strong temporal or spatial correlation. For instance, in the case of images, the value of a pixel is highly correlated to the neighboring pixels. If you randomly permute the pixels, then this correlation goes away, and convolution no longer makes sense.
For the breast cancer dataset, you have only 10 attributes which are not spatially correlated in this way. Unlike the previous image example, you can randomly permute these 10 features and no information is lost. Therefore, CNNs are not directly useful for this problem domain.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have 2 arrays, one with sizes and one with prices. How can I train or predict or use a cost function (i'm a begginner yeah) so i can predict prices according to a random size?
Maybe i'm confused with the terms but I hope someone can understand. thanks.
You must use a regressor and fit it to your data. Once fitted, you can use this regressor to predict unseen samples.
Here is a link that shows all the regressors available on sklearn.
Amongst the regressors you could use I can cite : OLS, Ridge, K-NN, Decision trees, Random Forest ...
The documentation is very clear so you won't find (a priori) any difficulty.
NB :
A training dataset with 14 elements is clearly not sufficient.
Try to find out other samples to add to your dataset.