Error when imputing minimum values using SimpleImputer - python

I'm trying to use the minimum values of each column to replace missing values but keep getting an error. Below is my code:
from sklearn.impute import SimpleImputer
numeric_cols = [X_test.select_dtypes(exclude=['object']).columns]
numeric_df = X_test.select_dtypes(exclude=['object'])
for col in numeric_cols:
my_imputer = SimpleImputer(strategy='constant', fill_value=X_test[col].min())
imputed_numeric_X_test = pd.DataFrame(my_imputer.fit_transform(numeric_df))
imputed_numeric_X_test.columns = numeric_df.columns
This is the error I get when I run it:
ValueError: 'fill_value'=MSSubClass 20.0
LotFrontage 21.0
LotArea 1470.0
OverallQual 1.0
OverallCond 1.0
YearBuilt 1879.0
YearRemodAdd 1950.0
MasVnrArea 0.0
BsmtFinSF1 0.0
BsmtFinSF2 0.0
BsmtUnfSF 0.0
TotalBsmtSF 0.0
1stFlrSF 407.0
2ndFlrSF 0.0
LowQualFinSF 0.0
GrLivArea 407.0
BsmtFullBath 0.0
BsmtHalfBath 0.0
FullBath 0.0
HalfBath 0.0
BedroomAbvGr 0.0
KitchenAbvGr 0.0
TotRmsAbvGrd 3.0
Fireplaces 0.0
GarageYrBlt 1895.0
GarageCars 0.0
GarageArea 0.0
WoodDeckSF 0.0
OpenPorchSF 0.0
EnclosedPorch 0.0
3SsnPorch 0.0
ScreenPorch 0.0
PoolArea 0.0
MiscVal 0.0
MoSold 1.0
YrSold 2006.0
dtype: float64 is invalid. Expected a numerical value when imputing numerical data
What is wrong and how can I fix it?

SimpleImputer only supports a single value for fill_value, not a per-column specification. Adding that was discussed in Issue19783, but passed on, and wouldn't support taking the columnwise minimum anyway. I can't find any discussion to add a custom callable option for strategy, which would seem to be the clearest solution. So I think you're stuck doing it manually or with a custom transformer. To do it somewhat manually, you could use the ColumnTransformer approach specified in the linked Issue.

Related

How to get rid of urls while using TfidfVectorizer

I'm using TfidfVectorizer to extract features of my samples, all texts. However, in my samples, there are so many urls and as a result, http and https become important features. This also causes inaccurate predictions later with my Naive Bayes model.
The features I got are as follows. As you can see, https has high values.
good got great happy http https
0 0.18031992253877868 0.056537832999741425 0.0 0.13494772859235538 0.0 0.7206169458767526
1 0.062052081178508904 0.0 0.03348108448960768 0.03482887785597041 0.0 0.8266008657388199
2 0.066100442981558 0.0 0.03566543577965484 0.03710116101033473 0.0 0.9685823681046619
3 0.030596521808766947 0.028779865519712563 0.0 0.0 0.0 0.9781890670696571
4 0.0 0.03803344358481952 0.0 0.0 0.0 0.9964607105785932
5 0.0 0.0 0.0 0.07716693868942119 0.0 0.938602085540054
6 0.17689804723173405 0.033278959234969596 0.07635828939724364 0.15886424082427333 0.0 0.8718951596544265
7 0.0 0.0 0.02288252957804802 0.0 0.0 0.9603936784408945
8 0.08544543470034431 0.3214885842670747 0.09220660336028486 0.09591841408082484 0.0 0.39837897672993183
9 0.09492740119653752 0.02976370819366948 0.06829257573052833 0.0 0.0 0.9273261812039216
10 0.06892455146463301 0.0648321836892671 0.1859461187415361 0.0 0.0 0.8492883859345594
11 0.06407942255789043 0.02009157746015972 0.13829986166195216 0.023977862240478147 0.0 0.938967971292072
12 0.0 0.06353009389659953 0.03644231525495783 0.0 0.0 0.8772167495025313
13 0.0 0.0 0.044113599370101265 0.030592939021541497 0.0 0.34488252084969045
Please anyone could help me to get rid of this when I extract key words using TfIDF?
This is the vectorizer I initialized:
vectorizer = TfidfVectorizer(input='content', lowercase=True, stop_words='english', analyzer='word', max_features=50)
You can pass a list of stopwords to TfidfVectorizer:
vectorizer = TfidfVectorizer(input='content', lowercase=True, stop_words=['http', 'https'], analyzer='word', max_features=50)
These words will be ignored when vectorizing the texts.
And you can add your words to the default list like this:
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import TfidfVectorizer
my_stop_words = text.ENGLISH_STOP_WORDS.union(['http', 'https'])
vectorizer = TfidfVectorizer(input='content', lowercase=True, stop_words=my_stop_words, analyzer='word', max_features=50)

During handling of the above exception, another exception occurred when using SHAP to interpret keras neural network model

The x_train looks like this (22 features):
total_amount reward difficulty duration discount bogo mobile social web income ... male other_gender age_under25 age_25_to_35 age_35_to_45 age_45_to_55 age_55_to_65 age_65_to_75 age_75_to_85 age_85_to_105
0 0.006311 0.2 0.50 1.000000 1.0 0.0 1.0 1.0 1.0 0.355556 ... 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
1 0.015595 0.2 0.50 1.000000 1.0 0.0 1.0 1.0 1.0 0.977778 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
The label is 0 and 1, it's a binary classification problem, here's the code for building the model, and I was following this page to implement SHAP:
#use SHAG
deep_explainer = shap.DeepExplainer(nn_model_2, x_train[:100])
# explain the first 10 predictions
# explaining each prediction requires 2 * background dataset size runs
shap_values = deep_explainer.shap_values(x_train)
This gave me error:
KeyError: 0
During handling of the above exception, another exception occurred
I have no idea what this message is complaining, I tried to use SHAP with a XGBoost and Logistic Regression model and they both work fine, I'm new to keras and SHAP, can someone have a look for me and how I can solved it? Many thanks.
I think SHAP (whatever it is) is expecting a Numpy array and so indexing x_train like a Numpy array, it yields an error. Try:
shap_values = deep_explainer.shap_values(x_train.values)

Using groupby().sum() on a dataframe, then plotting a pie chart with labels?

This is my first question here, I'm quite new to Python/pandas/matplolib
I have this line of code that creates a DataFrame:
repartition = sorted2017.groupby(by=sorted2017["Traitement"]).sum()
It works as I expected, except that the column title "Traitement" seems to appear on its own row:
Prix Coût net Manuvie CCQ SSQ
Traitement
masso (Véro) 213.86 0.0 144.0 69.86 0.0
ostéo (Véro) 80.00 0.0 64.0 16.00 0.0
physio (Danny) 415.00 0.0 265.0 150.00 0.0
physio (Véro) 269.00 0.0 204.8 64.20 0.0
psy (Simone) 500.00 0.0 150.0 350.00 0.0
psy (Véro) 300.00 0.0 240.0 60.00 0.0
I wanted to use the "Traitement" column as labels for my matplotlib pie chart, so I tried :
plt.pie(repartition["Prix"], labels=repartition["Traitement"])
plt.show()
But I get a KeyError. I've also tried with iloc for the labels, but then I get
ValueError : "'label' must be of length 'x'"
How can I fix this?
After groupby, "Traitement" column is in index column.
plt.pie(x=repartition["Prix"], labels=repartition.index)
plt.show()

converting to numeric data for decision tree from a matrix?

I have some data in a form of numpy array as follows:
array([['vhigh', '2', '2', 'small', 'low', 'unacc'],
['vhigh', '2', '2', 'small', 'med', 'unacc'],
['vhigh', '2', '2', 'small', 'high', 'good']], dtype=object)
that is extracted from the car dataset available at:
https://archive.ics.uci.edu/ml/datasets/Car+Evaluation
I want to use this data to apply a classification decision tree by using scikit and I managed to convert the first column, or category, into a set of numbers with:
y=data[:,0]
y=le.fit_transform(y)
print y
because I was getting an error that said:
could not convert string to float
the problem that I have is when I want to convert the array into one hot encoding. I have done the following:
X=data[:,1:]
enc=preprocessing.LabelEncoder()
enc.fit(X)
Xn=enc.transform(X)
Xn=Xn.reshape(-1,1)
ohe=preprocessing.OneHotEncoder(sparse=False)
and the error I get is:
bad input shape (1728L, 6L)
What am I doing wrong? or is there another way around to convert from categorical to numeric an array?
Thanks
For the last sklearn version (>20.0) you can just use OneHotEncoder:
df = pd.read_csv(
'https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data', header=None)
X, y = df.iloc[:,1:] , df.iloc[:,0]
encoded_y = preprocessing.LabelEncoder().fit_transform(y)
sklearn >= 20.0:
ohe = preprocessing.OneHotEncoder(sparse=False)
encoded_x = ohe.fit_transform(X)
>>> pd.DataFrame(encoded_x, columns=ohe.get_feature_names())
x0_high x0_low x0_med x0_vhigh x1_high x1_low ...
0 0.0 0.0 0.0 1.0 0.0 0.0 ...
1 0.0 0.0 0.0 1.0 0.0 0.0 ...
2 0.0 0.0 0.0 1.0 0.0 0.0 ...
3 0.0 0.0 0.0 1.0 0.0 0.0 ...
4 0.0 0.0 0.0 1.0 0.0 0.0 ...
5 0.0 0.0 0.0 1.0 0.0 0.0 ...
...
sklearn < 20.0:
from sklearn.feature_extraction import DictVectorizer
dv = DictVectorizer(sparse=False)
encoded_x = dv.fit_transform(X.to_dict(orient='records'))
pd.DataFrame(encoded_x, columns=dv.get_feature_names())
Fitting a classifier:
from sklearn.tree import DecisionTreeClassifier
DecisionTreeClassifier().fit(encoded_x, encoded_y)

One hot encoding error python machine learning

I am working with categorical variables in Machine Learning.Here is sample of my data:
age,gender,height,class,label
25,m,43,A,0
35,f,45,B,1
12,m,36,C,0
14,f,42,A,0
There are two categorical variables gender and height.I have used LabelEncoding technique.
My code:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
df=pd.read_csv('test.csv')
X=df.drop(['label'],1)
y=np.array(df['label'])
data=X.iloc[:,:].values
lben = LabelEncoder()
data[:,1] = lben.fit_transform(data[:,1])
data[:,3] = lben.fit_transform(data[:,3])
onehotencoder = OneHotEncoder(categorical_features=[1])
data = onehotencoder.fit_transform(data).toarray()
onehotencoder = OneHotEncoder(categorical_features=[3])
data = onehotencoder.fit_transform(data).toarray()
print(data.shape)
np.savetxt('data.csv',data,fmt='%s')
The data.csv looks like this:
0.0 0.0 1.0 0.0 0.0 1.0 25.0 0.0
0.0 0.0 0.0 1.0 1.0 0.0 35.0 1.0
1.0 0.0 0.0 0.0 0.0 1.0 12.0 2.0
0.0 1.0 0.0 0.0 1.0 0.0 14.0 0.0
I am unable to understand why the column is like this i.e where is the value of the 'height' column.Also the data.shape is (4,8) instead of (4,7) i.e(gender represented by 2 columns and class by 3 and 'age' and 'height' features.
Are you sure that you need to use LabelEncoder+OneHotEncoder? There is a much simpler method (which does not allow to do advanced procedures, but so far you seem to work on basics):
import pandas as pd
import numpy as np
df=pd.read_csv('test.csv')
X=df.drop(['label'],1)
y=np.array(df['label'])
data = pd.get_dummies(X)
The problem with the current code is that after you have done the first OHE:
onehotencoder = OneHotEncoder(categorical_features=[1])
data = onehotencoder.fit_transform(data).toarray()
the columns get shifted and column 3 is in fact the original height column instead of the label-encoded class column. So change the second one to use column 4 and you will get what you want.

Categories

Resources