converting to numeric data for decision tree from a matrix? - python

I have some data in a form of numpy array as follows:
array([['vhigh', '2', '2', 'small', 'low', 'unacc'],
['vhigh', '2', '2', 'small', 'med', 'unacc'],
['vhigh', '2', '2', 'small', 'high', 'good']], dtype=object)
that is extracted from the car dataset available at:
https://archive.ics.uci.edu/ml/datasets/Car+Evaluation
I want to use this data to apply a classification decision tree by using scikit and I managed to convert the first column, or category, into a set of numbers with:
y=data[:,0]
y=le.fit_transform(y)
print y
because I was getting an error that said:
could not convert string to float
the problem that I have is when I want to convert the array into one hot encoding. I have done the following:
X=data[:,1:]
enc=preprocessing.LabelEncoder()
enc.fit(X)
Xn=enc.transform(X)
Xn=Xn.reshape(-1,1)
ohe=preprocessing.OneHotEncoder(sparse=False)
and the error I get is:
bad input shape (1728L, 6L)
What am I doing wrong? or is there another way around to convert from categorical to numeric an array?
Thanks

For the last sklearn version (>20.0) you can just use OneHotEncoder:
df = pd.read_csv(
'https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data', header=None)
X, y = df.iloc[:,1:] , df.iloc[:,0]
encoded_y = preprocessing.LabelEncoder().fit_transform(y)
sklearn >= 20.0:
ohe = preprocessing.OneHotEncoder(sparse=False)
encoded_x = ohe.fit_transform(X)
>>> pd.DataFrame(encoded_x, columns=ohe.get_feature_names())
x0_high x0_low x0_med x0_vhigh x1_high x1_low ...
0 0.0 0.0 0.0 1.0 0.0 0.0 ...
1 0.0 0.0 0.0 1.0 0.0 0.0 ...
2 0.0 0.0 0.0 1.0 0.0 0.0 ...
3 0.0 0.0 0.0 1.0 0.0 0.0 ...
4 0.0 0.0 0.0 1.0 0.0 0.0 ...
5 0.0 0.0 0.0 1.0 0.0 0.0 ...
...
sklearn < 20.0:
from sklearn.feature_extraction import DictVectorizer
dv = DictVectorizer(sparse=False)
encoded_x = dv.fit_transform(X.to_dict(orient='records'))
pd.DataFrame(encoded_x, columns=dv.get_feature_names())
Fitting a classifier:
from sklearn.tree import DecisionTreeClassifier
DecisionTreeClassifier().fit(encoded_x, encoded_y)

Related

How to get rid of urls while using TfidfVectorizer

I'm using TfidfVectorizer to extract features of my samples, all texts. However, in my samples, there are so many urls and as a result, http and https become important features. This also causes inaccurate predictions later with my Naive Bayes model.
The features I got are as follows. As you can see, https has high values.
good got great happy http https
0 0.18031992253877868 0.056537832999741425 0.0 0.13494772859235538 0.0 0.7206169458767526
1 0.062052081178508904 0.0 0.03348108448960768 0.03482887785597041 0.0 0.8266008657388199
2 0.066100442981558 0.0 0.03566543577965484 0.03710116101033473 0.0 0.9685823681046619
3 0.030596521808766947 0.028779865519712563 0.0 0.0 0.0 0.9781890670696571
4 0.0 0.03803344358481952 0.0 0.0 0.0 0.9964607105785932
5 0.0 0.0 0.0 0.07716693868942119 0.0 0.938602085540054
6 0.17689804723173405 0.033278959234969596 0.07635828939724364 0.15886424082427333 0.0 0.8718951596544265
7 0.0 0.0 0.02288252957804802 0.0 0.0 0.9603936784408945
8 0.08544543470034431 0.3214885842670747 0.09220660336028486 0.09591841408082484 0.0 0.39837897672993183
9 0.09492740119653752 0.02976370819366948 0.06829257573052833 0.0 0.0 0.9273261812039216
10 0.06892455146463301 0.0648321836892671 0.1859461187415361 0.0 0.0 0.8492883859345594
11 0.06407942255789043 0.02009157746015972 0.13829986166195216 0.023977862240478147 0.0 0.938967971292072
12 0.0 0.06353009389659953 0.03644231525495783 0.0 0.0 0.8772167495025313
13 0.0 0.0 0.044113599370101265 0.030592939021541497 0.0 0.34488252084969045
Please anyone could help me to get rid of this when I extract key words using TfIDF?
This is the vectorizer I initialized:
vectorizer = TfidfVectorizer(input='content', lowercase=True, stop_words='english', analyzer='word', max_features=50)
You can pass a list of stopwords to TfidfVectorizer:
vectorizer = TfidfVectorizer(input='content', lowercase=True, stop_words=['http', 'https'], analyzer='word', max_features=50)
These words will be ignored when vectorizing the texts.
And you can add your words to the default list like this:
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import TfidfVectorizer
my_stop_words = text.ENGLISH_STOP_WORDS.union(['http', 'https'])
vectorizer = TfidfVectorizer(input='content', lowercase=True, stop_words=my_stop_words, analyzer='word', max_features=50)

Error when imputing minimum values using SimpleImputer

I'm trying to use the minimum values of each column to replace missing values but keep getting an error. Below is my code:
from sklearn.impute import SimpleImputer
numeric_cols = [X_test.select_dtypes(exclude=['object']).columns]
numeric_df = X_test.select_dtypes(exclude=['object'])
for col in numeric_cols:
my_imputer = SimpleImputer(strategy='constant', fill_value=X_test[col].min())
imputed_numeric_X_test = pd.DataFrame(my_imputer.fit_transform(numeric_df))
imputed_numeric_X_test.columns = numeric_df.columns
This is the error I get when I run it:
ValueError: 'fill_value'=MSSubClass 20.0
LotFrontage 21.0
LotArea 1470.0
OverallQual 1.0
OverallCond 1.0
YearBuilt 1879.0
YearRemodAdd 1950.0
MasVnrArea 0.0
BsmtFinSF1 0.0
BsmtFinSF2 0.0
BsmtUnfSF 0.0
TotalBsmtSF 0.0
1stFlrSF 407.0
2ndFlrSF 0.0
LowQualFinSF 0.0
GrLivArea 407.0
BsmtFullBath 0.0
BsmtHalfBath 0.0
FullBath 0.0
HalfBath 0.0
BedroomAbvGr 0.0
KitchenAbvGr 0.0
TotRmsAbvGrd 3.0
Fireplaces 0.0
GarageYrBlt 1895.0
GarageCars 0.0
GarageArea 0.0
WoodDeckSF 0.0
OpenPorchSF 0.0
EnclosedPorch 0.0
3SsnPorch 0.0
ScreenPorch 0.0
PoolArea 0.0
MiscVal 0.0
MoSold 1.0
YrSold 2006.0
dtype: float64 is invalid. Expected a numerical value when imputing numerical data
What is wrong and how can I fix it?
SimpleImputer only supports a single value for fill_value, not a per-column specification. Adding that was discussed in Issue19783, but passed on, and wouldn't support taking the columnwise minimum anyway. I can't find any discussion to add a custom callable option for strategy, which would seem to be the clearest solution. So I think you're stuck doing it manually or with a custom transformer. To do it somewhat manually, you could use the ColumnTransformer approach specified in the linked Issue.

TypeError: Unable to build `Dense` layer with non-floating point dtype <dtype: 'string'> : My data has no String type data in it

I'm trying to build a Neural Network for classification. I preprocessed all the data and it looks like this:
0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 2 0.12436986167881312 -0.426405420419126 1
although everything looks okay and datatype is int or float but I'm still getting the following error:
File "C:\Users\spark\anaconda3\lib\site-packages\tensorflow\python\keras\layers\core.py", line 1002, in build
'dtype %s' % (dtype,))
TypeError: Unable to build `Dense` layer with non-floating point dtype <dtype: 'string'>
Most features are dummies or scaled by standardscaler they are floats. and just to be sure I checked data type of last and 4th last column(which are integers in Bold) they are also integers. So..
why am I getting this error? how do I resolve this.
Below is the code I'm using:
X = dataset.iloc[:, 1:].values
y = dataset.iloc[:, 0].values
pred_set = prediction_set.values
temp_dataset = np.concatenate([X, pred_set], axis=0)
'''Encoding Features'''
index_list = [1,2,4,6,7]
reverse_index = []
for i in range(len(index_list)):
reverse_index.append(index_list[i]-temp_dataset.shape[1])
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
for i in range(len(reverse_index)):
index = reverse_index[i]
ct = ColumnTransformer(transformers = [('encoder', OneHotEncoder(), [index])], remainder = 'passthrough')
temp_dataset = ct.fit_transform(temp_dataset)
X = temp_dataset[:891]
pred_set = temp_dataset[891:]
'''Train - Test split'''
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size = 0.2, random_state = 0)
'''Feature scaling'''
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
Xtrain[:, (-2,-3)] = sc.fit_transform(Xtrain[:, (-2,-3)])
Xtest[:, (-2,-3)] = sc.transform(Xtest[:, (-2,-3)])
pred_set[:, (-2,-3)] = sc.transform(pred_set[:, (-2,-3)])
Xtrain.astype(float)
Xtest.astype(float)
pred_set.astype(float)
import tensorflow as tf
classifier = tf.keras.models.Sequential()
classifier.add(tf.keras.layers.Dense(units= 20, activation='relu'))
classifier.add(tf.keras.layers.Dense(units= 20, activation='relu'))
classifier.add(tf.keras.layers.Dense(units= 1, activation='sigmoid'))
classifier.compile(optimizer='adam', loss = 'binary_crossentropy', metrics=['accuracy'])
classifier.fit(Xtrain, ytrain)
Try to cast your data to float, works both with pandas or numpy:
df.astype(float)
According to comment, do not try to convert just a column, simply convert whole dataset.
In case there are some characters that are not digits, the following code should help:
df[-1] = pd.to_numeric(df[-1], errors='coerce')

OneHot vectors with feature names

Looking at the documentation of the OneHotEncoder there doesn't seem to be a way to include the feature names as a prefix of the OneHot vectors. Does anyone know of a way around this? Am I missing something?
Sample dataframe:
df = pd.DataFrame({'a':['c1', 'c1', 'c2', 'c1', 'c3'], 'b':['c1', 'c4', 'c1', 'c1', 'c1']})
from sklearn.preprocessing import OneHotEncoder
onehot = OneHotEncoder()
onehot.fit(df)
onehot.get_feature_names()
array(['x0_c1', 'x0_c2', 'x0_c3', 'x1_c1', 'x1_c4'], dtype=object)
Where given that the encoder is fed a dataframe I'd expect the possibility to obtain something like:
array(['a_c1', 'a_c2', 'a_c3', 'b_c1', 'b_c4'], dtype=object)
Here is what you need to do to include your feature names from get_feature_name.
onehot.get_feature_names(input_features=df.columns)
Output:
array(['a_c1', 'a_c2', 'a_c3', 'b_c1', 'b_c4'], dtype=object)
Per docs:
get_feature_name(self, input_features=None)
Return feature names for output features.
Parameters: input_features : list of string, length n_features,
optional String names for input features if available. By default,
“x0”, “x1”, … “xn_features” is used.
Returns: output_feature_names : array of string, length
n_output_features
Let's create a dataframe with 3 columns, each having some categorical values.
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
df_dict= {'Sex' :['m', 'f' ,'m' ,'f'] , 'City' : ['C1' , 'C2' , 'C3' , 'C4'] , 'States' :['S1' , 'S2', 'S3', 'S4']}
df = pd.DataFrame.from_dict(df_dict)
cat_enc = OneHotEncoder(handle_unknown = 'ignore')
transformed_array = cat_enc.fit_transform(df).toarray()
transformed_df = pd.DataFrame(transformed_array , columns= cat_enc.get_feature_names(df.columns))
transformed_df.head()
We will get the following output -
City_C1 City_C2 City_C3 City_C4 Sex_f Sex_m States_S1 States_S2 States_S3 States_S4
0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0
1 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0
2 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0
3 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0

One hot encoding error python machine learning

I am working with categorical variables in Machine Learning.Here is sample of my data:
age,gender,height,class,label
25,m,43,A,0
35,f,45,B,1
12,m,36,C,0
14,f,42,A,0
There are two categorical variables gender and height.I have used LabelEncoding technique.
My code:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
df=pd.read_csv('test.csv')
X=df.drop(['label'],1)
y=np.array(df['label'])
data=X.iloc[:,:].values
lben = LabelEncoder()
data[:,1] = lben.fit_transform(data[:,1])
data[:,3] = lben.fit_transform(data[:,3])
onehotencoder = OneHotEncoder(categorical_features=[1])
data = onehotencoder.fit_transform(data).toarray()
onehotencoder = OneHotEncoder(categorical_features=[3])
data = onehotencoder.fit_transform(data).toarray()
print(data.shape)
np.savetxt('data.csv',data,fmt='%s')
The data.csv looks like this:
0.0 0.0 1.0 0.0 0.0 1.0 25.0 0.0
0.0 0.0 0.0 1.0 1.0 0.0 35.0 1.0
1.0 0.0 0.0 0.0 0.0 1.0 12.0 2.0
0.0 1.0 0.0 0.0 1.0 0.0 14.0 0.0
I am unable to understand why the column is like this i.e where is the value of the 'height' column.Also the data.shape is (4,8) instead of (4,7) i.e(gender represented by 2 columns and class by 3 and 'age' and 'height' features.
Are you sure that you need to use LabelEncoder+OneHotEncoder? There is a much simpler method (which does not allow to do advanced procedures, but so far you seem to work on basics):
import pandas as pd
import numpy as np
df=pd.read_csv('test.csv')
X=df.drop(['label'],1)
y=np.array(df['label'])
data = pd.get_dummies(X)
The problem with the current code is that after you have done the first OHE:
onehotencoder = OneHotEncoder(categorical_features=[1])
data = onehotencoder.fit_transform(data).toarray()
the columns get shifted and column 3 is in fact the original height column instead of the label-encoded class column. So change the second one to use column 4 and you will get what you want.

Categories

Resources