I am working with categorical variables in Machine Learning.Here is sample of my data:
age,gender,height,class,label
25,m,43,A,0
35,f,45,B,1
12,m,36,C,0
14,f,42,A,0
There are two categorical variables gender and height.I have used LabelEncoding technique.
My code:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
df=pd.read_csv('test.csv')
X=df.drop(['label'],1)
y=np.array(df['label'])
data=X.iloc[:,:].values
lben = LabelEncoder()
data[:,1] = lben.fit_transform(data[:,1])
data[:,3] = lben.fit_transform(data[:,3])
onehotencoder = OneHotEncoder(categorical_features=[1])
data = onehotencoder.fit_transform(data).toarray()
onehotencoder = OneHotEncoder(categorical_features=[3])
data = onehotencoder.fit_transform(data).toarray()
print(data.shape)
np.savetxt('data.csv',data,fmt='%s')
The data.csv looks like this:
0.0 0.0 1.0 0.0 0.0 1.0 25.0 0.0
0.0 0.0 0.0 1.0 1.0 0.0 35.0 1.0
1.0 0.0 0.0 0.0 0.0 1.0 12.0 2.0
0.0 1.0 0.0 0.0 1.0 0.0 14.0 0.0
I am unable to understand why the column is like this i.e where is the value of the 'height' column.Also the data.shape is (4,8) instead of (4,7) i.e(gender represented by 2 columns and class by 3 and 'age' and 'height' features.
Are you sure that you need to use LabelEncoder+OneHotEncoder? There is a much simpler method (which does not allow to do advanced procedures, but so far you seem to work on basics):
import pandas as pd
import numpy as np
df=pd.read_csv('test.csv')
X=df.drop(['label'],1)
y=np.array(df['label'])
data = pd.get_dummies(X)
The problem with the current code is that after you have done the first OHE:
onehotencoder = OneHotEncoder(categorical_features=[1])
data = onehotencoder.fit_transform(data).toarray()
the columns get shifted and column 3 is in fact the original height column instead of the label-encoded class column. So change the second one to use column 4 and you will get what you want.
Related
I'm trying to use the minimum values of each column to replace missing values but keep getting an error. Below is my code:
from sklearn.impute import SimpleImputer
numeric_cols = [X_test.select_dtypes(exclude=['object']).columns]
numeric_df = X_test.select_dtypes(exclude=['object'])
for col in numeric_cols:
my_imputer = SimpleImputer(strategy='constant', fill_value=X_test[col].min())
imputed_numeric_X_test = pd.DataFrame(my_imputer.fit_transform(numeric_df))
imputed_numeric_X_test.columns = numeric_df.columns
This is the error I get when I run it:
ValueError: 'fill_value'=MSSubClass 20.0
LotFrontage 21.0
LotArea 1470.0
OverallQual 1.0
OverallCond 1.0
YearBuilt 1879.0
YearRemodAdd 1950.0
MasVnrArea 0.0
BsmtFinSF1 0.0
BsmtFinSF2 0.0
BsmtUnfSF 0.0
TotalBsmtSF 0.0
1stFlrSF 407.0
2ndFlrSF 0.0
LowQualFinSF 0.0
GrLivArea 407.0
BsmtFullBath 0.0
BsmtHalfBath 0.0
FullBath 0.0
HalfBath 0.0
BedroomAbvGr 0.0
KitchenAbvGr 0.0
TotRmsAbvGrd 3.0
Fireplaces 0.0
GarageYrBlt 1895.0
GarageCars 0.0
GarageArea 0.0
WoodDeckSF 0.0
OpenPorchSF 0.0
EnclosedPorch 0.0
3SsnPorch 0.0
ScreenPorch 0.0
PoolArea 0.0
MiscVal 0.0
MoSold 1.0
YrSold 2006.0
dtype: float64 is invalid. Expected a numerical value when imputing numerical data
What is wrong and how can I fix it?
SimpleImputer only supports a single value for fill_value, not a per-column specification. Adding that was discussed in Issue19783, but passed on, and wouldn't support taking the columnwise minimum anyway. I can't find any discussion to add a custom callable option for strategy, which would seem to be the clearest solution. So I think you're stuck doing it manually or with a custom transformer. To do it somewhat manually, you could use the ColumnTransformer approach specified in the linked Issue.
I performed a PCA of my data. The data looks like the following:
df
Out[60]:
Drd1_exp1 Drd1_exp2 Drd1_exp3 ... M7_pppp M7_puuu Brain_Region
0 -1.0 -1.0 -1.0 ... 0.0 0.0 BaGr
3 -1.0 -1.0 -1.0 ... 0.0 0.0 BaGr
4 -1.0 -1.0 -1.0 ... 0.0 0.0 BaGr
... ... ... ... ... ... ...
150475 -1.0 -1.0 -1.0 ... 0.0 0.0 BaGr
150478 -1.0 -1.0 -1.0 ... 0.0 0.0 BaGr
150479 -1.0 -1.0 -1.0 ... 0.0 0.0 BaGr
I know used every row until 'Brain Regions' as features. I also standardized them.
These features are different experiments, that give me information about a 3D image of a brain.
I'll show you my code:
from sklearn.preprocessing import StandardScaler
x = df.loc[:, listend1].values
y= df.loc[:, 'Brain_Region'].values
x = StandardScaler().fit_transform(x)
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(x)
principalDf = pd.DataFrame(data = principalComponents
, columns = ['principal component 1', 'principal component 2'])
finalDf = pd.concat([principalDf, df[['Brain_Region']]], axis = 1)
I then plotted finalDF:
My question now is: How can I find out, which features contribute to my Components? How can I find out, to interpret the data?
You can use pca.components_ (or pca.components depending on the sklearn version).
It has shape (n_components, n_features), in your case (2, n_features) and represents the directions of maximum variance in the data, which reflects the magnitude of the corresponding values in the eigenvectors (higher magnitude - higher importance). You will have something like this:
[[0.522 0.26 0.58 0.56],
[0.37 0.92 0.02 0.06]]
implying that for the first component (first row) the first, third and last features have an higher importance, while for the second component only the second feature is important.
Have a look to sklern PCA attributes description or to this post.
By the way, you can also use a Random Forest Classifier including the labels, and after the training you can explore the feature importance, e.g. this post.
Looking at the documentation of the OneHotEncoder there doesn't seem to be a way to include the feature names as a prefix of the OneHot vectors. Does anyone know of a way around this? Am I missing something?
Sample dataframe:
df = pd.DataFrame({'a':['c1', 'c1', 'c2', 'c1', 'c3'], 'b':['c1', 'c4', 'c1', 'c1', 'c1']})
from sklearn.preprocessing import OneHotEncoder
onehot = OneHotEncoder()
onehot.fit(df)
onehot.get_feature_names()
array(['x0_c1', 'x0_c2', 'x0_c3', 'x1_c1', 'x1_c4'], dtype=object)
Where given that the encoder is fed a dataframe I'd expect the possibility to obtain something like:
array(['a_c1', 'a_c2', 'a_c3', 'b_c1', 'b_c4'], dtype=object)
Here is what you need to do to include your feature names from get_feature_name.
onehot.get_feature_names(input_features=df.columns)
Output:
array(['a_c1', 'a_c2', 'a_c3', 'b_c1', 'b_c4'], dtype=object)
Per docs:
get_feature_name(self, input_features=None)
Return feature names for output features.
Parameters: input_features : list of string, length n_features,
optional String names for input features if available. By default,
“x0”, “x1”, … “xn_features” is used.
Returns: output_feature_names : array of string, length
n_output_features
Let's create a dataframe with 3 columns, each having some categorical values.
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
df_dict= {'Sex' :['m', 'f' ,'m' ,'f'] , 'City' : ['C1' , 'C2' , 'C3' , 'C4'] , 'States' :['S1' , 'S2', 'S3', 'S4']}
df = pd.DataFrame.from_dict(df_dict)
cat_enc = OneHotEncoder(handle_unknown = 'ignore')
transformed_array = cat_enc.fit_transform(df).toarray()
transformed_df = pd.DataFrame(transformed_array , columns= cat_enc.get_feature_names(df.columns))
transformed_df.head()
We will get the following output -
City_C1 City_C2 City_C3 City_C4 Sex_f Sex_m States_S1 States_S2 States_S3 States_S4
0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0
1 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0
2 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0
3 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0
I have some data in a form of numpy array as follows:
array([['vhigh', '2', '2', 'small', 'low', 'unacc'],
['vhigh', '2', '2', 'small', 'med', 'unacc'],
['vhigh', '2', '2', 'small', 'high', 'good']], dtype=object)
that is extracted from the car dataset available at:
https://archive.ics.uci.edu/ml/datasets/Car+Evaluation
I want to use this data to apply a classification decision tree by using scikit and I managed to convert the first column, or category, into a set of numbers with:
y=data[:,0]
y=le.fit_transform(y)
print y
because I was getting an error that said:
could not convert string to float
the problem that I have is when I want to convert the array into one hot encoding. I have done the following:
X=data[:,1:]
enc=preprocessing.LabelEncoder()
enc.fit(X)
Xn=enc.transform(X)
Xn=Xn.reshape(-1,1)
ohe=preprocessing.OneHotEncoder(sparse=False)
and the error I get is:
bad input shape (1728L, 6L)
What am I doing wrong? or is there another way around to convert from categorical to numeric an array?
Thanks
For the last sklearn version (>20.0) you can just use OneHotEncoder:
df = pd.read_csv(
'https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data', header=None)
X, y = df.iloc[:,1:] , df.iloc[:,0]
encoded_y = preprocessing.LabelEncoder().fit_transform(y)
sklearn >= 20.0:
ohe = preprocessing.OneHotEncoder(sparse=False)
encoded_x = ohe.fit_transform(X)
>>> pd.DataFrame(encoded_x, columns=ohe.get_feature_names())
x0_high x0_low x0_med x0_vhigh x1_high x1_low ...
0 0.0 0.0 0.0 1.0 0.0 0.0 ...
1 0.0 0.0 0.0 1.0 0.0 0.0 ...
2 0.0 0.0 0.0 1.0 0.0 0.0 ...
3 0.0 0.0 0.0 1.0 0.0 0.0 ...
4 0.0 0.0 0.0 1.0 0.0 0.0 ...
5 0.0 0.0 0.0 1.0 0.0 0.0 ...
...
sklearn < 20.0:
from sklearn.feature_extraction import DictVectorizer
dv = DictVectorizer(sparse=False)
encoded_x = dv.fit_transform(X.to_dict(orient='records'))
pd.DataFrame(encoded_x, columns=dv.get_feature_names())
Fitting a classifier:
from sklearn.tree import DecisionTreeClassifier
DecisionTreeClassifier().fit(encoded_x, encoded_y)
I would like to compute a daily percentage change for this DataFrame (frame_):
import pandas as pd
import numpy as np
data_ = {
'A':[1,np.NaN,2,1,1,2],
'B':[1,2,3,1,np.NaN,1],
'C':[1,2,np.NaN,1,1,2],
}
dates_ = [
'06/01/2018','05/01/2018','04/01/2018','03/01/2018','02/01/2018', '01/01/2018'
]
frame_ = pd.DataFrame(data_, index=dates_, columns=['A','B','C'])
The issue is that I get a DataFrame with this method:
returns_ = frame_.pct_change(periods=1, fill_method='pad')
dates,A,B,C
06/01/2018,,,
05/01/2018,,1.0,1.0
04/01/2018,1.0,0.5,
03/01/2018,-0.5,-0.6666666666666667,-0.5
02/01/2018,0.0,,0.0
01/01/2018,1.0,0.0,1.0
Which is not what I am looking for. And the dropna() method also doesn't give me the result I seek. I would like to compute a value for each day which has value and NaN for the day where there is no value or NaN. For example, on column A: as a percentage change I would like to see
dates,A
06/01/2018,1
05/01/2018,
04/01/2018,1.0
03/01/2018,-0.5
02/01/2018,0.0
01/01/2018,1.0
Many thanks in advance
This is one way, a bit by brute-force.
import pandas as pd
import numpy as np
data_ = {
'A':[1,np.NaN,2,1,1,2],
'B':[1,2,3,1,np.NaN,1],
'C':[1,2,np.NaN,1,1,2],
}
dates_ = [
'06/01/2018','05/01/2018','04/01/2018','03/01/2018','02/01/2018', '01/01/2018'
]
frame_ = pd.DataFrame(data_, index=dates_, columns=['A','B','C'])
frame_ = pd.concat([frame_, pd.DataFrame(columns=['dA', 'dB', 'dC'])])
for col in ['A', 'B', 'C']:
frame_['d'+col] = frame_[col].pct_change()
frame_.loc[pd.notnull(frame_[col]) & pd.isnull(frame_['d'+col]), 'd'+col] = frame_[col]
# A B C dA dB dC
# 06/01/2018 1.0 1.0 1.0 1.0 1.000000 1.0
# 05/01/2018 NaN 2.0 2.0 NaN 1.000000 1.0
# 04/01/2018 2.0 3.0 NaN 1.0 0.500000 NaN
# 03/01/2018 1.0 1.0 1.0 -0.5 -0.666667 -0.5
# 02/01/2018 1.0 NaN 1.0 0.0 NaN 0.0
# 01/01/2018 2.0 1.0 2.0 1.0 0.000000 1.0