Sklearn feature selection - python

I have been unable to use any of the Sklearn feature extraction methods without getting the following error:
"TypeError: cannot perform reduce with flexible type"
Working from examples, the feature extraction methods appear to only work for non-classification problems. I am of course, trying to do a classification problem. How can I fix this?
Example code:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston
import random
# Load data
boston = load_boston()
X = boston["data"]
Y = boston["target"]
# Make a classification problem
classes = ['a', 'b', 'c']
Y = [random.choice(classes) for entry in Y]
# Perform feature selection
names = boston["feature_names"]
lr = LinearRegression()
rfe = RFE(lr, n_features_to_select=1)
rfe.fit(X, Y)
print "Features sorted by their rank:"
print sorted(zip(map(lambda x: round(x, 4), rfe.ranking_), names))

I guess the following will solve your problem.
X = np.array(X, dtype = 'float_')
Y = np.array(X, dtype = 'float_')
Do it before calling the fit method. You can also use int_ instead of float_. It totally depends on the data type you need.
If your labels are string, then you can use LabelEncoder to encode the labels into integers.
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le = le.fit_transform(Y)
model.fit(X, le)

Related

LinearSVC: Equation of a straight line that separates two classes from a scatterplot graph and pandas DataFrame

I'm trying to create a straight line that separates two classes. I'm using panda's dataframe with scatterplot.
Here is my code before I get you into my problem:
Libraries:
import pandas as pd
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.svm import LinearSVC
from sklearn.metrics import ConfusionMatrixDisplay
from scipy.io import arff
Data:
arquivo_arff = arff.loadarff(r"/content/Rice_MSC_Dataset.arff")
dados = pd.DataFrame(arquivo_arff[0])
Filter:
dados = dados[['MINOR_AXIS', 'MAJOR_AXIS', 'CLASS']]
Another filter:
dados = dados[dados['CLASS'].isin([b"Arborio", b"Ipsala"])]
Graph with two parameters:
sns.scatterplot(
data=dados,
x="MINOR_AXIS",
y="MAJOR_AXIS",
hue="CLASS")
plt.show()
My problem is here, when I use LinearSVC for finding que parameters and coeficients of my equation:
model = LinearSVC()
model.fit(dados.drop('CLASS', axis=1), dados['CLASS'])
a, b = model.coef_[0]
d = model.intercept_[0]
print('a:', a)
print('b:', b)
print('d:', d)
You appear to be using a legacy multi-label data representation. Sequence of sequences are no longer supported; use a binary array or sparse matrix instead - the MultiLabelBinarizer transformer can convert to this format.
I didn't understand that error quite well. Is there any ways that I can fix this in my code?
The documentation for multilabelbinarizer have some good examples for specific use, but a general workflow for sklearn transformers is:
Split data into features and labels
X = dados.drop('CLASS', axis=1)
y = dados['CLASS']
#optionally, use train_test_split to split data into training and validation sets
#X_train,X_test,y_train,y_test=train_test_split(X,y)
Do transformations on input and target data
mb = MultiLabelBinarizer()
mb.fit(y)
mb.transform(y)
#can also be done in one step with mb.fit_transform(y)
#if using train_test_split: mb.fit_transform(y_train); mb.transform(y_test)
Fit your model
model = LinearSVC()
model.fit(X,y) #or model.fit(X_train,y_train) if using training and validation sets

ValueError: Expected 2D array, got 1D array instead: array=[-1]

Here is the problem
Extract just the median_income column from the independent variables (from X_train and X_test).
Perform Linear Regression to predict housing values based on median_income.
Predict output for test dataset using the fitted model.
Plot the fitted model for training data as well as for test data to check if the fitted model satisfies the test data.
I did a linear regression earlier.Following is the code
import pandas as pd
import os
os.getcwd()
os.chdir('/Users/saurabhsaha/Documents/PGP-AI:ML-Purdue/New/datasets')
df=pd.read_excel('California_housing.xlsx')
df.total_bedrooms=df.total_bedrooms.fillna(df.total_bedrooms.mean())
x = df.iloc[:,2:8]
y = df.median_house_value
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=.20)
from sklearn.linear_model import LinearRegression
california_model = LinearRegression().fit(x_train,y_train)
california_model.predict(x_test)
Prdicted_values = pd.DataFrame(california_model.predict(x_test),columns=['Pred'])
Prdicted_values
Final = pd.concat([x_test.reset_index(drop=True),y_test.reset_index(drop=True),Prdicted_values],axis=1)
Final['Err_pct'] = abs(Final.median_house_value-
Final.Pred)/Final.median_house_value
Here is my dataset- https://docs.google.com/spreadsheets/d/1vYngxWw7tqX8FpwkWB5G7Q9axhe9ipTu/edit?usp=sharing&ouid=114925088866643320785&rtpof=true&sd=true
Following is my code.
x1_train=x_train.median_income
x1_train
x1_train.shape
x1_test=x_test.median_income
x1_test
type(x1_test)
x1_test.shape
from sklearn.linear_model import LinearRegression
california_model_new = LinearRegression().fit(x1_train,y_train)```
I get an error right here and when I try converting my 2 D array to 1 D as follows , i can not
```python
import numpy as np
x1_train= x1_train.reshape(-1, 1)
x1_test = x1_train.reshape(-1, 1)
This is the error I get
AttributeError: 'Series' object has no attribute 'reshape'
I am new to data science so if you can explain a bit then it would be real helpful
x1_train and x1_test are pandas Series objects, whereas the the reshape() method is applied to numpy arrays.
Do this instead:
x1_train= x1_train.to_numpy().reshape(-1, 1)
x1_test = x1_train.to_numpy().reshape(-1, 1)

Python label encoding : Decision tree classification

Im really new to Python and am trying to run a decision tree model with the below query:
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import numpy as np
import pandas as pd
import sklearn as skl
data_forecast = pd.read_excel("./Forcast_data_Analytics.xlsx")
x = data_forecast[['Name','Power', 'FirstEventID','AlleventIds']]
y = data_forecast[['Possible_fix','Changes_Required']]
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.8)
classifier = DecisionTreeClassifier()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
sample data:
Name Power FirstEventID AlleventIds Possible_fix Changes_Required
India I3000 10130-1 10130-1, 134-00 yes Bug Fix
Can I do the decision tree classification without label encoding?
or Do I need to encode my data in order to enter classification?
what is the best way to do this?
I want to consider everything as string and encode them.
After classification, I also want to decode them.
I tried the below encoding method, which did not work:
from sklearn.preprocessing import LabelEncoder
vals = np.array(data_forecast)
LabelEncoder = LabelEncoder()
integer_encoded = LabelEncoder.fit_transform(vals)
Error:
Exception has occurred: ValueError
y should be a 1d array, got an array of shape (59, 23) instead.
What is the right way to do this?
How do i encode/decode my labels and use this?
The question is already old, but I'll try to help, it may be useful for someone else.
The error seems to be simple and happened even before the encoding was processed by the classifier. y should be one single column (1-dimension array) and you passed 2 here:
y = data_forecast[['Possible_fix','Changes_Required']]
About the encoding part, I'm not specialist on that, but what I've already done and worked was to load data as a DataFrame "df" and later split as df2 for X:
df2 = df.loc[:, df.columns != 'col_class']
And encode only X:
from sklearn.preprocessing import LabelEncoder
X = df2.apply(LabelEncoder().fit_transform)
y = df['col_class']
Hope it helps.

Using sklearn voting ensemble with partial fit

Can someone please tell how to use ensembles in sklearn using partial fit.
I don't want to retrain my model.
Alternatively, can we pass pre-trained models for ensembling ?
I have seen that voting classifier for example does not support training using partial fit.
The Mlxtend library has an implementation of VotingEnsemble which allows you to pass in pre-fitted models. For example if you have three pre-trained models clf1, clf2, clf3. The following code would work.
from mlxtend.classifier import EnsembleVoteClassifier
import copy
eclf = EnsembleVoteClassifier(clfs=[clf1, clf2, clf3], weights=[1,1,1], fit_base_estimators=False)
When set to false the fit_base_estimators argument in EnsembleVoteClassifier ensures that the classifiers are not refit.
In general, when looking for more advanced technical features that sci-kit learn does not provide, look to mlxtend as a first point of reference.
Workaround:
VotingClassifier checks that estimators_ is set in order to understand whether it is fitted, and is using the estimators in estimators_ list for prediction.
If you have pre trained classifiers, you can put them in estimators_ directly like the code below.
However, it is also using LabelEnconder, so it assumes labels are like 0,1,2,... and you also need to set le_ and classes_ (see below).
from sklearn.ensemble import VotingClassifier
from sklearn.preprocessing import LabelEncoder
clf_list = [clf1, clf2, clf3]
eclf = VotingClassifier(estimators = [('1' ,clf1), ('2', clf2), ('3', clf3)], voting='soft')
eclf.estimators_ = clf_list
eclf.le_ = LabelEncoder().fit(y)
eclf.classes_ = seclf.le_.classes_
# Now it will work without calling fit
eclf.predict(X,y)
Unfortunately, currently this is not possible in scikit VotingClassifier.
But you can use http://sebastianraschka.com/Articles/2014_ensemble_classifier.html (from which VotingClassifer is implemented) to try and implement your own voting classifier which can take pre-fitted models.
Also we can look at the source code here and modify it to our use:
from sklearn.preprocessing import LabelEncoder
import numpy as np
le_ = LabelEncoder()
# When you do partial_fit, the first fit of any classifier requires
all available labels (output classes),
you should supply all same labels here in y.
le_.fit(y)
# Fill below list with fitted or partial fitted estimators
clf_list = [clf1, clf2, clf3, ... ]
# Fill weights -> array-like, shape = [n_classifiers] or None
weights = [clf1_wgt, clf2_wgt, ... ]
weights = None
#For hard voting:
pred = np.asarray([clf.predict(X) for clf in clf_list]).T
pred = np.apply_along_axis(lambda x:
np.argmax(np.bincount(x, weights=weights)),
axis=1,
arr=pred.astype('int'))
#For soft voting:
pred = np.asarray([clf.predict_proba(X) for clf in clf_list])
pred = np.average(pred, axis=0, weights=weights)
pred = np.argmax(pred, axis=1)
#Finally, reverse transform the labels for correct output:
pred = le_.inverse_transform(np.argmax(pred, axis=1))
It's not too hard to implement the voting. Here's my implementation:
import numpy as np
class VotingClassifier(object):
""" Implements a voting classifier for pre-trained classifiers"""
def __init__(self, estimators):
self.estimators = estimators
def predict(self, X):
# get values
Y = np.zeros([X.shape[0], len(self.estimators)], dtype=int)
for i, clf in enumerate(self.estimators):
Y[:, i] = clf.predict(X)
# apply voting
y = np.zeros(X.shape[0])
for i in range(X.shape[0]):
y[i] = np.argmax(np.bincount(Y[i,:]))
return y
The Mlxtend library has an implementation works, you still need to call the fit function for the EnsembleVoteClassifier. Seems the fit function doesn't really modify any parameters rather checking the possible label values. In the example below, you have to give an array contains all the possible values appear in original y(in this case 1,2) to eclf2.fit It doesn't matter for X.
import numpy as np
from mlxtend.classifier import EnsembleVoteClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
import copy
clf1 = LogisticRegression(random_state=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
y = np.array([1, 1, 1, 2, 2, 2])
for clf in (clf1, clf2, clf3):
clf.fit(X, y)
eclf2 = EnsembleVoteClassifier(clfs=[clf1, clf2, clf3],voting="soft",refit=False)
eclf2.fit(None,np.array([1,2]))
print(eclf2.predict(X))

Identifying filtered features after feature selection with scikit learn

Here is my Code for feature selection method in Python:
from sklearn.svm import LinearSVC
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target
X.shape
(150, 4)
X_new = LinearSVC(C=0.01, penalty="l1", dual=False).fit_transform(X, y)
X_new.shape
(150, 3)
But after getting new X(dependent variable - X_new), How do i know which variables are removed and which variables are considered in this new updated variable ? (which one removed or which three are present in data.)
Reason of getting this identification is to apply the same filtering on new test data.
Modified your code a little bit. For each class, the features used can be seen by looking at the the coefficients of LinearSVC. According to the documentation, coef_ : array, shape = [n_features] if n_classes == 2 else [n_classes, n_features]
As for new data, you just need to apply transform to it.
from sklearn.svm import LinearSVC
from sklearn.datasets import load_iris
import numpy as np
iris = load_iris()
X, y = iris.data, iris.target
print X.shape
lsvc = LinearSVC(C=0.01, penalty="l1", dual=False)
X_new = lsvc.fit_transform(X, y)
print X_new.shape
print lsvc.coef_
newData = np.random.rand(100,4)
newData_X = lsvc.transform(newData)
print newData_X.shape

Categories

Resources