How to use a new data set on a trained model? - python

I am trying to use a new data set on a previously trained model to see how accurate the model is. I use the following code and receive the below error. Would another method solve this problem? thanks
import pandas as pd
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
%matplotlib inline
df = pd.read_excel('xxxx.xlsx')
enc = LabelEncoder()
X = df[df.columns[1:]]
Y = df[df.columns[0]].values.ravel()
Y2 = enc.fit_transform(Y)
df.insert(0, "Unit Status", Y2, True)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y2, random_state = 0, test_size = 0.25)
clf = LinearSVC(random_state=0,dual=False, tol=1e-5)
clf.fit(X, Y2)
Y_pred = clf.predict(X_test)
confusion_matrix(Y_test, Y_pred)
classifier_predictions = clf.predict(X_test)
print(accuracy_score(Y_test, classifier_predictions)*100)
df2 = pd.read_excel('xxxx_v2.xlsx')
y_pred=clf.predict(df2)
ValueError: could not convert string to float: '20-002'

The data in the new dataframe must all be floats or at least can be converted to float, the first and second columns have string data which cannot be converted to numbers, thus the model cannot train or predict on this data. from looking at the data, you could use labelEncoder on the second column and decide whether or not to use OneHotEncoder, but it looks to me that the first column doesn't contain categorical data. If the model needs the first column's data, then you need to convert it to numbers somehow, otherwise just drop the column.

Related

How to get name of selected features when there are several feature selection methods in sklearn pipeline?

I want to use several feature selection methods in a sklearn pipeline as below:
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.feature_selection import VarianceThreshold
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
model = Pipeline([('vt', VarianceThreshold(0.01)),
('kbest', SelectKBest(chi2, k=5)),
('gbc', GradientBoostingClassifier(random_state=0))])
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
I want to get name or column index of selected features. The point is that the 2nd feature selection step gets the output of the 1st feature selection step (not original X_train). Therefore, when I use methods like get_support() or get_feature_names_out() for the 2nd feature selection step, the feature names or indices don't match with the original input features.
vt = model['vt']
vt.get_feature_names_out()
vt.get_support()
kbest = model['kbest']
kbest.get_feature_names_out()
kbest.get_support()
For example, when I run vt.get_support(), I get an array of boolean with 30 entires. But, when I run kbest.get_support(), I get an array of boolean with 14 entires. It means that the name or column index of data input to the 2nd feature selection method was reset and there is no match with input data to the 1st feature selction method.
How to solve this issue?
In case it is enough for you to get the names of the selected features without caring about which features are selected in which step**, the following might be an easy way to go.
You can just return your input X as a dataframe via the parameter as_frame set to True (X, y = load_breast_cancer(return_X_y=True, as_frame=True)). This will allow you to get feature names as strings, which in turn allows method .get_feature_names_out() to return the selected features with the original names. The same does not happen in case you work with a numpy array as they do not have explicit column names.
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.feature_selection import VarianceThreshold
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
model = Pipeline([('vt', VarianceThreshold(0.01)),
('kbest', SelectKBest(chi2, k=5)),
('gbc', GradientBoostingClassifier(random_state=0))])
model.fit(X_train, y_train)
model[:-1].get_feature_names_out()
** btw this will enable you to get the original name of the selected features also for the first transformer, but unfortunately not for the second one as the dataframe becomes a numpy array in the meanwhile.
vt = model['vt']
vt.get_feature_names_out()

Using Python to build a linear regression model and find R2'd; cannot get the model to fit or predict

Some imports for several reasons
import pandas as pd
import numpy as np
I successfully split the data -test(30%) and train(70%) and separated it:
X_train = df_train.drop(columns='Rating')
y_train = df_train.Rating
from sklearn.linear_model import LinearRegression
X_test = df_test.drop(columns='Rating')
y_test = df_test.Rating
Everything is fine to this point, then
linreg = LinearRegression()
linreg.fit(X_train, y_train)
ValueError: could not convert string to float: 'GAME'
Am positive the Rating column is a float
Check your df first row, it might have header repeating again in that place. or Just train from second row.

Python label encoding : Decision tree classification

Im really new to Python and am trying to run a decision tree model with the below query:
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import numpy as np
import pandas as pd
import sklearn as skl
data_forecast = pd.read_excel("./Forcast_data_Analytics.xlsx")
x = data_forecast[['Name','Power', 'FirstEventID','AlleventIds']]
y = data_forecast[['Possible_fix','Changes_Required']]
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.8)
classifier = DecisionTreeClassifier()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
sample data:
Name Power FirstEventID AlleventIds Possible_fix Changes_Required
India I3000 10130-1 10130-1, 134-00 yes Bug Fix
Can I do the decision tree classification without label encoding?
or Do I need to encode my data in order to enter classification?
what is the best way to do this?
I want to consider everything as string and encode them.
After classification, I also want to decode them.
I tried the below encoding method, which did not work:
from sklearn.preprocessing import LabelEncoder
vals = np.array(data_forecast)
LabelEncoder = LabelEncoder()
integer_encoded = LabelEncoder.fit_transform(vals)
Error:
Exception has occurred: ValueError
y should be a 1d array, got an array of shape (59, 23) instead.
What is the right way to do this?
How do i encode/decode my labels and use this?
The question is already old, but I'll try to help, it may be useful for someone else.
The error seems to be simple and happened even before the encoding was processed by the classifier. y should be one single column (1-dimension array) and you passed 2 here:
y = data_forecast[['Possible_fix','Changes_Required']]
About the encoding part, I'm not specialist on that, but what I've already done and worked was to load data as a DataFrame "df" and later split as df2 for X:
df2 = df.loc[:, df.columns != 'col_class']
And encode only X:
from sklearn.preprocessing import LabelEncoder
X = df2.apply(LabelEncoder().fit_transform)
y = df['col_class']
Hope it helps.

ValueError: could not convert string to float: '?'

I have tried to run a SVM program, and I got the above error. The code is here below. Please point out the error in it.
import numpy as np
import pandas as pd
from sklearn import svm
from sklearn.model_selection import train_test_split
data = pd.read_csv('risk_factors_cervical_cancer.csv')
X = np.array(data[[#some data elements]])
y = np.array(data[#some data elements])
print(X)
print(y)
print(X.shape)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,
random_state=30)
classifier = svm.SVC()
classifier.fit(X_train, y_train) #the error occurs here
y_pred = svm.predict(X_test)
acc = accuracy_score(y_test, y_pred)
`
As #Guimoute wrote, preprocessing your data is always necessary in order to train it with any machine learning algorithm. Try X.head(10) to get an introduction to the data you are using. Your error occurs because there is a value "?" in your X column. Replace it with some reasonable number, i.e. the mean of the column for example in order to get better results.

Python: How can we match values of predicted and truth values of a regression model

We are trying to plot the predicted values and truth values on the same graph after fitting a model to predict a truth value using a RandomForestRegressor in Python of the three column dataset (click the link to download the full CSV-dataset formatted as in the following
t_stamp,X,Y
0.000543,0,10
0.000575,0,10
0.041324,1,10
0.041331,2,10
0.041336,3,10
0.04134,4,10
0.041345,5,10
0.04135,6,10
0.041354,7,10
Here is how we do the prediction.
import pandas as pd
import numpy as np
import glob, os
from io import StringIO
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
import math
from math import sqrt
from sklearn.cross_validation import train_test_split
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "data*.csv"))))
for i in range(1,10):
df['X_t'+str(i)] = df['X'].shift(i)
print(df)
df.dropna(inplace=True)
X = pd.DataFrame({ 'X_%d'%i : df['X'].shift(i) for i in range(10)}).apply(np.nan_to_num, axis=0).values
y = df['Y'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40)
reg = RandomForestRegressor(criterion='mse')
reg.fit(X_train,y_train)
modelPred_test = reg.predict(X_test)
print(modelPred_test)
For comparison, we wish to generate a plot before prediction and after prediction. For the truth value, we tried it with
fig, ax = plt.subplots()
ax.plot(df['time'].values, df['Y'].values)
We wish to plot (in the same graph) the ground truth (time as x-axis and the value of Y as y-axis. When we do
ax.plot(df['time'].values, modelPred_test)
We are getting the following error.
raise ValueError("x and y must have same first dimension")
ValueError: x and y must have same first dimension
This means that we have less prediction values than we have time stamps in our dataset. To verify this, I did
print(df['time'].values.shape) and print(modelPred_test.shape) - and it outputs (258523,) and (103410,) respectively. How can we match which of my time values correspond to the prediction values, then i can use a subset of the time values for my plot command?
You have to set your data like the following.
X = df.drop('Y', axis=1)
y = df['Y']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40)
X_train = X_train.drop('time', axis=1)
X_test = X_test.drop('time', axis=1)
and then sort the datasets
index_values=range(0,len(y_test))
y_test.sort_index(inplace=True)
X_test.sort_index(inplace=True)
modelPred_test = reg.predict(X_test)
ax.plot(pd.Series(index_values), y_test.values)
finally, do the same plot for the predicted values of y. Hope this helps.
You need to keep track of the indices for training and testing datasets. For example, you could define
train_index, test_index = train_test_split(df.index, test_size=0.40)
and then X_train = X[train_index], etc.
Then, you could plot the results via ax.plot(df['time'][test_index].values, modelPred_test[df.index == test_index]).

Categories

Resources