Python label encoding : Decision tree classification - python

Im really new to Python and am trying to run a decision tree model with the below query:
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import numpy as np
import pandas as pd
import sklearn as skl
data_forecast = pd.read_excel("./Forcast_data_Analytics.xlsx")
x = data_forecast[['Name','Power', 'FirstEventID','AlleventIds']]
y = data_forecast[['Possible_fix','Changes_Required']]
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.8)
classifier = DecisionTreeClassifier()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
sample data:
Name Power FirstEventID AlleventIds Possible_fix Changes_Required
India I3000 10130-1 10130-1, 134-00 yes Bug Fix
Can I do the decision tree classification without label encoding?
or Do I need to encode my data in order to enter classification?
what is the best way to do this?
I want to consider everything as string and encode them.
After classification, I also want to decode them.
I tried the below encoding method, which did not work:
from sklearn.preprocessing import LabelEncoder
vals = np.array(data_forecast)
LabelEncoder = LabelEncoder()
integer_encoded = LabelEncoder.fit_transform(vals)
Error:
Exception has occurred: ValueError
y should be a 1d array, got an array of shape (59, 23) instead.
What is the right way to do this?
How do i encode/decode my labels and use this?

The question is already old, but I'll try to help, it may be useful for someone else.
The error seems to be simple and happened even before the encoding was processed by the classifier. y should be one single column (1-dimension array) and you passed 2 here:
y = data_forecast[['Possible_fix','Changes_Required']]
About the encoding part, I'm not specialist on that, but what I've already done and worked was to load data as a DataFrame "df" and later split as df2 for X:
df2 = df.loc[:, df.columns != 'col_class']
And encode only X:
from sklearn.preprocessing import LabelEncoder
X = df2.apply(LabelEncoder().fit_transform)
y = df['col_class']
Hope it helps.

Related

How to use a new data set on a trained model?

I am trying to use a new data set on a previously trained model to see how accurate the model is. I use the following code and receive the below error. Would another method solve this problem? thanks
import pandas as pd
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
%matplotlib inline
df = pd.read_excel('xxxx.xlsx')
enc = LabelEncoder()
X = df[df.columns[1:]]
Y = df[df.columns[0]].values.ravel()
Y2 = enc.fit_transform(Y)
df.insert(0, "Unit Status", Y2, True)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y2, random_state = 0, test_size = 0.25)
clf = LinearSVC(random_state=0,dual=False, tol=1e-5)
clf.fit(X, Y2)
Y_pred = clf.predict(X_test)
confusion_matrix(Y_test, Y_pred)
classifier_predictions = clf.predict(X_test)
print(accuracy_score(Y_test, classifier_predictions)*100)
df2 = pd.read_excel('xxxx_v2.xlsx')
y_pred=clf.predict(df2)
ValueError: could not convert string to float: '20-002'
The data in the new dataframe must all be floats or at least can be converted to float, the first and second columns have string data which cannot be converted to numbers, thus the model cannot train or predict on this data. from looking at the data, you could use labelEncoder on the second column and decide whether or not to use OneHotEncoder, but it looks to me that the first column doesn't contain categorical data. If the model needs the first column's data, then you need to convert it to numbers somehow, otherwise just drop the column.

Can encode categorical data in train set but not in the test set

I need to encode the categorical values on my test set, somehow it throws TypeError: argument must be a string or number. I do not know why this happens because i could do it to my train set. I mean they're train/test feature set so they're exactly the same, what differentiates them is just the number of the rows of course. I do not know how to fix this, i have tried to use different LabelEncoder for each, but it still does not fix the error. Please someone help me.
For your information the categorical data is on the column 8th in both train and test features set
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestRegressor
import scipy.stats as ss
avo_sales = pd.read_csv('avocados.csv')
avo_sales.rename(columns = {'4046':'small PLU sold',
'4225':'large PLU sold',
'4770':'xlarge PLU sold'},
inplace= True)
avo_sales.columns = avo_sales.columns.str.replace(' ','')
x = np.array(avo_sales.drop(['TotalBags','Unnamed:0','year','region','Date'],1))
y = np.array(avo_sales.TotalBags)
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
impC = SimpleImputer(strategy='most_frequent')
X_train[:,8] = impC.fit_transform(X_train[:,8].reshape(-1,1)).ravel()
imp = SimpleImputer(strategy='median')
X_train[:,1:8] = imp.fit_transform(X_train[:,1:8])
le = LabelEncoder()
X_train[:,8] = le.fit_transform(X_train[:,8])
X_test[:,8] = le.fit_transform(X_test[:,8])
On the test set you should never use fit_transform, but only transform. And it seems that you're not applying the preprocessing you did on the training data to your test data, that is also a mistake.
EDIT
When you use fit_transform, for example SimpleImputer(strategy='most_frequent') on your training data, you're basically calculating the most frequent value, to input it in the rows containing nan. This is fine. If you do fit_transform on your test set what you're doing is cheating, because you're assuming to have lot of instances from which calculate the most frequent value (whereas instead you might be predicting only one instance). The right thing to do is to input the missing data using the most frequent value you found on the training set. This is done by using only transform. The same logic apply to every other fit_transform / transform you can find in sklearn, for example when applying PCA or a CountVectorizer.

Sklearn (NLP text classifier newbie) - issue with shape and vectorizer, X and Y not matching up

I want to create a text classifer that looks at research abstracts and determines whether they are focused on access to care, based on a labeled dataset I have. The data source is an Excel spreadsheet, with three fields (project_number, abstract, and accessclass) and 326 rows of abstracts. The accessclass is 1 for access related and 0 for not access related (not sure if this is relevant). Anyway, I tried following along a tutorial by wanted to make it relevant by adding my own data and I'm having some issues with my X and Y arrays. Any help is appreciated.
import pandas as pd
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn import naive_bayes
from sklearn.metrics import roc_auc_score
df = pd.read_excel("accessclasses.xlsx")
df.head()
#TFIDF vectorizer
stopset = set(stopwords.words('english'))
vectorizer = TfidfVectorizer(use_idf=True, lowercase=True,
strip_accents='ascii', stop_words=stopset)
y = df.accessclass
x = vectorizer.fit_transform(df)
print(x.shape)
print(y.shape)
#above and below seem to be where the issue is.
x_train, x_test, y_train, y_test = train_test_split(x, y)
You are using your whole dataframe to encode your predictor. Remember to use only the abstract in the transformation (you could also fit the corpus word dictionary before and then transform it afterwards).
Here's a solution:
y = df.accessclass
x = vectorizer.fit_transform(df.abstract)
The rest looks ok.

ValueError: could not convert string to float: '?'

I have tried to run a SVM program, and I got the above error. The code is here below. Please point out the error in it.
import numpy as np
import pandas as pd
from sklearn import svm
from sklearn.model_selection import train_test_split
data = pd.read_csv('risk_factors_cervical_cancer.csv')
X = np.array(data[[#some data elements]])
y = np.array(data[#some data elements])
print(X)
print(y)
print(X.shape)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,
random_state=30)
classifier = svm.SVC()
classifier.fit(X_train, y_train) #the error occurs here
y_pred = svm.predict(X_test)
acc = accuracy_score(y_test, y_pred)
`
As #Guimoute wrote, preprocessing your data is always necessary in order to train it with any machine learning algorithm. Try X.head(10) to get an introduction to the data you are using. Your error occurs because there is a value "?" in your X column. Replace it with some reasonable number, i.e. the mean of the column for example in order to get better results.

scikit learn logistic regression model tfidfvectorizer

I am trying to create a logistic regression model using scikit learn with the code below. I am using 9 columns for the features (X) and one for the label (Y). When trying to fit I get an error "ValueError: Found input variables with inconsistent numbers of samples: [9, 560000]" even though previously the lengths of X and Y are the same, if I use x.transpose() i get a different error "AttributeError: 'int' object has no attribute 'lower'". I am assuming this has to do with the tfidfvectorizer possibly, I am doing this because 3 of the columns contain single words and wasn't working. Is this the right way to be doing this or should I be converting the words in the columns separately and then using train_test_split? If not why am I getting the errors and how can I fic them. Heres an example of the csv.
df = pd.read_csv("UNSW-NB15_1.csv",header=None, names=cols, encoding = "UTF-8",low_memory=False)
df.to_csv('netraf.csv')
csv = 'netraf.csv'
my_df = pd.read_csv(csv)
x_features = my_df.columns[1:10]
x_data = my_df[x_features]
Y = my_df["Label"]
x_train, x_validation, y_train, y_validation =
model_selection.train_test_split(x_data, Y, test_size=0.2, random_state=7)
tfidf_vectorizer = TfidfVectorizer()
lr = LogisticRegression()
tfidf_lr_pipe = Pipeline([('tfidf', tfidf_vectorizer), ('lr', lr)])
tfidf_lr_pipe.fit(x_train, y_train)
What you are trying to do is unusual because TfidfVectorizer is designed to extract numerical features from text. But if you don't really care and just want to make your code works, one way to do it is by converting your numerical data to string and configure TfidfVectorizer to accept tokenized data:
import pandas as pd
from sklearn import model_selection
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
cols = ['srcip','sport','dstip','dsport','proto','service','smeansz','dmeansz','attack_cat','Label']
df = pd.read_csv("UNSW-NB15_1.csv",header=None, names=cols, encoding = "UTF-8",low_memory=False)
df.to_csv('netraf.csv')
csv = 'netraf.csv'
my_df = pd.read_csv(csv)
# convert all columns to string like we don't care
for col in my_df.columns:
my_df[col] = my_df[col].astype(str)
# replace nan with empty string like we don't care
for col in my_df.columns[my_df.isna().any()].tolist():
my_df.loc[:, col].fillna('', inplace=True)
x_features = my_df.columns[1:10]
x_data = my_df[x_features]
Y = my_df["Label"]
x_train, x_validation, y_train, y_validation = model_selection.train_test_split(
x_data.values, Y.values, test_size=0.2, random_state=7)
# configure TfidfVectorizer to accept tokenized data
# reference http://www.davidsbatista.net/blog/2018/02/28/TfidfVectorizer/
tfidf_vectorizer = TfidfVectorizer(
analyzer='word',
tokenizer=lambda x: x,
preprocessor=lambda x: x,
token_pattern=None)
lr = LogisticRegression()
tfidf_lr_pipe = Pipeline([('tfidf', tfidf_vectorizer), ('lr', lr)])
tfidf_lr_pipe.fit(x_train, y_train)
That being said, I'd recommend you to use another method to do feature engineering on your dataset. For example, you can try to encode your nominal data (eg. IP, port) to numerical value.

Categories

Resources