Class_weight in scikit-learn doesn't seem to work properly?

Class_weight in scikit-learn doesn't seem to work properly? - python

I am trying to use the "class_weight" parameter in scikit-learn for the binary svm.SVC classifier. Which I am basically trying to do is to vary precision in class 1 by changing class weights.
Unfortunately after weeks of trying, I am not able to achieve this goal, which makes me think, that there still might be inconsistencies in sklearn...
Here is my code mini-example:
import os
import numpy as np
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn import svm
from sklearn import preprocessing
from sklearn.metrics import confusion_matrix
scaler = preprocessing.StandardScaler()
data = pd.read_csv("...", header=0, delimiter=";", quoting=3, low_memory=False)
def Train_Test_Split(test_size, dataframe, name_y, name_X):
X = dataframe.ix[:,name_X :]
y = dataframe[name_y]
y= np.asarray(y,dtype=int)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, stratify=y)
y_train = np.asarray(y_train,dtype=int)
y_test = np.asarray(y_test,dtype=int)
return(X_train, y_train, X_test, y_test)
def Score(y_test, y_pred):
a = confusion_matrix(y_test,y_pred, labels=[1,0])
Precision_stables = a[0][0]/(a[0][0]+a[1][0])
Precision_instables = a[1][1]/(a[1][1]+a[0][1])
return(Precision_stables, Precision_instables)
def Eval_svm(class_ponder,testsize, dataframe, name_y, name_X):
X_train, y_train, X_test, y_test = Train_Test_Split(testsize, dataframe, name_y, name_X)
clf_svm = svm.SVC(kernel='linear',class_weight=class_ponder,probability=True)
clf_svm_optimal = clf_svm.fit(X_train, y_train)
y_pred_svm = clf_svm_optimal.predict(X_test)
PRS_svm, PRI_svm = Score(y_test, y_pred_svm)
return(PRS_svm, PRI_svm)
name_y = "...variableofinterest..."
name_x = "...explanatoryvariables..."
a,b=Eval_svm({0: 100, 1: 1},0.3, data, name_y, name_x)
print(a,b)
I can choose whatever weighting I'd like, the precision in class 1 or even 0 doesn't change at all.
Could someone help me here? It's kind of exasperating...
Thank you very much in advance!
Best regards,
F

Related

How to merge predicted values to original pandas test data frame where X_test has been converted using CountVectorizer before splitting

I want to merge my predicted results of my test data to my X_test. I was able to merge it with y_test but since my X_test is a corpus I'm not sure how I can identify the indexes to merge.
My codes are as below
def lr_model(df):
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import pandas as pd
# Create corpus as a list
corpus = df['text'].tolist()
cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray()
y = df.iloc[:, -1].values
# Splitting to testing and training
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Train Logistic Regression on Training set
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
# Merge true vs predicted labels
true_vs_pred = pd.DataFrame(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))
return true_vs_pred
This gives me the y_test and y_pred but I'm not sure how I can add the X_test as an original data frame (the ids of the X_test) to this.
Any guidance is much appreciated. Thanks

Using a pipeline can help you link the original X_test with the prediction:
def lr_model(df):
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import pandas as pd
from sklearn.pipeline import Pipeline
# Defining X and y
cv = CountVectorizer()
X = df['text']
y = df.iloc[:, -1].values
# Splitting to testing and training
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Create a pipeline
pipeline = Pipeline([
('CountVectorizer', cv),
('LogisticRegression', LogisticRegression(random_state = 0)),
])
# Train pipeline on Training set
pipeline.fit(X_train, y_train)
# Predicting the Test set results
y_pred = pipeline.predict(X_test)
return X_test, y_test, y_pred

train_test_data_split function is showing problem

I was trying to make a program to predict the runs made by a cricketer. I used a csv file for data made by me. The code is:
import pandas as pd
from sklearn.linear_model import LinearRegression
import numpy as np
from sklearn.model_selection import train_test_split
#Data
data = pd.read_csv('Rohit Sharma.csv')
X = [['against','wickets','currentrun','weather','ball','over']]
Y = ['runsmade']
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.33, train_size=None, random_state=42)
reg = LinearRegression()
reg.fit(x_train,y_train)
a = reg.predict(x_test)
print(a)
print(data)
But it showed an error:
ValueError: With n_samples=1, test_size=0.33 and train_size=None, the resulting
train set will be empty. Adjust any of the aforementioned parameters
How to fix it?

Try this:
Looks like you made an error while selecting the columns of the data. See below.
import pandas as pd
from sklearn.linear_model import LinearRegression
import numpy as np
from sklearn.model_selection import train_test_split
#Data
data = pd.read_csv('Rohit Sharma.csv')
X = data[['against','wickets','currentrun','weather','ball','over']].to_numpy()
Y = data['runsmade'].to_numpy()
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.33, random_state=42)
reg = LinearRegression()
reg.fit(x_train,y_train)
a = reg.predict(x_test)
print(a)
print(data)

ValueError: bad input shape (560, 5) sklearn

I am starting to write the learning machine model. I have a Y_train dataset containing the labels where there are 5 classes. The X_train dataset contains the samples. I try to make my model with the help of a logistic regression.
X_train ((560, 20531)) and Y_train ((560, 5)) have the same dimensions.
I have seen a few publications associated with the same problem but I have not been able to solve the problem.
I don't know how to correct this error,can you help me please ?
X = pd.read_csv('/Users/lottie/desktop/data.csv', header=None, skiprows=[0])
Y = pd.read_csv('/Users/lottie/desktop/labels.csv', header=None)
Y_encoded = list()
for i in Y.loc[0:,1] :
if i == 'BRCA' : Y_encoded.append(0)
if i == 'KIRC' : Y_encoded.append(1)
if i == 'COAD' : Y_encoded.append(2)
if i == 'LUAD' : Y_encoded.append(3)
if i == 'PRAD' : Y_encoded.append(4)
Y_bis = to_categorical(Y_encoded)
#separation of the data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y_bis, test_size=0.30, random_state=42)
regression_log = linear_model.LogisticRegression(multi_class='multinomial', solver='newton-cg')
X_train=X_train.iloc[:,1:]
#train model
train_train = regression_log.fit(X_train, Y_train)

You get that error because your label is categorical. You need to use a label encoder to encode it into 0,1,2.. , check out help page from scikit-learn. Below would be an implementation using an example dataset similar to yours:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
Y = pd.DataFrame({'label':np.random.choice(['BRCA','KIRC','COAD','LUAD','PRAD'],560)})
X = pd.DataFrame(np.random.normal(0,1,(560,5)))
Y_encoded = le.fit_transform(Y['label'])
X_train, X_test, Y_train, Y_test = train_test_split(X, Y_encoded, test_size=0.30, random_state=42)
regression_log = linear_model.LogisticRegression(multi_class='multinomial', solver='newton-cg')
X_train=X_train.iloc[:,1:]
train_train = regression_log.fit(X_train, Y_train)

Standardized data of SVM - Scikit-learn/ Python

This is for an assignment where the SVM methods has to be used for model accuracy.
There were 3 parts, wrote the below code
import sklearn.datasets as datasets
import sklearn.model_selection as ms
from sklearn.model_selection import train_test_split
digits = datasets.load_digits();
X = digits.data
y = digits.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=30, stratify=y)
print(X_train.shape)
print(X_test.shape)
from sklearn.svm import SVC
svm_clf = SVC().fit(X_train, y_train)
print(svm_clf.score(X_test,y_test))
But after this, the question is as below
Perform Standardization of digits.data and store the transformed data
in variable digits_standardized.
Hint : Use required utility from sklearn.preprocessing. Once again,
split digits_standardized into two sets names X_train and X_test.
Also, split digits.target into two sets Y_train and Y_test.
Hint: Use train_test_split method from sklearn.model_selection; set
random_state to 30; and perform stratified sampling. Build another SVM
classifier from X_train set and Y_train labels, with default
parameters. Name the model as svm_clf2.
Evaluate the model accuracy on testing data set and print it's score.
On top of the above code, tried writing this, but seems to be failing. Can anyone help on how the data can be standardized.
std_scale = preprocessing.StandardScaler().fit(X_train)
X_train_std = std_scale.transform(X_train)
X_test_std = std_scale.transform(X_test)
svm_clf2 = SVC().fit(X_train, y_train)
print(svm_clf.score(X_test,y_test))

Tried the below. Seems to be working.
import sklearn.datasets as datasets
import sklearn.model_selection as ms
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
digits = datasets.load_digits();
X = digits.data
scaler = StandardScaler()
scaler.fit(X)
digits_standardized = scaler.transform(X)
y = digits.target
X_train, X_test, y_train, y_test = train_test_split(digits_standardized, y, random_state=30, stratify=y)
#print(X_train.shape)
#print(X_test.shape)
from sklearn.svm import SVC
svm_clf2 = SVC().fit(X_train, y_train)
print("Accuracy ",svm_clf2.score(X_test,y_test))

Try this as final code includes all Tasks
import sklearn.datasets as datasets
import sklearn.model_selection as ms
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
digits = datasets.load_digits()
X = digits.data
y = digits.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=30, stratify=y)
print(X_train.shape)
print(X_test.shape)
svm_clf = SVC().fit(X_train, y_train)
print(svm_clf.score(X_test,y_test))
scaler = StandardScaler()
scaler.fit(X)
digits_standardized = scaler.transform(X)
X_train, X_test, y_train, y_test = train_test_split(digits_standardized, y, random_state=30, stratify=y)
svm_clf2 = SVC().fit(X_train, y_train)
print(svm_clf2.score(X_test,y_test))

How to get the same results in different iterations in RandomForest in sklearn

I am using Random Forest classifier for the classification and in each iteration I get different results. My code is as follows.
input_file = 'sample.csv'
df1 = pd.read_csv(input_file)
df2 = pd.read_csv(input_file)
X=df1.drop(['lable'], axis=1) # Features
y=df2['lable'] # Labels
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
clf=RandomForestClassifier(random_state = 42, class_weight="balanced")
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
As suggested by other answers I added the parameter n_estimators and random_state. However, it did not work for me.
I have attached the csv file here:
I am happy to provide more details if needed.

You need to set the random state for the train-test splitting as well.
The following code would give you a reproducible results. The recommended approach is not to change the random_state value for improving performance.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
df1=pd.read_csv('sample.csv')
X=df1.drop(['lable'], axis=1) # Features
y=df1['lable'] # Labels
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=5)
clf=RandomForestClassifier(random_state = 42, class_weight="balanced")
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
Output:
Accuracy: 0.6777777777777778

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Class_weight in scikit-learn doesn't seem to work properly? - python

Related

How to merge predicted values to original pandas test data frame where X_test has been converted using CountVectorizer before splitting

train_test_data_split function is showing problem

ValueError: bad input shape (560, 5) sklearn

Standardized data of SVM - Scikit-learn/ Python

How to get the same results in different iterations in RandomForest in sklearn

Categories

Resources