scikitlearn SVR giving same values for predictions (have tried scaling) - python

I have a battery dataframe with rows representing various cycles and a set of features for that cycle:
As an example row 1:
df = pd.DataFrame(columns=['Ecell_V', 'I_mA', 'EnergyCharge_W_h', 'QCharge_mA_h',
'EnergyDischarge_W_h', 'QDischarge_mA_h', 'Temperature__C',
'cycleNumber', 'SOH', 'Cell'])
df.loc[0] = [3.730646, 2988.8713, 0.185061, 49.724845, 0.0, 0.0, 27.5, 2, 0.99, 'VAH11']
There are 600,000 rows
I am trying to predict the value for SOH as follows:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression # for building a linear regression model
from sklearn.svm import SVR # for building SVR model
from sklearn.preprocessing import MinMaxScaler
train_data = pd.read_csv("train_data.csv")
train_cell = train_data.pop('Cell')
# reduce size of df train for comp purposes
train_data = train_data.iloc[::20, :]
train_data = train_data.reset_index(drop=True)
#remove unwanted features
train_data.pop('Ns')
train_data.pop('time_s')
#scale the data
scaler = MinMaxScaler()
train_data_scaled = scaler.fit_transform(train_data)
#return to df
train_data_scaled = pd.DataFrame(train_data_scaled, columns=['Ecell_V', 'I_mA', 'EnergyCharge_W_h', 'QCharge_mA_h',
'EnergyDischarge_W_h', 'QDischarge_mA_h', 'Temperature__C',
'cycleNumber', 'SOH'])
train_data_scaled
#unscale target
train_data_scaled['SOH'] = train_data['SOH']
train_data_scaled
#split target and input
X = train_data_scaled.drop('SOH', axis=1)
y = train_data_scaled['SOH'].values
#model
model = SVR(kernel='rbf', C=100, epsilon=1)
svr = model.fit(X, y)
#predict model
pred = model.predict(X)
Now returning ``` pred `` gives the same prediction for each row:
array([0.89976814, 0.89976814, 0.89976814, ..., 0.89976814, 0.89976814,
0.89976814])
why is this happening?

Using StandardScaler() on the X and y data corrected this issue, with an inverse called to return it to original values.

Related

NLP classification with sparse and numerical features crashes

I have a dataset of 10 million english shows, which has been cleaned and lemmatized, and their classification into different category types such as comedy, documentary, action, ... etc
I also have a feature called duration, which is the length of the tv show.
Data can be found here
I perform tfidf vectorization on the titles, which returns a sparse matrix and normalization on the duration column.
Then I want to feed the data to a logistic regression classifier.
side question: I want to know if theres a better way to handle combining a sparse matrix and a numerical column
when I try to do it using todense() or toarray(), It works
When i pass it to the logistic regression function, the notebook crashes. But if i dont have the duration col, which means i dont have to apply the toarray() or todense() function, it works perfectly. Is this a memory issue?
This is my code:
import os
import pandas as pd
from sklearn import metrics
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
def normalize(df, col = ''):
mms = MinMaxScaler()
mms_col = mms.fit_transform(df[[col]])
return mms_col
def tfidf(X, col = ''):
tfidf_vectorizer = TfidfVectorizer(max_df = 0.8, max_features = 10000)
return tfidf_vectorizer.fit_transform(X[col])
def get_training_data(df):
df = shuffle(pd.read_csv(df).dropna())
data = df[['name_title', 'Duration']]
X_duration = normalize(data, col = 'Duration')
X_sparse = tfidf(data, col = 'name_title')
X = pd.DataFrame(X_sparse.toarray())
X['Duration'] = X_duration
y = df['target']
return X, y
def logistic_regression(X, y):
X_train, X_test, y_train, y_test = train_test_split(X, y)
lr = LogisticRegression(C = 100.0, random_state = 1, solver = 'lbfgs', multi_class = 'ovr')
lr.fit(X_train, y_train)
y_predict = lr.predict(X_test)
print(y_predict)
print("Logistic Regression Accuracy %.3f" %metrics.accuracy_score(y_test, y_predict))
data_path = '../data/'
X, y = get_training_data(os.path.join(data_path, 'podcasts_en_processed.csv'))
print(X.shape) # this prints (971426, 10001)
logistic_regression(X, y)

how to use numpy mutual information correctly

i want to use principal component analysis-mutual information (PCA-MI) to have data representation from source which has source relevance of (value from smartinsole) and ouput variable (value from force plate). PCA was used to determine the principal component of Ni provided that the cumulative variance is greater than 98% of the source information measured from 89 insole sensors. MI is generally used in the selection of input variables for predictive models because it is a good indicator of the relationship between input variables and output variables. here I want to get results like a flowchart as below
then I try to make code like below. but I can't generate like what's in the flowchart
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_classif
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# load the dataset
def load_dataset(filename):
# load the dataset as a pandas DataFrame
data = read_csv(filename, header=None)
# retrieve numpy array
dataset = data.values
y = dataset
return y
def load_dataset2(filename):
# load the dataset as a pandas DataFrame
data2 = read_csv(filename, header=None)
# retrieve numpy array
dataset2 = data2.values
X = dataset2
return X
# feature selection
def select_features(X_train, y_train, X_test):
# configure to select a subset of features
fs = SelectKBest(score_func=mutual_info_classif, k=4)
# learn relationship from training data
fs.fit(X_train, y_train)
# transform train input data
X_train_fs = fs.transform(X_train)
# transform test input data
X_test_fs = fs.transform(X_test)
return X_train_fs, X_test_fs, fs
# load the dataset
Insole = pd.read_csv('1119_Rwalk40s1_list.txt', header=None, low_memory=False)
SIData = np.asarray(Insole)
df = pd.read_csv('1119_Rwalk40s1.csv', low_memory=False)
columns = ['Fx','Fy','Fz','Mx','My','Mz']
selected_df = df[columns]
FCDatas = selected_df
SmartInsole = np.array(SIData)
FCData = np.array(FCDatas)
scaler_x = MinMaxScaler(feature_range=(0, 1))
scaler_x.fit(SmartInsole)
xscale = scaler_x.transform(SmartInsole)
scaler_y = MinMaxScaler(feature_range=(0, 1))
scaler_y.fit(FCData)
yscale = scaler_y.transform(FCData)
SIDataPCA = xscale
pca = PCA(n_components=89)
pca.fit(SIDataPCA)
SIdata_pca = pca.transform(SIDataPCA)
X = SIdata_pca
y = yscale
X = SIdata_pca
y = yscale
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# feature selection
X_train_fs, X_test_fs, fs = select_features(X_train, y_train, X_test)
# fit the model
model = LogisticRegression(solver='liblinear')
model.fit(X_train_fs, y_train)
# evaluate the model
yhat = model.predict(X_test_fs)
# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print('Accuracy: %.2f' % (accuracy*100))
how can I get the correct PCA-MI result data?

confidence interval for random forest regressor

i'm using a kaggle dataset (https://www.kaggle.com/datasets/harlfoxem/housesalesprediction) to make a prediction on house prices.
This is the code I used and so far so good.
import pandas as pd
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn import metrics
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score
from sklearn.ensemble import RandomForestRegressor
#import dataset
dataset = pd.read_csv(path_to_dataset)
dataset.head()
dataset['date'] = pd.to_datetime(dataset['date']) #convert date in datetime
#house_age is a new feature
dataset["house_age"] = dataset["date"].dt.year - dataset['yr_built']
#drop useful features
dataset=dataset.drop('date', axis=1)
dataset=dataset.drop('yr_built', axis=1)
dataset = dataset.drop(["id"],axis=1)
train, test = train_test_split(dataset, test_size=0.3, random_state=43)
xtrain = train.drop(['price'], axis = 1) #train array without price
ytrain = train['price'] #train array with price
xtest = test.drop(['price'], axis = 1) #test array without price
ytest = test['price'] #test array with price
reg = RandomForestRegressor()
reg.fit(xtrain,ytrain)
pred = reg.predict(xtest)
print("Score: ",r2_score(ytrain, reg.predict(xtrain)))
print("Score: ",r2_score(ytest, pred))
print('MSE: ', metrics.mean_squared_error(ytest, pred))
Now, however, I would like to calculate and draw a confidence interval for the predictions made with my model.
I have already tried to look at many articles and libraries for several hours but I have not yet been able to find a solution that works for my case.
These are a couple of the references I followed but with little success:
http://contrib.scikit-learn.org/forest-confidence-interval/auto_examples/plot_mpg.html#sphx-glr-auto-examples-plot-mpg-py
https://scikit-garden.github.io/examples/QuantileRegressionForests/#quantile-regression-forests_1
Does anyone know how to create a confidence interval for this situation?
To construct confidence intervals, you can use the quantile-forest package. Using the RandomForestQuantileRegressor method in the package, you can specify quantiles to estimate during training, which can then be used to construct intervals.
Here's an example that extends your code with the above package to do this:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from quantile_forest import RandomForestQuantileRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
# import dataset
dataset = pd.read_csv(path_to_dataset)
dataset.head()
dataset['date'] = pd.to_datetime(dataset['date']) # convert date in datetime
# house_age is a new feature
dataset['house_age'] = dataset["date"].dt.year - dataset['yr_built']
# drop useful features
dataset = dataset.drop('date', axis=1)
dataset = dataset.drop('yr_built', axis=1)
dataset = dataset.drop(['id'], axis=1)
train, test = train_test_split(dataset, test_size=0.3, random_state=43)
x_train = train.drop(['price'], axis=1) # train array without price
y_train = train['price'] # train array with price
x_test = test.drop(['price'], axis=1) # test array without price
y_test = test['price'] # test array with price
reg = RandomForestQuantileRegressor(n_estimators=100, random_state=0)
reg.fit(x_train, y_train)
# Get predictions at 95% prediction intervals and median.
y_pred = reg.predict(x_test, quantiles=[0.025, 0.5, 0.975])
def plot_intervals(y_true, y_pred_lower, y_pred_upper):
fig = plt.figure(figsize=(10, 4))
y_pred_interval = y_pred_upper - y_pred_lower
sort_idx = np.argsort(y_pred_interval)
y_true = y_true[sort_idx]
y_pred_lower = y_pred_lower[sort_idx]
y_pred_upper = y_pred_upper[sort_idx]
# Center data, with the mean of the prediction interval at 0.
mean = (y_pred_lower + y_pred_upper) / 2
y_true -= mean
y_pred_lower -= mean
y_pred_upper -= mean
plt.plot(y_true, marker='.', ms=5, c='r', lw=0)
plt.fill_between(
np.arange(len(y_pred_upper)),
y_pred_lower,
y_pred_upper,
alpha=0.2,
color='gray',
)
plt.plot(np.arange(len(y_true)), y_pred_lower, marker='_', c='0.2', lw=0)
plt.plot(np.arange(len(y_true)), y_pred_upper, marker='_', c='0.2', lw=0)
plt.xlim([0, len(y_true)])
plt.xlabel('Ordered Samples')
plt.ylabel('Observed Values and Prediction Intervals (Centered)')
plt.show()
plot_intervals(y_test.values, y_pred[:, 0], y_pred[:, 2])
print('Score: ', r2_score(y_train, reg.predict(x_train)))
print('Score: ', r2_score(y_test, y_pred[:, 1]))
print('MSE: ', mean_squared_error(y_test, y_pred[:, 1]))
The code plots the generated intervals from smallest to largest along with the observed values:

How to output Shap values in probability and make force_plot from binary classifier

I need to plot how each feature impacts the predicted probability for each sample from my LightGBM binary classifier. So I need to output Shap values in probability, instead of normal Shap values. It does not appear to have any options to output in term of probability.
The example code below is what I use to generate dataframe of Shap values and do a force_plot for the first data sample. Does anyone know how I should modify the code to change the output?
I'm new to Shap value and the Shap package. Thanks a lot in advance.
import pandas as pd
import numpy as np
import shap
import lightgbm as lgbm
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = lgbm.LGBMClassifier()
model.fit(X_train, y_train)
explainer = shap.TreeExplainer(model)
shap_values = explainer(X_train)
# force plot of first row for class 1
class_idx = 1
row_idx = 0
expected_value = explainer.expected_value[class_idx]
shap_value = shap_values[:,:,class_idx].values[row_idx]
shap.force_plot (base_value = expected_value, shap_values = shap_value, features = X_train.iloc[row_idx, :], matplotlib=True)
# dataframe of shap values for class 1
shap_df = pd.DataFrame(shap_values[:,:, 1 ].values, columns = shap_values.feature_names)
TL;DR:
You can achieve plotting results in probability space with link="logit" in the force_plot method:
import pandas as pd
import numpy as np
import shap
import lightgbm as lgbm
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from scipy.special import expit
shap.initjs()
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
model = lgbm.LGBMClassifier()
model.fit(X_train, y_train)
explainer_raw = shap.TreeExplainer(model)
shap_values = explainer_raw(X_train)
# force plot of first row for class 1
class_idx = 1
row_idx = 0
expected_value = explainer_raw.expected_value[class_idx]
shap_value = shap_values[:, :, class_idx].values[row_idx]
shap.force_plot(
base_value=expected_value,
shap_values=shap_value,
features=X_train.iloc[row_idx, :],
link="logit",
)
Expected output:
Alternatively, you may achieve the same with the following, explicitly specifying model_output="probability" you're interested in to explain:
explainer = shap.TreeExplainer(
model,
data=X_train,
feature_perturbation="interventional",
model_output="probability",
)
shap_values = explainer(X_train)
# force plot of first row for class 1
class_idx = 1
row_idx = 0
shap_value = shap_values.values[row_idx]
shap.force_plot(
base_value=expected_value,
shap_values=shap_value,
features=X_train.iloc[row_idx, :]
)
Expected output:
However, it might be more interesting for understanding what's happening here to find out where these figures come from:
Our target proba for the point of interest:
model_proba= model.predict_proba(X_train.iloc[[row_idx]])
model_proba
# array([[0.00275887, 0.99724113]])
Base case raw from model given X_train as background (note, LightGBM outputs raw for class 1):
model.predict(X_train, raw_score=True).mean()
# 2.4839751932445577
Base case raw from SHAP (note, they are symmetric):
bv = explainer_raw(X_train).base_values[0]
bv
# array([-2.48397519, 2.48397519])
Raw SHAP values for the point of interest:
sv_0 = explainer_raw(X_train).values[row_idx].sum(0)
sv_0
# array([-3.40619584, 3.40619584])
Proba inferred from SHAP values (via sigmoid):
shap_proba = expit(bv + sv_0)
shap_proba
# array([0.00275887, 0.99724113])
Check:
assert np.allclose(model_proba, shap_proba)
Please ask questions if something is not clear.
Side notes
Proba might be misleading if you're analyzing raw size effect of different features because sigmoid is non-linear and saturates after reaching certain threshold.
Some people expect to see SHAP values in probability space as well, but this is not feasible because:
SHAP values are additive by construction (to be precise SHapley Additive exPlanations are average marginal contributions over all possible feature coalitions)
exp(a + b) != exp(a) + exp(b)
You may find useful:
Feature importance in a binary classification and extracting SHAP values for one of the classes only answer
How to interpret base_value of GBT classifier when using SHAP? answer
You can consider running your output values through a softmax() function. For reference, it is defined as :
def get_softmax_probabilities(x):
return np.exp(x) / np.sum(np.exp(x)).reshape(-1, 1)
and there is a scipy implementation as well:
from scipy.special import softmax
The output from softmax() will be probabilities proportional to the (relative) values in vector x, which are your shop values.
import pandas as pd
import numpy as np
import shap
import lightgbm as lgbm
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print('X_train: ',X_train.shape)
print('X_test: ',X_test.shape)
model = lgbm.LGBMClassifier()
model.fit(X_train, y_train)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_train)
# plot
# shap.summary_plot(shap_values[class_idx], X_train, plot_type='bar')
# shap.summary_plot(shap_values[class_idx], X_train)
# shap_value = shap_values[:,:,class_idx].values[row_idx]
# shap.force_plot (base_value = expected_value, shap_values = shap_value, features = X_train.iloc[row_idx, :], matplotlib=True)
# # dataframe of shap values for class 1
# shap_df = pd.DataFrame(shap_values[:,:, 1 ].values, columns = shap_values.feature_names)
# verification
def verification(index_number,class_idx):
print('-----------------------------------')
print('index_number: ', index_number)
print('class_idx: ', class_idx)
print('')
y_base = explainer.expected_value[class_idx]
print('y_base: ', y_base)
player_explainer = pd.DataFrame()
player_explainer['feature_value'] = X_train.iloc[j].values
player_explainer['shap_value'] = shap_values[class_idx][j]
print('verification: ')
print('y_base + sum_of_shap_values: %.2f'%(y_base + player_explainer['shap_value'].sum()))
print('y_pred: %.2f'%(y_train[j]))
j = 10 # index
verification(j,0)
verification(j,1)
# show:
# X_train: (455, 30)
# X_test: (114, 30)
# -----------------------------------
# index_number: 10
# class_idx: 0
# y_base: -2.391423081639827
# verification:
# y_base + sum_of_shap_values: -9.40
# y_pred: 1.00
# -----------------------------------
# index_number: 10
# class_idx: 1
# y_base: 2.391423081639827
# verification:
# y_base + sum_of_shap_values: 9.40
# y_pred: 1.00
# -9.40,9.40 takes the maximum value(class_idx:1 = y_pred), and the result is obviously correct.
I helped you achieve it and verified the reliability of the results.

Naivebayes MultinomialNB scikit-learn/sklearn

I am bulding a naive bayes classifier and I follow the tutorial on the scikit-learn website.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
import csv
import string
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
# Importing dataset
data = pd.read_csv("test.csv", quotechar='"', delimiter=',',quoting=csv.QUOTE_ALL, skipinitialspace=True,error_bad_lines=False)
df2 = data.set_index("name", drop = False)
df2['sentiment'] = df2['rating'].apply(lambda rating : +1 if rating > 3 else -1)
train, test = train_test_split(df2, test_size=0.2)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(traintrain['review'])
test_matrix = count_vect.transform(testrain['review'])
clf = MultinomialNB().fit(X_train_tfidf, train['sentiment'])
The first argument is the vocabulary dictionary and it returns a Document-Term matrix.
What should be the second argument,twenty_train.target?
Edit Data example
Name, review,rating
film1,......,1
film2, the film is....,5
film3, film about..., 4
with this instruction I created a new column , if the rating is >3 so the review is positive, else it is negative
df2['sentiment'] = df2['rating'].apply(lambda rating : +1 if rating > 3 else -1)
The fit method of MultinomialNB expects as input the x and y.
Now, x should be the training vectors (training data) and y should be the target values.
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)
In more detail:
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Training vectors, where n_samples is the number of samples and n_features is
the number of features.
y : array-like, shape = [n_samples]
Target values.
Note: Make sure that shape = [n_samples, n_features] and shape = [n_samples] of x and y are defined correctly. Otherwise, the fit will throw an error.
Toy example:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
newsgroups_train = fetch_20newsgroups(subset='train')
categories = ['alt.atheism', 'talk.religion.misc',
'comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train',
categories=categories)
vectorizer = TfidfVectorizer()
# the following will be the training data
vectors = vectorizer.fit_transform(newsgroups_train.data)
vectors.shape
newsgroups_test = fetch_20newsgroups(subset='test',
categories=categories)
# this is the test data
vectors_test = vectorizer.transform(newsgroups_test.data)
clf = MultinomialNB(alpha=.01)
# the fitting is done using the TRAINING data
# Check the shapes before fitting
vectors.shape
#(2034, 34118)
newsgroups_train.target.shape
#(2034,)
# fit the model using the TRAINING data
clf.fit(vectors, newsgroups_train.target)
# the PREDICTION is done using the TEST data
pred = clf.predict(vectors_test)
EDIT:
The newsgroups_train.target is just a numpy array that contains the labels (or targets or classes).
import numpy as np
newsgroups_train.target
array([1, 3, 2, ..., 1, 0, 1])
np.unique(newsgroups_train.target)
array([0, 1, 2, 3])
So in this example we have 4 different classes/targets.
This variable is needed in order to fit a classifier.

Categories

Resources