rfecv.fit() in python not accepting my x and y arguments - python

I am really new to Python so I don't know a lot of basics, but I have a college report that has to be done in Python and I'm struggling with figuring out how to resolve the issues in my code.
First I created my training data for X and y, then transformed it into a pandas DataFrame so I can call ols from statsmodels on it for my initial model. Now I want to use rfe to reduce my model, and I'm starting with RFECV so I can determine how many features I want RFE to select. But every time I run the code I have an issue with rfecv.fit().
Here is my code:
'''
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
pipe = make_pipeline(StandardScaler())
pipe.fit(X_train, y_train)
#recombine training dataset in order to call ols
X_train_df = pd.DataFrame(X_train)
y_train_df = pd.DataFrame(y_train)
traindata = pd.concat([X_train_df.reset_index(drop=True), y_train_df.reset_index(drop=True)], axis=1)
#create first linear model
from statsmodels.formula.api import ols
model1 = ols('Tenure ~ Population + Children + Age + Income + Outage_sec_perweek + Email + Contacts + Yearly_equip_failure + MonthlyCharge + Bandwidth_GB_Year + Area_Suburban + Area_Urban + Marital_Married + Marital_Never_Married + Marital_Separated + Marital_Widowed + Gender_Male + Gender_Nonbinary + Churn_Yes + Contract_One_Year + Contract_Two_Year', data=traindata).fit()
print(model1.params)
#RFECV to determine number of variables to include for the optimal model
from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import RFECV
svc = SVC(kernel="linear")
rfecv = RFECV(estimator = svc, step = 1, cv = StratifiedKFold, scoring = "accuracy")
rfecv.fit(X_train_df, y_train_df)
'''
The output error looks like this:
TypeError: Singleton array array(None, dtype=object) cannot be considered a valid collection.
Any help or resources would be really appreciated! Thanks

You need to pass cv = StratifiedKFold() instead of cv = StratifiedKFold, so the below will work:
rfecv = RFECV(estimator = svc, step = 1, cv = StratifiedKFold(), scoring = "accuracy")
Or if you want 10 folds (default is 5):
rfecv = RFECV(estimator = svc, step = 1, cv = StratifiedKFold(n_splits=10), scoring = "accuracy")
You can check out the difference between having / not having the parenthesis from post like this or this.

Related

NLP classification with sparse and numerical features crashes

I have a dataset of 10 million english shows, which has been cleaned and lemmatized, and their classification into different category types such as comedy, documentary, action, ... etc
I also have a feature called duration, which is the length of the tv show.
Data can be found here
I perform tfidf vectorization on the titles, which returns a sparse matrix and normalization on the duration column.
Then I want to feed the data to a logistic regression classifier.
side question: I want to know if theres a better way to handle combining a sparse matrix and a numerical column
when I try to do it using todense() or toarray(), It works
When i pass it to the logistic regression function, the notebook crashes. But if i dont have the duration col, which means i dont have to apply the toarray() or todense() function, it works perfectly. Is this a memory issue?
This is my code:
import os
import pandas as pd
from sklearn import metrics
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
def normalize(df, col = ''):
mms = MinMaxScaler()
mms_col = mms.fit_transform(df[[col]])
return mms_col
def tfidf(X, col = ''):
tfidf_vectorizer = TfidfVectorizer(max_df = 0.8, max_features = 10000)
return tfidf_vectorizer.fit_transform(X[col])
def get_training_data(df):
df = shuffle(pd.read_csv(df).dropna())
data = df[['name_title', 'Duration']]
X_duration = normalize(data, col = 'Duration')
X_sparse = tfidf(data, col = 'name_title')
X = pd.DataFrame(X_sparse.toarray())
X['Duration'] = X_duration
y = df['target']
return X, y
def logistic_regression(X, y):
X_train, X_test, y_train, y_test = train_test_split(X, y)
lr = LogisticRegression(C = 100.0, random_state = 1, solver = 'lbfgs', multi_class = 'ovr')
lr.fit(X_train, y_train)
y_predict = lr.predict(X_test)
print(y_predict)
print("Logistic Regression Accuracy %.3f" %metrics.accuracy_score(y_test, y_predict))
data_path = '../data/'
X, y = get_training_data(os.path.join(data_path, 'podcasts_en_processed.csv'))
print(X.shape) # this prints (971426, 10001)
logistic_regression(X, y)

Alternative to Using Repeated Stratified K Fold with Multiple Outputs?

I am exploring the number of features that would be best to use for my models. I understand that a Repeated Stratified K Fold requires 1 1D array output while I am trying to evaluate the number of features for an output that has multiple outputs. Is there a way to use the Repeated Stratified K Fold with multiple outputs? Or is there an alternative to accomplish what I need?
from sklearn import datasets
from numpy import mean, std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score, RepeatedStratifiedKFold, KFold
from sklearn.feature_selection import RFE
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from matplotlib import pyplot
def get_models():
models = dict()
for i in range(4,20):
rfe = RFE(estimator = DecisionTreeClassifier(), n_features_to_select = i)
model = DecisionTreeClassifier()
models[str(i)] = Pipeline(steps=[('s', rfe), ('m', model)])
return models
from sklearn.utils.multiclass import type_of_target
x = imp_data.iloc[:,:34]
y = imp_data.iloc[:,39]
model = DecisionTreeClassifier()
def evaluate_model(model,x,y):
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=0)
scores = cross_val_score(model, x, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score = 'raise')
return scores
models = get_models()
results, names = list(), list()
for name,model in models.items():
scores = evaluate_model(model,x,y)
results.append(scores)
names.append(name)
print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores)))
as i know, you can use cross_validate() as alternative of StratifiedKFold with multiple output. You can define cross validation technique with StratifiedKFold and scoring metrics as your preference. You can check link below for more detail !
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html

Poor accuarcy score for Semi-Supervised Support Vector machine

I am using a Semi-Supervised approach for Support Vector Machine in Python for the image classification from PASCAL VOC 2007 data.
I have tried with the default parameters from the libraries and also tuned them but it get extremely bad accuracy of about only ~ 2%.
Below is my code:
import pandas as pd
import numpy as np
from sklearn import decomposition
from sklearn.model_selection import train_test_split
from numpy import concatenate
import numpy as np
from sklearn import datasets
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn import decomposition
import warnings
warnings.filterwarnings("ignore")
color_layout_features = pd.read_pickle("color_layout_descriptor.pkl")
bow_surf = pd.read_pickle("bow_surf.pkl")
color_hist_features = pd.read_pickle("hist.pkl")
labels = pd.read_pickle("labels.pkl")
# Feat. Scaling
def scale(X, x_min, x_max):
nom = (X-X.min(axis=0))*(x_max-x_min)
denom = X.max(axis=0) - X.min(axis=0)
denom[denom==0] = 1
return x_min + nom/denom
# normalization
def normalize(x):
return (x - np.min(x))/(np.max(x) - np.min(x))
color_layout_features_scaled = scale(color_layout_features, 0, 1)
color_hist_features_scaled = scale(color_hist_features, 0, 1)
bow_surf_scaled = scale(bow_surf, 0, 1)
features = np.hstack([color_layout_features_scaled, color_hist_features_scaled, bow_surf_scaled])
# define dataset
X, Y = features, labels
X = normalize(X)
pca = decomposition.PCA(n_components=100)
pca.fit(X)
X = pca.transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.30, random_state=1, stratify=Y)
# split train into labeled and unlabeled
X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.30, random_state=1, stratify=y_train)
# create the training dataset input
X_train_mixed = concatenate((X_train_lab, X_test_unlab))
# create "no label" for unlabeled data
nolabel = [-1 for _ in range(len(y_test_unlab))]
# recombine training dataset labels
y_train_mixed = concatenate((y_train_lab, nolabel))
from semisupervised import S3VM
model = S3VM(kernel="Linear", C = 1e-2, gamma = 0.5, lamU = 1.0, probability=True)
#model.fit(X_train_mixed, _train_mixed)
model.fit(np.vstack((X_train_lab, X_test_unlab)), np.append(y_train_lab, nolabel))
#model.fit(np.vstack((label_X_train, unlabel_X_train)), np.append(label_y_train, unlabel_y))
# predict
predict = model.predict(X_test)
acc = metrics.accuracy_score(y_test, predict)
# metric
print("accuracy", acc*100)
accuracy 2.6692291266282298
I am using a Transductive version of SVM (TSVM) from the semisupervised library. But not sure what am I doing wrong so that even after tweaking the parameters I still get the same result. Any inputs would be helpful.
I refer https://github.com/rosefun/SemiSupervised/blob/master/semisupervised/TSVM.py to make the implementation. Any inputs would be helpful.
Please consider that according to link Documentation "The unlabeled samples should be labeled as -1".

added Standardscaler but receive errors in Cross Validation and the correlation matrix

This is the code I built to apply a multiple linear regression. I added standard scaler to fix the Y intercept p-value which was not significant but the problem that the results of CV RMSE in the end changed and have nosense anymore and received an error in the code for plotting the correlation Matrix saying : AttributeError: 'numpy.ndarray' object has no attribute 'corr'
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy import stats
from scipy.stats.stats import pearsonr
# Import Excel File
data = pd.read_excel("C:\\Users\\AchourAh\\Desktop\\Multiple_Linear_Regression\\SP Level Reasons Excels\\SP000273701_PL14_IPC_03_09_2018_Reasons.xlsx",'Sheet1') #Import Excel file
# Replace null values of the whole dataset with 0
data1 = data.fillna(0)
print(data1)
# Extraction of the independent and dependent variables
X = data1.iloc[0:len(data1),[1,2,3,4,5,6,7]] #Extract the column of the COPCOR SP we are going to check its impact
Y = data1.iloc[0:len(data1),9] #Extract the column of the PAUS SP
# Data Splitting to train and test set
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size =0.25,random_state=1)
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)
# Statistical Analysis of the training set with Statsmodels
X = sm.add_constant(X_train) # add a constant to the model
est = sm.OLS(Y_train, X).fit()
print(est.summary()) # print the results
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import math
lm = LinearRegression() # create an lm object of LinearRegression Class
lm.fit(X_train,Y_train) # train our LinearRegression model using the training set of data - dependent and independent variables as parameters. Teaching lm that Y_train values are all corresponding to X_train.
print(lm.intercept_)
print(lm.coef_)
mse_test = mean_squared_error(Y_test, lm.predict(X_test))
print(math.sqrt(mse_test))
# Data Splitting to train and test set of the reduced data
X_1 = data1.iloc[0:len(data1),[1,2]] #Extract the column of the COPCOR SP we are going to check its impact
X_train2, X_test2, Y_train2, Y_test2 = train_test_split(X_1, Y, test_size =0.25,random_state=1)
X_train2 = ss.fit_transform(X_train2)
X_test2 = ss.transform(X_test2)
# Statistical Analysis of the reduced model with Statsmodels
X_reduced = sm.add_constant(X_train2) # add a constant to the model
est_reduced = sm.OLS(Y_train2, X_reduced).fit()
print(est_reduced.summary()) # print the results
# Fitting a Linear Model for the reduced model with Scikit-Learn
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import math
lm1 = LinearRegression() #create an lm object of LinearRegression Class
lm1.fit(X_train2, Y_train2)
print(lm1.intercept_)
print(lm1.coef_)
mse_test1 = mean_squared_error(Y_test2, lm1.predict(X_test2))
print(math.sqrt(mse_test1))
#Cross Validation and Training again the model
from sklearn.model_selection import KFold
from sklearn import model_selection
kf = KFold(n_splits=6, random_state=1)
for train_index, test_index in kf.split(X_train2):
print("Train:", train_index, "Validation:",test_index)
X_train1, X_test1 = X[train_index], X[test_index]
Y_train1, Y_test1 = Y[train_index], Y[test_index]
results = -1 * model_selection.cross_val_score(lm1, X_train1, Y_train1,scoring='neg_mean_squared_error', cv=kf)
print(np.sqrt(results))
#RMSE values interpretation
print(math.sqrt(mse_test1))
print(math.sqrt(results.mean()))
#Good model built no overfitting or underfitting (Barely Same for test and training :Goal of Cross validation but low prediction accuracy = Value is big
import seaborn
Corr=X_train2.corr(method='pearson')
mask=np.zeros_like(Corr)
mask[np.triu_indices_from(mask)]=True
seaborn.heatmap(Corr,cmap='RdYlGn_r',vmax=1.0,vmin=-1.0,mask=mask, linewidths=2.5)
plt.yticks(rotation=0)
plt.xticks(rotation=90)
plt.show()
enter code here
Do you have an idea how to fix the issue ?
I'm guessing the problem lies with:
Corr=X_train2.corr(method='pearson')
.corr is a pandas dataframe method but X_train2 is a numpy array at that stage. If a dataframe/series is passed into StandardScaler, a numpy array is returned. Try replacing the above with:
Corr=pd.DataFrame(X_train2).corr(method='pearson')
or make use of numpy.corrcoef or numpy.correlate in their respective forms.

how to print estimated coefficients after a (GridSearchCV) fit a model? (SGDRegressor)

I am new to scikit-learn, but it did what I was hoping for. Now, maddeningly, the only remaining issue is that I don't find how I could print (or even better, write to a small text file) all the coefficients it estimated, all the features it selected. What is the way to do this?
Same with SGDClassifier, but I think it is the same for all base objects that can be fit, with cross validation or without. Full script below.
import scipy as sp
import numpy as np
import pandas as pd
import multiprocessing as mp
from sklearn import grid_search
from sklearn import cross_validation
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDClassifier
def main():
print("Started.")
# n = 10**6
# notreatadapter = iopro.text_adapter('S:/data/controls/notreat.csv', parser='csv')
# X = notreatadapter[1:][0:n]
# y = notreatadapter[0][0:n]
notreatdata = pd.read_stata('S:/data/controls/notreat.dta')
notreatdata = notreatdata.iloc[:10000,:]
X = notreatdata.iloc[:,1:]
y = notreatdata.iloc[:,0]
n = y.shape[0]
print("Data lodaded.")
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.4, random_state=0)
print("Data split.")
scaler = StandardScaler()
scaler.fit(X_train) # Don't cheat - fit only on training data
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test) # apply same transformation to test data
print("Data scaled.")
# build a model
model = SGDClassifier(penalty='elasticnet',n_iter = np.ceil(10**6 / n),shuffle=True)
#model.fit(X,y)
print("CV starts.")
# run grid search
param_grid = [{'alpha' : 10.0**-np.arange(1,7),'l1_ratio':[.05, .15, .5, .7, .9, .95, .99, 1]}]
gs = grid_search.GridSearchCV(model,param_grid,n_jobs=8,verbose=1)
gs.fit(X_train, y_train)
print("Scores for alphas:")
print(gs.grid_scores_)
print("Best estimator:")
print(gs.best_estimator_)
print("Best score:")
print(gs.best_score_)
print("Best parameters:")
print(gs.best_params_)
if __name__=='__main__':
mp.freeze_support()
main()
The SGDClassifier instance fitted with the best hyperparameters is stored in gs.best_estimator_. The coef_ and intercept_ are the fitted parameters of that best model.
From an estimator, you can get the coefficients with coef_ attribute.
From a pipeline you can get the model with the named_steps attribute then get the coefficients with coef_.
From a grid search, you can get the model (best model) with best_estimator_, then get the named_steps to get the pipeline and then get the coef_.
Example:
from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
pipe = Pipeline([
("scale", StandardScaler()),
("model", LinearSVC())
])
# from pipe:
pipe.fit(X, y);
coefs = pipe.named_steps.model.coef_
# from gridsearch:
gs_svc_model = GridSearchCV(estimator=pipe,
param_grid={
'model__C': [.01, .1, 10, 100, 1000],
},
cv=5,
n_jobs = -1)
gs_svc_model.fit(X, y);
coefs = gs_svc_model.best_estimator_.named_steps.model.coef_
I think you might be looking for estimated parameters of the "best" model rather than the hyper-parameters determined through grid-search. You can plug the best hyper-parameters from grid-search ('alpha' and 'l1_ratio' in your case) back to the model ('SGDClassifier' in your case) to train again. You can then find the parameters from the fitted model object.
The code could be something like this:
model2 = SGDClassifier(penalty='elasticnet',n_iter = np.ceil(10**6 / n),shuffle=True, alpha = gs.best_params_['alpha'], l1_ratio=gs.best_params_['l1_ratio'])
print(model2.coef_)

Categories

Resources