Object not found while trying to write a pickle file - python

I am trying to do cancer detection using Random Vector Forest. I am trying to make a pickle file by using the command pickle.dump(forest,open("model.pkl","wb") .But I am getting a name error
NameError Traceback (most recent call last)
c:\Users\hp\newtest\pcancer.ipynb Cell 6' in <cell line: 1>()
----> 1 pickle.dump(forest,open("model.pkl","wb"))
NameError: name 'forest' is not defined
This is my source code for detection:
import numpy as np
import pandas as pd
import warnings as wr
#Ignoring warnings
from sklearn.exceptions import UndefinedMetricWarning
wr.filterwarnings("ignore", category=UndefinedMetricWarning)
import pickle
df=df.dropna(axis=1)#Drop the column with empty data
#Encoding first column
from sklearn.preprocessing import LabelEncoder
#Splitting data for dependence
#Train-Test split
from sklearn.model_selection import train_test_split
#Standard scaling
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
def models(X_train,Y_train):
#Random forest classifier
print("Random Forest:",forest.score(X_train,Y_train))
return forest

this part of code is not indented in order. so its a local declaration and action as a recursive call

There is indentation problem in the last section of your code. This is correctly indented code and when you create a pickle file you'll write model object in it not the forest as forest is returned in object named model
from sklearn.ensemble import RandomForestClassifier
def models(X_train,Y_train):
#Random forest classifier
print("Random Forest:",forest.score(X_train,Y_train))
return forest


AttributeError: 'Pipeline' object has no attribute 'fit_resample'

Based on the documentation given on the following link pipeline and imbalanced
i have tried to implement code on some dataset, here is code :
import numpy as np
import pandas as pd
from collections import Counter
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import GaussianNB
data =pd.read_csv('aug_train.csv')
mymodel =GaussianNB()
y =data['Response'].values
X =data.drop('Response',axis=1).values
#X,y =SMOTE().fit_resample(X,y)
over = SMOTE(sampling_strategy=0.1)
under = RandomUnderSampler(sampling_strategy=0.5)
steps = [('o', over), ('u', under)]
pipeline = Pipeline(steps=steps)
# transform the dataset
X, y = pipeline.fit_sample(X, y)
the main problem in this code is with line :
X, y = pipeline.fit_sample(X, y)
error says that AttributeError: 'Pipeline' object has no attribute 'fit_resample' how can i fix this issue? thanks in advance
The tutorial employs imblearn.pipeline.Pipeline, while your code uses sklearn.pipeline.Pipeline (check import expressions). These appear to be different kinds of pipelines.

Random Forest Error (input variables with inconsistent numbers of samples)

After reading so many examples with 'inconsistent number of samples' errors, I am still not able to see what is wrong with my code.
In an excel file, sheet 1 contains data. Sheet 2 contains a shortlisted list of variables.
I saved the variables in sheet 2 into an array. And feed it to a Random Forest model to evaluate its impact on a parameter in sheet 1.
But I am getting "Found input variables with inconsistent numbers of samples: [54, 2016]"
54 is the number of variables in sheet 2.
2016 is the number of rows of data in sheet 1.
I am trying to see how these 54 variables impact 'Target' variable in sheet 1.
How should i manipulate my data to make this work?
Many thanks in advance.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score
df = pd.read_excel(r'C:\Users\ngks\Desktop\TP Course\Project Module\ProjectDataSetrev2.xlsx',sheet_name=0)
df2 = pd.read_excel(r'C:\Users\ngks\Desktop\TP Course\Project Module\ProjectDataSetrev2.xlsx',sheet_name=1)
df['DateTime']=pd.to_datetime(df['Time Stamp'], format='%Y-%m-%d %H:%M:%S')
df.set_index(df['DateTime'], inplace=True)
allvar = list()
for each_var in df2.columns:
allvar = np.array(allvar)
target = df['(CUP) Chiller Optimization Plant Efficiency [kW/RT]']
allvar_train,allvar_test,target_train,target_test= train_test_split(allvar,target, random_state=0, test_size=0.6)
clf = RandomForestClassifier(n_estimators=10000, random_state=0, n_jobs=-1)
clf.fit(allvar_train, target_train)
for feature in zip(feat_labels, clf.feature_importances_):
Sheet 1 (saved as df) looks like this
Sheet 1
Sheet 2 (saved as df2) looks like this
Error log is as shown
Error log
Error log 2: Unknown label type: 'continuous'Error Log 2
target train
The issue is with 'train_test_spilt', where you're only passing the feature column name not the data. Use the list of columns to get data from the DataFrame like this.
allvar_train,allvar_test,target_train,target_test= train_test_split(df[allvar],target, random_state=0, test_size=0.6)
You don't necessarily need to convert 'allvar' and 'target' to numpy array it can directly be used in 'train_test_split'.
Note: This issue has got nothing to do with Random Forest
Here is the code that works for me.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score
df = pd.read_excel(r'C:\Users\ngks\Desktop\TP Course\Project Module\ProjectDataSetrev3.xlsx',sheet_name=0)
df2 = pd.read_excel(r'C:\Users\ngks\Desktop\TP Course\Project Module\ProjectDataSetrev3.xlsx',sheet_name=1)
df['DateTime']=pd.to_datetime(df['Time Stamp'], format='%Y-%m-%d %H:%M:%S')
df.set_index(df['DateTime'], inplace=True)
allvarlist = list()
for each_var in df2.columns:
countvar = len(allvarlist)
allvar = df[allvarlist]
allvar = allvar.values.reshape(len(allvar),countvar)
target = df['(CUP) Chiller Optimization Plant Efficiency [kW/RT]']
allvar_train,allvar_test,target_train,target_test= train_test_split(allvar,target, random_state=0, test_size=0.7)
clf = RandomForestRegressor(n_estimators=10000, random_state=0, n_jobs=-1)
for feature in zip(allvarlist, clf.feature_importances_):
importances = clf.feature_importances_
#indices = np.argsort(importances)
plt.barh(range(allvar_train.shape[1]), importances, color="r")

Unable to run logit model/ logistic regression

I'm trying to run a logistic regression. The data has been scrubbed and categorical variables change to dummies however when i run the code i get an error message from the "statsmodels" package outside of my code and not sure how to correct in this case.
A friend of mine ran the same code and he got an output (print screen below), as i'm using spyder with python 3.6 he thinks it might be a version issue - he is using python 3.5
I've got the code below. Any ideas on how to fix it or how better to run a logistic regression is appreciated.
error message i'm getting is in statsmodels library:
File "C:\Users\sebas\Anaconda3\lib\site-packages\statsmodels\discrete\discrete_model.py", line 2405, in llr_pvalue
return stats.chisqprob(self.llr, self.df_model)
AttributeError: module 'scipy.stats' has no attribute 'chisqprob'
import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib.pyplot as plt
plt.rc("font", size=14)
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from sklearn.cross_validation import train_test_split
import seaborn as sns
sns.set(style="whitegrid", color_codes=True)
## Logistic regression
data = pd.read_csv(r"log reg test Lending club 2007-2011 car only.csv")
#data = data.dropna()
print(data['Distressed'].value_counts()) ## number of defaulted car loans is binary
sns.countplot(x='Distressed', data=data, palette='hls')
plt.show ## confrim dependent variable is binary
##basic numerical analysis of variables to check feasibility for model
## we will need to create dummy variables for strings
#print(data.groupby('Distressed').mean()) ##numerical variable means
#print(data.groupby('grade').mean()) ## string variable means
##testing for nulls in dataset
scrub_data=data.drop(['mths_since_last_delinq'],1) ## this variable is not statistically significant
print('Here is the logit model data')
print(scrub_data.isnull().sum()) ## removed records of missing info, sample still sufficiently large
##convert categorical variables to dummies completed in csv file
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=.3, random_state=25)
from sklearn.metrics import classification_report
print('alternative method using RFE')
#x=[i for i in data if i not in y]
## check for independance between features
correlation=sns.heatmap(data.corr()) ## heatmap showing correlations of the variables
from sklearn.svm import LinearSVC
#logreg = LogisticRegression()
#rfe = RFE(logreg,10)
import statsmodels.api as sm
The error can be fixed by assigning the missing function back into the scipy.stats namespace as shown below:
from scipy import stats
stats.chisqprob = lambda chisq, df: stats.chi2.sf(chisq, df)

Scikit-learn binarize categorical data

I've been trying to load a CSV file into scikit via pandas and setting the target column to be a list of 20 categorical variables. I've tried using label_binarize but that didn't seem to do any good so after some reading I've switched to LabelEncoder but it doesn't appear to change much.
from io import StringIO
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import permutation_test_score
from sklearn import svm, datasets
from sklearn.metrics import roc_curve, auc, confusion_matrix
from sklearn.model_selection import train_test_split, ShuffleSplit
from sklearn.preprocessing import label_binarize, MultiLabelBinarizer, LabelEncoder
from sklearn.multiclass import OneVsRestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.naive_bayes import GaussianNB
#loading the data
y = data.iloc[:,19]
X = data.iloc[:,1:18+20:22]
#Binarize the output
le = LabelEncoder()
y = label_binarize(y, le)
n_classes = y.shape[1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5,
model3 = KNeighborsClassifier(n_neighbors=7)
yet when I run this I get:
Traceback (most recent call last):
File "file, line 30, in <module>
File "C:\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py", line 149, in transform
classes = np.unique(y)
File "\Anaconda3\lib\site-packages\numpy\lib\arraysetops.py", line 198, in unique
TypeError: '>' not supported between instances of 'str' and 'float'
Is this kind of target data even possible for scikit?
Ok so to solve this issue I found you needed to surround the categorical data itself with quotation marks like this: "0-1"
Otherwise Python would read it as the long of 0-1 and get confused. The data loads correctly.

AttributeError: LinearRegression object has no attribute 'coef_'

I've been attempting to fit this data by a Linear Regression, following a tutorial on bigdataexaminer. Everything was working fine up until this point. I imported LinearRegression from sklearn, and printed the number of coefficients just fine. This was the code before I attempted to grab the coefficients from the console.
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import sklearn
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
boston = load_boston()
bos = pd.DataFrame(boston.data)
bos.columns = boston.feature_names
bos['PRICE'] = boston.target
X = bos.drop('PRICE', axis = 1)
lm = LinearRegression()
After I had all this set up I ran the following command, and it returned the proper output:
In [68]: print('Number of coefficients:', len(lm.coef_)
Number of coefficients: 13
However, now if I ever try to print this same line again, or use 'lm.coef_', it tells me coef_ isn't an attribute of LinearRegression, right after I JUST used it successfully, and I didn't touch any of the code before I tried it again.
In [70]: print('Number of coefficients:', len(lm.coef_))
Traceback (most recent call last):
File "<ipython-input-70-5ad192630df3>", line 1, in <module>
print('Number of coefficients:', len(lm.coef_))
AttributeError: 'LinearRegression' object has no attribute 'coef_'
The coef_ attribute is created when the fit() method is called. Before that, it will be undefined:
>>> import numpy as np
>>> import pandas as pd
>>> from sklearn.datasets import load_boston
>>> from sklearn.linear_model import LinearRegression
>>> boston = load_boston()
>>> lm = LinearRegression()
>>> lm.coef_
AttributeError Traceback (most recent call last)
<ipython-input-22-975676802622> in <module>()
8 lm = LinearRegression()
----> 9 lm.coef_
AttributeError: 'LinearRegression' object has no attribute 'coef_'
If we call fit(), the coefficients will be defined:
>>> lm.fit(boston.data, boston.target)
>>> lm.coef_
array([ -1.07170557e-01, 4.63952195e-02, 2.08602395e-02,
2.68856140e+00, -1.77957587e+01, 3.80475246e+00,
7.51061703e-04, -1.47575880e+00, 3.05655038e-01,
-1.23293463e-02, -9.53463555e-01, 9.39251272e-03,
My guess is that somehow you forgot to call fit() when you ran the problematic line.
I also got the same problem while dealing with linear regression the problem object has no attribute 'coef'.
There are just slight changes in the syntax only.
linreg = LinearRegression()
linreg.fit(X,y) # fit the linesr model to the data
I Hope this will help you Thanks

