I was using StandardScaler to scale my data as was shown in a tutorial. But its not working.
I tried copy the same code as was used in the tutorial but still error was displayed.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(df.drop('TARGET CLASS',axis=1))
scaled_features = scaler.transform(df.drop('TARGET CLASS',axis=1))
The error is as follows:
TypeError: fit() missing 1 required positional argument: 'X'
By trying to recreate your problem, it seems that everything in the code is correct and being executed perfectly. Here is a stand-alone example I created in order to test your code:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
data = load_iris()
df = pd.DataFrame(data.data, columns=['TARGET CLASS', 'a', 'b', 'c'])
scaler = StandardScaler()
scaler.fit(df.drop('TARGET CLASS', axis=1))
scaled_features = scaler.transform(df.drop('TARGET CLASS',axis=1))
I suggest you examine your variable df by printing it. For instance, you could try to transform it into a NumPy array before passing it and print its contents:
import numpy as np
X = df.drop('TARGET CLASS',axis=1).values
print(X)
Related
Following is my code
I am running it on IDLE python 3.8
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction import DictVectorizer
from sklearn import trees
from sklearn.metrics import accuracy_score,classification_report
import warnings
from sklearn.preprocessing import StandardScalar
from sklearn.neural_networks import MLPClassifier
warnings.filterwarnings(action='ignore',category=DeprecationWarning)
data=pd.read_csv('data.csv')
cols_to_retain=[]
x-feature=data[cols_to_retain]
x_dict=x_feature.T.to_dict.values()
vect=DictVectorizer(sparse=False)
x_vector=vect.fit_transform(x_dict)
print(x_vector)
x_train=[:-1]
x_test=[-1:]
print('Train set')
print(x_train)
print('Test set')
print(x_test)
le=LabelEncoder
y_train=le.fit_transform(data['Goal'][:-1])
clf=tree.DecisionTreeClassifier(criteron='entropy')
clf=clf.fit_transform(x_train,y_train)
print('Test Data')
print(le.inverse_transform(clf.predict(x_test)))
It shows me error for these particular lines
It only says invalid syntax error
x_train=[:-1]
x_test=[-1:]
packages are imported correctly
Your code contains multiple issues:
The import should be StandardScaler not StandardScalar,
You got unused imports like MLPClassifier,
cols_to_retrain is empty. Thus, data[cols_to_retrain] will return an empty data frame,
to_dict should be to_dict(),
variable names x-feature and x_feature do not match,
LabelEncoder is missing brackets (),
x_train=[:-1] and x_test=[-1:] is not valid. You probably wanted to select a subset like x_train = x_vector[:-1] or x_test = x_vector[-1:]. Please add additional sample data, if you need help with this selection.
Here is an updated version of your code:
import numpy as np
import pandas as pd
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.tree import DecisionTreeClassifier
data = pd.read_csv("data.csv")
print(data)
cols_to_retain = []
x_feature = data[cols_to_retain]
x_dict = x_feature.T.to_dict().values()
vect = DictVectorizer(sparse=False)
x_vector = vect.fit_transform(x_dict)
print(x_vector)
x_train = x_vector[:-1]
x_test = x_vector[-1:]
print("Train set")
print(x_train)
print("Test set")
print(x_test)
le = LabelEncoder()
y_train = le.fit_transform(data["Goal"][:-1])
clf = DecisionTreeClassifier(criteron="entropy")
clf = clf.fit_transform(x_train, y_train)
print("Test Data")
print(le.inverse_transform(clf.predict(x_test)))
I am Learing Sklearn and have been working out with it. But as I started with KNN there seem to be a problem. Here is my code for this:
# import required packages
import numpy as np # later use
import pandas as pd
from sklearn import neighbors, metrics # later use
from sklearn.model_selection import train_test_split # later use
from sklearn.preprocessing import LabelEncoder
# start training
data = pd.read_csv('data/car.data')
X = data[[
'buying',
'maint',
'safety'
]]
y = data[['class']]
Le = LabelEncoder()
for i in range(len[X[0]]):
X[:, i] = Le.fit_transform(X[:, i])
After running the debugger, the problem seems to be at X = data[[
And I have no Idea how to solve it
I once saw the same error before and for that I just added .values to the end of the variable X
I am trying to perform preprocessing on iris dataset but on the imputation step I get this error while using SimpleImputer to print the mean of every column.
here is the full code for reference. I am getting the error in last part.
import numpy as np
from sklearn.datasets import load_iris
from sklearn import preprocessing
iris = load_iris()
X = iris.data
iris_normalized = preprocessing.normalize(iris.datadata,norm='l2')
print(iris_normalized.mean(axis=0))
enc = preprocessing.OneHotEncoder()
iris_target_onehot = enc.fit_transform(iris.target.reshape(-1,1))
print(iris_target_onehot.toarray()[[0,50,100]])
X[:50,:] = np.nan
from sklearn.impute import SimpleImputer
iris_imputed = SimpleImputer(missing_values=np.nan,strategy='mean')
iris_imputed.fit_transform(X)
print(iris_imputed.mean(axis=0))
Sorry I am new to Machine Learning.
Just save the fit_transform to iris_imputed before calling its mean. It will work
iris_imputed = iris_imputed.fit_transform(iris.data)
I'm trying to do RFECV on the transformed data using SciKit.
For that, I create a pipeline and pass the pipeline to the RFECV. It works fine unless I have ColumnTransformer as a pipeline step. It gives me the following error:
ValueError: Specifying the columns using strings is only supported for pandas DataFrames
I have checked the answer for this Question, but I'm not sure if they are applicable here. The code is as follows:
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import Normalizer
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LinearRegression
class CustomPipeline(Pipeline):
#property
def coef_(self):
return self._final_estimator.coef_
#property
def feature_importances_(self):
return self._final_estimator.feature_importances_
X = pd.DataFrame({
'col1': [i for i in range(100)] ,
'col2': [i*2 for i in range(100)],
})
y = pd.DataFrame({'out': [i*3 for i in range(100)]})
ct = ColumnTransformer([("norm", Normalizer(norm='l1'), ['col1'])])
pipe = CustomPipeline([
('col_transform', ct),
('lr', LinearRegression())
])
rfecv = RFECV(
estimator=pipe,
step=1,
cv=3,
)
#pipe.fit(X,y) # pipe can fit, no problems
rfecv.fit(X,y)
Obviously, I can do this transformation step outside the pipeline and then use the transformed X, but I was wondering if there is any workaround.
I'd also like to raise this as an RFECV's design issue (it converts X to numpy array first thing, while other approaches with built-in cross-validation e.g. GridSearchCV do not do that)
After reading so many examples with 'inconsistent number of samples' errors, I am still not able to see what is wrong with my code.
In an excel file, sheet 1 contains data. Sheet 2 contains a shortlisted list of variables.
I saved the variables in sheet 2 into an array. And feed it to a Random Forest model to evaluate its impact on a parameter in sheet 1.
But I am getting "Found input variables with inconsistent numbers of samples: [54, 2016]"
54 is the number of variables in sheet 2.
2016 is the number of rows of data in sheet 1.
I am trying to see how these 54 variables impact 'Target' variable in sheet 1.
How should i manipulate my data to make this work?
Many thanks in advance.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score
df = pd.read_excel(r'C:\Users\ngks\Desktop\TP Course\Project Module\ProjectDataSetrev2.xlsx',sheet_name=0)
df2 = pd.read_excel(r'C:\Users\ngks\Desktop\TP Course\Project Module\ProjectDataSetrev2.xlsx',sheet_name=1)
df['DateTime']=pd.to_datetime(df['Time Stamp'], format='%Y-%m-%d %H:%M:%S')
df.set_index(df['DateTime'], inplace=True)
print(len(df2.columns))
allvar = list()
for each_var in df2.columns:
allvar.append(each_var)
allvar = np.array(allvar)
print(allvar)
target = df['(CUP) Chiller Optimization Plant Efficiency [kW/RT]']
target=target.values.reshape(len(target),1)
allvar_train,allvar_test,target_train,target_test= train_test_split(allvar,target, random_state=0, test_size=0.6)
clf = RandomForestClassifier(n_estimators=10000, random_state=0, n_jobs=-1)
clf.fit(allvar_train, target_train)
for feature in zip(feat_labels, clf.feature_importances_):
print(feature)
Sheet 1 (saved as df) looks like this
Sheet 1
Sheet 2 (saved as df2) looks like this
Sheet2
Error log is as shown
Error log
Error log 2: Unknown label type: 'continuous'Error Log 2
allvar_train
target train
The issue is with 'train_test_spilt', where you're only passing the feature column name not the data. Use the list of columns to get data from the DataFrame like this.
allvar_train,allvar_test,target_train,target_test= train_test_split(df[allvar],target, random_state=0, test_size=0.6)
You don't necessarily need to convert 'allvar' and 'target' to numpy array it can directly be used in 'train_test_split'.
Note: This issue has got nothing to do with Random Forest
Here is the code that works for me.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score
df = pd.read_excel(r'C:\Users\ngks\Desktop\TP Course\Project Module\ProjectDataSetrev3.xlsx',sheet_name=0)
df2 = pd.read_excel(r'C:\Users\ngks\Desktop\TP Course\Project Module\ProjectDataSetrev3.xlsx',sheet_name=1)
df['DateTime']=pd.to_datetime(df['Time Stamp'], format='%Y-%m-%d %H:%M:%S')
df.set_index(df['DateTime'], inplace=True)
print(len(df2.columns))
allvarlist = list()
for each_var in df2.columns:
allvarlist.append(each_var)
countvar = len(allvarlist)
allvar = df[allvarlist]
allvar = allvar.values.reshape(len(allvar),countvar)
target = df['(CUP) Chiller Optimization Plant Efficiency [kW/RT]']
target=target.values.reshape(len(target),1)
allvar_train,allvar_test,target_train,target_test= train_test_split(allvar,target, random_state=0, test_size=0.7)
clf = RandomForestRegressor(n_estimators=10000, random_state=0, n_jobs=-1)
#print(allvar_train)
#print(target_train)
clf.fit(allvar_train,np.ravel(target_train))
for feature in zip(allvarlist, clf.feature_importances_):
print(feature)
importances = clf.feature_importances_
#indices = np.argsort(importances)
plt.figure().set_size_inches(14,16)
plt.barh(range(allvar_train.shape[1]), importances, color="r")
plt.yticks(range(allvar_train.shape[1]),allvarlist)