Extra tree classifier missing argument y - python

So I was trying to implement Extra Tree Classifier in order to find the parameters importance in my data base, I wrote this simple code but for some reason I keep getting thiss Error.
My Code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.ensemble import ExtraTreesClassifier
df = pd.read_csv('C:\\Users\\ali97\\Desktop\\Project\\Database\\5-FINAL2\\Final After Simple Filtering.csv')
extra_tree_forest = ExtraTreesClassifier(n_estimators = 5, criterion ='entropy', max_features = 2)
extra_tree_forest.fit(df)
feature_importance = extra_tree_forest.feature_importances_
feature_importance_normalized = np.std([tree.feature_importances_ for tree in extra_tree_forest.estimators_], axis = 1)
plt.bar(X.columns, feature_importance_normalized)
plt.xlabel('Lbale')
plt.ylabel('Feature Importance')
plt.title('Parameters Importance')
plt.show()
The Error:
TypeError Traceback (most recent call last)
<ipython-input-7-4aad8882ce6d> in <module>
16 extra_tree_forest = ExtraTreesClassifier(n_estimators = 5, criterion ='entropy', max_features = 2)
17
---> 18 extra_tree_forest.fit(df)
19
20 feature_importance = extra_tree_forest.feature_importances_
TypeError: fit() missing 1 required positional argument: 'y'
Thank you

Usually, for fit function, we need to have both attributes(X) and labels(Y) and you need to use extra_tree_forest.fit(X, Y) to train this classifier.
I recommend you split labels and attributes and import them as two separate lists when you import
Final After Simple Filtering.csv.

Related

i can't run a linear regression and cross validation. can someone enlighten me? i get errors such as could not convert string to float

import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from statsmodels.formula.api import ols
import statsmodels.api as sm
import scipy
import scipy.stats
import seaborn as sns
import numpy.random as npr
import math
from scipy.stats import norm
import sqlite3 as sql
import seaborn
from numba import jit, prange
df = pd.read_csv('ODI-2021.edited.csv')
df.info()
sr_targets = pd.Series(df['What is your stress level (0-100)?'])
sr_targets.describe()
df_features = df.drop('What is your stress level (0-100)?', axis=1)
print (df_features)
df_features.describe()
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
df.isnull().sum()
df_missing = df.dropna()
df_missing.shape
X = df["What is your stress level (0-100)?"]
y = df["Time you went to be Yesterday"]
est = sm.OLS(y, X.astype(float)).fit()
model = sm.OLS(y, X).fit()
predictions = model.predict(X)
-ValueError: could not convert string to float: 'over 9000'
model.summary()
-AttributeError: 'LinearRegression' object has no attribute 'summary'
from sklearn import preprocessing
def convert(df):
number = preprocessing.LabelEncoder()
data['Date'] = number.fit_transform(df['Date'])
data=data.fillna(-999)
return data
model = LinearRegression(fit_intercept=True)
result = model.fit(df_features, sr_targets)
-ValueError: could not convert string to float: '3/16/2021'
sr_coef = pd.Series(result.coef_, index=df_features.columns)
sr_coef
-NameError: name 'result' is not defined
sr_endog = sr_targets.copy()
df_exog = sm.add_constant(df_features)
model = sm.OLS(sr_endog, df_exog)
result = model.fit()
-ValueError: Pandas data cast to numpy dtype of object. Check input data with
np.asarray(data)
result.summary()
-NameError: name 'result' is not defined
cross validation
from sklearn.linear_model import LassoCV
from sklearn.model_selection import Fold
nb_folds = 10
cv = KFold(n_splits=nb_folds)
model = LassoCV(fit_intercept=True, cv=cv, n_alphas=200, max_iter=2000)
result = model.fit(df_features_rescaled, sr_targets)
-NameError: name 'df_features_rescaled' is not defined
fig = plt.figure(figsize=[16,15])
xvalues = np.log10(result.alphas_)
rmse_path = np.sqrt(result.mse_path_)
for k in range (nb_folds):
yvalues = rmse_path[:,k]
plt.plot(xvalues, yvalues)
pos_ymin = yvalues.argmin()
plt.plot(xvalues[pos_ymin], yvalues[pos_ymin], marker='o')
plt.axvline(np.log10(result.alpha_))
plt.title('RMSE for differebt alpha', fontsize=20)
plt.grid()
-NameError: name 'result' is not define
sr_coef = pd.Series(result.coef_, index=df_features.columns)
sr_coef
this is my code and these are the errors im getting, could someone help me with what im doing wrong? i have looked up the errors and i have no clue how to fix those. my data set has numbers but also dates and answers such as yes no and university level education level responses which i have no clue how to convert in float. i have been trying to run a regression with two columns which consist of numbers and i get there errors. for the cross validation i am dropping one column and im using the rest and im getting the error that i havent defined the variable result which i have , im clueless
3.thanks in advance!
Each of your errors means something. Learning to read the errors is extremely important in understanding what is going on. For example,
est = sm.OLS(y, X.astype(float)).fit()
model = sm.OLS(y, X).fit()
predictions = model.predict(X)
-ValueError: could not convert string to float: 'over 9000'
This appears to suggest that someone place the phrase "over 9000" in the cell of the CSV file you are opening. Hence, python is having trouble figuring out how to convert that to a float. Same thing would happen if you tried to run
float("over 9000")
It appears the data needs to be cleaned up a bit before it can be used by your sm. It appears that python is trying to tell you the same thing here too:
-ValueError: could not convert string to float: '3/16/2021'
The string "3/16/2021" has symbols that are not apart of a float(), namely "/" symbol.
I think it would be helpful if you broke up your errors and concerns into separate questions, then people could tackle them one at a time for you.

2D output on Lineal regression model

I'm getting the following error from my code:
ValueError: Expected 2D array, got scalar array instead:
array=99.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Here is the code used:
#importing libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import linear_model
Physical_activity_df = pd.read_excel('C:/Users/Usuario/Desktop/LW_docs/Physical_activity_nopass.xlsx')
prediction_df = Physical_activity_df[['Activity_Score','Calories']]
prediction_df.plot(kind='scatter', x= 'Activity_Score', y= 'Calories')
plt.show()
#change to df variables
activity_score = pd.DataFrame(prediction_df['Activity_Score'])
calories = pd.DataFrame(prediction_df['Calories'])
lm = linear_model.LinearRegression()
model = lm.fit(activity_score,calories)
#predict new values for calories (FROM HERE COMES THE ERROR)
activity_score_new = 99
calories_predict = model.predict(activity_score_new)
calories_predict
Any idea about how to fix this issue? Thanks!

TypeError: string indices must be integers. Dont know what im doing wrong

I have a dataset with two columns the one identifies whether an email is classified as spam or not and the other column shows the emails content. ive been trying to implement naive bayes with PSO as well as ABC. However I get the error TypeError: string indices must be integers.
email_train,email_test,spam_train,spam_test train_test_split(dfTotal.Email,dfTotal.Spam,test_size=0.3,random_state=0)
email_test_dtm = cv.transform(email_test)
# convert to TFIDF form
email_test_tf = tf.fit_transform(email_test_dtm)
email_test_tf
Artificial Bee Colony
from Hive import Hive
from Hive import Utilities
from sklearn.metrics import log_loss
# ---- SOLVE TEST CASE WITH ARTIFICIAL BEE COLONY ALGORITHM
def run(lowBounds,upBounds,evaluator):
model = Hive.BeeHive(lower = lowBounds, # MUST BE A LIST !
upper = upBounds , # MUST BE A LIST !
fun = evaluator ,
numb_bees = 100 ,
max_itrs = 2 ,)
# runs model
cost,sol = model.run()
# plots convergence
Utilities.ConvergencePlot(cost)
# prints out best solution
print("Fitness Value ABC: {0}".format(model.best))
ABC_model = MultinomialNB(alpha=10**sol[0]).fit(email_train_tf,spam_train) # Create the optimized model with best parameter
result = ABC_model.predict(email_test_tf) # predict with the ABC_model
return sol,result
Import Optunity
import optunity
import optunity.metrics
Naive Bayes
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.get_params()
# fit tf-idf representation to NB model
nb.fit(email_train_tf, spam_train)
# class predictions for testing set
result1 = nb.predict(email_test_tf)
def evaluator(params):
nBayes = MultinomialNB(alpha=10**params[0]).fit(email_train_tf,spam_train)
pred_proba = nBayes.predict_proba(email_test_tf)
return log_loss(spam_test,pred_proba)
sol,result3 = run([-2],[1],evaluator)
The traceback I receive is as follows:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-44-006c296828ed> in <module>
8
9
---> 10 sol,result3 = run([-2],[1],evaluator)
<ipython-input-32-6852d973eb15> in run(lowBounds, upBounds, evaluator)
19
20 # plots convergence
---> 21 Utilities.ConvergencePlot(cost)
22
23 # prints out best solution
c:\users\lidak\article\src\hive\Hive\Utilities.py in ConvergencePlot(cost)
55 labels = ["Best Cost Function", "Mean Cost Function"]
56 plt.figure(figsize=(12.5, 4));
---> 57 plt.plot(range(len(cost["best"])), cost["best"], label=labels[0]);
58 plt.scatter(range(len(cost["mean"])), cost["mean"], color='red', label=labels[1]);
59 plt.xlabel("Iteration #");
TypeError: string indices must be integers
<Figure size 900x288 with 0 Axes>

dataset is not callable problems

Im trying to impute NaN values but,first i want to check the best method to calculate this values. Im new using this methods, so im want to use a code i found to capare the differents regressors and choose the best. The original code is this:
from sklearn.experimental import enable_iterative_imputer # noqa
from sklearn.datasets import fetch_california_housing
from sklearn.impute import SimpleImputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import BayesianRidge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score
N_SPLITS = 5
rng = np.random.RandomState(0)
X_full, y_full = fetch_california_housing(return_X_y=True)
# ~2k samples is enough for the purpose of the example.
Remove the following two lines for a slower run with different error bars.
X_full = X_full[::10]
y_full = y_full[::10]
n_samples, n_features = X_full.shape
fetch_california_housing is his Dataset.
So, when i try to adapt this code to my case i wrote this code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from numpy import genfromtxt
data = genfromtxt('documents/datasets/df.csv', delimiter=',')
features = data[:, :2]
targets = data[:, 2]
N_SPLITS = 5
rng = np.random.RandomState(0)
X_full, y_full = data(return_X_y= True)
# ~2k samples is enough for the purpose of the example.
# Remove the following two lines for a slower run with different error bars.
X_full = X_full[::10]
y_full = y_full[::10]
n_samples, n_features = X_full.shape
I always get the same error:
AttributeError: 'numpy.ndarray' object is not callable
and before I used my DF as csv (df.csv) the error is the same
AttributeError: 'Dataset' object is not callable
the complete error is this:
ypeError Traceback (most recent call last) <ipython-input-8-3b63ca34361e> in <module>
3 rng = np.random.RandomState(0) 4
----> 5 X_full, y_full = df(return_X_y=True)
6 # ~2k samples is enough for the purpose of the example.
7 # Remove the following two lines for a slower run with different error bars.
TypeError: 'DataFrame' object is not callable
and i dont know how to solve one of both error to go away
I hope to explain well my problem cause my english is not very good

Linear Regression issues

I'm trying to run a linear regression for 2 columns of data (IMF_VALUES, BBG_FV)
I have this code:
import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt
import pandas as pd
raw_data = pd.read_csv("IMF and BBG Fair Values.csv")
ISO_TH = raw_data[["IMF_VALUE","BBG_FV"]]
filtered_TH = ISO_TH[np.isfinite(raw_data['BBG_FV'])]
npMatrix = np.matrix(filtered_TH)
IMF_VALUE, BBG_FV = npMatrix[:,0], npMatrix[:,1]
regression = linear_model.LinearRegression
regression.fit(IMF_VALUE, BBG_FV)
When I run this as a test, I get this error and I really have no idea why:
TypeError Traceback (most recent call last)
<ipython-input-28-1ee2fa0bbed1> in <module>()
1 regression = linear_model.LinearRegression
----> 2 regression.fit(IMF_VALUE, BBG_FV)
TypeError: fit() missing 1 required positional argument: 'y'
Make sure that both are one dimensional arrays:
regression.fit(np.array(IMF_VALUE).reshape(-1,1), np.array(BBG_FV).reshape(-1,1))

Categories

Resources