Im trying to impute NaN values but,first i want to check the best method to calculate this values. Im new using this methods, so im want to use a code i found to capare the differents regressors and choose the best. The original code is this:
from sklearn.experimental import enable_iterative_imputer # noqa
from sklearn.datasets import fetch_california_housing
from sklearn.impute import SimpleImputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import BayesianRidge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score
N_SPLITS = 5
rng = np.random.RandomState(0)
X_full, y_full = fetch_california_housing(return_X_y=True)
# ~2k samples is enough for the purpose of the example.
Remove the following two lines for a slower run with different error bars.
X_full = X_full[::10]
y_full = y_full[::10]
n_samples, n_features = X_full.shape
fetch_california_housing is his Dataset.
So, when i try to adapt this code to my case i wrote this code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from numpy import genfromtxt
data = genfromtxt('documents/datasets/df.csv', delimiter=',')
features = data[:, :2]
targets = data[:, 2]
N_SPLITS = 5
rng = np.random.RandomState(0)
X_full, y_full = data(return_X_y= True)
# ~2k samples is enough for the purpose of the example.
# Remove the following two lines for a slower run with different error bars.
X_full = X_full[::10]
y_full = y_full[::10]
n_samples, n_features = X_full.shape
I always get the same error:
AttributeError: 'numpy.ndarray' object is not callable
and before I used my DF as csv (df.csv) the error is the same
AttributeError: 'Dataset' object is not callable
the complete error is this:
ypeError Traceback (most recent call last) <ipython-input-8-3b63ca34361e> in <module>
3 rng = np.random.RandomState(0) 4
----> 5 X_full, y_full = df(return_X_y=True)
6 # ~2k samples is enough for the purpose of the example.
7 # Remove the following two lines for a slower run with different error bars.
TypeError: 'DataFrame' object is not callable
and i dont know how to solve one of both error to go away
I hope to explain well my problem cause my english is not very good
Related
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from statsmodels.formula.api import ols
import statsmodels.api as sm
import scipy
import scipy.stats
import seaborn as sns
import numpy.random as npr
import math
from scipy.stats import norm
import sqlite3 as sql
import seaborn
from numba import jit, prange
df = pd.read_csv('ODI-2021.edited.csv')
df.info()
sr_targets = pd.Series(df['What is your stress level (0-100)?'])
sr_targets.describe()
df_features = df.drop('What is your stress level (0-100)?', axis=1)
print (df_features)
df_features.describe()
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
df.isnull().sum()
df_missing = df.dropna()
df_missing.shape
X = df["What is your stress level (0-100)?"]
y = df["Time you went to be Yesterday"]
est = sm.OLS(y, X.astype(float)).fit()
model = sm.OLS(y, X).fit()
predictions = model.predict(X)
-ValueError: could not convert string to float: 'over 9000'
model.summary()
-AttributeError: 'LinearRegression' object has no attribute 'summary'
from sklearn import preprocessing
def convert(df):
number = preprocessing.LabelEncoder()
data['Date'] = number.fit_transform(df['Date'])
data=data.fillna(-999)
return data
model = LinearRegression(fit_intercept=True)
result = model.fit(df_features, sr_targets)
-ValueError: could not convert string to float: '3/16/2021'
sr_coef = pd.Series(result.coef_, index=df_features.columns)
sr_coef
-NameError: name 'result' is not defined
sr_endog = sr_targets.copy()
df_exog = sm.add_constant(df_features)
model = sm.OLS(sr_endog, df_exog)
result = model.fit()
-ValueError: Pandas data cast to numpy dtype of object. Check input data with
np.asarray(data)
result.summary()
-NameError: name 'result' is not defined
cross validation
from sklearn.linear_model import LassoCV
from sklearn.model_selection import Fold
nb_folds = 10
cv = KFold(n_splits=nb_folds)
model = LassoCV(fit_intercept=True, cv=cv, n_alphas=200, max_iter=2000)
result = model.fit(df_features_rescaled, sr_targets)
-NameError: name 'df_features_rescaled' is not defined
fig = plt.figure(figsize=[16,15])
xvalues = np.log10(result.alphas_)
rmse_path = np.sqrt(result.mse_path_)
for k in range (nb_folds):
yvalues = rmse_path[:,k]
plt.plot(xvalues, yvalues)
pos_ymin = yvalues.argmin()
plt.plot(xvalues[pos_ymin], yvalues[pos_ymin], marker='o')
plt.axvline(np.log10(result.alpha_))
plt.title('RMSE for differebt alpha', fontsize=20)
plt.grid()
-NameError: name 'result' is not define
sr_coef = pd.Series(result.coef_, index=df_features.columns)
sr_coef
this is my code and these are the errors im getting, could someone help me with what im doing wrong? i have looked up the errors and i have no clue how to fix those. my data set has numbers but also dates and answers such as yes no and university level education level responses which i have no clue how to convert in float. i have been trying to run a regression with two columns which consist of numbers and i get there errors. for the cross validation i am dropping one column and im using the rest and im getting the error that i havent defined the variable result which i have , im clueless
3.thanks in advance!
Each of your errors means something. Learning to read the errors is extremely important in understanding what is going on. For example,
est = sm.OLS(y, X.astype(float)).fit()
model = sm.OLS(y, X).fit()
predictions = model.predict(X)
-ValueError: could not convert string to float: 'over 9000'
This appears to suggest that someone place the phrase "over 9000" in the cell of the CSV file you are opening. Hence, python is having trouble figuring out how to convert that to a float. Same thing would happen if you tried to run
float("over 9000")
It appears the data needs to be cleaned up a bit before it can be used by your sm. It appears that python is trying to tell you the same thing here too:
-ValueError: could not convert string to float: '3/16/2021'
The string "3/16/2021" has symbols that are not apart of a float(), namely "/" symbol.
I think it would be helpful if you broke up your errors and concerns into separate questions, then people could tackle them one at a time for you.
I'm getting the following error from my code:
ValueError: Expected 2D array, got scalar array instead:
array=99.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Here is the code used:
#importing libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import linear_model
Physical_activity_df = pd.read_excel('C:/Users/Usuario/Desktop/LW_docs/Physical_activity_nopass.xlsx')
prediction_df = Physical_activity_df[['Activity_Score','Calories']]
prediction_df.plot(kind='scatter', x= 'Activity_Score', y= 'Calories')
plt.show()
#change to df variables
activity_score = pd.DataFrame(prediction_df['Activity_Score'])
calories = pd.DataFrame(prediction_df['Calories'])
lm = linear_model.LinearRegression()
model = lm.fit(activity_score,calories)
#predict new values for calories (FROM HERE COMES THE ERROR)
activity_score_new = 99
calories_predict = model.predict(activity_score_new)
calories_predict
Any idea about how to fix this issue? Thanks!
So I was trying to implement Extra Tree Classifier in order to find the parameters importance in my data base, I wrote this simple code but for some reason I keep getting thiss Error.
My Code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.ensemble import ExtraTreesClassifier
df = pd.read_csv('C:\\Users\\ali97\\Desktop\\Project\\Database\\5-FINAL2\\Final After Simple Filtering.csv')
extra_tree_forest = ExtraTreesClassifier(n_estimators = 5, criterion ='entropy', max_features = 2)
extra_tree_forest.fit(df)
feature_importance = extra_tree_forest.feature_importances_
feature_importance_normalized = np.std([tree.feature_importances_ for tree in extra_tree_forest.estimators_], axis = 1)
plt.bar(X.columns, feature_importance_normalized)
plt.xlabel('Lbale')
plt.ylabel('Feature Importance')
plt.title('Parameters Importance')
plt.show()
The Error:
TypeError Traceback (most recent call last)
<ipython-input-7-4aad8882ce6d> in <module>
16 extra_tree_forest = ExtraTreesClassifier(n_estimators = 5, criterion ='entropy', max_features = 2)
17
---> 18 extra_tree_forest.fit(df)
19
20 feature_importance = extra_tree_forest.feature_importances_
TypeError: fit() missing 1 required positional argument: 'y'
Thank you
Usually, for fit function, we need to have both attributes(X) and labels(Y) and you need to use extra_tree_forest.fit(X, Y) to train this classifier.
I recommend you split labels and attributes and import them as two separate lists when you import
Final After Simple Filtering.csv.
This code is for data preprocessing that I am learning in an online course of ML.
import numpy as np
import matplotlib.pyplot as plt #pyplot is a sublibrary of matplotlib
import pandas as pd
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:,:-1]
Y = dataset.iloc[:,-1]
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values = np.nan,strategy = 'mean',verbose = 0)
imputer = imputer.fit(X[:,1:3])
X[:,1:3] = imputer.transform(X[:,1:3])
But it is giving this Type error: unhashable type: 'slice' .
Please help me with this.
X is a dataframe and you can't access like X[:,1:3].you should use iloc.
Try this
imputer = imputer.fit(X.iloc[:,1:3])
X.iloc[:,1:3] = imputer.transform(X.iloc[:,1:3])
I would also advise to make use of sklearn.pipeline.Pipeline and sklearn.compose .ColumnTransformer make these preprocessing transformation if your final goal is to predict: https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html#sphx-glr-auto-examples-compose-plot-column-transformer-mixed-types-py
I'm using the titanic data set to predict if a passenger survived or not using random forest. This is my code:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn import cross_validation
import matplotlib.pyplot as plt
%matplotlib inline
data=pd.read_csv("C:\\Users\\kabala\\Downloads\\Titanic.csv")
data.isnull().any()
data["Age"]=data1["Age"].fillna(data1["Age"].median())
data["PClass"]=data["PClass"].fillna("3rd")
data["PClass"].isnull().any()
data1.isnull().any()
pd.get_dummies(data.Sex)
# choosing the predictive variables
x=data[["PClass","Age","Sex"]]
# the target variable is y
y=data["Survived"]
modelrandom=RandomForestClassifier(max_depth=3)
modelrandom=cross_validation.cross_val_score(modelrandom,x,y,cv=5)
But, I keep on getting this error:
ValueError: could not convert string to float: 'female'
and I don't understand what is the problem because I changed the Sex feature to a dummy
Thanks:)
pd.get_dummies returns a data frame, and does not do the operation in place. Therefore you really are sending a sting with the sex column.
So you would need something like X = pd.get_dummies(data[['Sex','PClass','Age']], columns=['Sex','PClass']) and this should fix your problem. I think PClass will also be a string column you need to use dummy variables, as you have it filling '3rd'.
There are still some more places where you call data.isnull().any() that is not doing anything to the underlying dataframe. I left those as they were, but just FYI they may not be doing what you intended.
Full code would be:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn import cross_validation
import matplotlib.pyplot as plt
%matplotlib inline
data=pd.read_csv("C:\\Users\\kabala\\Downloads\\Titanic.csv")
data.isnull().any() <-----Beware this is not doing anything to the data
data["Age"]=data1["Age"].fillna(data1["Age"].median())
data["PClass"]=data["PClass"].fillna("3rd")
data["PClass"].isnull().any() <-----Beware this is not doing anything to the data
data1.isnull().any() <-----Beware this is not doing anything to the data
#********Fix for your code*******
X = pd.get_dummies(data[['Sex','PClass','Age']], columns=['Sex','PClass'])
# choosing the predictive variables
# x=data[["PClass","Age","Sex"]]
# the target variable is y
y=data["Survived"]
modelrandom=RandomForestClassifier(max_depth=3)
modelrandom=cross_validation.cross_val_score(modelrandom,x,y,cv=5)