I'm a beginner to python and machine learning . I get below error when i try to fit data into statsmodels.formula.api OLS.fit()
Traceback (most recent call last):
File "", line 47, in
regressor_OLS = sm.OLS(y , X_opt).fit()
File
"E:\Anaconda\lib\site-packages\statsmodels\regression\linear_model.py",
line 190, in fit
self.pinv_wexog, singular_values = pinv_extended(self.wexog)
File "E:\Anaconda\lib\site-packages\statsmodels\tools\tools.py",
line 342, in pinv_extended
u, s, vt = np.linalg.svd(X, 0)
File "E:\Anaconda\lib\site-packages\numpy\linalg\linalg.py", line
1404, in svd
u, s, vt = gufunc(a, signature=signature, extobj=extobj)
TypeError: No loop matching the specified signature and casting was
found for ufunc svd_n_s
code
#Importing Libraries
import numpy as np # linear algebra
import pandas as pd # data processing
import matplotlib.pyplot as plt #Visualization
#Importing the dataset
dataset = pd.read_csv('Video_Games_Sales_as_at_22_Dec_2016.csv')
#dataset.head(10)
#Encoding categorical data using panda get_dummies function . Easier and straight forward than OneHotEncoder in sklearn
#dataset = pd.get_dummies(data = dataset , columns=['Platform' , 'Genre' , 'Rating' ] , drop_first = True ) #drop_first use to fix dummy varible trap
dataset=dataset.replace('tbd',np.nan)
#Separating Independent & Dependant Varibles
#X = pd.concat([dataset.iloc[:,[11,13]], dataset.iloc[:,13: ]] , axis=1).values #Getting important variables
X = dataset.iloc[:,[10,12]].values
y = dataset.iloc[:,9].values #Dependant Varible (Global sales)
#Taking care of missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN' , strategy = 'mean' , axis = 0)
imputer = imputer.fit(X[:,0:2])
X[:,0:2] = imputer.transform(X[:,0:2])
#Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2 , random_state = 0)
#Fitting Mutiple Linear Regression to the Training Set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train,y_train)
#Predicting the Test set Result
y_pred = regressor.predict(X_test)
#Building the optimal model using Backward Elimination (p=0.050)
import statsmodels.formula.api as sm
X = np.append(arr = np.ones((16719,1)).astype(float) , values = X , axis = 1)
X_opt = X[:, [0,1,2]]
regressor_OLS = sm.OLS(y , X_opt).fit()
regressor_OLS.summary()
Dataset
dataset link
Couldn't find anything helpful to solve this issue on stack-overflow or google .
try specifiying the
dtype = 'float'
When the matrix is created.
Example:
a=np.matrix([[1,2],[3,4]], dtype='float')
Hope this works!
Faced the similar problem. Solved the problem my mentioning dtype and flatten the array.
numpy version: 1.17.3
a = np.array(a, dtype=np.float)
a = a.flatten()
As suggested previously, you need to ensure X_opt is a float type.
For example in your code, it would look like this:
X_opt = X[:, [0,1,2]]
X_opt = X_opt.astype(float)
regressor_OLS = sm.OLS(endog=y, exog=X_opt).fit()
regressor_OLS.summary()
Was facing a similar problem, I used df.values[]
y = df.values[:, 4]
fixed the issue by using df.iloc[].values function.
y = dataset.iloc[:, 4].values
df.values[] function returns object datatype
array([192261.83, 191792.06, 191050.39, 182901.99, 166187.94, 156991.12,
156122.51, 155752.6, 152211.77, 149759.96, 146121.95, 144259.4,
141585.52, 134307.35, 132602.65, 129917.04, 126992.93, 125370.37,
124266.9, 122776.86, 118474.03, 111313.02, 110352.25, 108733.99,
108552.04, 107404.34, 105733.54, 105008.31, 103282.38, 101004.64,
99937.59, 97483.56, 97427.84, 96778.92, 96712.8, 96479.51,
90708.19, 89949.14, 81229.06, 81005.76, 78239.91, 77798.83,
71498.49, 69758.98, 65200.33, 64926.08, 49490.75, 42559.73,
35673.41, 14681.4], dtype=object)
but
df.iloc[:, 4].values returns floats array
which is what
regressor_OLS = sm.OLS(endog=y, exog=X_opt).fit()
OLS() fun accepts
OR
you can just change the datatype of y before inserting it into the fun OLS()
y = np.array(y, dtype = float)
Downgrading from NumPy 1.18.4 to 1.15.2 worked for me:
pip install --upgrade numpy==1.15.2
Related
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
dataset = pd.read_csv('C:/Users/seemarahul/Downloads/adult-1.csv')
X = dataset.iloc[:,0:15].values
Y = dataset['income']
y_train: object
X_train,X_test,y_train,y_test= train_test_split(X,Y,shuffle=True,test_size=0.3)
lin = LinearRegression()
lin.fit(X_train,y_train)
y_pred = lin.predict(X_test)
coef = lin.coef_
components = pd.DataFrame(zip(X.columns,coef), columns=['component','value'])
components = components.append({'components':'intercept','value':lin.intercept_}, ignore_index=True )
This is My code its getting some error & is is getting redirected to base.py
this line is getting error
lin.fit(X_train,y_train)
I have tried multiple way of adding data to the X and Y variables
none is working
The image if of the traceback Error
Look at the error. It tells you that you have a value 'Private' exists in either your X or Y variable. It looks like it is in X based on the fifth line of the traceback.
'Private' is a string and cannot be cast to a float, so it raises the error.
Good morning! I'm new of python, I use Spyder 4.0 to build neural network.
In the script below I use the random forest in order to do feature importances. So the values importances are the ones that tell me what is the importance of each features. Unfortunatly I can't upload the dataset, but I can tell you that there are 18 features and 1 label, both are phisical quantyties and it's a regression problem.
I want to export in a excel file the variable importances, but when I do it (simply cooping the vector) the numbers are with the dot (eg 0.012, 0.015, .....ect). In order to use it in the excel file I prefere to have the comma instead of the dot.
I try to use .replace('.',',') but it doesn't works, the error is:
AttributeError: 'numpy.ndarray' object has no attribute 'replace'
It think that it happens because the vector importances is an Array of float64 (18,).
What can I do?
Thanks.`
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectFromModel
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
dataset = pd.read_csv('Dataset.csv', decimal=',', delimiter = ";")
label = dataset.iloc[:,-1]
features = dataset.drop(columns = ['Label'])
y_max_pre_normalize = max(label)
y_min_pre_normalize = min(label)
def denormalize(y):
final_value = y*(y_max_pre_normalize-y_min_pre_normalize)+y_min_pre_normalize
return final_value
X_train1, X_test1, y_train1, y_test1 = train_test_split(features, label, test_size = 0.20, shuffle = True)
y_test2 = y_test1.to_frame()
y_train2 = y_train1.to_frame()
scaler1 = preprocessing.MinMaxScaler()
scaler2 = preprocessing.MinMaxScaler()
X_train = scaler1.fit_transform(X_train1)
X_test = scaler2.fit_transform(X_test1)
scaler3 = preprocessing.MinMaxScaler()
scaler4 = preprocessing.MinMaxScaler()
y_train = scaler3.fit_transform(y_train2)
y_test = scaler4.fit_transform(y_test2)
sel = RandomForestRegressor(n_estimators = 200,max_depth = 9, max_features = 5, min_samples_leaf = 1, min_samples_split = 2,bootstrap = False)
sel.fit(X_train, y_train)
importances = sel.feature_importances_
# sel.fit(X_train, y_train)
# a = []
# for feature_list_index in sel.get_support(indices=True):
# a.append(feat_labels[feature_list_index])
# print(feat_labels[feature_list_index])
# X_important_train = sel.transform(X_train1)
# X_important_test = sel.transform(X_test1)
I will try to show you an example of what you should do by using some random values. I ran this on the python shell that's why you see also the ">>>".
>>> import numpy as np # first I import numpy as "np"
# I generate 10 random values and I store them in "importance"
>>> importance=np.random.rand(10)
# here I just want to see the content of "importance"
>>> importance
array([0.77609076, 0.97746829, 0.56946118, 0.23986983, 0.93655692,
0.22003531, 0.7711095 , 0.36083248, 0.58277805, 0.57865248])
# here there is your error that I reproduce for teaching purpose
>>>importance.replace(".", ",")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'numpy.ndarray' object has no attribute 'replace'
What you need to to is to convert the elements of "importance" to a list of strings
>>> imp_astr=[str(i) for i in importance]
>>> imp_astr
['0.7760907642658763', '0.9774682868805988', '0.569461184647781', '0.23986982589422634', '0.9365569207431337', '0.22003531170279356', '0.7711094966708247', '0.3608324767276052', '0.5827780487688116', '0.5786524781334242']
# at the end, for each string, you can use the "replace" function
>>> imp_astr=[i.replace(".", ",") for i in imp_astr]
>>> imp_astr
['0,7760907642658763', '0,9774682868805988', '0,569461184647781', '0,23986982589422634', '0,9365569207431337', '0,22003531170279356', '0,7711094966708247', '0,3608324767276052', '0,5827780487688116', '0,5786524781334242']
Whenever I am going to predict, I see an error.
I am stuck with the line y_pred = regressor.predict(6.5) in the code.
I am getting the error:
ValueError: Expected 2D array, got scalar array instead:
array=6.5.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
spyder
# SVR
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_y = StandardScaler()
X = sc_X.fit_transform(X)
y = sc_y.fit_transform(y)
# Fitting SVR to the dataset
from sklearn.svm import SVR
regressor = SVR(kernel = 'rbf')
regressor.fit(X, y)
# Predicting a new result
y_pred = regressor.predict(6.5)
Error: y_pred = regressor.predict(sc_X.transform(6.5))
Traceback (most recent call last):
File "<ipython-input-11-64bf1bca4870>", line 1, in <module>
y_pred = regressor.predict(sc_X.transform(6.5))
File "C:\Users\achiever\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py", line 758, in transform
force_all_finite='allow-nan')
File "C:\Users\achiever\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 514, in check_array
"if it contains a single sample.".format(array))
ValueError: Expected 2D array, got scalar array instead: array=6.5. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Well, obviously, since regressor.predit() expects a list/array of values to make a prediction of, and you're passing it a single float, it won't work:
# Predicting a new result
y_pred = regressor.predict(6.5)
At the very least :
# Predicting a new result
y_pred = regressor.predict(np.array([6.5]))
But presumably you have more stuff you want to pass to it, so more like:
# Predicting a new result
y_pred = regressor.predict(some_data_array)
EDIT:
you need to arrange the shape of the 2d array you pass to the predictor so it looks like this:
data = [[1,0,0,1],[0,1,12,5],....]
where [1,0,0,1] is ONE set of parameter for ONE datapoint for which you want a prediction. [0,1,12,5) it ANOTHER data point.
At any rate, they should all have the same # of feature (e.g. 4 in my example) and they should have the same number of features as the data you used to train your predictor.
y_pred = sc_Y.inverse_transform(regressor.predict(sc_X.transform(np.array([[6.5]]))))
Use reshape function:
sc_y.inverse_transform(regressor.predict(sc_X.transform([[6.5]])).reshape(1,-1))
I'm using scikit's logistic regression but I keep getting the message:
Found input variables with inconsistent numbers of samples: [90000, 5625]
In the code below, I've removed the columns with text in them and then I've split the date into a training and testing set.
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from scipy import stats
from sklearn import datasets, linear_model
from sklearn.model_selection import train_test_split
dataset = pd.read_csv("/Users/An/Desktop/data/telco.csv", na_values = ' ')
dataset = dataset.dropna(axis = 0)
dataset = dataset.replace({'Yes':1, 'Fiber optic': 1, 'DSL':1, 'No':0, 'No phone service':0, 'No internet service':0})
dataset = dataset.drop('Contract', axis =1)
dataset = dataset.drop('PaymentMethod',axis =1)
dataset = dataset.drop('customerID',axis =1)
dataset = dataset.drop('gender',axis =1)
for i in list(['tenure', 'MonthlyCharges', 'TotalCharges']):
sd = np.std(dataset[i])
mean = np.mean(dataset[i])
dataset[i] = (dataset[i] - mean) / sd
total = pd.DataFrame(dataset)
data_train, data_test = train_test_split(total, test_size=0.2)
data_train = data_train.values
data_test = data_test.values
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(C=1e9)
clf = clf.fit(data_train[:,0:16], data_train[:,16])
print clf.intercept_, clf.coef_
Could someone please explain what the error message means and help me figure out why I'm getting it?
In the second last line, data_train.reshape(-1, 1) is causing your problem. Removing reshape will do you a favor.
Reason
LogisticRegression.fit is expecting x and y to have same shape[0], but you are reshaping your x from (n, m) to (n*m, 1).
Here is the reproduced shapes:
import numpy as np
df = np.ndarray((2000,10))
x, y = df[:, 2:9], df[:, 9]
x.shape, y.shape # << what you should give to `clf.fit`
# ((2000, 7), (2000, ))
x.reshape(-1, 1).shape, y.shape # << what you ARE giving to `clf.fit`,
# ((14000, 1), (2000,)) # << which is causing the problem
I was just trying out for DataPreprocessing where I frequently get this error.Can anyone explain me what is wrong in this particular code for the given dataset?
Thanks in advance!
# STEP 1: IMPORTING THE LIBARIES
import numpy as np
import pandas as pd
# STEP 2: IMPORTING THE DATASET
dataset = pd.read_csv("https://github.com/Avik-Jain/100-Days-Of-ML-Code/blob/master/datasets/Data.csv", error_bad_lines=False)
X = dataset.iloc[:,:-1].values
Y = dataset.iloc[:,1:3].values
# STEP 3: HANDLING THE MISSING VALUES
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = "NaN",strategy = "mean",axis = 0)
imputer = imputer.fit(X[ : , 1:3])
X[:,1:3] = imputer.transform(X[:,1:3])
# STEP 4: ENCODING CATEGPRICAL DATA
from sklearn.preprocessing import LaberEncoder,OneHotEncoder
labelencoder_X = LabelEncoder() # Encode labels with value between 0 and n_classes-1.
X[ : , 0] = labelencoder_X.fit_transform(X[ : , 0]) # All the rows and first columns
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)
# Step 5: Splitting the datasets into training sets and Test sets
from sklearn.cross_validation import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split( X , Y , test_size = 0.2, random_state = 0)
# Step 6: Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.fit_transform(X_test)
Returns Error:
ValueError: Found array with 0 feature(s) (shape=(546, 0)) while a minimum of 1 is required.
Your link in this line
dataset = pd.read_csv("https://github.com/Avik-Jain/100-Days-Of-ML-Code/blob/master/datasets/Data.csv", error_bad_lines=False)
is wrong.
The current link returns the webpage on github where this csv is shown, but not the actual csv data. So whatever data is present in dataset is invalid.
Change that to:
dataset = pd.read_csv("https://raw.githubusercontent.com/Avik-Jain/100-Days-Of-ML-Code/master/datasets/Data.csv", error_bad_lines=False)
Other than that, there is a spelling mistake in LabelEncoder import.
Now even if you correct these, there will still be errors, because of
Y = labelencoder_Y.fit_transform(Y)
LabelEncoder only accepts a single column array as input, but your current Y will be of 2 columns due to
Y = dataset.iloc[:,1:3].values
Please explain more clearly what do you want to do.