I am Learing Sklearn and have been working out with it. But as I started with KNN there seem to be a problem. Here is my code for this:
# import required packages
import numpy as np # later use
import pandas as pd
from sklearn import neighbors, metrics # later use
from sklearn.model_selection import train_test_split # later use
from sklearn.preprocessing import LabelEncoder
# start training
data = pd.read_csv('data/car.data')
X = data[[
'buying',
'maint',
'safety'
]]
y = data[['class']]
Le = LabelEncoder()
for i in range(len[X[0]]):
X[:, i] = Le.fit_transform(X[:, i])
After running the debugger, the problem seems to be at X = data[[
And I have no Idea how to solve it
I once saw the same error before and for that I just added .values to the end of the variable X
Here is the problem
Extract just the median_income column from the independent variables (from X_train and X_test).
Perform Linear Regression to predict housing values based on median_income.
Predict output for test dataset using the fitted model.
Plot the fitted model for training data as well as for test data to check if the fitted model satisfies the test data.
I did a linear regression earlier.Following is the code
import pandas as pd
import os
os.getcwd()
os.chdir('/Users/saurabhsaha/Documents/PGP-AI:ML-Purdue/New/datasets')
df=pd.read_excel('California_housing.xlsx')
df.total_bedrooms=df.total_bedrooms.fillna(df.total_bedrooms.mean())
x = df.iloc[:,2:8]
y = df.median_house_value
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=.20)
from sklearn.linear_model import LinearRegression
california_model = LinearRegression().fit(x_train,y_train)
california_model.predict(x_test)
Prdicted_values = pd.DataFrame(california_model.predict(x_test),columns=['Pred'])
Prdicted_values
Final = pd.concat([x_test.reset_index(drop=True),y_test.reset_index(drop=True),Prdicted_values],axis=1)
Final['Err_pct'] = abs(Final.median_house_value-
Final.Pred)/Final.median_house_value
Here is my dataset- https://docs.google.com/spreadsheets/d/1vYngxWw7tqX8FpwkWB5G7Q9axhe9ipTu/edit?usp=sharing&ouid=114925088866643320785&rtpof=true&sd=true
Following is my code.
x1_train=x_train.median_income
x1_train
x1_train.shape
x1_test=x_test.median_income
x1_test
type(x1_test)
x1_test.shape
from sklearn.linear_model import LinearRegression
california_model_new = LinearRegression().fit(x1_train,y_train)```
I get an error right here and when I try converting my 2 D array to 1 D as follows , i can not
```python
import numpy as np
x1_train= x1_train.reshape(-1, 1)
x1_test = x1_train.reshape(-1, 1)
This is the error I get
AttributeError: 'Series' object has no attribute 'reshape'
I am new to data science so if you can explain a bit then it would be real helpful
x1_train and x1_test are pandas Series objects, whereas the the reshape() method is applied to numpy arrays.
Do this instead:
x1_train= x1_train.to_numpy().reshape(-1, 1)
x1_test = x1_train.to_numpy().reshape(-1, 1)
I'm trying to fit the dataset to a logistic regression model but I'm facing the below error :
ValueError: Input contains NaN, infinity or a value too large for dtype('float64')
I've tried filling the missing values of the Age column and tried to run model fitting but it still isn't working. note- using python 3.7.1
train = pd.read_csv('titanic_train.csv')
X = train.drop('Survived',axis=1)
y = train['Survived']
from sklearn.model_selection import train_test_split
train['Age'].isnull().values.any()
train['Age'].fillna(train['Age'].mean())
X_train, X_test, y_train,y_test = train_test_split(train.drop('Survived',axis=1),train['Survived'],test_size=0.3,random_state=101)
from sklearn.linear_model import LogisticRegression
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)
The model should fit and we should be able to get the confusion matrix
The reason is this line:
train['Age'].fillna(train['Age'].mean())
pandas methods create copies; they do not modify the object they are called on unless you explicitly tell them to. Therefore, you need to do one of the following:
Set inplace=True:
train['Age'].fillna(train['Age'].mean(), inplace=True)
Reassign:
train['Age'] = train['Age'].fillna(train['Age'].mean())
Note that doing both will not work.
I'm a beginner to python and machine learning . I get below error when i try to fit data into statsmodels.formula.api OLS.fit()
Traceback (most recent call last):
File "", line 47, in
regressor_OLS = sm.OLS(y , X_opt).fit()
File
"E:\Anaconda\lib\site-packages\statsmodels\regression\linear_model.py",
line 190, in fit
self.pinv_wexog, singular_values = pinv_extended(self.wexog)
File "E:\Anaconda\lib\site-packages\statsmodels\tools\tools.py",
line 342, in pinv_extended
u, s, vt = np.linalg.svd(X, 0)
File "E:\Anaconda\lib\site-packages\numpy\linalg\linalg.py", line
1404, in svd
u, s, vt = gufunc(a, signature=signature, extobj=extobj)
TypeError: No loop matching the specified signature and casting was
found for ufunc svd_n_s
code
#Importing Libraries
import numpy as np # linear algebra
import pandas as pd # data processing
import matplotlib.pyplot as plt #Visualization
#Importing the dataset
dataset = pd.read_csv('Video_Games_Sales_as_at_22_Dec_2016.csv')
#dataset.head(10)
#Encoding categorical data using panda get_dummies function . Easier and straight forward than OneHotEncoder in sklearn
#dataset = pd.get_dummies(data = dataset , columns=['Platform' , 'Genre' , 'Rating' ] , drop_first = True ) #drop_first use to fix dummy varible trap
dataset=dataset.replace('tbd',np.nan)
#Separating Independent & Dependant Varibles
#X = pd.concat([dataset.iloc[:,[11,13]], dataset.iloc[:,13: ]] , axis=1).values #Getting important variables
X = dataset.iloc[:,[10,12]].values
y = dataset.iloc[:,9].values #Dependant Varible (Global sales)
#Taking care of missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN' , strategy = 'mean' , axis = 0)
imputer = imputer.fit(X[:,0:2])
X[:,0:2] = imputer.transform(X[:,0:2])
#Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2 , random_state = 0)
#Fitting Mutiple Linear Regression to the Training Set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train,y_train)
#Predicting the Test set Result
y_pred = regressor.predict(X_test)
#Building the optimal model using Backward Elimination (p=0.050)
import statsmodels.formula.api as sm
X = np.append(arr = np.ones((16719,1)).astype(float) , values = X , axis = 1)
X_opt = X[:, [0,1,2]]
regressor_OLS = sm.OLS(y , X_opt).fit()
regressor_OLS.summary()
Dataset
dataset link
Couldn't find anything helpful to solve this issue on stack-overflow or google .
try specifiying the
dtype = 'float'
When the matrix is created.
Example:
a=np.matrix([[1,2],[3,4]], dtype='float')
Hope this works!
Faced the similar problem. Solved the problem my mentioning dtype and flatten the array.
numpy version: 1.17.3
a = np.array(a, dtype=np.float)
a = a.flatten()
As suggested previously, you need to ensure X_opt is a float type.
For example in your code, it would look like this:
X_opt = X[:, [0,1,2]]
X_opt = X_opt.astype(float)
regressor_OLS = sm.OLS(endog=y, exog=X_opt).fit()
regressor_OLS.summary()
Was facing a similar problem, I used df.values[]
y = df.values[:, 4]
fixed the issue by using df.iloc[].values function.
y = dataset.iloc[:, 4].values
df.values[] function returns object datatype
array([192261.83, 191792.06, 191050.39, 182901.99, 166187.94, 156991.12,
156122.51, 155752.6, 152211.77, 149759.96, 146121.95, 144259.4,
141585.52, 134307.35, 132602.65, 129917.04, 126992.93, 125370.37,
124266.9, 122776.86, 118474.03, 111313.02, 110352.25, 108733.99,
108552.04, 107404.34, 105733.54, 105008.31, 103282.38, 101004.64,
99937.59, 97483.56, 97427.84, 96778.92, 96712.8, 96479.51,
90708.19, 89949.14, 81229.06, 81005.76, 78239.91, 77798.83,
71498.49, 69758.98, 65200.33, 64926.08, 49490.75, 42559.73,
35673.41, 14681.4], dtype=object)
but
df.iloc[:, 4].values returns floats array
which is what
regressor_OLS = sm.OLS(endog=y, exog=X_opt).fit()
OLS() fun accepts
OR
you can just change the datatype of y before inserting it into the fun OLS()
y = np.array(y, dtype = float)
Downgrading from NumPy 1.18.4 to 1.15.2 worked for me:
pip install --upgrade numpy==1.15.2
I have been unable to use any of the Sklearn feature extraction methods without getting the following error:
"TypeError: cannot perform reduce with flexible type"
Working from examples, the feature extraction methods appear to only work for non-classification problems. I am of course, trying to do a classification problem. How can I fix this?
Example code:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston
import random
# Load data
boston = load_boston()
X = boston["data"]
Y = boston["target"]
# Make a classification problem
classes = ['a', 'b', 'c']
Y = [random.choice(classes) for entry in Y]
# Perform feature selection
names = boston["feature_names"]
lr = LinearRegression()
rfe = RFE(lr, n_features_to_select=1)
rfe.fit(X, Y)
print "Features sorted by their rank:"
print sorted(zip(map(lambda x: round(x, 4), rfe.ranking_), names))
I guess the following will solve your problem.
X = np.array(X, dtype = 'float_')
Y = np.array(X, dtype = 'float_')
Do it before calling the fit method. You can also use int_ instead of float_. It totally depends on the data type you need.
If your labels are string, then you can use LabelEncoder to encode the labels into integers.
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le = le.fit_transform(Y)
model.fit(X, le)