Expected 2d array but got scalar array instead - python

I am getting this error
ValueError: Expected 2D array, got scalar array instead: array=6.5.
Reshape your data either using array.reshape(-1, 1) if your data has a
single feature or array.reshape(1, -1) if it contains a single sample.
while executing this code
# SVR
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.svm import SVR
# Load dataset
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values
# Fitting the SVR to the data set
regressor = SVR(kernel = 'rbf', gamma = 'auto')
regressor.fit(X, y)
# Predicting a new result
y_pred = regressor.predict(6.5)

You need to understand how SVM works. Your trainig data is a matrix of shape (n_samples, n_features). That means, your SVM operates in feature space of n_features dimensions. Hence, it cannot predict a value for a scalar input, unless n_features is 1. You can only predict values for vectors of dimension n_features. So, if your data set has 5 columns, you can predict values for an arbitrary row-vector of 5 columns. See the below example.
import numpy as np
from sklearn.svm import SVR
# Data: 200 instances of 5 features each
X = randint(1, 100, size=(200, 5))
y = randint(0, 2, size=200)
reg = SVR()
reg.fit(X, y)
y_test = np.array([[0, 1, 2, 3, 4]]) # Input to .predict must be 2-dimensional
reg.predict(y_test)

# Predicting a new result with Linear Regression
X_test = np.array([[6.5]])
print(lin_reg.predict(X_test))
# Predicting a new result with Polynomial Regression
print(lin_reg_2.predict(poly_reg.fit_transform(X_test)))

Related

How fix the error that I got in linear regression in scikit-learn

I am new in the concept of linear regression in Python. I am using the linear regression in scikit-learn to find the predicted value of y, here it is called y_new. The below code is what I have scripted so far:
import numpy as np
#creating data for the run
x=spendings = np.linspace(0,5,4000)
y=sales = np.linspace(0,0.5,4000)
#defining the training function
def train(x,y):
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(x,y)
return model
model = train(x,y)
x_new = 23.0
y_new = model.predict([[x_new]])
print(y_new)
I can't get the value of the y_new due to this error message:
Expected 2D array, got 1D array instead:
array=[0.00000000e+00 1.25031258e-03 2.50062516e-03 ... 4.99749937e+00
4.99874969e+00 5.00000000e+00].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
According to the documentation of LinearRegression fit method expects X and y input to be (n_samples, n_features) shape.
if you check your x and y shapes,it is like this
x=spendings = np.linspace(0,5,4000)
y=sales = np.linspace(0,0.5,4000)
print(x.shape)
print(y.shape)
(4000,)
(4000,)
what error says,you need to reshape your x and y to shape (n_samples, n_features) using arr.reshape(-1,1).
so what you need is reshape your x and y before fit to the LinearRegression.
x = x.reshape(-1,1)
y = y.reshape(-1,1)
print(x.shape)
print(y.shape)
(4000, 1)
(4000, 1)

y_pred = regressor.predict(sc_X.transform(6.5)) not working

Whenever I am going to predict, I see an error.
I am stuck with the line y_pred = regressor.predict(6.5) in the code.
I am getting the error:
ValueError: Expected 2D array, got scalar array instead:
array=6.5.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
spyder
# SVR
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_y = StandardScaler()
X = sc_X.fit_transform(X)
y = sc_y.fit_transform(y)
# Fitting SVR to the dataset
from sklearn.svm import SVR
regressor = SVR(kernel = 'rbf')
regressor.fit(X, y)
# Predicting a new result
y_pred = regressor.predict(6.5)
Error: y_pred = regressor.predict(sc_X.transform(6.5))
Traceback (most recent call last):
File "<ipython-input-11-64bf1bca4870>", line 1, in <module>
y_pred = regressor.predict(sc_X.transform(6.5))
File "C:\Users\achiever\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py", line 758, in transform
force_all_finite='allow-nan')
File "C:\Users\achiever\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 514, in check_array
"if it contains a single sample.".format(array))
ValueError: Expected 2D array, got scalar array instead: array=6.5. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Well, obviously, since regressor.predit() expects a list/array of values to make a prediction of, and you're passing it a single float, it won't work:
# Predicting a new result
y_pred = regressor.predict(6.5)
At the very least :
# Predicting a new result
y_pred = regressor.predict(np.array([6.5]))
But presumably you have more stuff you want to pass to it, so more like:
# Predicting a new result
y_pred = regressor.predict(some_data_array)
EDIT:
you need to arrange the shape of the 2d array you pass to the predictor so it looks like this:
data = [[1,0,0,1],[0,1,12,5],....]
where [1,0,0,1] is ONE set of parameter for ONE datapoint for which you want a prediction. [0,1,12,5) it ANOTHER data point.
At any rate, they should all have the same # of feature (e.g. 4 in my example) and they should have the same number of features as the data you used to train your predictor.
y_pred = sc_Y.inverse_transform(regressor.predict(sc_X.transform(np.array([[6.5]]))))
Use reshape function:
sc_y.inverse_transform(regressor.predict(sc_X.transform([[6.5]])).reshape(1,-1))

Print predict ValueError: Expected 2D array, got 1D array instead

The error shows in my last two codes.
ValueError: Expected 2D array, got 1D array instead: array=[0 1].
Reshape your data either using array.reshape(-1, 1) if your data has a
single feature or array.reshape(1, -1) if it contains a single sample.
import numpy as np
import pandas as pd
from sklearn.model_selection import ShuffleSplit
%matplotlib inline
df = pd.read_csv('.......csv')
df.drop(['Company'], 1, inplace=True)
x = pd.DataFrame(df.drop(['R&D Expense'],1))
y = pd.DataFrame(df['R&D Expense'])
X_test = x.index[[0,1]]
y_test = y.index[[0,1]]
X_train = x.drop(x.index[[0,1]])
y_train = y.drop(y.index[[0,1]])
from sklearn.metrics import r2_score
def performance_metric(y_true, y_predict):
score = r2_score(y_true, y_predict)
return score
from sklearn.metrics import make_scorer
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV
def fit_model_shuffle(x, y):
cv_sets = ShuffleSplit(n_splits = 10, test_size = 0.20, random_state = 0)
regressor = KNeighborsRegressor()
params = {'n_neighbors':range(3,10)}
scoring_fnc = make_scorer(performance_metric)
grid = GridSearchCV(regressor, param_grid=params,scoring=scoring_fnc,cv=cv_sets)
grid = grid.fit(x, y)
return grid.best_estimator_
reg = fit_model_shuffle(X_train, y_train)
> for i, y_predict in enumerate(reg.predict(X_test),1):
print(i, y_predict)
The error message is self-explanatory. Your library expects the input to be a 2D matrix, with one pattern per row. So, if you are doing regression with just one input, before passing it to the regressor, do
my_data = my_data.reshape(-1, 1)
to make a 2X1 shaped matrix
On the other hand (unlikely), if you have a single vector [0, 1]
my_data = my_data.reshape(1, -1)
to make a 1X2 matrix

Naivebayes MultinomialNB scikit-learn/sklearn

I am bulding a naive bayes classifier and I follow the tutorial on the scikit-learn website.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
import csv
import string
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
# Importing dataset
data = pd.read_csv("test.csv", quotechar='"', delimiter=',',quoting=csv.QUOTE_ALL, skipinitialspace=True,error_bad_lines=False)
df2 = data.set_index("name", drop = False)
df2['sentiment'] = df2['rating'].apply(lambda rating : +1 if rating > 3 else -1)
train, test = train_test_split(df2, test_size=0.2)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(traintrain['review'])
test_matrix = count_vect.transform(testrain['review'])
clf = MultinomialNB().fit(X_train_tfidf, train['sentiment'])
The first argument is the vocabulary dictionary and it returns a Document-Term matrix.
What should be the second argument,twenty_train.target?
Edit Data example
Name, review,rating
film1,......,1
film2, the film is....,5
film3, film about..., 4
with this instruction I created a new column , if the rating is >3 so the review is positive, else it is negative
df2['sentiment'] = df2['rating'].apply(lambda rating : +1 if rating > 3 else -1)
The fit method of MultinomialNB expects as input the x and y.
Now, x should be the training vectors (training data) and y should be the target values.
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)
In more detail:
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Training vectors, where n_samples is the number of samples and n_features is
the number of features.
y : array-like, shape = [n_samples]
Target values.
Note: Make sure that shape = [n_samples, n_features] and shape = [n_samples] of x and y are defined correctly. Otherwise, the fit will throw an error.
Toy example:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
newsgroups_train = fetch_20newsgroups(subset='train')
categories = ['alt.atheism', 'talk.religion.misc',
'comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train',
categories=categories)
vectorizer = TfidfVectorizer()
# the following will be the training data
vectors = vectorizer.fit_transform(newsgroups_train.data)
vectors.shape
newsgroups_test = fetch_20newsgroups(subset='test',
categories=categories)
# this is the test data
vectors_test = vectorizer.transform(newsgroups_test.data)
clf = MultinomialNB(alpha=.01)
# the fitting is done using the TRAINING data
# Check the shapes before fitting
vectors.shape
#(2034, 34118)
newsgroups_train.target.shape
#(2034,)
# fit the model using the TRAINING data
clf.fit(vectors, newsgroups_train.target)
# the PREDICTION is done using the TEST data
pred = clf.predict(vectors_test)
EDIT:
The newsgroups_train.target is just a numpy array that contains the labels (or targets or classes).
import numpy as np
newsgroups_train.target
array([1, 3, 2, ..., 1, 0, 1])
np.unique(newsgroups_train.target)
array([0, 1, 2, 3])
So in this example we have 4 different classes/targets.
This variable is needed in order to fit a classifier.

Found input variables with inconsistent numbers of samples when fitting LogisticRegression

I am creating LogisticRegression classifier with the following code:
regressor = LogisticRegression()
regressor.fit(x_train, y_train)
Both x_train and y_train shapes are
<class 'tuple'>: (32383,)
x_train contains values around range [0..1], and y_train contains only 0s and 1s.
Unfortunately, fit fails with error
ValueError: Found input variables with inconsistent numbers of samples: [1, 32383]
Adding transpose to arguments doesn't help.
To continue the solution that I proposed in my comment:
The problem is the shape of the x_train. So we need to reshape it:
From the documentation:
X : {array-like, sparse matrix}, shape (n_samples, n_features)
y : array-like, shape (n_samples,)
Example using scikit-learn and numpy:
from sklearn.linear_model import LogisticRegression
import numpy as np
# create the tuple data
x_train = tuple(range(32383))
x_train = np.asarray(x_train)
#same for y_train
y_train=tuple(range(32383))
y_train = np.asarray(y_train)
#convert tuples to nparray and reshape the x_train
x_train = x_train.reshape(32383,1)
#check if shape if (32383,)
y_train.shape
#create the model
lg = LogisticRegression()
#Fit the model
lg.fit(x_train, y_train)
This should work fine.
Hope it helps
I guess a little reshaping is necessary. I tried it like this :
from sklearn.linear_model import LogisticRegression
import numpy as np
#x_train = np.random.randn(10,1)
x_train = np.asarray(x_train).reshape(32383,1)
con = np.ones_like(x_train)
x_train = np.concatenate((con,x_train), axis =1)
#y = np.random.randn(10,1)
#y_train = np.where(y<0.5,1,0)
y_train = np.asarray(y_train).reshape(32383,1)
regressor = LogisticRegression()
regressor.fit(x_train,y_train)
The comments are just what i did to create some data. And dont forget to ad a constant like in the example, as far as i know sklearn isn't doing it. Also Statsmodels could be helpfull to you if you are interested in some statistical test and a pretty print of the results :
from statsmodels.api import Logit
logit =Logit(y_train, x_train)
fit= logit.fit()
fit.summary()
That will give you a little more statistical intel without much effort.

Categories

Resources