Python Linear Regression Predict Error - Array Issue - python

When I try to use .predict on my linear regression, I get thrown the following error:
ValueError: Expected 2D array, got scalar array instead:
array=80.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
I don't really understand the reshape feature and why its needed. Can somebody please explain to me what this does, and how to apply it to get a prediction for my model?
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
x = np.array([95,85,80,70,60])
y = np.array([85,95,70,65,70])
x = x.reshape(-1,1)
y = y.reshape(-1,1)
plt.scatter(x,y)
plt.show()
reg = LinearRegression()
reg.fit(x,y)
reg.predict(80)

input of predict() is 2d array you are passing integer value that's why you are getting error. You need to pass 80 as a 2d list [[80]]
reg.predict([[80]])

Related

How to normalize-scale data in attribute in range <-1;1>

Hello i have used many options for normalize data in my dataframe attribute elnino_1["air_temp"] ,but it always shows me an error like "Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample." or "'int' object is not callable" .
I try this code:
(1)
elnino_1["air_temp"].min=-1
elnino_1["air_temp"].max=1
elnino_1_std = (elnino_1["air_temp"] - elnino_1["air_temp"].min(axis=0)) / (elnino_1["air_temp"].max(axis=0) - elnino_1["air_temp"].min(axis=0))
elnino_1_scaled = elnino_1_std * (max - min) + min
(2)
XD=elnino_1["air_temp"]
scaler = MinMaxScaler(feature_range=(-1, 1))
scaler = MinMaxScaler(feature_range=(-1, 1))
In both option I use libraries:
from sklearn.preprocessing import scale
from sklearn import preprocessing
What I should to do for normalize this data please?
As I do not have access to your dataset, here I'm using make_classification to generate some synthetic data. Please run through in a notebook to gain understanding. (Do note as well there may be slight differences as I'm using a numpy array as dataset, yours is a DataFrame.)
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=100, n_features=2, n_redundant=0, n_informative=1)
pd.DataFrame(X).head()
Thereafter, we fit a MinMaxScaler to the data. MinMaxScaler accepts a 2d array as input by default. In other words, a 'table'. Throughout these, please call X.shape to understand how array shapes work. For e.g in the above, X.shape >> (100, 2) where it is (numrows, numcolumns)
scaler = MinMaxScaler(feature_range=(-1, 1))
X_norm = scaler.fit_transform(X)
pd.DataFrame(X_norm).head()
In your case, you are only trying to fit/scale a single column. When you only fit elnino_1["air_temp"] it is a 1d array, shape is something like (100,).
So we have to reshape it into a 2d array.
x1_norm = scaler.fit_transform(X[:, 1].reshape(-1,1))
pd.DataFrame(x1_norm)
For example, if xyz.shape is (100,) and I want it to be (100,1), I can use xyz.reshape(100,1) if I'm being specific.
The length of the dimension set to -1 is automatically determined by inferring from the specified values of other dimensions. This is useful when converting a large array shape. Thus xyz.reshape(-1,1) achieves the same as above.

Found input variables with inconsistent numbers of samples: [14559, 1455900]

I am facing some problems when I try to fit the model. This happens when I try to use LogisticRegression, Naive bayes or svm models. But I get results when I use random forest regression or decision tree.
The error says:
ValueError: y should be a 1d array, got an array of shape (20799, 100)
instead.
The solution is to use y_train.ravel() when I fit the model. But then again, the below error appears:
Found input variables with inconsistent numbers of samples: [14559,
1455900]
Here's my code:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
df=pd.read_csv('../input/filteredbymany.csv',low_memory=False,usecols=['county','crashalcoh','drvrsex','developmen','lightcond','drvrvehtyp','drvrage','pedage','city','crashloc','crashtype','pedpos'])
df.dropna(inplace=True)
dummies= pd.get_dummies(df)
merged=pd.concat([df,dummies],axis='columns')
X = merged
X = X.drop(['county','crashalcoh','city','developmen','drvrage','drvrsex','drvrvehtyp','lightcond','pedage','crashloc','crashtype','pedpos'],axis='columns')
y = X.loc[:, X.columns.str.startswith('county')]
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=0)
model = LogisticRegression()
model.fit(X_train,y_train.values.ravel())
model.predict(X_test)
I have been struggling with this for around 80 hours or so. Please help.
The problem
You want to have an array X with N rows. Each row is a sample of something and each column is a feature of these samples. And then you want to have an array y with N values. The i'th value of y is the value ("label") you want to predict for the i'th row of X.
The first error
Your y is two-dimensional (shape is (N, 100)), but it should be one-dimensional (shape (N,)). So you have 100 labels for each instance in X, but the model you chose can only predict one label per instance.
The second error
Then you ravel it to a one-dimensional array with shape (100*N,). Now you have one dimension, but still too many values.
Solution
Look at your tables X and y and see which column of y you actually want.

Linear Regression Model predict function not working [duplicate]

This question already has answers here:
Error in Python script "Expected 2D array, got 1D array instead:"?
(11 answers)
Closed 3 years ago.
I have this code. It basically works all the way until I try to use predict(x-value) to get the y-value answer.
The code is below.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
df = pd.read_csv('linear_data.csv')
x = df.iloc[:,:-1].values
y = df.iloc[:,1:].values
x_train, x_test, y_train, y_test= train_test_split(x,y,test_size=1/3,random_state=0)
reg = LinearRegression()
reg.fit(x_train, y_train)
y_predict = reg.predict(x_test)
y_predict_res = reg.predict(11) --> #This is the error! 11 is the number of years to predict the salary
print(y_predict_res)
The error I get is:
ValueError: Expected 2D array, got scalar array instead: array=11.
Reshape your data either using array.reshape(-1, 1) if your data has a
single feature or array.reshape(1, -1) if it contains a single sample.
The error message doesn't help me much as I don't understand why I need to reshape it.
Please note here that the parameter X it is expecting is array_like or sparse matrix, shape (n_samples, n_features), meaning it can't be an individual number. The number/value has to be part of an array.

Python PolynomialFeatures transforms data into different shape from the original one

I'm using sklearn's PolynomialFeatures to preprocess data into various degree transformations in order to compare their model fit.
Below is my code:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
np.random.seed(0)
# x and y are the original data
n = 100
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+n/6 + np.random.randn(n)/10
# using .PolynomialFeatures and fit_transform to transform original data to degree 2
poly1 = PolynomialFeatures(degree=2)
x_D2_poly = poly1.fit_transform(x)
#check out their dimensions
x.shape
x_D2_poly.shape
However, the above transformation returned an array of (1, 5151) from the original x of (100, 1). This is not what I have expected. I couldn't figure out what's wrong with my code. It will be great if someone could point out the error of my code or misconception on my part.
Should I use alternative methods to transform original data instead?
Thank you.
Sincerely,
[update]
So after I used x = x.reshape(-1, 1) to transform the original x, Python does give me the desired output dimension (100, 1) via poly1.fit_transform(x). However, when I did a train_test_split, fitted the data, and tried to obtain predicted values:
x_poly1_train, x_poly1_test, y_train, y_test = train_test_split(x_poly1, y, random_state = 0)
linreg = LinearRegression().fit(x_poly1_train, y_train)
poly_predict = LinearRegression().predict(x)
Python returned an error message:
shapes (1,100) and (2,) not aligned: 100 (dim 1) != 2 (dim 0)
Apparently, there must be somewhere I got the dimensional thing wrong again. Could anyone shed some light on this?
Thank you.
I think you need to reshape your x like
x=x.reshape(-1,1)
Your x had shape (100,) not (100,1) and fit_transform expects 2 dimensions.
The reason you were getting 5151 features is that you were seeing one feature for each distinct pair (100*99/2 = 4950), one feature for each feature squared (100), 1 feature for first power of each feature (100), and one the 0th power (1).
Response to your edited question:
You need to call transform to convert the data you wish to predict on.

data dimension of scikit learn linear regression

I just started using Python scikit-learn package to do linear regression. I am confused with the dimension of data set it required. For example, I want to regress X on Y using the following code
from sklearn import linear_model
x=[0,1,2]
y=[0,1,2]
regr = linear_model.LinearRegression()
regr.fit (x,y)
print('Coefficients: \n', regr.coef_)
System returned with error : tuple index out of range.
According the scikit-learn website, effective arrays should be like
x=[[0,0],[1,1],[2,2]]
y=[0,1,2]
(http://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares)
from sklearn import linear_model
x=[[0,0],[1,1],[2,2]]
y=[0,1,2]
regr = linear_model.LinearRegression()
regr.fit (x,y)
print('Coefficients: \n', regr.coef_)
so it means that package can not regress X[i] on Y[i] for two single numbers? it must be an array on a number? like [0,0] in X to 0in Y?
Thanks in advance.
You can.
Simply reshape your data to be x = [[0], [1], [2]].
In this case , every point in your data will have a single feature - single number.
Scikit requires your x to be a 2-dimensional array. It need not be a numpy array. You can always use a simple python list.
In case if you have your x as a 1-dimensional array like you just mentioned in your question, you can simply do the following:
x = [[value] for value in [0,1,2]]
This will store a 2D array of your 1D array in x i.e. every individual value of your list is stored as an array.
x can also be converted into a numpy array, and then reshaped as follows:
import numpy as np
x = np.array(x).reshape(-1, 1)
This converts your data into a 2D array so that you can use it for fitting the linear regression model from sklearn.
array([[0],
[1],
[2]])

Categories

Resources