This question already has answers here:
Error in Python script "Expected 2D array, got 1D array instead:"?
(11 answers)
Closed 3 years ago.
I have this code. It basically works all the way until I try to use predict(x-value) to get the y-value answer.
The code is below.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
df = pd.read_csv('linear_data.csv')
x = df.iloc[:,:-1].values
y = df.iloc[:,1:].values
x_train, x_test, y_train, y_test= train_test_split(x,y,test_size=1/3,random_state=0)
reg = LinearRegression()
reg.fit(x_train, y_train)
y_predict = reg.predict(x_test)
y_predict_res = reg.predict(11) --> #This is the error! 11 is the number of years to predict the salary
print(y_predict_res)
The error I get is:
ValueError: Expected 2D array, got scalar array instead: array=11.
Reshape your data either using array.reshape(-1, 1) if your data has a
single feature or array.reshape(1, -1) if it contains a single sample.
The error message doesn't help me much as I don't understand why I need to reshape it.
Please note here that the parameter X it is expecting is array_like or sparse matrix, shape (n_samples, n_features), meaning it can't be an individual number. The number/value has to be part of an array.
Related
Here is the problem
Extract just the median_income column from the independent variables (from X_train and X_test).
Perform Linear Regression to predict housing values based on median_income.
Predict output for test dataset using the fitted model.
Plot the fitted model for training data as well as for test data to check if the fitted model satisfies the test data.
I did a linear regression earlier.Following is the code
import pandas as pd
import os
os.getcwd()
os.chdir('/Users/saurabhsaha/Documents/PGP-AI:ML-Purdue/New/datasets')
df=pd.read_excel('California_housing.xlsx')
df.total_bedrooms=df.total_bedrooms.fillna(df.total_bedrooms.mean())
x = df.iloc[:,2:8]
y = df.median_house_value
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=.20)
from sklearn.linear_model import LinearRegression
california_model = LinearRegression().fit(x_train,y_train)
california_model.predict(x_test)
Prdicted_values = pd.DataFrame(california_model.predict(x_test),columns=['Pred'])
Prdicted_values
Final = pd.concat([x_test.reset_index(drop=True),y_test.reset_index(drop=True),Prdicted_values],axis=1)
Final['Err_pct'] = abs(Final.median_house_value-
Final.Pred)/Final.median_house_value
Here is my dataset- https://docs.google.com/spreadsheets/d/1vYngxWw7tqX8FpwkWB5G7Q9axhe9ipTu/edit?usp=sharing&ouid=114925088866643320785&rtpof=true&sd=true
Following is my code.
x1_train=x_train.median_income
x1_train
x1_train.shape
x1_test=x_test.median_income
x1_test
type(x1_test)
x1_test.shape
from sklearn.linear_model import LinearRegression
california_model_new = LinearRegression().fit(x1_train,y_train)```
I get an error right here and when I try converting my 2 D array to 1 D as follows , i can not
```python
import numpy as np
x1_train= x1_train.reshape(-1, 1)
x1_test = x1_train.reshape(-1, 1)
This is the error I get
AttributeError: 'Series' object has no attribute 'reshape'
I am new to data science so if you can explain a bit then it would be real helpful
x1_train and x1_test are pandas Series objects, whereas the the reshape() method is applied to numpy arrays.
Do this instead:
x1_train= x1_train.to_numpy().reshape(-1, 1)
x1_test = x1_train.to_numpy().reshape(-1, 1)
When I try to use .predict on my linear regression, I get thrown the following error:
ValueError: Expected 2D array, got scalar array instead:
array=80.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
I don't really understand the reshape feature and why its needed. Can somebody please explain to me what this does, and how to apply it to get a prediction for my model?
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
x = np.array([95,85,80,70,60])
y = np.array([85,95,70,65,70])
x = x.reshape(-1,1)
y = y.reshape(-1,1)
plt.scatter(x,y)
plt.show()
reg = LinearRegression()
reg.fit(x,y)
reg.predict(80)
input of predict() is 2d array you are passing integer value that's why you are getting error. You need to pass 80 as a 2d list [[80]]
reg.predict([[80]])
This question already has answers here:
Linear Regression on Pandas DataFrame using Sklearn ( IndexError: tuple index out of range)
(5 answers)
Closed 3 years ago.
I have a large data frame with MANY columns. I want to normalize a few columns which are all numeric, and then plot two using regression. I thought the code below would do it for me.
from sklearn import preprocessing
# Create x, where x the 'scores' column's values as floats
modDF = df[['WeightedAvg','Score','Co','Score', 'PeerGroup', 'TimeT', 'Ter', 'Spread']].values.astype(float)
# Create a minimum and maximum processor object
min_max_scaler = preprocessing.MinMaxScaler()
# Create an object to transform the data to fit minmax processor
x_scaled = min_max_scaler.fit_transform(modDF)
# Run the normalizer on the dataframe
df_normalized = pd.DataFrame(x_scaled)
import seaborn as sns
import matplotlib.pyplot as plt
sns.regplot(x="WeightedAvg", y="Spread", data=modDF)
However, I am getting the following error: IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices
I did a regression without normalizing, using sns.regplot and it worked, but it looked weird, so I want to see it with normalization applied. I know how the regression works. I just don't know how the regression works.
There is no need to use the command: df_normalized = pd.DataFrame(x_scaled).
If you want to run a linear regression. This should work:
from sklearn import preprocessing
from sklearn.linear_model import LinearRegression
df = ['WeightedAvg','Score','Co','Score', 'PeerGroup', 'TimeT', 'Ter', 'Spread']
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce', axis=1)
X = df[['WeightedAvg','Score','Co','Score', 'PeerGroup', 'TimeT', 'Ter', 'Spread']]
#select your target variable
y = df[['target']]
#train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# Create a minimum and maximum processor object
min_max_scaler = preprocessing.MinMaxScaler()
# Create an object to transform the data to fit minmax processor
X_train_scaled = min_max_scaler.fit_transform(X_train)
X_test_scaled = min_max_scaler.transform(X_test)
#start linear regression
reg = LinearRegression().fit(X_train_scaled, y_train)
#predict for test
y_predict = reg(X_test_scaled, y_test)
If you work with train/test-split it is important that you use the scaler fitting only on the training data, the test data is unknow to that point in time! For the testing part you are only allowed to use it for transforming.
I am working on a Multi-Target (binary) classification. There are 11 targets and I am using sklearn's MultiOutputClassifier. I am having difficulty with the Predict_proba function. See a snippet of the dataset, and code below:
import pandas as pd
import numpy as npy
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.multioutput import MultiOutputClassifier
data = pd.read_csv("123.csv")
dataset
target = ['H67BC97','H67GC93','H67LC63','H67WC103','H67RC91','H67YC73','H67RC92','H67GC94','H67LC64','H67NC60','H67YC72']
train, test = train_test_split(data, test_size=0.2)
X_train = train.drop(['H67BC97','H67GC93','H67LC63','H67WC103','H67RC91','H67YC73','H67RC92','H67GC94','H67LC64','H67NC60','H67YC72','FORMULA_NUMBER'],axis=1)
X_test = test.drop(['H67BC97','H67GC93','H67LC63','H67WC103','H67RC91','H67YC73','H67RC92','H67GC94','H67LC64','H67NC60','H67YC72','FORMULA_NUMBER'],axis=1)
Y_train = train[target]
Y_test = test[target]
model = MultiOutputClassifier(GradientBoostingClassifier())
model.fit(X_train, Y_train)
target_probabilities = model.predict_proba(X_test)
print(target_probabilities)
probabilities
The probabilities output do not seem to be in the correct form. I get 11 565x2 arrays (565 is length of my test set). I'd like to save the target_probabilities to a csv file, but I get the error: ValueError: Expected 1D or 2D array, got 3D array instead. My question is essentially the same as on the link -
https://datascience.stackexchange.com/questions/22762/understanding-predict-proba-from-multioutputclassifier, but the answer there only explains why the output is a set of arrays.
EDIT: I have simplified the problem.
target_probabilities = array(target_probabilities)
Now target_probabilities is an (11,565,2) matrix - need to change the form of the matrix to be (565,11), where each row is of the form target_probabilities[:,i][:,1], for i in the range(0,565).
I there. I just started with the machine learning with a simple example to try and learn. So, I want to classify the files in my disk based on the file type by making use of a classifier. The code I have written is,
import sklearn
import numpy as np
#Importing a local data set from the desktop
import pandas as pd
mydata = pd.read_csv('file_format.csv',skipinitialspace=True)
print mydata
x_train = mydata.script
y_train = mydata.label
#print x_train
#print y_train
x_test = mydata.script
from sklearn import tree
classi = tree.DecisionTreeClassifier()
classi.fit(x_train, y_train)
predictions = classi.predict(x_test)
print predictions
And I am getting the error as,
script class div label
0 5 6 7 html
1 0 0 0 python
2 1 1 1 csv
Traceback (most recent call last):
File "newtest.py", line 21, in <module>
classi.fit(x_train, y_train)
File "/home/initiouser2/.local/lib/python2.7/site-
packages/sklearn/tree/tree.py", line 790, in fit
X_idx_sorted=X_idx_sorted)
File "/home/initiouser2/.local/lib/python2.7/site-
packages/sklearn/tree/tree.py", line 116, in fit
X = check_array(X, dtype=DTYPE, accept_sparse="csc")
File "/home/initiouser2/.local/lib/python2.7/site-
packages/sklearn/utils/validation.py", line 410, in check_array
"if it contains a single sample.".format(array))
ValueError: Expected 2D array, got 1D array instead:
array=[ 5. 0. 1.].
Reshape your data either using array.reshape(-1, 1) if your data has a
single feature or array.reshape(1, -1) if it contains a single sample.
If anyone can help me with the code, it would be so helpful to me !!
When passing your input to the classifiers, pass 2D arrays (of shape (M, N) where N >= 1), not 1D arrays (which have shape (N,)). The error message is pretty clear,
Reshape your data either using array.reshape(-1, 1) if your data has a
single feature or array.reshape(1, -1) if it contains a single sample.
from sklearn.model_selection import train_test_split
# X.shape should be (N, M) where M >= 1
X = mydata[['script']]
# y.shape should be (N, 1)
y = mydata['label']
# perform label encoding if "label" contains strings
# y = pd.factorize(mydata['label'])[0].reshape(-1, 1)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42)
...
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))
Some other helpful tips -
split your data into valid train and test portions. Do not use your training data to test - that leads to inaccurate estimations of your classifier's strength
I'd recommend factorizing your labels, so you're dealing with integers. It's just easier.
X=dataset.iloc[:, 0].values
y=dataset.iloc[:, 1].values
regressor=LinearRegression()
X=X.reshape(-1,1)
regressor.fit(X,y)
I had the following code. The reshape operator is not an inplace operator. So we have to replace it's value by the value after reshaping like given above.
You have to create a 2D array
you might be giving input like this:
model.predict([1,2,0,4])
But this is wrong
You have to give input like this:-
model.predict([[1,2,0,4]])
There are 2 square brackets not one.
A Simple solution that reshapes it automatically is
instead of using:
X=dataset.iloc[:, 0].values
You can use:
X=dataset.iloc[:, :-1].values
that is if you only have two column and you are trying to get the first one
the code gets all the column except the last one
Suppose initially you have,
X = dataset.iloc[:, 1].values
which indicates you have First column including all the rows.
Now make it as following
X = dataset.iloc[:, 1:2].values
here 1:2 means [1,2) simillar to upper bound formation.
Easy while selecting column make it 2 d.
x_train = mydata[['script']]
y_train = mydata[['label']]