Linear Regression fitting - python

I have first done a train/test split then fitted that data to a LinearRegression model shown below
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.4, random_state = 101)
Log_m = LinearRegression()
Log_m.fit(X_train,y_train)
predictions = Log_m.predict(X_test)
I have been given another test data frame and wanted to fit that to the Log_m model which has been created. So I did
predictions_t = Log_m.predict(fin_df1_t)
But I get the error message :
ValueError: shapes (1450,262) and (282,) not aligned: 262 (dim 1) != 282 (dim 0)
These are the shapes of dataframes
fin_df1_t (1450,262)
X_test (556,282)
X_train (834,282)
y_test (556,)
y_train (834,)

The feature columns of new test data (262) are not equal to feature columns of Xtrain and Xtest (282), so it will always give an error. Both should have the same feature columns.
For example, Xtrain and Xtest have the same columns (282), so there is no error at that step.

Related

Using MinMaxScaler on arrays and getting scaling parameters for later use: missing values

I have this issue when trying to use sklearn.preprocessing.MinMaxScaler on a large array and obtaining the scaling parameters to do "redo" the normalization after handling the array for a while.
The issue I have is that after doing my MinMaxScaler.fit_transform(data), where data is a numpy array with shape (8,412719), the scaling parameters obtained with MinMaxScaler.scale_ is just a list with length 412719.
How do I obtain an array with scaling parameters instead? I'm missing 7 columns worth of scaling parameters if I've not misunderstood something.
I build my X dataframe and y target then scaler the X dataframe
df3.dropna(inplace=True)
X_Columns=[column for column in df3.columns if not column in["Target","DateTime","Date","CO2Intensity","ActualWindProduction","ORKWindspeed","ForecastWindProduction"]]
#print(X_Columns)
X=df3[X_Columns]
#print(X)
y=df3["Target"]
scaler=MinMaxScaler()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)
X_train_scaled = scaler.fit_transform(X_train)
classifier = GaussianNB()
classifier.fit(X_train_scaled, y_train)

Error when calculating predicted values of polynomial regression python

I am trying to calculate predicted values after running a polynomial regression in Python using the following code:
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x) + x/6 + np.random.randn(n)/10
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)
X = X_train.reshape(-1, 1)
X_predict = np.linspace(0, 10, 100)
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X)
model = LinearRegression()
reg_poly = model.fit(X_train_poly, y_train)
y_predict = model.predict(X_predict)
After running it I get the following error:
ValueError: Expected 2D array, got 1D array instead:
array=[ 0. 0.1010101 0.2020202 0.3030303 0.4040404 0.50505051 ......
Reshape your data either using array.reshape(-1, 1)
if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
I tried reshaping the array as was said in the error message, so the last line of code would be:
y_predict = model.predict(X_predict.reshape(-1,1))
But as a result I got this error:
ValueError: shapes (100,1) and (3,) not aligned: 1 (dim 1) != 3 (dim 0)
Can someone please explain what I am doing wrong?
You forgot to prepare data for your prediction in the same way you prepared training data for the model. In particular, you forgot to fit_transform your X_predict with PolynomialFeatures.
Since the shape of data you used to predict have to exactly match the shape used for training, you need to recreate all you did for X_train_poly (you used that for training) for X_predict. Therefore your line should look like:
y_predict = model.predict(poly.fit_transform(X_predict.reshape(-1, 1)))

How to convert pandas.core.series.Series type to 2D array?

I am trying to train a model using KNNClassifier. I split the data as follows:
X_train, X_test, y_train, y_test = train_test_split(X_bow, y, test_size=0.30, random_state=42)
y_train= y_train.astype('int')
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_train, y_train)
When I try to test it, I get a value error.
pre = neigh.predict(y_test)
Expected 2D array, got 1D array instead:
array=[0. 1. 1. ... 0. 0. 0.].
Reshape your data either using array.reshape(-1, 1) if your data has a
single feature or array.reshape(1, -1) if it contains a single sample.
My y_test is of type pandas.core.series.Series
So how do I convert pandas.core.series.Series to array of 2D to make this testing work?
I have tried to convert y_test to dataframe and then to array, but I get another value error and I am stuck.
y_test = pd.DataFrame(y_test)
y_test = y_test.as_matrix().reshape(-1,1)
pre = neigh.predict(y_test)
ValueError: Incompatible dimension for X and Y matrices: X.shape[1] == 1 while Y.shape[1] == 6038
I guess you need to use your X_test variable / array, not y_test.
X_test are the independent variables / features used to test the accuracy of our model, and y_test are the actual target values which will be compared with the predicted values.
Example:
pre = neigh.predict(X_test)
To measure accuracy:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, pre)

pandas - non-aligned dataframes

I have two data frames:
df_train
Data types in the dataset: ['uint8', 'int64', 'float64']
Number of features: 233
Shape: (1457, 233)
df_test
Data types in the dataset: ['uint8', 'int64', 'float64']
Number of features: 216
Shape: (1447, 216)
The difference in the number of columns (233 vs 216) is due to the dummy variables I created in them both with pd.get_dummies() - fewer were created in df_test. Prior to that, df_train originally contained just one extra variable "SalePrice" which is the target variable to be predicted on df_test.
X = df_train.drop(["SalePrice"], axis=1)
y = df_train["SalePrice"]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
from sklearn.linear_model import Ridge
ridge = Ridge().fit(X_train, y_train)
This results in a solid test set score and all's well. But when I try to predict on df_test as follows
y_pred = ridge.predict(df_test)
it gives the following error:
ValueError: shapes (1447,216) and (232,) not aligned: 216 (dim 1) !=
232 (dim 0)
I understand I messed up with the different shapes. Can you help me manage this problem? I have 0 experience in this. Much appreciated.

Python ValueError: non-broadcastable output operand with shape (124,1) doesn't match the broadcast shape (124,13)

I would like to normalize a training and test data set using MinMaxScaler in sklearn.preprocessing. However, the package does not appear to be accepting my test data set.
import pandas as pd
import numpy as np
# Read in data.
df_wine = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data',
header=None)
df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash',
'Alcalinity of ash', 'Magnesium', 'Total phenols',
'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins',
'Color intensity', 'Hue', 'OD280/OD315 of diluted wines',
'Proline']
# Split into train/test data.
from sklearn.model_selection import train_test_split
X = df_wine.iloc[:, 1:].values
y = df_wine.iloc[:, 0].values
X_train, y_train, X_test, y_test = train_test_split(X, y, test_size=0.3,
random_state = 0)
# Normalize features using min-max scaling.
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler()
X_train_norm = mms.fit_transform(X_train)
X_test_norm = mms.transform(X_test)
When executing this, I get a DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample. along with a ValueError: operands could not be broadcast together with shapes (124,) (13,) (124,).
Reshaping the data still yields an error.
X_test_norm = mms.transform(X_test.reshape(-1, 1))
This reshaping yields an error ValueError: non-broadcastable output operand with shape (124,1) doesn't match the broadcast shape (124,13).
Any input on how to get fix this error would be helpful.
The partitioning of train/test data must be specified in the same order as the input array to the train_test_split() function for it to unpack them corresponding to that order.
Clearly, when the order was specified as X_train, y_train, X_test, y_test, the resulting shapes of y_train (len(y_train)=54) and X_test (len(X_test)=124) got swapped resulting in the ValueError.
Instead, you must:
# Split into train/test data.
# _________________________________
# | | \
# | | \
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
# | | /
# |__________|_____________________________________/
# (or)
# y_train, y_test, X_train, X_test = train_test_split(y, X, test_size=0.3, random_state=0)
# Normalize features using min-max scaling.
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler()
X_train_norm = mms.fit_transform(X_train)
X_test_norm = mms.transform(X_test)
produces:
X_train_norm[0]
array([ 0.72043011, 0.20378151, 0.53763441, 0.30927835, 0.33695652,
0.54316547, 0.73700306, 0.25 , 0.40189873, 0.24068768,
0.48717949, 1. , 0.5854251 ])
X_test_norm[0]
array([ 0.72849462, 0.16386555, 0.47849462, 0.29896907, 0.52173913,
0.53956835, 0.74311927, 0.13461538, 0.37974684, 0.4364852 ,
0.32478632, 0.70695971, 0.60566802])

Categories

Resources