I have two data frames:
df_train
Data types in the dataset: ['uint8', 'int64', 'float64']
Number of features: 233
Shape: (1457, 233)
df_test
Data types in the dataset: ['uint8', 'int64', 'float64']
Number of features: 216
Shape: (1447, 216)
The difference in the number of columns (233 vs 216) is due to the dummy variables I created in them both with pd.get_dummies() - fewer were created in df_test. Prior to that, df_train originally contained just one extra variable "SalePrice" which is the target variable to be predicted on df_test.
X = df_train.drop(["SalePrice"], axis=1)
y = df_train["SalePrice"]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
from sklearn.linear_model import Ridge
ridge = Ridge().fit(X_train, y_train)
This results in a solid test set score and all's well. But when I try to predict on df_test as follows
y_pred = ridge.predict(df_test)
it gives the following error:
ValueError: shapes (1447,216) and (232,) not aligned: 216 (dim 1) !=
232 (dim 0)
I understand I messed up with the different shapes. Can you help me manage this problem? I have 0 experience in this. Much appreciated.
Related
I've seen quite a lot of conflicting views on if one-hot encoding (dummy variable creation) should be done before/after the training/test split.
Responses seem to state that one-hot encoding before leads to "data leakage".
This example states it's industry norm to do one-hot encoding on the entire data before training/test split:
Industry Example
This example from kaggle states that it should be done after the training/test split to avoid data leakage:
kaggle response - after split
My question is the following;
Do we perform one-hot encoding before or after the Train/Test Split?
Where is the data leakage occuring in the following example?
If we take the following example, we have two columns - web_views and website (non-ordinal categorical feature) (assuming we are one-hot encoding across the entire column, not dropping any dummies)
Our dataframe:
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
df = pd.DataFrame({'web_views': [100,200,300,400],
'website': ['Youtube','Facebook','Instagram', 'Google']})
Scenario 1: One-Hot Encoding/Dummy Variables before splitting into Train/Test:
np.random.seed(123)
df_before_split = pd.concat([df.drop('website', axis = 1), pd.get_dummies(df['website'])], axis=1)
# create your X and y dataframes
X_before_split = df_before_split.drop('web_views', axis = 1)
y_before_split = df_before_split['web_views']
# perform train test split
X_train_before_split, X_test_before_split, y_train_before_split, y_test_before_split = train_test_split(X_before_split, y_before_split, test_size = 0.20)
Now viewing the dataframes we have:
# view X train dataset (this is encoding before split)
X_train_before_split
and then for test
# View X test dataset dataset (this is encoding before split)
X_test_before_split
Scenario 2: One-Hot Encoding/Dummy Variables AFTER splitting into Train/Test:
# Perform One Hot encoding after the train/test split instead
X = df.drop('web_views', axis = 1)
y = df['web_views']
# perform data split:
np.random.seed(123)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
# perform one hot encoding on the train and test dataset datasets:
X_train = pd.concat([X_train.drop('website', axis = 1), pd.get_dummies(X_train['website'])], axis=1)
X_test = pd.concat([X_test.drop('website', axis = 1), pd.get_dummies(X_test['website'])], axis=1)
Viewing the X_train and X_test dataframes:
# encode after train/test split - train dataframe
X_train
# encode after train/test split - test dataframe
X_test
Performing Linear Regression Modelling
Now that we have split our data to demonstrate we will create a simple linear model:
from sklearn.linear_model import LinearRegression
Before split linear model
regressor_before_split = LinearRegression()
regressor_before_split.fit(X_train_before_split, y_train_before_split)
y_pred_before_split = regressor_before_split.predict(X_test_before_split)
y_pred_before_split
y_pred_before_split returns a predicting value what we would expect.
After split linear model
regressor_after_split = LinearRegression()
regressor_after_split.fit(X_train, y_train)
y_pred_after_split = regressor_after_split.predict(X_test)
y_pred_after_split
Error message from Scenario 2:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-92-c63978a198c8> in <module>()
2 regressor_after_split.fit(X_train, y_train)
3
----> 4 y_pred_after_split = regressor_after_split.predict(X_test)
5 y_pred_after_split
C:\Anaconda3\lib\site-packages\sklearn\linear_model\base.py in predict(self, X)
254 Returns predicted values.
255 """
--> 256 return self._decision_function(X)
257
258 _preprocess_data = staticmethod(_preprocess_data)
C:\Anaconda3\lib\site-packages\sklearn\linear_model\base.py in _decision_function(self, X)
239 X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])
240 return safe_sparse_dot(X, self.coef_.T,
--> 241 dense_output=True) + self.intercept_
242
243 def predict(self, X):
C:\Anaconda3\lib\site-packages\sklearn\utils\extmath.py in safe_sparse_dot(a, b, dense_output)
138 return ret
139 else:
--> 140 return np.dot(a, b)
141
142
<__array_function__ internals> in dot(*args, **kwargs)
ValueError: shapes (1,1) and (3,) not aligned: 1 (dim 1) != 3 (dim 0)
My thoughts:
Encoding with dummies before splitting ensures that the test data that we pass in e.g. X_test to perform the predicitions has the same shape as the training data that the model was trained on therefore understands how to predict values when it encounters these features - unlike with encoding after splitting, since the X_test data has only one feature we are using to make predicitions with whereas the X_train has 3 features
Maybe I've introduced data leakage?
I'd be happy for someone to correct me if i've got things wrong or misinterpreted anything, but i'm stuck scratching me head if you encode before or after splitting!
I have this issue when trying to use sklearn.preprocessing.MinMaxScaler on a large array and obtaining the scaling parameters to do "redo" the normalization after handling the array for a while.
The issue I have is that after doing my MinMaxScaler.fit_transform(data), where data is a numpy array with shape (8,412719), the scaling parameters obtained with MinMaxScaler.scale_ is just a list with length 412719.
How do I obtain an array with scaling parameters instead? I'm missing 7 columns worth of scaling parameters if I've not misunderstood something.
I build my X dataframe and y target then scaler the X dataframe
df3.dropna(inplace=True)
X_Columns=[column for column in df3.columns if not column in["Target","DateTime","Date","CO2Intensity","ActualWindProduction","ORKWindspeed","ForecastWindProduction"]]
#print(X_Columns)
X=df3[X_Columns]
#print(X)
y=df3["Target"]
scaler=MinMaxScaler()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)
X_train_scaled = scaler.fit_transform(X_train)
classifier = GaussianNB()
classifier.fit(X_train_scaled, y_train)
I have first done a train/test split then fitted that data to a LinearRegression model shown below
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.4, random_state = 101)
Log_m = LinearRegression()
Log_m.fit(X_train,y_train)
predictions = Log_m.predict(X_test)
I have been given another test data frame and wanted to fit that to the Log_m model which has been created. So I did
predictions_t = Log_m.predict(fin_df1_t)
But I get the error message :
ValueError: shapes (1450,262) and (282,) not aligned: 262 (dim 1) != 282 (dim 0)
These are the shapes of dataframes
fin_df1_t (1450,262)
X_test (556,282)
X_train (834,282)
y_test (556,)
y_train (834,)
The feature columns of new test data (262) are not equal to feature columns of Xtrain and Xtest (282), so it will always give an error. Both should have the same feature columns.
For example, Xtrain and Xtest have the same columns (282), so there is no error at that step.
I am building an application in Python which can predict the values for Pm2.5 pollution from a dataframe. I am using the values for November and I am trying to first build the linear regression model. How can I make the linear regression without using the dates? I only need predictions for the Pm2.5, the dates are known.
Here is what I tried so far:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
data = pd.read_csv("https://raw.githubusercontent.com/iulianastroia/csv_data/master/final_dataframe.csv")
data['day'] = pd.to_datetime(data['day'], dayfirst=True)
#Splitting the dataset into training(70%) and test(30%)
X_train, X_test, y_train, y_test = train_test_split(data['day'], data['pm25'], test_size=0.3,
random_state=0
)
#Fitting Linear Regression to the dataset
lin_reg = LinearRegression()
lin_reg.fit(data['day'], data['pm25'])
This code throws the following error:
ValueError: Expected 2D array, got 1D array instead:
array=['2019-11-01T00:00:00.000000000' '2019-11-01T00:00:00.000000000'
'2019-11-01T00:00:00.000000000' ... '2019-11-30T00:00:00.000000000'
'2019-11-30T00:00:00.000000000' '2019-11-30T00:00:00.000000000'].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
You need to pass pandas dataframe instead of pandas series for X values, so you might want to do something like this,
UPDATE:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import datetime
data = pd.read_csv("https://raw.githubusercontent.com/iulianastroia/csv_data/master/final_dataframe.csv")
data['day'] = pd.to_datetime(data['day'], dayfirst=True)
print(data.head())
x_data = data[['day']]
y_data = data['pm25']
#Splitting the dataset into training(70%) and test(30%)
X_train, X_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.3,
random_state=0
)
# linear regression does not work on date type of data, convert it into numerical type
X_train['day'] = X_train['day'].map(datetime.datetime.toordinal)
X_test['day'] = X_test['day'].map(datetime.datetime.toordinal)
#Fitting Linear Regression to the dataset
lin_reg = LinearRegression()
lin_reg.fit(X_train[["day"]], y_train)
Now you can predict the data using,
print(lin_reg.predict(X_test[["day"]])) #-->predict the data
This is just something else to add to why you need the "[[", and how to avoid the frustration.
The reason the data[['day']] works and data['day'] doesn't is that the fit method expects for X an tuple of 2 with shape, but not for Y, see the vignette:
fit(self, X, y, sample_weight=None)[source]¶ Fit linear model.
Parameters X{array-like, sparse matrix} of shape (n_samples,
n_features) Training data
yarray-like of shape (n_samples,) or (n_samples, n_targets) Target
values. Will be cast to X’s dtype if necessary
So for example:
data[['day']].shape
(43040, 1)
data['day'].shape
(43040,)
np.resize(data['day'],(len(data['day']),1)).shape
(43040, 1)
These work because they have the structure required:
lin_reg.fit(data[['day']], data['pm25'])
lin_reg.fit(np.resize(data['day'],(len(data['day']),1)), data['pm25'])
While this doesn't:
lin_reg.fit(data['day'], data['pm25'])
Hence before running the function, check that you are providing input in the required format :)
So I am creating my training and test sets for use in a Multiple Linear Regression model using sklearn.
my dataset contains 182 features looks like the following;
id feature1 feature2 .... feature182 Target
D24352 145 8 7 1
G09340 10 24 0 0
E40988 6 42 8 1
H42093 238 234 2 1
F32093 12 72 1 0
I have then have the following code;
import pandas as pd
dataset = pd.read_csv('C:\\mylocation\\myfile.csv')
dataset0 = dataset.set_index('t1.id')
dataset2 = pd.get_dummies(dataset0)
y = dataset0.iloc[:, 31:32].values
dataset2.pop('Target')
X = dataset2.iloc[:, :180].values
Once I use dataframe.iloc however, I loose my indexes (which I have set to be my IDs). I would like to keep these as I currently have no way of telling which records in my results relate to which records in my original dataset when I do the following step;
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
It looks like your data is stored as object type. You should convert it to float64 (assuming that all your data is of numeric type. Else only convert those rows, that you want to have as numeric type). Since it turns out your index is of type string, you need to set the dtype of your dataframe after setting the index (and generating the dummies). Again assuming that the rest of your data is of numeric type:
dataset = pd.read_csv('C:\\mylocation\\myfile.csv')
dataset0 = dataset.set_index('t1.id')
dataset2 = pd.get_dummies(dataset0)
dataset0 = dataset0.astype(np.float64) # add this line to explicitly set the dtype
Now you should be able to just leave out values when slicing the DataFrame:
y = dataset0.iloc[:, 31:32]
dataset2.pop('Target')
X = dataset2.iloc[:, :180]
With .values you access the underlying numpy arrays of the DataFrame. These do not have an index column. Since sklearn is, in most cases, compatible with pandas, you can simply pass a pandas DataFrame to sklearn.
If this does not work, you can still apply reset_index to your DataFrame. This will add the index as a new column, which you will have to drop when passing the training data to sklearn:
dataset0.reset_index(inplace=True)
dataset2.reset_index(inplace=True)
y = dataset0.iloc[:, 31:32].values
dataset2.pop('Target')
X = dataset2.iloc[:, :180].values
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state = 0)
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train.drop('index', axis=1), y_train.drop('index', axis=1))
y_pred = regressor.predict(X_test.drop('index', axis=1))
In this case you'll still have to change the slicing [:, 31:32] and [:, :180] to the correct columns, so that the index will be included in the slice.