Multi-output regression from two input datasets - python

Is it possible to regress a dataset Y from two datasets X1 and X2, if all
X1, X2 and Y are matrices. So, it's a multi-output regression problem.
x1_train, x1_test, x2_train, x2_test, y_train, y_test = train_test_split(x1, x2, y, test_size=0.2)
Lasso_Regr = Lasso(alpha=0.05, normalize=True)
Lasso_Regr.fit([x1_train, x2_train], y_train)
y_pred = Lasso_Regr.predict([x1_test, x2_test])
I am getting the following error:
Found array with dim 3. Estimator expected <= 2.*

This is misleading if you split the predictors of training set separately, as the mapping between two predictors is necessary for an accurate prediction.
Since you have importing a csv, first transform it to get it to vertical format, convert to a data frame and do the analysis as below.
EDITED:
Sample Code:
import pandas as pd
import csv
from itertools import izip
from sklearn import linear_model, model_selection
a = izip(*csv.reader(open("input.csv", "rb")))
csv.writer(open("output.csv", "wb")).writerows(a)
df = pd.read_csv("output.csv")
print(df)
x = df[['x1', 'x2', 'x3']]
y = df['y']
x_train, x_test, y_train, y_test = model_selection.train_test_split(x, y, test_size=0.2)
Lasso_Regr = linear_model.Lasso(alpha=0.05, normalize=True)
Lasso_Regr.fit(x_train, y_train)
y_pred = Lasso_Regr.predict(x_test)
print y_pred
You can add any number of predictors.

Related

How to apply KNN from large datatset to small dataset or to just one test data

I have trained and tested a KNN model on a supervised dataset of about 180 samples (6 classes of 30 samples each) in Python. I would like to apply these results to a small unsupervised dataset of 21 samples (3 classes of 7 samples).
The problem is datasets have different number of raws. So either I getting an error with inconsistent numbers of samples, or matching target in a new datasets and getting not representative result.
I want to see which classes datas from new small dataset corespond in large dataset. Is there a way to do that?
Here is my code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import utils
data, y = utils.load_data() #utils consist large dataset
Y = pd.get_dummies(y).values
n_classes = Y.shape[1]
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
clf = KNeighborsClassifier()
for key in data:
scores = cross_val_score(clf, data[key], y, cv=5)
print("Accuracy for {:5s} : {:0.2f} (+/- {:0.2f})".format(
key, scores.mean(), scores.std() * 2))
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
df = pd.read_csv('small dataset')
X = df.drop(columns=['subject', 'sessionIndex', 'rep'])
y = df['subject']
Y = pd.get_dummies(y).values
X_train, X_test, Y_train, Y_test = train_test_split(
X, Y, test_size=0.2, random_state=1, stratify=y)
n_neighbors = [2, 3, 4, 5, 6]
parameters = dict(n_neighbors=n_neighbors)
clf = KNeighborsClassifier()
grid = GridSearchCV(clf, parameters, cv=5)
grid.fit(X_train, Y_train)
results = grid.cv_results_
for i in range(1, 4):
candidates = np.flatnonzero(results['rank_test_score'] == i)
for candidate in candidates:
print("Model with rank: {}".format(i))
print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
results['mean_test_score'][candidate],
results['std_test_score'][candidate]))
print("Parameters: {}".format(results['params'][candidate]))
print()
from sklearn.metrics import accuracy_score, roc_curve, auc
Y_pred = grid.predict(X[1:2])
print(Y_pred)`
So I'm getting an array [[0 0 1]] which is correct, only it doesn't check any classes in large dataset of 6 classes like if I matching X and Y to datas from it, not from small dataset
data, y = utils.load_data() #utils consist large dataset
Y = pd.get_dummies(y).values
n_classes = Y.shape[1]
X = data['large dataset']
X_train, X_test, Y_train, Y_test = train_test_split(
X, Y, test_size=0.2, random_state=1, stratify=y)
Y_pred = grid.predict(X[1:2])
print(Y_pred)`
This way the result an a array of 6 numbers like [[0 0 0 0 0 1]]. And I want to see the same when testing new small dataset.

Trying to understand an example script on ML

I'm trying to work through an example script on machine learning: Common pitfalls in interpretation of coefficients of linear models but I'm having trouble understanding some of the steps. The beginning of the script looks like this:
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_openml
survey = fetch_openml(data_id=534, as_frame=True)
# We identify features `X` and targets `y`: the column WAGE is our
# target variable (i.e., the variable which we want to predict).
X = survey.data[survey.feature_names]
X.describe(include="all")
X.head()
# Our target for prediction is the wage.
y = survey.target.values.ravel()
survey.target.head()
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
train_dataset = X_train.copy()
train_dataset.insert(0, "WAGE", y_train)
_ = sns.pairplot(train_dataset, kind='reg', diag_kind='kde')
My problem is in the lines
y = survey.target.values.ravel()
survey.target.head()
If we examine survey.target.head() immediately after these lines, the output is
Out[36]:
0 5.10
1 4.95
2 6.67
3 4.00
4 7.50
Name: WAGE, dtype: float64
How does the model know that WAGE is the target variable? Does is not have to be explicitly declared?
The line survey.target.values.ravel() is meant to flatten the array, but in this example it is not necessary. survey.target is a pd Series (i.e 1 column data frame) and survey.target.values is a numpy array. You can use both for train/test split since there is only 1 column in survey.target .
type(survey.target)
pandas.core.series.Series
type(survey.target.values)
numpy.ndarray
If we use just survey.target, you can see that the regression will work:
y = survey.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
train_dataset = X_train.copy()
train_dataset.insert(0, "WAGE", y_train)
sns.pairplot(train_dataset, kind='reg', diag_kind='kde')
If you have another dataset, for example iris, I want to regress petal width against the rest. You would call the column of the data.frame using the square brackets [] :
from sklearn.datasets import load_iris
from sklearn.linear_model import LinearRegression
dat = load_iris(as_frame=True).frame
X = dat[['sepal length (cm)','sepal width (cm)','petal length (cm)']]
y = dat[['petal width (cm)']]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
LR = LinearRegression()
LR.fit(X_train,y_train)
plt.scatter(x=y_test,y=LR.predict(X_test))

Can't predict values using Linear Regression

there!
I'm studying the IBM Data Science course by Coursera and I'm trying to create some snippets to practice. I've created the following code:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
# Import and format the dataframes
ibov = pd.read_csv('https://raw.githubusercontent.com/thiagobodruk/datasets/master/ibov.csv')
ifix = pd.read_csv('https://raw.githubusercontent.com/thiagobodruk/datasets/master/ifix.csv')
ibov['DATA'] = pd.to_datetime(ibov['DATA'], format='%d/%m/%Y')
ifix['DATA'] = pd.to_datetime(ifix['DATA'], format='%d/%m/%Y')
ifix = ifix.sort_values(by='DATA', ascending=False)
ibov = ibov.sort_values(by='DATA', ascending=False)
ibov = ibov[['DATA','FECHAMENTO']]
ibov.rename(columns={'FECHAMENTO':'IBOV'}, inplace=True)
ifix = ifix[['DATA','FECHAMENTO']]
ifix.rename(columns={'FECHAMENTO':'IFIX'}, inplace=True)
# Merge datasets
df_idx = ibov.merge( ifix, how='left', on='DATA')
df_idx.set_index('DATA', inplace=True)
df_idx.head()
# Split training and testing samples
x_train, x_test, y_train, y_test = train_test_split(df_idx['IBOV'], df_idx['IFIX'], test_size=0.2)
# Convert the samples to Numpy arrays
regr = linear_model.LinearRegression()
x_train = np.array([x_train])
y_train = np.array([y_train])
x_test = np.array([x_test])
y_test = np.array([y_test])
# Plot the result
regr.fit(x_train, y_train)
y_pred = regr.predict(y_train)
plt.scatter(x_train, y_train)
plt.plot(x_test, y_pred, color='blue', linewidth=3) # This line produces no result
I experienced some issues with the output values returned by the train_test_split() method. So I converted them to Numpy arrays, then my code worked. I can plot my scatter plot normally, but I can't plot my prediction line.
Running this code on my IBM Data Cloud Notebook produces the following warning:
/opt/conda/envs/Python36/lib/python3.6/site-packages/matplotlib/axes/_base.py:380: MatplotlibDeprecationWarning:
cycling among columns of inputs with non-matching shapes is deprecated.
cbook.warn_deprecated("2.2", "cycling among columns of inputs "
I searched on Google and here on StackOverflow, but I can't figure what is wrong.
I'll appreciate some assistance. Thanks in advance!
There are several issues in your code, like y_pred = regr.predict(y_train) and the way you draw a line.
The following code snippet should set you in the right direction:
# Split training and testing samples
x_train, x_test, y_train, y_test = train_test_split(df_idx['IBOV'], df_idx['IFIX'], test_size=0.2)
# Convert the samples to Numpy arrays
regr = linear_model.LinearRegression()
x_train = x_train.values
y_train = y_train.values
x_test = x_test.values
y_test = y_test.values
# Plot the result
plt.scatter(x_train, y_train)
regr.fit(x_train.reshape(-1,1), y_train)
idx = np.argsort(x_train)
y_pred = regr.predict(x_train[idx].reshape(-1,1))
plt.plot(x_train[idx], y_pred, color='blue', linewidth=3);
To do the same for the test subset with already fitted model:
# Plot the result
plt.scatter(x_test, y_test)
idx = np.argsort(x_test)
y_pred = regr.predict(x_test[idx].reshape(-1,1))
plt.plot(x_test[idx], y_pred, color='blue', linewidth=3);
Feel free to ask questions if you have any.

ValueError : x and y must be the same size

I have a dataset which i'm trying to calculate Linear regression using sklearn.
The dataset i'm using is already made so there are not suppose to be problems with it.
I have used train_test_split in order to split my data into train and test groups.
When I try to use matplotlib in order to create scatter plot between my ttest and prediction group, I get the next error:
ValueError: x and y must be the same size
This is my code:
y=data['Yearly Amount Spent']
x=data[['Avg. Session Length','Time on App','Time on Website','Length of Membership','Yearly Amount Spent']]
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=101)
#training the model
from sklearn.linear_model import LinearRegression
lm=LinearRegression()
lm.fit(x_train,y_train)
lm.coef_
predictions=lm.predict(X_test)
#here the problem starts:
plt.scatter(y_test,predictions)
Why does this error occurs?
I have seen previous posts here and the suggestions for this was to use x.shape and y.shape but i'm not sure what is the purpose of that.
Thanks
It seems that you are using the EcommerceCustomers.csv dataset (link here)
In your original post the column 'Yearly Amount Spent' is also included in the y as well as in x but this is wrong.
The following should work fine:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
data = pd.read_csv("EcommerceCustomers.csv")
y = data['Yearly Amount Spent']
X = data[['Avg. Session Length', 'Time on App','Time on Website', 'Length of Membership']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
# ## Training the Model
lm = LinearRegression()
lm.fit(X_train,y_train)
# The coefficients
print('Coefficients: \n', lm.coef_)
# ## Predicting Test Data
predictions = lm.predict( X_test)
See also this

Train test dataset regression results

My problem is that I am applying a simple linear regression on my data: when I split the data to train and test data I don't find significant model when bad p-value and r squared and adjusted r squared results while there is good results in train data.
Here's the code for more explanations :
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy import stats
data = pd.read_excel ("C:\\Users\\AchourAh\\Desktop\\PL14_IPC_03_09_2018_SP_Level.xlsx",'Sheet1') #Import Excel file
data1 = data.fillna(0) #Replace null values of the whole dataset with 0
print(data1)
X = data1.iloc[0:len(data1),5].values.reshape(-1, 1) #Extract the column of the COPCOR SP we are going to check its impact
Y = data1.iloc[0:len(data1),6].values.reshape(-1, 1) #Extract the column of the PAUS SP
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size =0.3, random_state = 0)
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, Y_train)
plt.scatter(X_train, Y_train, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('SP00114585')
plt.xlabel('COP COR Quantity')
plt.ylabel('PAUS Quantity')
plt.show()
plt.scatter(X_test, Y_test, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('SP00114585')
plt.xlabel('COP COR Quantity')
plt.ylabel('PAUS Quantity')
plt.show()
X2 = sm.add_constant(X_train)
est = sm.OLS(Y_train, X2)
est2 = est.fit()
print(est2.summary())
X3 = sm.add_constant(X_test)
est3 = sm.OLS(Y_test, X3)
est4 = est3.fit()
print(est4.summary())
At the end, when trying to display statistical results, I always find good results in train data but not in test data. Probably something wrong in my code.
To notice I am a beginner with python
Try running this model a few times, without specifying random_state in train_test_split or changing the test_size parameter.
I.e.
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size =0.2)
As of now, every time you run the model, you do the same split of data, so it is possible that you overfit the model just because of the split.

Categories

Resources