Can't predict values using Linear Regression - python

there!
I'm studying the IBM Data Science course by Coursera and I'm trying to create some snippets to practice. I've created the following code:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
# Import and format the dataframes
ibov = pd.read_csv('https://raw.githubusercontent.com/thiagobodruk/datasets/master/ibov.csv')
ifix = pd.read_csv('https://raw.githubusercontent.com/thiagobodruk/datasets/master/ifix.csv')
ibov['DATA'] = pd.to_datetime(ibov['DATA'], format='%d/%m/%Y')
ifix['DATA'] = pd.to_datetime(ifix['DATA'], format='%d/%m/%Y')
ifix = ifix.sort_values(by='DATA', ascending=False)
ibov = ibov.sort_values(by='DATA', ascending=False)
ibov = ibov[['DATA','FECHAMENTO']]
ibov.rename(columns={'FECHAMENTO':'IBOV'}, inplace=True)
ifix = ifix[['DATA','FECHAMENTO']]
ifix.rename(columns={'FECHAMENTO':'IFIX'}, inplace=True)
# Merge datasets
df_idx = ibov.merge( ifix, how='left', on='DATA')
df_idx.set_index('DATA', inplace=True)
df_idx.head()
# Split training and testing samples
x_train, x_test, y_train, y_test = train_test_split(df_idx['IBOV'], df_idx['IFIX'], test_size=0.2)
# Convert the samples to Numpy arrays
regr = linear_model.LinearRegression()
x_train = np.array([x_train])
y_train = np.array([y_train])
x_test = np.array([x_test])
y_test = np.array([y_test])
# Plot the result
regr.fit(x_train, y_train)
y_pred = regr.predict(y_train)
plt.scatter(x_train, y_train)
plt.plot(x_test, y_pred, color='blue', linewidth=3) # This line produces no result
I experienced some issues with the output values returned by the train_test_split() method. So I converted them to Numpy arrays, then my code worked. I can plot my scatter plot normally, but I can't plot my prediction line.
Running this code on my IBM Data Cloud Notebook produces the following warning:
/opt/conda/envs/Python36/lib/python3.6/site-packages/matplotlib/axes/_base.py:380: MatplotlibDeprecationWarning:
cycling among columns of inputs with non-matching shapes is deprecated.
cbook.warn_deprecated("2.2", "cycling among columns of inputs "
I searched on Google and here on StackOverflow, but I can't figure what is wrong.
I'll appreciate some assistance. Thanks in advance!

There are several issues in your code, like y_pred = regr.predict(y_train) and the way you draw a line.
The following code snippet should set you in the right direction:
# Split training and testing samples
x_train, x_test, y_train, y_test = train_test_split(df_idx['IBOV'], df_idx['IFIX'], test_size=0.2)
# Convert the samples to Numpy arrays
regr = linear_model.LinearRegression()
x_train = x_train.values
y_train = y_train.values
x_test = x_test.values
y_test = y_test.values
# Plot the result
plt.scatter(x_train, y_train)
regr.fit(x_train.reshape(-1,1), y_train)
idx = np.argsort(x_train)
y_pred = regr.predict(x_train[idx].reshape(-1,1))
plt.plot(x_train[idx], y_pred, color='blue', linewidth=3);
To do the same for the test subset with already fitted model:
# Plot the result
plt.scatter(x_test, y_test)
idx = np.argsort(x_test)
y_pred = regr.predict(x_test[idx].reshape(-1,1))
plt.plot(x_test[idx], y_pred, color='blue', linewidth=3);
Feel free to ask questions if you have any.

Related

Logistic regression gives me precision of 0.55. Whats wrong with my code?

Copied columns from data frame Z, made dummies, trying to predict Click, a 0/1 variable. Balanced size of train and test. Where did I go wrong?
df = z[['user_state', 'device_maker', 'day_of_week', 'device_area_zscore', 'Age_zscore', 'consumption_zscore', 'click']].copy()
day_dummy = pd.get_dummies(df["day_of_week"])
state_dummy = pd.get_dummies(df["user_state"])
maker_dummy = pd.get_dummies(df["device_maker"])
combined_df = pd.concat([df, day_dummy, state_dummy, maker_dummy], axis=1)
click_rows = combined_df[combined_df.click == 1]
no_click_rows = combined_df[combined_df.click == 0]
no_click_rows = no_click_rows.sample(frac=1, replace=False, random_state=1)
final_df = pd.concat([click_rows, no_click_rows], axis = 0)
final_df = final_df.reset_index(drop=True)
from sklearn.model_selection import train_test_split
final_df = final_df.drop(['user_state', 'device_maker', 'day_of_week'], axis = 1)
x_train, x_test, y_train, y_test = train_test_split(final_df.drop(['click'], axis = 1), final_df['click'], test_size=0.2, random_state=2)
from sklearn.linear_model import LogisticRegression
logmodel = LogisticRegression()
logmodel.fit(x_train,y_train)
predictions = logmodel.predict(x_test)
from sklearn.metrics import classification_report
print(classification_report(y_test,predictions))
Following are my suggestions:
You are missing stratify argument in train_test_split function. It ensures the target distribution is similar in train/val/test data.
Logistic Regression doesn't well in detecting non-linear patterns in data. Try a tree based model like RandomForestClassifier.

Incorrect labels in confusion matrix

I have tried to create a confusion matrix on a knn-classifier in python, but the labeled classes are wrong.
The classes attribute of the dataset is 2 (for benign) and 4 (for malignant), but when I plot the confusion matrix, all labels are 2. The code I use is:
Data source: http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
KNN classifier on Breast Cancer Wisconsin (Diagnostic) Data Set from UCI:
data = pd.read_csv('/breast-cancer-wisconsin.data')
data.replace('?', 0, inplace=True)
data.drop('id', 1, inplace = True)
X = np.array(data.drop(' class ', 1))
Y = np.array(data[' class '])
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.2)
clf = neighbors.KNeighborsClassifier()
clf.fit(X_train, Y_train)
accuracy = clf.score(X_test, Y_test)
Plot confusion matrix
from sklearn.metrics import plot_confusion_matrix
disp = plot_confusion_matrix(clf, X_test, Y_test,
display_labels=Y,
cmap=plt.cm.Blues,)
Confusion matrix
The problem is that you're specifying the display_labels argument with Y, where it should just be the target names used for plotting. Now it's just using the two first values that appear in Y, which happen to be 2, 2. Note too that, as mentioned in the docs, the displayed labels will be the same as specified in labels if it is provided, so you just need:
from sklearn.metrics import plot_confusion_matrix
fig, ax = plt.subplots(figsize=(8,8))
disp = plot_confusion_matrix(clf, X_test, Y_test,
labels=np.unique(y),
cmap=plt.cm.Blues,ax=ax)

Train test dataset regression results

My problem is that I am applying a simple linear regression on my data: when I split the data to train and test data I don't find significant model when bad p-value and r squared and adjusted r squared results while there is good results in train data.
Here's the code for more explanations :
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy import stats
data = pd.read_excel ("C:\\Users\\AchourAh\\Desktop\\PL14_IPC_03_09_2018_SP_Level.xlsx",'Sheet1') #Import Excel file
data1 = data.fillna(0) #Replace null values of the whole dataset with 0
print(data1)
X = data1.iloc[0:len(data1),5].values.reshape(-1, 1) #Extract the column of the COPCOR SP we are going to check its impact
Y = data1.iloc[0:len(data1),6].values.reshape(-1, 1) #Extract the column of the PAUS SP
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size =0.3, random_state = 0)
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, Y_train)
plt.scatter(X_train, Y_train, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('SP00114585')
plt.xlabel('COP COR Quantity')
plt.ylabel('PAUS Quantity')
plt.show()
plt.scatter(X_test, Y_test, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('SP00114585')
plt.xlabel('COP COR Quantity')
plt.ylabel('PAUS Quantity')
plt.show()
X2 = sm.add_constant(X_train)
est = sm.OLS(Y_train, X2)
est2 = est.fit()
print(est2.summary())
X3 = sm.add_constant(X_test)
est3 = sm.OLS(Y_test, X3)
est4 = est3.fit()
print(est4.summary())
At the end, when trying to display statistical results, I always find good results in train data but not in test data. Probably something wrong in my code.
To notice I am a beginner with python
Try running this model a few times, without specifying random_state in train_test_split or changing the test_size parameter.
I.e.
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size =0.2)
As of now, every time you run the model, you do the same split of data, so it is possible that you overfit the model just because of the split.

Multi-output regression from two input datasets

Is it possible to regress a dataset Y from two datasets X1 and X2, if all
X1, X2 and Y are matrices. So, it's a multi-output regression problem.
x1_train, x1_test, x2_train, x2_test, y_train, y_test = train_test_split(x1, x2, y, test_size=0.2)
Lasso_Regr = Lasso(alpha=0.05, normalize=True)
Lasso_Regr.fit([x1_train, x2_train], y_train)
y_pred = Lasso_Regr.predict([x1_test, x2_test])
I am getting the following error:
Found array with dim 3. Estimator expected <= 2.*
This is misleading if you split the predictors of training set separately, as the mapping between two predictors is necessary for an accurate prediction.
Since you have importing a csv, first transform it to get it to vertical format, convert to a data frame and do the analysis as below.
EDITED:
Sample Code:
import pandas as pd
import csv
from itertools import izip
from sklearn import linear_model, model_selection
a = izip(*csv.reader(open("input.csv", "rb")))
csv.writer(open("output.csv", "wb")).writerows(a)
df = pd.read_csv("output.csv")
print(df)
x = df[['x1', 'x2', 'x3']]
y = df['y']
x_train, x_test, y_train, y_test = model_selection.train_test_split(x, y, test_size=0.2)
Lasso_Regr = linear_model.Lasso(alpha=0.05, normalize=True)
Lasso_Regr.fit(x_train, y_train)
y_pred = Lasso_Regr.predict(x_test)
print y_pred
You can add any number of predictors.

Python: How can we match values of predicted and truth values of a regression model

We are trying to plot the predicted values and truth values on the same graph after fitting a model to predict a truth value using a RandomForestRegressor in Python of the three column dataset (click the link to download the full CSV-dataset formatted as in the following
t_stamp,X,Y
0.000543,0,10
0.000575,0,10
0.041324,1,10
0.041331,2,10
0.041336,3,10
0.04134,4,10
0.041345,5,10
0.04135,6,10
0.041354,7,10
Here is how we do the prediction.
import pandas as pd
import numpy as np
import glob, os
from io import StringIO
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
import math
from math import sqrt
from sklearn.cross_validation import train_test_split
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "data*.csv"))))
for i in range(1,10):
df['X_t'+str(i)] = df['X'].shift(i)
print(df)
df.dropna(inplace=True)
X = pd.DataFrame({ 'X_%d'%i : df['X'].shift(i) for i in range(10)}).apply(np.nan_to_num, axis=0).values
y = df['Y'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40)
reg = RandomForestRegressor(criterion='mse')
reg.fit(X_train,y_train)
modelPred_test = reg.predict(X_test)
print(modelPred_test)
For comparison, we wish to generate a plot before prediction and after prediction. For the truth value, we tried it with
fig, ax = plt.subplots()
ax.plot(df['time'].values, df['Y'].values)
We wish to plot (in the same graph) the ground truth (time as x-axis and the value of Y as y-axis. When we do
ax.plot(df['time'].values, modelPred_test)
We are getting the following error.
raise ValueError("x and y must have same first dimension")
ValueError: x and y must have same first dimension
This means that we have less prediction values than we have time stamps in our dataset. To verify this, I did
print(df['time'].values.shape) and print(modelPred_test.shape) - and it outputs (258523,) and (103410,) respectively. How can we match which of my time values correspond to the prediction values, then i can use a subset of the time values for my plot command?
You have to set your data like the following.
X = df.drop('Y', axis=1)
y = df['Y']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40)
X_train = X_train.drop('time', axis=1)
X_test = X_test.drop('time', axis=1)
and then sort the datasets
index_values=range(0,len(y_test))
y_test.sort_index(inplace=True)
X_test.sort_index(inplace=True)
modelPred_test = reg.predict(X_test)
ax.plot(pd.Series(index_values), y_test.values)
finally, do the same plot for the predicted values of y. Hope this helps.
You need to keep track of the indices for training and testing datasets. For example, you could define
train_index, test_index = train_test_split(df.index, test_size=0.40)
and then X_train = X[train_index], etc.
Then, you could plot the results via ax.plot(df['time'][test_index].values, modelPred_test[df.index == test_index]).

Categories

Resources