My code functions properly but I am repeating a block several times to vary the polynomial variable, degree. I assume this can and should be looped to allow quicker iterations, but I'm not sure how to do it. Prior to the code below I generate the train_test split which I keep for plotting.
After several iterations, I use np.vstack on the y_predictions to create a single array.
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
### degree 1 ####
poly1 = PolynomialFeatures(degree=1)
x1_poly = poly1.fit_transform(X_train)
linreg1 = LinearRegression().fit(x1_poly, y_train)
pred_1= poly1.transform(x_prediction_data)
y1_poly_pred=linreg1.predict(pred_1)
### degree 3 #####
poly3 = PolynomialFeatures(degree=3)
x3_poly = poly3.fit_transform(X_train)
linreg3 = LinearRegression().fit(x3_poly, y_train)
pred_3= poly3.transform(x_prediction_data)
y3_poly_pred=linreg3.predict(pred_3)
#### ect... will make several other degree = 6, 9 ...
I would recommend collecting your answers in a dictionary, but I created a list for simplicity.
The code iterates over i, which is the degree of your polynomials. Trains the model, etc..., then collects its answers.
prediction_collector = []
for i in [1,3,6,9]:
poly = PolynomialFeatures(degree=i)
x_poly = poly.fit_transform(X_train)
linreg = LinearRegression().fit(x_poly, y_train)
pred= poly.transform(x_prediction_data)
y_poly_pred=linreg.predict(pred)
# to collect the answer after each iteration/increase of degrees
predictions_collector.append(y_poly_pred)
Related
I've seen that it's common practice to delete input features that demonstrate co-linearity (and leave only one of them).
However, I've just completed a course on how a linear regression model will give different weights to different features, and I've thought that maybe the model will do better than us giving a low weight to less necessary features instead of completely deleting them.
To try to solve this doubt myself, I've created a small dataset resembling a x_squared function and applied two linear regression models using Python:
A model that keeps only the x_squared feature
A model that keeps both the x and x_squared features
The results suggest that we shouldn't delete features, and let the model decide the best weights instead. However, I would like to ask the community if the rationale of my exercise is right, and whether you've found this doubt in other places.
Here's my code to generate the dataset:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Generate the data
all_Y = [10, 3, 1.5, 0.5, 1, 5, 8]
all_X = range(-3, 4)
all_X_2 = np.square(all_X)
# Store the data into a dictionary
data_dic = {"x": all_X, "x_2": all_X_2, "y": all_Y}
# Generate a dataframe
df = pd.DataFrame(data=data_dic)
# Display the dataframe
display(df)
which produces this:
and this is the code to generate the ML models:
# Create the lists to iterate over
ids = [1, 2]
features = [["x_2"], ["x", "x_2"]]
titles = ["$x^{2}$", "$x$ and $x^{2}$"]
colors = ["blue", "green"]
# Initiate figure
fig = plt.figure(figsize=(15,5))
# Iterate over the necessary lists to plot results
for i, model, title, color in zip(ids, features, titles, colors):
# Initiate model, fit and make predictions
lr = LinearRegression()
lr.fit(df[model], df["y"])
predicted = lr.predict(df[model])
# Calculate mean squared error of the model
mse = mean_squared_error(all_Y, predicted)
# Create a subplot for each model
plt.subplot(1, 2, i)
plt.plot(df["x"], predicted, c=color, label="f(" + title + ")")
plt.scatter(df["x"], df["y"], c="red", label="y")
plt.title("Linear regression using " + title + " --- MSE: " + str(round(mse, 3)))
plt.legend()
# Display results
plt.show()
which generate this:
What do you think about this issue? This difference in the Mean Squared Error can be of high importance on certain contexts.
Because x and x^2 are not linear anymore, that is why deleting one of them is not helping the model. The general notion for regression is to delete those features which are highly co-linear (which is also highly correlated)
So x2 and y are highly correlated and you are trying to predict y with x2? A high correlation between predictor variable and response variable is usually a good thing - and since x and y are practically uncorrelated you are likely to "dilute" your model and with that get worse model performance.
(Multi-)Colinearity between the predicor variables themselves would be more problematic.
I wanted to create my own Transformer using scikit-learn FunctionTransformer and followed their example as a dry run. It worked, but then I wanted to take the inverse of that transformation just to see the end result. However, when I tried the inverse_transform, it returned the same thing as the transformation. How do I get the original values? I ask this because I plan on using this transformation to transform a target variable, then make predictions. Those predictions will need be inversely transformed after I predict.
As a side bar, should I fit on y_train and transform on my y_test? Or can I transform y all at once?
My transformer:
import numpy as np
from sklearn.preprocessing import FunctionTransformer
import random
randomlist = []
for i in range(0,100):
n = random.randint(1,100)
randomlist.append(n)
y = pd.Series(randomlist)
y_train = y[:80]
y_test = y[80:]
target_trans = FunctionTransformer(np.log, validate=True, check_inverse = True)
logy_train = target_trans.fit_transform(y_train.values.reshape(-1,1))
logy_test = target_trans.transform(y_test.values.reshape(-1,1))
target_trans.inverse_transform(y_train.values.reshape(-1,1))
Within FunctionTransformer() you not only need to define check_inverse=True but also define the actual inverse function itself.
So for the above,
target_trans = FunctionTransformer(np.log, inverse_func = np.exp
,validate=True, check_inverse = True)
which yields the desired result.
I wrote a Python script that uses scikit-learn to fit Gaussian Processes to some data.
IN SHORT: the problem I am facing is that while the Gaussian Processses seem to learn very well the training dataset, the predictions for the testing dataset are off, and it seems to me there is a problem of normalization behind this.
IN DETAIL: my training dataset is a set of 1500 time series. Each time series has 50 time components. The mapping learnt by the Gaussian Processes is between a set of three coordinates x,y,z (which represent the parameters of my model) and one time series. In other words, there is a 1:1 mapping between x,y,z and one time series, and the GPs learn this mapping. The idea is that, by giving to the trained GPs new coordinates, they should be able to give me the predicted time series associated to those coordinates.
Here is my code:
from __future__ import division
import numpy as np
from matplotlib import pyplot as plt
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import Matern
coordinates_training = np.loadtxt(...) # read coordinates training x, y, z from file
coordinates_testing = np.loadtxt(..) # read coordinates testing x, y, z from file
# z-score of the coordinates for the training and testing data.
# Note I am using the mean and std of the training dataset ALSO to normalize the testing dataset
mean_coords_training = np.zeros(3)
std_coords_training = np.zeros(3)
for i in range(3):
mean_coords_training[i] = coordinates_training[:, i].mean()
std_coords_training[i] = coordinates_training[:, i].std()
coordinates_training[:, i] = (coordinates_training[:, i] - mean_coords_training[i])/std_coords_training[i]
coordinates_testing[:, i] = (coordinates_testing[:, i] - mean_coords_training[i])/std_coords_training[i]
time_series_training = np.loadtxt(...)# reading time series of training data from file
number_of_time_components = np.shape(time_series_training)[1] # 100 time components
# z_score of the time series
mean_time_series_training = np.zeros(number_of_time_components)
std_time_series_training = np.zeros(number_of_time_components)
for i in range(number_of_time_components):
mean_time_series_training[i] = time_series_training[:, i].mean()
std_time_series_training[i] = time_series_training[:, i].std()
time_series_training[:, i] = (time_series_training[:, i] - mean_time_series_training[i])/std_time_series_training[i]
time_series_testing = np.loadtxt(...)# reading test data from file
# the number of time components is the same for training and testing dataset
# z-score of testing data, again using mean and std of training data
for i in range(number_of_time_components):
time_series_testing[:, i] = (time_series_testing[:, i] - mean_time_series_training[i])/std_time_series_training[i]
# GPs
pred_time_series_training = np.zeros((np.shape(time_series_training)))
pred_time_series_testing = np.zeros((np.shape(time_series_testing)))
# Instantiate a Gaussian Process model
kernel = 1.0 * Matern(nu=1.5)
gp = GaussianProcessRegressor(kernel=kernel)
for i in range(number_of_time_components):
print("time component", i)
# Fit to data using Maximum Likelihood Estimation of the parameters
gp.fit(coordinates_training, time_series_training[:,i])
# Make the prediction on the meshed x-axis (ask for MSE as well)
y_pred_train, sigma_train = gp.predict(coordinates_train, return_std=True)
y_pred_test, sigma_test = gp.predict(coordinates_test, return_std=True)
pred_time_series_training[:,i] = y_pred_train*std_time_series_training[i] + mean_time_series_training[i]
pred_time_series_testing[:,i] = y_pred_test*std_time_series_training[i] + mean_time_series_training[i]
# plot training
fig, ax = plt.subplots(5, figsize=(10,20))
for i in range(5):
ax[i].plot(time_series_training[100*i], color='blue', label='Original training')
ax[i].plot(pred_time_series_training[100*i], color='black', label='GP predicted - training')
# plot testing
fig, ax = plt.subplots(5, figsize=(10,20))
for i in range(5):
ax[i].plot(features_time_series_testing[100*i], color='blue', label='Original testing')
ax[i].plot(pred_time_series_testing[100*i], color='black', label='GP predicted - testing')
Here examples of performance on the training data.
Here examples of performance on the testing data.
first you should use the sklearn preprocessing tool to treat your data.
from sklearn.preprocessing import StandardScaler
There are other useful tools to organaize but this specific one its to normalize the data.
Second you should normalize the training set and the test set with the same parameters¡¡ the model will fit the "geometry" of the data to define the parameters, if you train the model with other scale its like use the wrong system of units.
scale = StandardScaler()
training_set = scale.fit_tranform(data_train)
test_set = scale.transform(data_test)
this will use the same tranformation in the sets.
and finaly you need to normalize the features not the traget, I mean to normalize the X entries not the Y output, the normalization helps the model to find the answer faster changing the topology of the objective function in the optimization process the outpu doesnt affect this.
I hope this respond your question.
I am working with healthcare insurance claims data and would like to identify fraudulent claims. Have been reading online to try and find a better method. I came across the following code on scikit-learn.org
Does anyone know how to select the outliers? the code plot them in a graph but I would like to select those outliers if possible.
I have tried appending the y_predictions to the x dataframe but that has not worked.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import LocalOutlierFactor
np.random.seed(42)
# Generate train data
X = 0.3 * np.random.randn(100, 2)
# Generate some abnormal novel observations
X_outliers = np.random.uniform(low=-4, high=4, size=(20, 2))
X = np.r_[X + 2, X - 2, X_outliers]
# fit the model
clf = LocalOutlierFactor(n_neighbors=20)
y_pred = clf.fit_predict(X)
y_pred_outliers = y_pred[200:]
Below is the code i tried.
X['outliers'] = y_pred
The first 200 data are inliers while the last 20 are outliers. When you did fit_predict on X, you will get either outlier (-1) or inlier(1) in y_pred. So to get the predicted outliers, you need to get those y_pred = -1 and get the corresponding value in X. Below script will give you the outliers in X.
X_pred_outliers = [each[1] for each in list(zip(y_pred, X.tolist())) if each[0] == -1]
I combine y_pred and X into an array and check if y=-1, if yes then collect X values.
However, there are eight errors on the predictions (8 out of 220). These errors are -1 values in y_pred[:200] and 1 in y_pred[201:220]. Please be aware of the errors as well.
Thank you for taking your time to read my problem.
I have to run Ordinal Ridge and Lasso regression on my dataset. The values that I want to predict are ordinal (5 levels) and I have many predictors (over 60) that are continuous but not all of them are logically significant. So, I would like to run the Ordinal Regression using Lasso and Ridge to find the significant ones.
I am very new to python and I don't know really what to do and appreciate any help from the community.
I have found the mord module (and even if I am using it right), it doesn't provide Ordinal Lasso.
Could anyone help me with this, please?
Thanks in advance.
Update:
I have written the following code, I don't get any error and I get an accuracy lower than previous analyses. So, I assume I am making mistake at a point in how I am doing it. I would appreciate it if someone helps me with it. I guess it could be in scaling, but I don't know how.
"rel" has five values: 1,2,3,4,5 which are my predicted values.
import numpy as np
import pandas as pd
import mord
from sklearn.preprocessing import scale, StandardScaler
from sklearn.metrics import mean_squared_error
import csv
#defining a function to rotate numbers in an array
def leftRotatebyOne(arr, n):
temp = arr[0]
for i in range(n-1):
arr[i] = arr[i+1]
arr[n-1] = temp
#defining OR to do Ordinal Ridge Regression
OR = mord.OrdinalRidge()
#definign the loop to go through all participants
for s in range(17):
#reading the data for each participant
df = pd.read_csv("Complete{0}.csv".format(s+1), index_col=0, header=None).dropna()
df.index.name = 'subject{0}'.format(s+1)
df.columns = ["ch{0}".format(i+1) for i in range(64)] +["irrel", "rel"]
#defining output and predictors
y = df.rel
X = df.drop(['rel', 'irrel'], axis=1).astype('float64')
#an array containig trial numbers
T = np.array(range(480))
#defining a matrix to hold the models of all runs(480 one-leave_out) for each participants
out=np.empty((67,480))
#runing the model for all trials (each time keeping one out)
for t in range(480):
T1 = T[:479]
T2 = T[479:] #the last one which is going to be out
## Always the last one is going to be out, how it works is that we rotate T, so the last trail changes
#train samples
X_train = X.iloc[T1,:]
y_train = np.array(y.iloc[T1])
scaler = StandardScaler().fit(X_train)
#test sample
X_test = X.iloc[T2,:]
y_test = np.array(y.iloc[T2])
#rotating T
leftRotatebyOne(T,480)
#runing ordinal ridge regression from the module mord
OR.fit(scaler.transform(X_train), y_train)
predicted = OR.predict(scaler.transform(X_test))
error = mean_squared_error(y_test, predicted)
coeff = pd.Series(OR.coef_, index=X.columns)
#getting the accuracy of each prediction
if predicted == y_test:
accuracy = 1
else:
accuracy = 0
#having all results in a matrix (each column is for leaving out one of the trials)
out[:,t]=np.hstack((coeff,predicted,error, accuracy))
#saving the results for each participant
np.savetxt("reg{0}.csv".format(s+1), out, delimiter=',')
#saving all results in one file
filenames = ["reg{0}.csv".format(i+1) for i in range(17)]
dataframes = [pd.read_csv(p) for p in filenames]
merged_dataframe = pd.concat(dataframes, axis=1)
merged_dataframe.to_csv("merged.csv", index=False)
#reading the file that contains all the models for all the
participants
cl = pd.read_csv("merged.csv", header=None).dropna()
#naming the rows
cl.index = ["ch{0}".format(i+1) for i in range(64)]["predicted","error","accuracy"]
#calculating the mean of each row
print(pd.Series.mean(cl, axis=1))
#getting teh mean of accuracy for each participant
for s in range(17):
regg = pd.read_csv("reg{0}.csv".format(s+1), header=None).dropna()
regg.index = ["ch{0}".format(i+1) for i in range(64)]["predicted","error","accuracy"]
print(pd.Series.mean(regg, axis=1)[66])
I didn't find anything other than mord module.
I want to do a leave-one-out cross validation, and I have to just keep one of the samples for the test.
PS.
I am following instructions in this link:
http://nbviewer.jupyter.org/github/JWarmenhoven/ISL-python/blob/master/Notebooks/Chapter%206.ipynb
I get the following error with doing exactly as they have done:
module 'glmnet' has no attribute 'ElasticNet'
*However, they do not cover ordinal regression.
You can use sklearn for this,
from sklearn import linear_model
regr_lasso = linear_model.Lasso(alpha=0.1)
regr_ridge = linear_model.Ridge(alpha=1.0)
regr_elasticnet = linear_model.ElasticNet(random_state=0)
Refer below links for more details,
http://scikit-learn.org/stable/auto_examples/linear_model/plot_lasso_coordinate_descent_path.html