Predicting values using an array in python [duplicate] - python

This question already exists:
Predicting Values from an array in python
Closed 1 year ago.
I am looking for a solution to the following problem. I am hoping to use a regression model to predict some values. I can currently use it to predict a single value i.e. the model will predict the y-value for the x-value equal to 10 for example. What I would like to do is use an array as the input value. I am hoping I can get the model to create a new array of predicted values when the second array is used as the input value. Is this possible? I have attached some code hoping that this helps.
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree=3)
wspoly = poly_reg.fit_transform(ws)
lin_reg2 = LinearRegression()
lin_reg2.fit(wspoly,elec)
plt.figure(figsize = (25,15))
x_grid = np.arange(min(ws), max(ws), 0.1)
x_grid = x_grid.reshape(len(x_grid), 1)
plt.scatter(ws, elec, color = 'red')
plt.plot(x_grid, lin_reg2.predict(poly_reg.fit_transform(x_grid)), color = 'blue')
plt.title("Polynomial Regression of Wind Speed & Electricty Generated")
plt.xlabel('Wind Speed (m/s)')
plt.ylabel('Electricity Generated (kWh)')
plt.show()
Output from the above code showing the polynomial regression model
prediction = lin_reg2.predict(poly_reg.fit_transform([[10]]))

Start out by creating a function to return the prediction. The function will look like this.
def my_func(x_value):
# your code goes here
return y_value
Then you can create an array of predictions like this.
y_values = [my_func(x) for x in list_of_x_values]

Related

Should features that correlate be deleted from ML models?

I've seen that it's common practice to delete input features that demonstrate co-linearity (and leave only one of them).
However, I've just completed a course on how a linear regression model will give different weights to different features, and I've thought that maybe the model will do better than us giving a low weight to less necessary features instead of completely deleting them.
To try to solve this doubt myself, I've created a small dataset resembling a x_squared function and applied two linear regression models using Python:
A model that keeps only the x_squared feature
A model that keeps both the x and x_squared features
The results suggest that we shouldn't delete features, and let the model decide the best weights instead. However, I would like to ask the community if the rationale of my exercise is right, and whether you've found this doubt in other places.
Here's my code to generate the dataset:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Generate the data
all_Y = [10, 3, 1.5, 0.5, 1, 5, 8]
all_X = range(-3, 4)
all_X_2 = np.square(all_X)
# Store the data into a dictionary
data_dic = {"x": all_X, "x_2": all_X_2, "y": all_Y}
# Generate a dataframe
df = pd.DataFrame(data=data_dic)
# Display the dataframe
display(df)
which produces this:
and this is the code to generate the ML models:
# Create the lists to iterate over
ids = [1, 2]
features = [["x_2"], ["x", "x_2"]]
titles = ["$x^{2}$", "$x$ and $x^{2}$"]
colors = ["blue", "green"]
# Initiate figure
fig = plt.figure(figsize=(15,5))
# Iterate over the necessary lists to plot results
for i, model, title, color in zip(ids, features, titles, colors):
# Initiate model, fit and make predictions
lr = LinearRegression()
lr.fit(df[model], df["y"])
predicted = lr.predict(df[model])
# Calculate mean squared error of the model
mse = mean_squared_error(all_Y, predicted)
# Create a subplot for each model
plt.subplot(1, 2, i)
plt.plot(df["x"], predicted, c=color, label="f(" + title + ")")
plt.scatter(df["x"], df["y"], c="red", label="y")
plt.title("Linear regression using " + title + " --- MSE: " + str(round(mse, 3)))
plt.legend()
# Display results
plt.show()
which generate this:
What do you think about this issue? This difference in the Mean Squared Error can be of high importance on certain contexts.
Because x and x^2 are not linear anymore, that is why deleting one of them is not helping the model. The general notion for regression is to delete those features which are highly co-linear (which is also highly correlated)
So x2 and y are highly correlated and you are trying to predict y with x2? A high correlation between predictor variable and response variable is usually a good thing - and since x and y are practically uncorrelated you are likely to "dilute" your model and with that get worse model performance.
(Multi-)Colinearity between the predicor variables themselves would be more problematic.

How to loop through multiple polynomial fits changing the degree

My code functions properly but I am repeating a block several times to vary the polynomial variable, degree. I assume this can and should be looped to allow quicker iterations, but I'm not sure how to do it. Prior to the code below I generate the train_test split which I keep for plotting.
After several iterations, I use np.vstack on the y_predictions to create a single array.
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
### degree 1 ####
poly1 = PolynomialFeatures(degree=1)
x1_poly = poly1.fit_transform(X_train)
linreg1 = LinearRegression().fit(x1_poly, y_train)
pred_1= poly1.transform(x_prediction_data)
y1_poly_pred=linreg1.predict(pred_1)
### degree 3 #####
poly3 = PolynomialFeatures(degree=3)
x3_poly = poly3.fit_transform(X_train)
linreg3 = LinearRegression().fit(x3_poly, y_train)
pred_3= poly3.transform(x_prediction_data)
y3_poly_pred=linreg3.predict(pred_3)
#### ect... will make several other degree = 6, 9 ...
I would recommend collecting your answers in a dictionary, but I created a list for simplicity.
The code iterates over i, which is the degree of your polynomials. Trains the model, etc..., then collects its answers.
prediction_collector = []
for i in [1,3,6,9]:
poly = PolynomialFeatures(degree=i)
x_poly = poly.fit_transform(X_train)
linreg = LinearRegression().fit(x_poly, y_train)
pred= poly.transform(x_prediction_data)
y_poly_pred=linreg.predict(pred)
# to collect the answer after each iteration/increase of degrees
predictions_collector.append(y_poly_pred)

Gaussian Processes in scikit-learn: good performance on training data, bad performance on testing data

I wrote a Python script that uses scikit-learn to fit Gaussian Processes to some data.
IN SHORT: the problem I am facing is that while the Gaussian Processses seem to learn very well the training dataset, the predictions for the testing dataset are off, and it seems to me there is a problem of normalization behind this.
IN DETAIL: my training dataset is a set of 1500 time series. Each time series has 50 time components. The mapping learnt by the Gaussian Processes is between a set of three coordinates x,y,z (which represent the parameters of my model) and one time series. In other words, there is a 1:1 mapping between x,y,z and one time series, and the GPs learn this mapping. The idea is that, by giving to the trained GPs new coordinates, they should be able to give me the predicted time series associated to those coordinates.
Here is my code:
from __future__ import division
import numpy as np
from matplotlib import pyplot as plt
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import Matern
coordinates_training = np.loadtxt(...) # read coordinates training x, y, z from file
coordinates_testing = np.loadtxt(..) # read coordinates testing x, y, z from file
# z-score of the coordinates for the training and testing data.
# Note I am using the mean and std of the training dataset ALSO to normalize the testing dataset
mean_coords_training = np.zeros(3)
std_coords_training = np.zeros(3)
for i in range(3):
mean_coords_training[i] = coordinates_training[:, i].mean()
std_coords_training[i] = coordinates_training[:, i].std()
coordinates_training[:, i] = (coordinates_training[:, i] - mean_coords_training[i])/std_coords_training[i]
coordinates_testing[:, i] = (coordinates_testing[:, i] - mean_coords_training[i])/std_coords_training[i]
time_series_training = np.loadtxt(...)# reading time series of training data from file
number_of_time_components = np.shape(time_series_training)[1] # 100 time components
# z_score of the time series
mean_time_series_training = np.zeros(number_of_time_components)
std_time_series_training = np.zeros(number_of_time_components)
for i in range(number_of_time_components):
mean_time_series_training[i] = time_series_training[:, i].mean()
std_time_series_training[i] = time_series_training[:, i].std()
time_series_training[:, i] = (time_series_training[:, i] - mean_time_series_training[i])/std_time_series_training[i]
time_series_testing = np.loadtxt(...)# reading test data from file
# the number of time components is the same for training and testing dataset
# z-score of testing data, again using mean and std of training data
for i in range(number_of_time_components):
time_series_testing[:, i] = (time_series_testing[:, i] - mean_time_series_training[i])/std_time_series_training[i]
# GPs
pred_time_series_training = np.zeros((np.shape(time_series_training)))
pred_time_series_testing = np.zeros((np.shape(time_series_testing)))
# Instantiate a Gaussian Process model
kernel = 1.0 * Matern(nu=1.5)
gp = GaussianProcessRegressor(kernel=kernel)
for i in range(number_of_time_components):
print("time component", i)
# Fit to data using Maximum Likelihood Estimation of the parameters
gp.fit(coordinates_training, time_series_training[:,i])
# Make the prediction on the meshed x-axis (ask for MSE as well)
y_pred_train, sigma_train = gp.predict(coordinates_train, return_std=True)
y_pred_test, sigma_test = gp.predict(coordinates_test, return_std=True)
pred_time_series_training[:,i] = y_pred_train*std_time_series_training[i] + mean_time_series_training[i]
pred_time_series_testing[:,i] = y_pred_test*std_time_series_training[i] + mean_time_series_training[i]
# plot training
fig, ax = plt.subplots(5, figsize=(10,20))
for i in range(5):
ax[i].plot(time_series_training[100*i], color='blue', label='Original training')
ax[i].plot(pred_time_series_training[100*i], color='black', label='GP predicted - training')
# plot testing
fig, ax = plt.subplots(5, figsize=(10,20))
for i in range(5):
ax[i].plot(features_time_series_testing[100*i], color='blue', label='Original testing')
ax[i].plot(pred_time_series_testing[100*i], color='black', label='GP predicted - testing')
Here examples of performance on the training data.
Here examples of performance on the testing data.
first you should use the sklearn preprocessing tool to treat your data.
from sklearn.preprocessing import StandardScaler
There are other useful tools to organaize but this specific one its to normalize the data.
Second you should normalize the training set and the test set with the same parameters¡¡ the model will fit the "geometry" of the data to define the parameters, if you train the model with other scale its like use the wrong system of units.
scale = StandardScaler()
training_set = scale.fit_tranform(data_train)
test_set = scale.transform(data_test)
this will use the same tranformation in the sets.
and finaly you need to normalize the features not the traget, I mean to normalize the X entries not the Y output, the normalization helps the model to find the answer faster changing the topology of the objective function in the optimization process the outpu doesnt affect this.
I hope this respond your question.

Shap force plot and decision plot giving wrong output for XGBClassifier model

I'm trying to deliver shap decision plots for a small subset of predictions but the outputs found by shap are different than what I'm getting when just using the model to predict even with using link = 'logit' in the call. The result of every decision plot I'm trying to produce should be greater than the expected value due to the subset I'm trying to plot. However, every single plot produced has a predicted value lower than the expected value.
I have two models that are in a minimum ensemble so I'm using a for loop to determine which model to produce a plot for. I'm having no issue creating the correct plots for the RandomForestClassifier model, but the issue is occurring for the XGB model.
rf_explainer = shap.TreeExplainer(RF_model)
xgb_explainer = shap.TreeExplainer(XGB_model)
for i in range(flagged.shape[0]):
if flagged_preds.RF_Score[i] == flagged_preds.Ensemble_Score[i]:
idx = flagged.index[i]
idxstr = idx[1].astype('str') + ' -- ' + idx[2].date().strftime('%Y-%m-%d') + ' -- ' + idx[0].astype('str')
shap_value = rf_explainer.shap_values(flagged.iloc[i,:])
shap.decision_plot(rf_explainer.expected_value[1], shap_value[1], show=False)
plt.savefig(f'//PathToFolder/{idxstr} -- RF.jpg', format = 'jpg', bbox_inches = 'tight', facecolor = 'white')
if flagged_preds.XGB_Score[i] == flagged_preds.Ensemble_Score[i]:
idx = flagged.index[i]
idxstr = idx[1].astype('str') + ' -- ' + idx[2].date().strftime('%Y-%m-%d') + ' -- ' + idx[0].astype('str')
shap_value = xgb_explainer.shap_values(flagged.iloc[i,:])
shap.decision_plot(xgb_explainer.expected_value, shap_value, link = 'logit', show=False)
plt.savefig(f'//PathToFolder/{idxstr} -- XGB.jpg', format = 'jpg', bbox_inches = 'tight', facecolor = 'white')
plt.close()
As mentioned before, when scoring, each observation (of the ones I'm concerned with) should have a score > .5 but that isn't what I'm seeing in my shap plots. Here is an example:
This plot shows an output of about .1 but when scoring this observation using predict_proba, I get a value of .608
I can't really provide a reprex due to the sensitive nature of the data and I'm not sure what the underlying issue is.
Any feedback would be very welcome, Thank you.
Relevant pip freeze items:
Python 3.7.3
matplotlib==3.0.3
shap==0.30.1
xgboost==0.90
I suggest making a direct comparison between your model output and your SHAP output.
The second code example in Section "Changing the SHAP base value" in the SHAP Decision Plots documentation shows how to sum SHAP values to match the model output for a LightGBM model. You can use the same approach for any other model. If the summed SHAP values don't match the model output, it's not a plotting issue. The code example is copied below. Both lines should print the same value.
# The model's raw prediction for the first observation.
print(model.predict(features.iloc[[0]].values, raw_score=True)[0].round(4))
# The corresponding sum of the mean + shap values
print((expected_value + shap_values[0].sum()).round(4))

Outlier detection with Local Outlier Factor (LOF)

I am working with healthcare insurance claims data and would like to identify fraudulent claims. Have been reading online to try and find a better method. I came across the following code on scikit-learn.org
Does anyone know how to select the outliers? the code plot them in a graph but I would like to select those outliers if possible.
I have tried appending the y_predictions to the x dataframe but that has not worked.
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import LocalOutlierFactor
np.random.seed(42)
# Generate train data
X = 0.3 * np.random.randn(100, 2)
# Generate some abnormal novel observations
X_outliers = np.random.uniform(low=-4, high=4, size=(20, 2))
X = np.r_[X + 2, X - 2, X_outliers]
# fit the model
clf = LocalOutlierFactor(n_neighbors=20)
y_pred = clf.fit_predict(X)
y_pred_outliers = y_pred[200:]
Below is the code i tried.
X['outliers'] = y_pred
The first 200 data are inliers while the last 20 are outliers. When you did fit_predict on X, you will get either outlier (-1) or inlier(1) in y_pred. So to get the predicted outliers, you need to get those y_pred = -1 and get the corresponding value in X. Below script will give you the outliers in X.
X_pred_outliers = [each[1] for each in list(zip(y_pred, X.tolist())) if each[0] == -1]
I combine y_pred and X into an array and check if y=-1, if yes then collect X values.
However, there are eight errors on the predictions (8 out of 220). These errors are -1 values in y_pred[:200] and 1 in y_pred[201:220]. Please be aware of the errors as well.

Categories

Resources