I am doing a project and trying to show some BASIC elements of scikit in python. My goal is to create a 3ish simple examples and show how it learns and predicts. I am applying a simple sine wave type pattern and have been playing with a good example online from
https://mclguide.readthedocs.io/en/latest/sklearn/regression.html
My problem is that since I am new to this library and ML in general, I don't understand what I have in front of me and how to transform it into the output I am going for. The two problems I am struggling with is a linear regression on a sine wave and a guassian regression on a more complicated wave. The output I am getting per the article is the accuracy and that works like intended but what I am trying to get to is how to plot the predicted output on top of (or as an extension) of the training data to visually show how it did. I think the data is in here, I am either just using the wrong methods to return the appropriate information or I am not understanding how to extract the information from what is already being returned.
Here are some additional questions
I do not completely understand the "features = x[:, np.newaxis]" line
When plotting, what does '-*' and '-o'do? I looked through the documentation and it appears to be formatting but I couldn't find these two examples exactly.
What do I need to do to get access to the 20% predicted values so that I can plot it against the original?
Is there a simple way to apply the most amount of this code to apply to simple and gaussian examples?
Here is the skeletal code. Most of the scikit from the article is unchanged.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import random
from operator import add
N = 200 # 10 samples
randomlist = []
x = np.linspace(0, 12, N)
sine_wave = np.sin(1*x)
#plot the source data
plt.figure(figsize=(20,5))
plt.plot(x, sum_vector, 'o');
plt.show()
# convert features in 2D format i.e. list of list
# print('Before: ', x.shape)
features = x[:, np.newaxis]
# print('After: ', features.shape)
# save sine wave in variable 'targets'
targets = sine_wave
# split the training and test data
train_features, test_features, train_targets, test_targets = train_test_split(
features, targets,
train_size=0.8,
test_size=0.2,
# random but same for all run, also accuracy depends on the
# selection of data e.g. if we put 10 then accuracy will be 1.0
# in this example
random_state=23,
# keep same proportion of 'target' in test and target data
# stratify=targets # can not used for single feature
)
# training using 'training data'
regressor = LinearRegression()
regressor.fit(train_features, train_targets) # fit the model for training data
# predict the 'target' for 'training data'
prediction_training_targets = regressor.predict(train_features)
# note that 'score' uses 'feature and target (not predict_target)'
# for scoring in Regression
# whereas 'accuracy_score' uses 'features and predict_targets'
# for scoring in Classification
self_accuracy = regressor.score(train_features, train_targets)
print("Accuracy for training data (self accuracy):", self_accuracy)
# predict the 'target' for 'test data'
prediction_test_targets = regressor.predict(test_features)
test_accuracy = regressor.score(test_features, test_targets)
print("Accuracy for test data:", test_accuracy)
# plot the predicted and actual target for test data
plt.figure(figsize=(20,5))
plt.plot(test_targets, color = "red")
plt.show()
plt.plot(prediction_test_targets, '-*', color = "red")
plt.plot(test_targets, '-o' )
plt.show()
Related
I am learning Linear regression, I wrote this Linear Regression code using scikit-learn , after making the prediction, how to do prediction for new data points which are not there in my original data set.
In this data set you are given the salaries of people according to their work experience.
For example , The predicted salary for a person with work experience of 15 years should be [167005.32889087]
Here is image of data set
Here is my code ,
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
data = pd.read_csv('project_1_dataset.csv')
X = data.iloc[:,0].values.reshape(-1,1)
Y = data.iloc[:,1].values.reshape(-1,1)
linear_regressor = LinearRegression()
linear_regressor.fit(X,Y)
Y_pred = linear_regressor.predict(X)
plt.scatter(X,Y)
plt.plot(X, Y_pred, color = 'red')
plt.show()
After fitting and training your model with your existed dataset (i.e. after linear_regressor.fit(X,Y)), you could make predictions in new instances in the same way:
new_prediction = linear_regressor.predict(new_data)
print(new_prediction)
where new_data is your new data point.
If you want to make predictions on particular random new data points, the above way should be enough. If your new data points belong to another dataframe, then you could replace new_data with the respective dataframe containing the new instances to be predicted.
I have monthly temperature data from several stations in eastern siberia. However, the one station which is necessary for my work is missing a lot of data, while other stations in the vicinity have good coverage. Is there a way to interpolate missing data based on the behavior of another dataset? Can't provide any code, since I don't know where to start and the datasets look like this:
The red dots is the data from the station with missing values while the green graph is from a station with good coverage
I would appreciate if anyone could point me in the right direction
There are methods to do this, for instance, apply a FFT on the dataset with good coverage and see how well it fits your dataset with poor coverage while removing high-frequency terms.
However, I highly doubt that this will be any useful: your dataset with high coverage fits almost perfectly your dataset with poor coverage. Whatever is the method you want to apply, the best function that resembles your dataset with high coverage while fitting your dataset with poor coverage, is the dataset with high coverage itself.
Let's create a trial dataset to work on your problem:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
t = np.linspace(0, 30*2*np.pi, 30*24*2)
td = pd.date_range("2020-01-01", freq='30T', periods=t.size)
T0 = np.sin(t)*8 - 15 + np.random.randn(t.size)*0.2
T1 = np.sin(t)*7 - 13 + np.random.randn(t.size)*0.1
T2 = np.sin(t)*9 - 10 + np.random.randn(t.size)*0.3
T3 = np.sin(t)*8.5 - 11 + np.random.randn(t.size)*0.5
T = np.vstack([T0, T1, T2, T3]).T
features = pd.DataFrame(T, columns=["s1", "s2", "s3", "s4"], index=td)
It looks like:
axe = features[:"2020-01-04"].plot()
axe.legend()
axe.grid()
Then if your time series linearly correlates well, you can simply predict missing values by the mean of Ordinary Least Square regression. SciKit-Learn provides a convenient interface to perform this kind of computations:
from sklearn import linear_model
from sklearn.model_selection import train_test_split
# Remove target site from features:
target = features.pop("s4")
# Split dataset into train (actual data) and test (missing temperatures):
x_train, x_test, y_train, y_test = train_test_split(features, target, train_size=0.25, random_state=123)
# Create a Linear Regressor and train it:
reg = linear_model.LinearRegression()
reg.fit(x_train, y_train)
# Assess regression score with test data:
reg.score(x_test, y_test) # 0.9926150729585087
# Predict missing values:
ypred = reg.predict(x_test)
ypred = pd.DataFrame(ypred, index=x_test.index, columns=["s4p"])
The result looks like:
axe = features[:"2020-01-04"].plot()
target[:"2020-01-04"].plot(ax=axe)
ypred[:"2020-01-04"].plot(ax=axe, linestyle='None', marker='.')
axe.legend()
axe.grid()
error = (y_test - ypred.squeeze())
axe = error.plot()
axe.legend(["Prediction Error"])
axe.grid()
I've been learning some of the core concepts of ML lately and writing code using the Sklearn library. After some basic practice, I tried my hand at the AirBnb NYC dataset from kaggle (which has around 40000 samples) - https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data#New_York_City_.png
I tried to make a model that could predict the price of a room/apt given the various features of the dataset. I realised that this was a regression problem and using this sklearn cheat-sheet, I started trying the various regression models.
I used the sklearn.linear_model.Ridge as my baseline and after doing some basic data cleaning, I got an abysmal R^2 score of 0.12 on my test set. Then I thought, maybe the linear model is too simplistic so I tried the 'kernel trick' method adapted for regression (sklearn.kernel_ridge.Kernel_Ridge) but they would take too much time to fit (>1hr)! To counter that, I used the sklearn.kernel_approximation.Nystroem function to approximate the kernel map, applied the transformation to the features prior to training and then used a simple linear regression model. However, even that took a lot of time to transform and fit if I increased the n_components parameter which I had to to get any meaningful increase in the accuracy.
So I am thinking now, what happens when you want to do regression on a huge dataset? The kernel trick is extremely computationally expensive while the linear regression models are too simplistic as real data is seldom linear. So are neural nets the only answer or is there some clever solution that I am missing?
P.S. I am just starting on Overflow so please let me know what I can do to make my question better!
This is a great question but as it often happens there is no simple answer to complex problems. Regression is not a simple as it is often presented. It involves a number of assumptions and is not limited to linear least squares models. It takes couple university courses to fully understand it. Below I'll write a quick (and far from complete) memo about regressions:
Nothing will replace proper analysis. This might involve expert interviews to understand limits of your dataset.
Your model (any model, not limited to regressions) is only as good as your features. If home price depends on local tax rate or school rating, even a perfect model would not perform well without these features.
Some features cannot be included in the model by design, so never expect a perfect score in real world. For example, it is practically impossible to account for access to grocery stores, eateries, clubs etc. Many of these features are also moving targets, as they tend to change over time. Even 0.12 R2 might be great if human experts perform worse.
Models have their assumptions. Linear regression expects that dependent variable (price) is linearly related to independent ones (e.g. property size). By exploring residuals you can observe some non-linearities and cover them with non-linear features. However, some patterns are hard to spot, while still addressable by other models, like non-parametric regressions and neural networks.
So, why people still use (linear) regression?
it is the simplest and fastest model. There are a lot of implications for real-time systems and statistical analysis, so it does matter
often it is used as a baseline model. Before trying a fancy neural network architecture, it would be helpful to know how much we improve comparing to a naive method.
sometimes regressions are used to test certain assumptions, e.g. linearity of effects and relations between variables
To summarize, regression is definitely not the ultimate tool in most cases, but this is usually the cheapest solution to try first
UPD, to illustrate the point about non-linearity.
After building a regression you calculate residuals, i.e. regression error predicted_value - true_value. Then, for each feature you make a scatter plot, where horizontal axis is feature value and vertical axis is the error value. Ideally, residuals have normal distribution and do not depend on the feature value. Basically, errors are more often small than large, and similar across the plot.
This is how it should look:
This is still normal - it only reflects the difference in density of your samples, but errors have the same distribution:
This is an example of nonlinearity (a periodic pattern, add sin(x+b) as a feature):
Another example of non-linearity (adding squared feature should help):
The above two examples can be described as different residuals mean depending on feature value. Other problems include but not limited to:
different variance depending on feature value
non-normal distribution of residuals (error is either +1 or -1, clusters, etc)
Some of the pictures above are taken from here:
http://www.contrib.andrew.cmu.edu/~achoulde/94842/homework/regression_diagnostics.html
This is an great read on regression diagnostics for beginners.
I'll take a stab at this one. Look at my notes/comments embedded in the code. Keep in mind, this is just a few ideas that I tested. There are all kinds of other things you can try (get more data, test different models, etc.)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#%matplotlib inline
import sklearn
from sklearn.linear_model import RidgeCV, LassoCV, Ridge, Lasso
from sklearn.datasets import load_boston
#boston = load_boston()
# Predicting Continuous Target Variables with Regression Analysis
df = pd.read_csv('C:\\your_path_here\\AB_NYC_2019.csv')
df
# get only 2 fields and convert non-numerics to numerics
df_new = df[['neighbourhood']]
df_new = pd.get_dummies(df_new)
# print(df_new.columns.values)
# df_new.shape
# df.shape
# let's use a feature selection technique so we can see which features (independent variables) have the highest statistical influence on the target (dependent variable).
from sklearn.ensemble import RandomForestClassifier
features = df_new.columns.values
clf = RandomForestClassifier()
clf.fit(df_new[features], df['price'])
# from the calculated importances, order them from most to least important
# and make a barplot so we can visualize what is/isn't important
importances = clf.feature_importances_
sorted_idx = np.argsort(importances)
# what kind of object is this
# type(sorted_idx)
padding = np.arange(len(features)) + 0.5
plt.barh(padding, importances[sorted_idx], align='center')
plt.yticks(padding, features[sorted_idx])
plt.xlabel("Relative Importance")
plt.title("Variable Importance")
plt.show()
X = df_new[features]
y = df['price']
reg = LassoCV()
reg.fit(X, y)
print("Best alpha using built-in LassoCV: %f" % reg.alpha_)
print("Best score using built-in LassoCV: %f" %reg.score(X,y))
coef = pd.Series(reg.coef_, index = X.columns)
print("Lasso picked " + str(sum(coef != 0)) + " variables and eliminated the other " + str(sum(coef == 0)) + " variables")
Result:
Best alpha using built-in LassoCV: 0.040582
Best score using built-in LassoCV: 0.103947
Lasso picked 78 variables and eliminated the other 146 variables
Next step...
imp_coef = coef.sort_values()
import matplotlib
matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
imp_coef.plot(kind = "barh")
plt.title("Feature importance using Lasso Model")
# get the top 25; plotting fewer features so we can actually read the chart
type(imp_coef)
imp_coef = imp_coef.tail(25)
matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
imp_coef.plot(kind = "barh")
plt.title("Feature importance using Lasso Model")
X = df_new
y = df['price']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 10)
# Training the Model
# We will now train our model using the LinearRegression function from the sklearn library.
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train, y_train)
# Prediction
# We will now make prediction on the test data using the LinearRegression function and plot a scatterplot between the test data and the predicted value.
prediction = lm.predict(X_test)
plt.scatter(y_test, prediction)
from sklearn import metrics
from sklearn.metrics import r2_score
print('MAE', metrics.mean_absolute_error(y_test, prediction))
print('MSE', metrics.mean_squared_error(y_test, prediction))
print('RMSE', np.sqrt(metrics.mean_squared_error(y_test, prediction)))
print('R squared error', r2_score(y_test, prediction))
Result:
MAE 1004799260.0756996
MSE 9.87308783180938e+21
RMSE 99363412943.64531
R squared error -2.603867717517002e+17
This is horrible! Well, we know this doesn't work. Let's try something else. We still need to rowk with numeric data so let's try lng and lat coordinates.
X = df[['longitude','latitude']]
y = df['price']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 10)
# Training the Model
# We will now train our model using the LinearRegression function from the sklearn library.
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train, y_train)
# Prediction
# We will now make prediction on the test data using the LinearRegression function and plot a scatterplot between the test data and the predicted value.
prediction = lm.predict(X_test)
plt.scatter(y_test, prediction)
df1 = pd.DataFrame({'Actual': y_test, 'Predicted':prediction})
df2 = df1.head(10)
df2
df2.plot(kind = 'bar')
from sklearn import metrics
from sklearn.metrics import r2_score
print('MAE', metrics.mean_absolute_error(y_test, prediction))
print('MSE', metrics.mean_squared_error(y_test, prediction))
print('RMSE', np.sqrt(metrics.mean_squared_error(y_test, prediction)))
print('R squared error', r2_score(y_test, prediction))
# better but not awesome
Result:
MAE 85.35438165291622
MSE 36552.6244271195
RMSE 191.18740655994972
R squared error 0.03598346983552425
Let's look at OLS:
import statsmodels.api as sm
model = sm.OLS(y, X).fit()
# run the model and interpret the predictions
predictions = model.predict(X)
# Print out the statistics
model.summary()
I would hypothesize the following:
One hot encoding is doing exactly what it is supposed to do, but it is not helping you get the results you want. Using lng/lat, is performing slightly better, but this too, is not helping you achieve the results you want. As you know, you must work with numeric data for a regression problem, but none of the features is helping you to predict price, at least not very well. Of course, I could have made a mistake somewhere. If I did make a mistake, please let me know!
Check out the links below for a good example of using various features to predict housing prices. Notice: all variables are numeric, and the results are pretty decent (just around 70%, give or take, but still much better than what we're seeing with the Air BNB data set).
https://bigdata-madesimple.com/how-to-run-linear-regression-in-python-scikit-learn/
https://towardsdatascience.com/linear-regression-on-boston-housing-dataset-f409b7e4a155
I'm trying to figure out how to unscale my data (presumably using inverse_transform) for predictions when I'm using a pipeline. The data below is just an example. My actual data is much larger and complicated, but I'm looking to use RobustScaler (as my data has outliers) and Lasso (as my data has dozens of useless features). I am new to pipelines in general.
Basically, if I try to use this model to predict anything, I want that prediction in unscaled terms. Is this possible with a pipeline? How can I do this with inverse_transform?
import pandas as pd
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler
data = [[100, 1, 50],[500 , 3, 25],[1000 , 10, 100]]
df = pd.DataFrame(data,columns=['Cost','People', 'Supplies'])
X = df[['People', 'Supplies']]
y = df[['Cost']]
#Split
X_train,X_test,y_train,y_test = train_test_split(X,y)
#Pipeline
pipeline = Pipeline([('scale', RobustScaler()),
('alg', Lasso())])
clf = pipeline.fit(X_train,y_train)
train_score = clf.score(X_train,y_train)
test_score = clf.score(X_test,y_test)
print ("training score:", train_score)
print ("test score:", test_score)
#Predict example
example = [[10,100]]
clf.predict(example)
Simple Explanation
Your pipeline is only transforming the values in X, not y. The differences you are seeing in y for predictions are related to the differences in the coefficient values between two models fitted using scaled vs. unscaled data.
So, if you "want that prediction in unscaled terms" then take the scaler out of your pipeline. If you want that prediction in scaled terms you need scale the new prediction data prior to passing it to the .predict() function. The Pipeline actually does this for you automatically if you have included a scaler step in it.
Scaling and Regression
The practical purpose of scaling here would be when people and supplies have different dynamic ranges. Using the RobustScaler() removes the median and scales the data according to the quantile range. Typically you would only do this if you thought that your people or supply data has outliers that would influence the sample mean / variance in a negative way. If this is not the case, you would likely use the StandardScaler() to remove the mean and scale to unit variance.
Once the data is scaled, you can compare the regression coefficients to better to understand how the model is making its predictions. This is important since the coefficients for unscaled data may be very misleading.
An Example Using Your Code
The following example shows:
Predictions using both scaled and unscaled data with and without the pipeline.
The predictions match in both cases.
You can see what the pipeline is doing in the background by looking at the non-pipeline examples.
I have also included the model coefficients in both cases. Note that the coefficients or weights for the scaled vs. unscaled fitted models are very different.
These coefficients are used to generate each prediction value for the variable example.
import pandas as pd
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler
data = [[100, 1, 50],[500 , 3, 25],[1000 , 10, 100]]
df = pd.DataFrame(data,columns=['Cost','People', 'Supplies'])
X = df[['People', 'Supplies']]
y = df[['Cost']]
#Split
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=0)
#Pipeline
pipeline_scaled = Pipeline([('scale', RobustScaler()),
('alg', Lasso(random_state=0))])
pipeline_unscaled = Pipeline([('alg', Lasso(random_state=0))])
clf1 = pipeline_scaled.fit(X_train,y_train)
clf2 = pipeline_unscaled.fit(X_train,y_train)
#Pipeline predict example
example = [[10,100]]
print('Pipe Scaled: ', clf1.predict(example))
print('Pipe Unscaled: ',clf2.predict(example))
#------------------------------------------------
rs = RobustScaler()
reg = Lasso(random_state=0)
# Scale the taining data
X_train_scaled = rs.fit_transform(X_train)
reg.fit(X_train_scaled, y_train)
# Scale the example
example_scaled = rs.transform(example)
# Predict using the scaled data
print('----------------------')
print('Reg Scaled: ', reg.predict(example_scaled))
print('Scaled Coefficents:',reg.coef_)
#------------------------------------------------
reg.fit(X_train, y_train)
print('Reg Unscaled: ', reg.predict(example))
print('Unscaled Coefficents:',reg.coef_)
Outputs:
Pipe Scaled: [1892.]
Pipe Unscaled: [-699.6]
----------------------
Reg Scaled: [1892.]
Scaled Coefficents: [199. -0.]
Reg Unscaled: [-699.6]
Unscaled Coefficents: [ 0. -15.9936]
For Completeness
You original question asks about "unscaling" your data. I don't think this is what you actually need, since the X_train is your unscaled data. Howver, the following example shows how you could do this as well using the scaler object from your pipeline.
#------------------------------------------------
pipeline_scaled['scale'].inverse_transform(X_train_scaled)
Output
array([[ 3., 25.],
[ 1., 50.]])
It's an old problem about prediction using regression exploring Gapminder data. They used "prediction space" to compute prediction.
Q1. Why should I be creating "prediction space"? What is the use of it?
Q2. The relation of computing predictions over the "prediction space"?
import numpy as np
import pandas as pd
# Read the CSV file into a DataFrame: df
df = pd.read_csv('gapminder.csv')
The data seems like this;
Country,Year,life,population,income,region
Afghanistan,1800,28.211,3280000,603.0,South Asia
Slovak Republic,1960,70.47800000000001,4137224,8693.0,Europe & Central Asia
# Create arrays for features and target variable
y = df.life.values
X = df.fertility.values
# Reshape X and y
y = y.reshape(-1,1)
X = X.reshape(-1,1)
# Create the regressor: reg
reg = LinearRegression()
# Create the prediction space
prediction_space = np.linspace(min(X_fertility), max(X_fertility)).reshape(-1,1)
# Fit the model to the data
reg.fit(X_fertility, y)
# Compute predictions over the prediction space: y_pred
y_pred = reg.predict(prediction_space)
I believe that you are taking a course from DataCamp
I stumbled upon this too, and the answer is prediction_space and y_pred are used to construct the straight line in the graph
NOTE: for those who are reading this and don't understand what I'm talking about, the code snippet is actually missing the graph plotting code
# Plot regression line
plt.plot(prediction_space, y_pred, color='black', linewidth=3)
plt.show()
It comes with the y_pred to make a baseline for you to calculate the residuals and further get the R^2 value.