I'm trying to use sklearn and Lasso regression to do some analysis, but I'm getting some strange results. I've tried to narrow the problem, but it appears that the issue is that I just don't understand what sklearn is trying to do. For example, in the code below I would expect that the coefficient for 5th power of x to be 2. Or at least close to it. However, no matter what I do, I keep getting values around 16.
Any ideas about what I'm missing/doing wrong?
import matplotlib
matplotlib.use('Qt4Agg')
from matplotlib import pyplot as plt
import numpy as np
from sklearn.linear_model import Lasso
x_data = np.reshape(np.linspace(-3, 3, 20), (-1, 1))
y_data = 2*x_data**5 # + np.random.normal(0, 2, x_data.shape)
X = np.hstack((np.ones(x_data.shape), x_data, x_data**2, x_data*3, x_data**4, x_data*5))
c5 = list()
for alpha in np.logspace(0, 2, num=100):
model = Lasso(alpha=alpha, max_iter=15000, fit_intercept=True, warm_start=False, selection='cyclic', tol=1e-5)
model.fit(X, y_data)
coefficient = model.coef_[-1]
c5.append(coefficient)
fig = plt.figure()
plt.plot(np.logspace(0, 2, num=100).tolist(), c5, 'r-')
plt.xlabel('x data')
plt.ylabel('y data')
plt.grid(True)
fig.canvas.manager.window.raise_()
plt.show()
Related
I'm trying to run kernel Ridge regression on a simple artificial dataset. When I run the code, I get two plots. The first is for Linear Regression fit, which looks normal. however, the kernel one is very erratic. Is this expected behavior, or am I not calling the functions properly?
The first plt.show():
The second plt.show():
from sklearn.kernel_ridge import KernelRidge
from sklearn.linear_model import LinearRegression
import numpy as np
import matplotlib.pyplot as plt
w = 5
x = np.random.randn(10, 1)
x_to_draw_line = np.random.randn(1000, 1)
y = w * x
lr = LinearRegression()
lr.fit(x, y)
lr_preds = lr.predict(x_to_draw_line)
plt.figure()
plt.plot(x_to_draw_line, lr_preds, color="C1")
plt.scatter(x, y, color="C0")
plt.show()
krr = KernelRidge(kernel="polynomial")
krr.fit(x, y)
krr_preds = krr.predict(x_to_draw_line)
plt.figure()
plt.plot(x_to_draw_line, krr_preds, color="C1")
plt.scatter(x, y, color="C0")
plt.show()
The line plot appears jumbled because matplotlib draws a connecting line between each pair of points in the order they appear in the input array.
The solution is to sort the array of randomly generated x-values for which to generate and draw predictions:
x_to_draw_line = np.random.randn(1000, 1).sort()
I have the following pandas dataframe covering more than 10k answers for 150 questions.
I am struggling to find a way to see the correlation between fields.
In particular I would like to understand how I can graphically show the correlation between Q015 and Q008, knowing that each respondent might have selected multiple answers (1,2,3).
So I am trying to figure out how to graphically display whether there is any correlation between Q015 and Q008 for each selected option of the survey.
Any ideas?
You can see a linear regression by Pearson
necessary libraries
import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
Code
list_variables, list_COEF, list_MSE, list_RMSE, list_R2SCORE = ([] for i in range(5))
# initializing Linear Regression by Pearson
lr = LinearRegression()
xtrain, xtest, ytrain, ytest = train_test_split(df[["Q015"]], df[["Q008"]], test_size=0.3)
lr = LinearRegression()
lr_baseline = lr.fit(xtrain, ytrain)
pred_baseline = lr_baseline.predict(xtest)
list_variables.append("Q015 & Q008")
list_COEF.append(round(lr_baseline.coef_[0,0], 4))
list_MSE.append(round(mean_squared_error(ytest, pred_baseline), 2))
list_RMSE.append(round(math.sqrt(mean_squared_error(ytest, pred_baseline)), 2))
list_R2SCORE.append(round(r2_score(ytest, pred_baseline), 2))
# Plotting the graph
plt.figure(figsize=(12,8))
ax = plt.gca()
plt.suptitle("Q015 & Q008", fontsize=24, y=0.96)
plt.plot(xtest, ytest, 'bo', markersize = 5)
plt.plot(xtest, pred_baseline, color="red", linewidth = 2)
plt.xlabel("Q015", size=14)
plt.ylabel("Q008", size=14)
plt.tight_layout()
plt.show()
You will get something as follows where the column Coef. says to you how much the variables are correlated
Another way is to see the matrix correlation
df_corr = pd.DataFrame(df[["Q015", "Q008"]].corr()).round(2)
mask = np.zeros_like(df_corr, dtype=bool)
mask[np.triu_indices_from(mask)] = True
plt.figure(figsize=(10,8))
plt.title("Pearson correlation between features", size=20)
ax = sns.heatmap(df_corr, mask=mask, vmin=-1, cmap="mako_r")
plt.xticks(rotation=25, size=14, horizontalalignment="right")
plt.yticks(rotation=0, size=14)
plt.tight_layout()
plt.show()
An example for numeric columns
df = pd.DataFrame(np.random.randint(0,15, size=(100, 6)), columns=[["Q01", "Q02", "Q03", "Q07", "Q015", "Q008"]])
I want to build a chart similar to this
I have created a bar chart, and I have the logistic regression completed.
#imports
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
plt.bar(prob_df['diff'], prob_df['full_win_prob'])
plt.show()
#logistic regression
X = dfx['home_diff'].values
y = dfx['away_win'].values
X = X.reshape(-1, 1)
logreg = LogisticRegression()
logreg.fit(X, y)
print(logreg.intercept_, logreg.coef_)
[-0.67032214] [[0.04948131]] #results
I have the chart and I have the model, I can't figure out how to plot the model on top of the chart, its a bit frustrating I'm sure the answer is simple. I would prefer an answer in matplotlib but seaborn is also ok.
You can use yy = logreg.predict_proba(XX)[:,1] to plot the logistic curve for an array of x-values. logreg.predict_proba(XX)[:,0] gives the inverted curve, the probability of being 0. logreg.predict(XX) gives the predictions, i.e. the logistic curve rounded to 0 or 1.
Here is an example starting from some generated test data.
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
import numpy as np
np.random.seed(123)
# logistic regression
X = np.random.uniform(1, 9, 20) # dfx['home_diff'].values
y = np.random.choice([0, 1], 20, p=[0.8, 0.2])
y = np.where(X < 5, y, 1 - y) # dfx['away_win'].values
X = X.reshape(-1, 1)
plt.scatter(X[:, 0], y, color='black', label='Given values')
logreg = LogisticRegression()
logreg.fit(X, y)
XX = np.linspace(0, 10, 1000).reshape(-1, 1)
prediction = logreg.predict(XX)
probability_0, probability_1 = logreg.predict_proba(XX).T
plt.plot(XX[:, 0], prediction, color='limegreen', lw=2, alpha=0.7, label='Predicted values')
plt.plot(XX[:, 0], probability_0, color='crimson', lw=2, alpha=0.7, ls='--', label='Probability of being 0')
plt.plot(XX[:, 0], probability_1, color='deepskyblue', lw=2, alpha=0.7, label='Probability of being 1')
plt.legend()
Thanks to JohanCs help I was able to get it. Clearly my answer is heavily inspired by his posting. I adjusted the code a bit so it displays the model over the bar graph rather than the scatterplot. I tried to upvote you but I don't have enough karma.
Here is what I used and the result:
# logistic regression
X = dfx['home_diff'].values
y = dfx['away_win'].values
X = X.reshape(-1, 1)
logreg = LogisticRegression()
logreg.fit(X, y)
prediction = logreg.predict(X)
probability_0, probability_1 = logreg.predict_proba(X).T
plt.plot(X[:, 0], probability_1, color='blue', lw=2, alpha=0.6, label='Probability of being 1')
plt.bar(prob_df['diff'], prob_df['full_win_prob'], color='orange')
plt.legend()
The following code results in an x axis that ranges from 8 to 18. The data for the x axis actually ranges from 1,000 to 50 million. I would expect a log scale to show (10,000), (100,000), (1,000,000) (10,000,000) etc.
How do i fix the x axis?
dataset = pandas.DataFrame(Transactions, Price)
dataset = dataset.drop_duplicates()
import numpy as np
import matplotlib.pyplot as plt
X=dataset[['Transactions']]
y=dataset[['Price']]
log_X =np.log(X)
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree=2)
X_poly = poly_reg.fit_transform(log_X)
pol_reg = LinearRegression()
pol_reg.fit(X_poly, y)
def viz_polymonial():
plt.scatter(log_X, y, color='red')
plt.plot(log_X, pol_reg.predict(poly_reg.fit_transform(log_X)), color='blue')
plt.title('Price Curve')
plt.xlabel('Transactions')
plt.ylabel('Price')
plt.grid(linestyle='dotted')
plt.show()
return
viz_polymonial()
Plot:
You plot the values of log_X with log-scale. It's double-logged. Plot just X with log scale, or np.exp(log_X).
No you are not even using log-scale. Plot X wiht log-scale: plt.xscale("log"), not log_X with normal scale.
I am using Python 2.7.15rc1 in Ubuntu 18.04 LTS. I was trying to plot graph of Support Vector Regression but I dint get any output.
import matplotlib
matplotlib.use("Agg")
import numpy as np
from sklearn.svm import SVR
import matplotlib.pyplot as plt
#Generate Sample data
x = np.sort(5 * np.random.rand(40, 1), axis = 0)
y = np.sin(x).ravel()
#Add noise to targets
y[::5] += 3 * (0.5 - np.random.rand(8))
#create classifier regression model
svr_rbf = SVR(kernel="rbf", C=1000, gamma=0.1)
svr_lin = SVR(kernel="linear", C=1000, gamma=0.1)
svr_poly = SVR(kernel="poly", C=1000, gamma=0.1)
#Fit regression model
y_rbf = svr_rbf.fit(x,y).predict(x)
y_lin = svr_lin.fit(x,y).predict(x)
y_poly = svr_poly.fit(x,y).predict(x)
#Plotting of results
lw = 2
plt.scatter(x, y, color="darkorange", label="data")
plt.plot(x, y_rbf, color="navy", lw=lw, label="RBF Model")
plt.plot(x, y_lin, color="c", lw=lw, label="Linear Model")
plt.plot(x, y_poly, color="cornflowerblue", lw=lw, label="Polynomial Model")
plt.xlabel("data")
plt.ylabel("target")
plt.title("Support Vector Regression")
plt.legend()
plt.show()
python svm.py outputs nothing.
Did I missed something to import? or we can not plot graph of this?
I am new to machine learning
You just need to add %matplotlib inline at the top of your code if you are running on Jupyter Ipython notebook. You can read more about it here and here.
Otherwise, I copied your code and removed matplotlib.use("Agg") , it works for me on Ubuntu 18.04, matplotlib version 2.2.2. Can you specify which version you are using ?
Also here is the code,
import matplotlib
import numpy as np
from sklearn.svm import SVR
import matplotlib.pyplot as plt
#Generate Sample data
x = np.sort(5 * np.random.rand(40, 1), axis = 0)
y = np.sin(x).ravel()
#Add noise to targets
y[::5] += 3 * (0.5 - np.random.rand(8))
#create classifier regression model
svr_rbf = SVR(kernel="rbf", C=1000, gamma=0.1)
svr_lin = SVR(kernel="linear", C=1000, gamma=0.1)
svr_poly = SVR(kernel="poly", C=1000, gamma=0.1)
#Fit regression model
y_rbf = svr_rbf.fit(x,y).predict(x)
y_lin = svr_lin.fit(x,y).predict(x)
y_poly = svr_poly.fit(x,y).predict(x)
#Plotting of results
lw = 2
plt.scatter(x, y, color="darkorange", label="data")
plt.plot(x, y_rbf, color="navy", lw=lw, label="RBF Model")
plt.plot(x, y_lin, color="c", lw=lw, label="Linear Model")
plt.plot(x, y_poly, color="cornflowerblue", lw=lw, label="Polynomial Model")
plt.xlabel("data")
plt.ylabel("target")
plt.title("Support Vector Regression")
plt.legend()
plt.show()
Matplotlib can use one of several "backends" for producing graphs. Those backends do different things. In your case you specified Agg backend that is used to write PNG files:
matplotlib.use("Agg")
So the solution is to remove that line to use the default backend for your system or choose a backend that produces graphs on screen. You might these first:
matplotlib.use("GTK3Agg")
matplotlib.use("WXAgg")
matplotlib.use("TkAgg")
matplotlib.use("Qt5Agg")
See https://matplotlib.org/faq/usage_faq.html#what-is-a-backend for the complete list of backends.