https://www.kaggle.com/paree24/development-index
i am trying to get my LinearRegression line but it doesnt show it right, any ideas?
also if i remove:
plt.scatter(x_test**[:,0]**, y_test)
i get a warning ValueError: x and y must be the same size
there is another question regarding the graph, since the population column is in the billions, all the other figures (has shown in the picture below) are going to be close of equal to 0, can i fix this?
and also.... right now i have 3 different pictures showing my graphs in the beggining (because the same problem, the population column is too big)
plot1 = dataset.plot(x= "GDP ($ per capita)", y='Infant mortality ', style='o')
plot2 = dataset.plot(x= "Literacy (%)", y='Infant mortality ', style='o')
plot3 = dataset.plot(x= "Population", y='Infant mortality ', style='o')
plt.tight_layout()
plt.show()
is there any way i can show these graphs in 1 picture and not in 3 different pictures?
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from scipy.stats import shapiro
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import normalize
from math import sqrt
from sklearn.model_selection import cross_val_score
import seaborn as sns
dataset = pd.read_csv(r"C:\Users\coolh\Desktop\machine learning\lab1\Development.csv")
x = np.array(dataset.drop(columns= ["Area (sq. mi.)", "Pop. Density ", "Development Index", "Infant mortality "]))
y = np.array(dataset["Infant mortality "])
plot1 = dataset.plot(x= "GDP ($ per capita)", y='Infant mortality ', style='o')
plot2 = dataset.plot(x= "Literacy (%)", y='Infant mortality ', style='o')
plot3 = dataset.plot(x= "Population", y='Infant mortality ', style='o')
plt.tight_layout()
plt.show()
stat, p = shapiro(y)
print(f"показатель {p}")
print(f"статистика {stat}")
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=10)
regressor = LinearRegression()
regressor.fit(x_train, y_train)
print(f"regressor.intercept_ {regressor.intercept_}")
print(f"regressor.coef_{regressor.coef_}")
scores = cross_val_score(regressor, x, y, cv=5)
print(scores)
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))
y_pred = regressor.predict(x_test)
df = pd.DataFrame({'Actual': y_test.flatten(), 'Predicted': y_pred.flatten()})
df1 = df.head(50)
df1.plot(kind = 'bar')
plt.grid(which ='major', color='green')
plt.grid(which='minor', color='red')
plt.show()
#RMSE
print(sqrt(mean_squared_error(y_test, y_pred)))
plt.scatter(x_test[:,0], y_test)
plt.plot(x_test, y_pred, color='green', linewidth=1)
plt.show()
enter image description here
The plot method accepts the ax keyword that lets you plot onto an existing figure. See the example below:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df = pd.DataFrame({**{"x": range(10)}, **{f"y{i}": np.random.normal(size=10) for i in range(5)}})
fig = plt.figure()
df.plot("x", "y0", ax=plt.gca())
df.plot("x", "y1", ax=plt.gca())
df.plot("x", "y2", ax=plt.gca())
plt.show()
Note: since you're new, a piece of advice that has served me well is to include a MWE. The code you posted is quite long and most of it is unrelated to the problem.
Related
I'm writing a script that uses GPR to analyze and predict burn properties of different fuels. I've got good outputs for my test set, and now want to add a 95% confidence interval. When I try to implement the interval I get terrible results. Please send help.
#Gaussian Predictions for Ignition Delay
#September 14 2021
%matplotlib inline
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_absolute_error as mae
from sklearn.model_selection import train_test_split
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, ConstantKernel as C
#gpr = GaussianProcessRegressor()
kernel = C(1.0, (1e-3, 1e3))*RBF(10, (1e-2, 1e2))
gpr = GaussianProcessRegressor(kernel = kernel, n_restarts_optimizer = 9, alpha = 0.1, normalize_y = True)
gpr.fit(x_train, y_train)
y_prediction, std = gpr.predict(x_test, return_std = True)
confidence = std*1.96/np.sqrt(len(x_test))
confidence = confidence.reshape(-1,1)
# Plot the function, the prediction and the 95% confidence interval based on
# the MSE
plt.figure()
plt.plot(x_train, y_train, "b.", markersize=10, label="Observations")
plt.fill(x_test,
y_prediction-confidence,
y_prediction+confidence,
alpha=0.3,
fc="b",
ec="None",
label="95% confidence interval",
) #this plots confidence interval and fit it to my data
plt.plot(x_test, y_prediction, "r.", markersize=10, label="Prediction")
```[enter image description here][1]
[1]: https://i.stack.imgur.com/PItpi.png
Looking at this example from the sklearn docs
https://scikit-learn.org/stable/auto_examples/gaussian_process/plot_gpr_noisy_targets.html#sphx-glr-auto-examples-gaussian-process-plot-gpr-noisy-targets-py
it looks like you need to adapt your plot function. For me, the following worked
plt.fill_between(
x_test.ravel(),
y_prediction - 1.96 * std,
y_prediction + 1.96 * std,
alpha=0.5,
label=r"95% confidence interval",
)
here, I generated data like in the sklearn example:
X = np.linspace(start=0, stop=10, num=1_000).reshape(-1, 1)
y = np.squeeze(X * np.sin(X))
rng = np.random.RandomState(1)
training_indices = rng.choice(np.arange(y.size), size=6, replace=False)
test_indices = [x for x in np.arange(y.size) if x not in training_indices]
x_train, y_train = X[training_indices], y[training_indices]
x_test, y_test = X[test_indices], y[test_indices]
I have the following pandas dataframe covering more than 10k answers for 150 questions.
I am struggling to find a way to see the correlation between fields.
In particular I would like to understand how I can graphically show the correlation between Q015 and Q008, knowing that each respondent might have selected multiple answers (1,2,3).
So I am trying to figure out how to graphically display whether there is any correlation between Q015 and Q008 for each selected option of the survey.
Any ideas?
You can see a linear regression by Pearson
necessary libraries
import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
Code
list_variables, list_COEF, list_MSE, list_RMSE, list_R2SCORE = ([] for i in range(5))
# initializing Linear Regression by Pearson
lr = LinearRegression()
xtrain, xtest, ytrain, ytest = train_test_split(df[["Q015"]], df[["Q008"]], test_size=0.3)
lr = LinearRegression()
lr_baseline = lr.fit(xtrain, ytrain)
pred_baseline = lr_baseline.predict(xtest)
list_variables.append("Q015 & Q008")
list_COEF.append(round(lr_baseline.coef_[0,0], 4))
list_MSE.append(round(mean_squared_error(ytest, pred_baseline), 2))
list_RMSE.append(round(math.sqrt(mean_squared_error(ytest, pred_baseline)), 2))
list_R2SCORE.append(round(r2_score(ytest, pred_baseline), 2))
# Plotting the graph
plt.figure(figsize=(12,8))
ax = plt.gca()
plt.suptitle("Q015 & Q008", fontsize=24, y=0.96)
plt.plot(xtest, ytest, 'bo', markersize = 5)
plt.plot(xtest, pred_baseline, color="red", linewidth = 2)
plt.xlabel("Q015", size=14)
plt.ylabel("Q008", size=14)
plt.tight_layout()
plt.show()
You will get something as follows where the column Coef. says to you how much the variables are correlated
Another way is to see the matrix correlation
df_corr = pd.DataFrame(df[["Q015", "Q008"]].corr()).round(2)
mask = np.zeros_like(df_corr, dtype=bool)
mask[np.triu_indices_from(mask)] = True
plt.figure(figsize=(10,8))
plt.title("Pearson correlation between features", size=20)
ax = sns.heatmap(df_corr, mask=mask, vmin=-1, cmap="mako_r")
plt.xticks(rotation=25, size=14, horizontalalignment="right")
plt.yticks(rotation=0, size=14)
plt.tight_layout()
plt.show()
An example for numeric columns
df = pd.DataFrame(np.random.randint(0,15, size=(100, 6)), columns=[["Q01", "Q02", "Q03", "Q07", "Q015", "Q008"]])
The following code results in an x axis that ranges from 8 to 18. The data for the x axis actually ranges from 1,000 to 50 million. I would expect a log scale to show (10,000), (100,000), (1,000,000) (10,000,000) etc.
How do i fix the x axis?
dataset = pandas.DataFrame(Transactions, Price)
dataset = dataset.drop_duplicates()
import numpy as np
import matplotlib.pyplot as plt
X=dataset[['Transactions']]
y=dataset[['Price']]
log_X =np.log(X)
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree=2)
X_poly = poly_reg.fit_transform(log_X)
pol_reg = LinearRegression()
pol_reg.fit(X_poly, y)
def viz_polymonial():
plt.scatter(log_X, y, color='red')
plt.plot(log_X, pol_reg.predict(poly_reg.fit_transform(log_X)), color='blue')
plt.title('Price Curve')
plt.xlabel('Transactions')
plt.ylabel('Price')
plt.grid(linestyle='dotted')
plt.show()
return
viz_polymonial()
Plot:
You plot the values of log_X with log-scale. It's double-logged. Plot just X with log scale, or np.exp(log_X).
No you are not even using log-scale. Plot X wiht log-scale: plt.xscale("log"), not log_X with normal scale.
I did a cubic regression on the data below. How can I plot the regression line with x value starting from 0 rather than the minimum x?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
df = pd.DataFrame({'x':list(range(3,18)),'y':[-4,-2,0,3,5,8,12,17,21,23,24,25,26,26,24]})
x = df['x'].values.reshape(-1,1)
y = df['y'].values.reshape(-1,1)
cubic = PolynomialFeatures(degree=3)
x_cubic = cubic.fit_transform(x)
cubic.fit(x_cubic, y)
model = LinearRegression()
model.fit(x_cubic, y)
fig, ax = plt.subplots()
ax.scatter(x, y, color = 'blue')
pred = model.predict(cubic.fit_transform(x))
ax.plot(x, pred, color = 'red')
ax.set_xlim(0)
ax.set_ylim(-20)
This is what I have now.
How can I get a plot like this?
Try creating and extended x range like this and predicting with your existing model. Add this to the bottom of your code.
ex_x = np.arange(0,4).reshape(-1,1)
ex_pred = model.predict(cubic.fit_transform(ex_x))
ax.plot(ex_x, ex_pred, color='red', linestyle='--')
Output:
I am trying to use linear and polynomial regression for the data.
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model, cross_validation
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
def f(x):
return np.sin(2 * np.pi * x)
x = np.random.uniform(0, 1, size=100)[:, np.newaxis]
y = f(x) + np.random.normal(scale=0.3, size=100)[:, np.newaxis]
x_train, x_test, y_train, y_test = cross_validation.train_test_split(x, y, test_size=0.5, random_state=0)
poly_model = make_pipeline(PolynomialFeatures(degree=2), linear_model.LinearRegression())
poly_model.fit(x_train, y_train)
linear_model_1 = linear_model.LinearRegression()
linear_model_1.fit(x_train, y_train)
fig = plt.figure()
ax = plt.axes()
ax.set(xlabel='X', ylabel='Y', title='X vs Y')
ax.scatter(x,y, alpha=0.5, cmap='viridis')
ax.plot(x_test, linear_model_1.predict(x_test), color='green', label='linear')
ax.plot(x_test, poly_model.predict(x_test), color='red', label='poly')
ax.legend()
With the above code, I am receiving this image:
But as you can see, the polynomial regression is not right.
I tried different approaches ( not using make_pipeline etc) but with no success.
If I've understood you correctly, just add a sorting to your x_test before passing it to predict() function and increase the degree of polynomial to 3:
poly_model = make_pipeline(PolynomialFeatures(degree=3), linear_model.LinearRegression())
and
x_test.sort(axis=0)
with these adjustments I'm getting following plot:
1) You can just call plot twice, it will add new line to existing plot. eg:
ax.plot(x_test, model1.predict(x_test), color='red', linewidth=2)
ax.plot(x_test, model2.predict(x_test), color='green', linewidth=2)
In your case I'd do sth like that:
linear_model = linear_model.LinearRegression(fit_intercept=False)
poly_model = model = Pipeline([('poly', PolynomialFeatures(degree=2)),
('linear', linear_model.LinearRegression(fit_intercept=False))])
linear_model.fit(x_train, y_train)
poly_model.fit(x_train, y_train)
And then:
ax.plot(x_test, linear_model.predict(x_test), color='red', linewidth=2, label='linear')
ax.plot(x_test, poly_model.predict(x_test), color='green', linewidth=2, label='poly')
ax.legend()