I have the following pandas dataframe covering more than 10k answers for 150 questions.
I am struggling to find a way to see the correlation between fields.
In particular I would like to understand how I can graphically show the correlation between Q015 and Q008, knowing that each respondent might have selected multiple answers (1,2,3).
So I am trying to figure out how to graphically display whether there is any correlation between Q015 and Q008 for each selected option of the survey.
Any ideas?
You can see a linear regression by Pearson
necessary libraries
import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
Code
list_variables, list_COEF, list_MSE, list_RMSE, list_R2SCORE = ([] for i in range(5))
# initializing Linear Regression by Pearson
lr = LinearRegression()
xtrain, xtest, ytrain, ytest = train_test_split(df[["Q015"]], df[["Q008"]], test_size=0.3)
lr = LinearRegression()
lr_baseline = lr.fit(xtrain, ytrain)
pred_baseline = lr_baseline.predict(xtest)
list_variables.append("Q015 & Q008")
list_COEF.append(round(lr_baseline.coef_[0,0], 4))
list_MSE.append(round(mean_squared_error(ytest, pred_baseline), 2))
list_RMSE.append(round(math.sqrt(mean_squared_error(ytest, pred_baseline)), 2))
list_R2SCORE.append(round(r2_score(ytest, pred_baseline), 2))
# Plotting the graph
plt.figure(figsize=(12,8))
ax = plt.gca()
plt.suptitle("Q015 & Q008", fontsize=24, y=0.96)
plt.plot(xtest, ytest, 'bo', markersize = 5)
plt.plot(xtest, pred_baseline, color="red", linewidth = 2)
plt.xlabel("Q015", size=14)
plt.ylabel("Q008", size=14)
plt.tight_layout()
plt.show()
You will get something as follows where the column Coef. says to you how much the variables are correlated
Another way is to see the matrix correlation
df_corr = pd.DataFrame(df[["Q015", "Q008"]].corr()).round(2)
mask = np.zeros_like(df_corr, dtype=bool)
mask[np.triu_indices_from(mask)] = True
plt.figure(figsize=(10,8))
plt.title("Pearson correlation between features", size=20)
ax = sns.heatmap(df_corr, mask=mask, vmin=-1, cmap="mako_r")
plt.xticks(rotation=25, size=14, horizontalalignment="right")
plt.yticks(rotation=0, size=14)
plt.tight_layout()
plt.show()
An example for numeric columns
df = pd.DataFrame(np.random.randint(0,15, size=(100, 6)), columns=[["Q01", "Q02", "Q03", "Q07", "Q015", "Q008"]])
Related
I'm writing a script that uses GPR to analyze and predict burn properties of different fuels. I've got good outputs for my test set, and now want to add a 95% confidence interval. When I try to implement the interval I get terrible results. Please send help.
#Gaussian Predictions for Ignition Delay
#September 14 2021
%matplotlib inline
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_absolute_error as mae
from sklearn.model_selection import train_test_split
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, ConstantKernel as C
#gpr = GaussianProcessRegressor()
kernel = C(1.0, (1e-3, 1e3))*RBF(10, (1e-2, 1e2))
gpr = GaussianProcessRegressor(kernel = kernel, n_restarts_optimizer = 9, alpha = 0.1, normalize_y = True)
gpr.fit(x_train, y_train)
y_prediction, std = gpr.predict(x_test, return_std = True)
confidence = std*1.96/np.sqrt(len(x_test))
confidence = confidence.reshape(-1,1)
# Plot the function, the prediction and the 95% confidence interval based on
# the MSE
plt.figure()
plt.plot(x_train, y_train, "b.", markersize=10, label="Observations")
plt.fill(x_test,
y_prediction-confidence,
y_prediction+confidence,
alpha=0.3,
fc="b",
ec="None",
label="95% confidence interval",
) #this plots confidence interval and fit it to my data
plt.plot(x_test, y_prediction, "r.", markersize=10, label="Prediction")
```[enter image description here][1]
[1]: https://i.stack.imgur.com/PItpi.png
Looking at this example from the sklearn docs
https://scikit-learn.org/stable/auto_examples/gaussian_process/plot_gpr_noisy_targets.html#sphx-glr-auto-examples-gaussian-process-plot-gpr-noisy-targets-py
it looks like you need to adapt your plot function. For me, the following worked
plt.fill_between(
x_test.ravel(),
y_prediction - 1.96 * std,
y_prediction + 1.96 * std,
alpha=0.5,
label=r"95% confidence interval",
)
here, I generated data like in the sklearn example:
X = np.linspace(start=0, stop=10, num=1_000).reshape(-1, 1)
y = np.squeeze(X * np.sin(X))
rng = np.random.RandomState(1)
training_indices = rng.choice(np.arange(y.size), size=6, replace=False)
test_indices = [x for x in np.arange(y.size) if x not in training_indices]
x_train, y_train = X[training_indices], y[training_indices]
x_test, y_test = X[test_indices], y[test_indices]
https://www.kaggle.com/paree24/development-index
i am trying to get my LinearRegression line but it doesnt show it right, any ideas?
also if i remove:
plt.scatter(x_test**[:,0]**, y_test)
i get a warning ValueError: x and y must be the same size
there is another question regarding the graph, since the population column is in the billions, all the other figures (has shown in the picture below) are going to be close of equal to 0, can i fix this?
and also.... right now i have 3 different pictures showing my graphs in the beggining (because the same problem, the population column is too big)
plot1 = dataset.plot(x= "GDP ($ per capita)", y='Infant mortality ', style='o')
plot2 = dataset.plot(x= "Literacy (%)", y='Infant mortality ', style='o')
plot3 = dataset.plot(x= "Population", y='Infant mortality ', style='o')
plt.tight_layout()
plt.show()
is there any way i can show these graphs in 1 picture and not in 3 different pictures?
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from scipy.stats import shapiro
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import normalize
from math import sqrt
from sklearn.model_selection import cross_val_score
import seaborn as sns
dataset = pd.read_csv(r"C:\Users\coolh\Desktop\machine learning\lab1\Development.csv")
x = np.array(dataset.drop(columns= ["Area (sq. mi.)", "Pop. Density ", "Development Index", "Infant mortality "]))
y = np.array(dataset["Infant mortality "])
plot1 = dataset.plot(x= "GDP ($ per capita)", y='Infant mortality ', style='o')
plot2 = dataset.plot(x= "Literacy (%)", y='Infant mortality ', style='o')
plot3 = dataset.plot(x= "Population", y='Infant mortality ', style='o')
plt.tight_layout()
plt.show()
stat, p = shapiro(y)
print(f"показатель {p}")
print(f"статистика {stat}")
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=10)
regressor = LinearRegression()
regressor.fit(x_train, y_train)
print(f"regressor.intercept_ {regressor.intercept_}")
print(f"regressor.coef_{regressor.coef_}")
scores = cross_val_score(regressor, x, y, cv=5)
print(scores)
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))
y_pred = regressor.predict(x_test)
df = pd.DataFrame({'Actual': y_test.flatten(), 'Predicted': y_pred.flatten()})
df1 = df.head(50)
df1.plot(kind = 'bar')
plt.grid(which ='major', color='green')
plt.grid(which='minor', color='red')
plt.show()
#RMSE
print(sqrt(mean_squared_error(y_test, y_pred)))
plt.scatter(x_test[:,0], y_test)
plt.plot(x_test, y_pred, color='green', linewidth=1)
plt.show()
enter image description here
The plot method accepts the ax keyword that lets you plot onto an existing figure. See the example below:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df = pd.DataFrame({**{"x": range(10)}, **{f"y{i}": np.random.normal(size=10) for i in range(5)}})
fig = plt.figure()
df.plot("x", "y0", ax=plt.gca())
df.plot("x", "y1", ax=plt.gca())
df.plot("x", "y2", ax=plt.gca())
plt.show()
Note: since you're new, a piece of advice that has served me well is to include a MWE. The code you posted is quite long and most of it is unrelated to the problem.
The following code results in an x axis that ranges from 8 to 18. The data for the x axis actually ranges from 1,000 to 50 million. I would expect a log scale to show (10,000), (100,000), (1,000,000) (10,000,000) etc.
How do i fix the x axis?
dataset = pandas.DataFrame(Transactions, Price)
dataset = dataset.drop_duplicates()
import numpy as np
import matplotlib.pyplot as plt
X=dataset[['Transactions']]
y=dataset[['Price']]
log_X =np.log(X)
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree=2)
X_poly = poly_reg.fit_transform(log_X)
pol_reg = LinearRegression()
pol_reg.fit(X_poly, y)
def viz_polymonial():
plt.scatter(log_X, y, color='red')
plt.plot(log_X, pol_reg.predict(poly_reg.fit_transform(log_X)), color='blue')
plt.title('Price Curve')
plt.xlabel('Transactions')
plt.ylabel('Price')
plt.grid(linestyle='dotted')
plt.show()
return
viz_polymonial()
Plot:
You plot the values of log_X with log-scale. It's double-logged. Plot just X with log scale, or np.exp(log_X).
No you are not even using log-scale. Plot X wiht log-scale: plt.xscale("log"), not log_X with normal scale.
I am trying to reproduce the example in this post, which produces this figure.
The colored regions above are plotted by mlxtend.plotting (version '0.14.0').
With the default settings on colab, this code
from mlxtend.plotting import plot_decision_regions
plot_decision_regions(X, y, clf=ppn)
produces this figure.
The data points have been plotted while the bottom region has not.
Is it possible to set the color for the bottom region with mlxtend.plotting?
it seems like a bug derived by the classification of two regions, if you try and separate 3 clusters as the following example it will work.
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import itertools
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from mlxtend.classifier import EnsembleVoteClassifier
from mlxtend.data import iris_data
from mlxtend.plotting import plot_decision_regions
# Initializing Classifiers
clf1 = LogisticRegression(random_state=0)
clf2 = RandomForestClassifier(random_state=0)
clf3 = SVC(random_state=0, probability=True)
eclf = EnsembleVoteClassifier(clfs=[clf1, clf2, clf3],
weights=[2, 1, 1], voting='soft')
# Loading some example data
X, y = iris_data()
X = X[:,[0, 2]]
# Plotting Decision Regions
gs = gridspec.GridSpec(2, 2)
fig = plt.figure(figsize=(10, 8))
labels = ['Logistic Regression',
'Random Forest',
'RBF kernel SVM',
'Ensemble']
for clf, lab, grd in zip([clf1, clf2, clf3, eclf],
labels,
itertools.product([0, 1],
repeat=2)):
clf.fit(X, y)
ax = plt.subplot(gs[grd[0], grd[1]])
fig = plot_decision_regions(X=X, y=y,
clf=clf, legend=2)
plt.title(lab)
plt.show()
Try and ask directly on their github directory: https://github.com/rasbt/mlxtend
I think it's possible. You can use the colors parameter instead, I think it is much easier. You should try this one, is this what you are looking for?
fig = plot_decision_regions(
X=X,
y=y.astype(int),
clf=clf,
legend=2,
colors='yellow,red'
)
I did a cubic regression on the data below. How can I plot the regression line with x value starting from 0 rather than the minimum x?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
df = pd.DataFrame({'x':list(range(3,18)),'y':[-4,-2,0,3,5,8,12,17,21,23,24,25,26,26,24]})
x = df['x'].values.reshape(-1,1)
y = df['y'].values.reshape(-1,1)
cubic = PolynomialFeatures(degree=3)
x_cubic = cubic.fit_transform(x)
cubic.fit(x_cubic, y)
model = LinearRegression()
model.fit(x_cubic, y)
fig, ax = plt.subplots()
ax.scatter(x, y, color = 'blue')
pred = model.predict(cubic.fit_transform(x))
ax.plot(x, pred, color = 'red')
ax.set_xlim(0)
ax.set_ylim(-20)
This is what I have now.
How can I get a plot like this?
Try creating and extended x range like this and predicting with your existing model. Add this to the bottom of your code.
ex_x = np.arange(0,4).reshape(-1,1)
ex_pred = model.predict(cubic.fit_transform(ex_x))
ax.plot(ex_x, ex_pred, color='red', linestyle='--')
Output: