You can easily plot a regression line using plotly express / px.scatter and retrieve regression results like beta using px.get_trendline_results(fig).iloc[0]["px_fit_results"].params[1]. But how can you retrieve other parameters like R-squared or p-vales for the coefficients?
Plot:
Code:
# imports
import plotly.express as px
import pandas as pd
import numpy as np
# data
np.random.seed(123)
numdays=20
X = (np.random.randint(low=-20, high=20, size=numdays).cumsum()+100).tolist()
Y = (np.random.randint(low=-20, high=20, size=numdays).cumsum()+100).tolist()
df = pd.DataFrame({'X': X, 'Y':Y})
# figure using px.scatter
fig = px.scatter(df, x="X", y="Y", trendline="ols", template = 'plotly_dark')
fig.show()
The answer:
model = px.get_trendline_results(fig)
results = model.iloc[0]["px_fit_results"]
alpha = params[0]
beta = .params[1]
p_beta = .pvalues[1]
r_squared = .rsquared
Details:
All regression results are available through:
px.get_trendline_results(fig)
Which, when run, will return a somewhat cryptic looking pandas dataframe:
px_fit_results
0 <statsmodels.regression.linear_model.Regressio...
The element under px_fit_results is an object of type statsmodels.regression.linear_model.RegressionResultsWrapper which is a wrapper for statsmodels.
So if we simplify matters a bit by setting:
models = px.get_trendline_results(fig)
And:
results = model.iloc[0]["px_fit_results"]
Then we can check what's available in that object using:
dir(results)
And find all the regression details one should need, like:
'predict',
'pvalues',
'remove_data',
'resid',
'resid_pearson',
'rsquared',
'rsquared_adj',
'save',
'scale',
'ssr',
'summary',
'summary2',
't_test',
't_test_pairwise',
But note that all these available results can be structured differently.
Running results.rsquared will return a single float 0.611901357827784, while running results.pvalues will return an array array([9.95834884e-01, 4.59734574e-05]). Which again will be subsettable for the constant and trendline through results.pvalues[0] and results.pvalues[1], respectively.
With this information available, you could for example extract some of them and include them as annotations to further improve your plotly figures:
Plot:
Complete code:
import plotly.graph_objects as go
import plotly.express as px
import pandas as pd
import numpy as np
import datetime
# data
np.random.seed(123)
numdays=20
X = (np.random.randint(low=-20, high=20, size=numdays).cumsum()+100).tolist()
Y = (np.random.randint(low=-20, high=20, size=numdays).cumsum()+100).tolist()
df = pd.DataFrame({'X': X, 'Y':Y})
# Figure using plotly express
fig = px.scatter(df, x="X", y="Y", trendline="ols", template = 'plotly_dark')
# retrieve model estimates
model = px.get_trendline_results(fig)
results = model.iloc[0]["px_fit_results"]
alpha = results.params[0]
beta = results.params[1]
p_beta = results.pvalues[1]
r_squared = results.rsquared
line1 = 'y = ' + str(round(alpha, 4)) + ' + ' + str(round(beta, 4))+'x'
line2 = 'p-value = ' + '{:.5f}'.format(p_beta)
line3 = 'R^2 = ' + str(round(r_squared, 3))
summary = line1 + '<br>' + line2 + '<br>' + line3
fig.add_annotation(
x=110,
y=140,
xref="x",
yref="y",
text=summary,
showarrow=False,
font=dict(
family="Courier New, monospace",
size=16,
color="#ffffff"
),
align="left",
arrowhead=2,
arrowsize=1,
arrowwidth=2,
arrowcolor="#636363",
ax=20,
ay=-30,
borderwidth=2,
borderpad=4,
bgcolor="rgba(100,100,100, 0.6)",
opacity=0.8
)
fig.show()
Related
So the question is:
Can I plot a histogram in Plotly, where all values that are bigger than some threshold will be grouped into one bin?
The desired output:
But using standard plotly Histogram class I was able only to get this output:
import pandas as pd
from plotly import graph_objs as go
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode()
test_df = pd.DataFrame({'values': [1]*10 + [2]*9 +
[3.1]*4 + [3.6]*4 +
[4]*7 + [5]*6 + [6]*5 + [7]*4 + [8]*3 +
[9]*2 + [10]*1 +
[111.2]*2 + [222.3]*2 + [333.4]*1}) # <- I want to group them into one bin "> 10"
data = [go.Histogram(x=test_df['values'],
xbins=dict(
start=0,
end=11,
size=1
),
autobinx = False)]
layout = go.Layout(
title='values'
)
fig = go.Figure(data=data, layout=layout)
iplot(fig, filename='basic histogram')
So after spending some time I found a solution myself using numpy.Histogram and plotly Bar chart.
Leaving it here in case anyone will face the same problem.
def plot_bar_with_outliers(series, name, end):
start = int(series.min())
size = 1
# Making a histogram
largest_value = series.max()
if largest_value > end:
hist = np.histogram(series, bins=list(range(start, end+size, size)) + [largest_value])
else:
hist = np.histogram(series, bins=list(range(start, end+size, size)) + [end+size])
# Adding labels to the chart
labels = []
for i, j in zip(hist[1][0::1], hist[1][1::1]):
if j <= end:
labels.append('{} - {}'.format(i, j))
else:
labels.append('> {}'.format(i))
# Plotting the graph
data = [go.Bar(x=labels,
y=hist[0])]
layout = go.Layout(
title=name
)
fig = go.Figure(data=data, layout=layout)
iplot(fig, filename='basic histogram')
plot_bar_with_outliers(test_df['values'], 'values', end=11)
An alternative to the above option is the following:
import numpy as np
# Initialize the values that you want in the histogram.
values = [7,8,8,8,9,10,10,11,12,13,14]
# Initialize the maximum x-axis value that you want.
maximum_value = 11
# Plot the histogram.
fig = go.Figure()
fig.add_trace(
go.Histogram(
x=[np.minimum(maximum_value, num) for num in values],
xbins = {"size": 1}
)
)
fig.show()
Link to Image
Below are 2 sets of code that do the same thing one in Python the other in R. They both graph the Kmeans the same with respect to PCA but once I do the bar chart at the end using the cluster Center the Graphs are totally different. I believe there is something wrong about the Kmeans and the cluster calculation in python. The original code was provided in R. I am trying to see why the bar chart in python does not match are I believe its the centers. Please review and provide some feed back.
Please use the link below to download the data set I used to generate these graphs.
https://www.dropbox.com/s/fhnxxrjl07y0h2c/TableStats2.csv?dl=0
R Code
## Retrive Libraries needed for script
library("ggplot2")
library("reshape2")
pcp <- read.csv(file='E:\\ProgramData\\R\\Code\\TableStats2.csv')
#Label each row with table Name to Plot names on chart.
data <- pcp
rownames(data) <- data[, 1]
#Gather all the data and leave out Table Names
data <- data[, -1]
data <- data[, -1]
#Create The PCA (Principle Component Analysis)
data <- scale(data)
pca <- prcomp(data)
plot.data <- data.frame(pca$x[, 1:2])
set.seed(2121)
clusters <- kmeans(data, 6)
plot.data$clusters <- factor(clusters$cluster)
g <- ggplot(plot.data, aes(x = PC1, y = PC2, colour = clusters)) +
geom_point(size = 3.5) +
geom_text(label = rownames(data), colour = "darkgrey", hjust = .7) +
theme_bw()
behaviours <- data.frame(clusters$centers)
behaviours$cluster <- 1:6
behavious <- melt(behaviours, "cluster")
g2 <- ggplot(behavious, aes(x = variable, y = value)) +
geom_bar(stat = "identity", position = 'identity', fill = "steelblue") +
facet_wrap(~cluster) +
theme_grey() +
theme(axis.text.x = element_text(angle = 90))
python code
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from matplotlib import pyplot as plt
from plotnine import ggplot, aes, geom_line, geom_bar, facet_wrap, theme_grey, theme, element_text
TableStats = pd.read_csv(r'E:\ProgramData\R\Code\TableStats2.csv')
sc = StandardScaler()
pca = PCA()
tables = TableStats.iloc[:,0]
y = tables
features = ['Range Scans', 'Singleton Lookups', 'Row Locks', 'Row Lock Waits (ms)','Page Locks', 'Page Lock Waits (ms)', 'Page IO Latch Wait (ms)']
# Separating out the features
x = TableStats.loc[:, features].values
x = sc.fit_transform(x)
dpca = pca.fit_transform(x)
x1 = dpca[:,0]
y1 = dpca[:,1]
plt.figure(figsize=(20,11))
plot = plt.scatter(x1,y1, c=y.index.tolist())
for i, label in enumerate(y):
#print(label)
plt.annotate(label,(x1[i], y1[i]))
plt.show()
df = pd.DataFrame(dpca,columns = ['Range Scans', 'Singleton Lookups', 'Row Locks', 'Row Lock Waits (ms)','Page Locks', 'Page Lock Waits (ms)', 'Page IO Latch Wait (ms)'])
clusters = KMeans(n_clusters=6,init='k-means++', random_state=2121).fit(df)
df['Cluster'] = clusters.labels_
df['Cluster Centroid D1'] = df['Cluster'].apply(lambda label: clusters.cluster_centers_[label][0])
df['Cluster Centroid D2'] = df['Cluster'].apply(lambda label: clusters.cluster_centers_[label][1])
df['tables'] = tables
#print Table Names
plt.figure(figsize=(20, 11))
ax = sns.scatterplot(data=df, x=x1, y=y1, hue='Cluster', s=200, palette='coolwarm', legend=True)
ax = sns.scatterplot(data=df, x="Cluster Centroid D1", y="Cluster Centroid D2", hue='Cluster', s=1000, palette='coolwarm', legend=False, alpha=0.1)
for line in range(0,df.shape[0]):
ax.text(x1[line]+0.05, y1[line],TableStats['Object Name'][line], horizontalalignment='left',size='medium', color='black',weight='semibold')
plt.legend(loc='upper right', title='Cluster')
ax.set_title("Clustered Points", fontsize='xx-large', y=1.05);
plt.show()
# here is where the R and Python graphs are different because the cluster centers dont match
behaviours = pd.DataFrame(clusters.cluster_centers_)
behaviours.columns = clusters.feature_names_in_
behaviours['cluster'] = [1,2,3,4,5,6]
b2 = pd.melt(behaviours, id_vars = "cluster",value_name="value")
(ggplot(b2, aes(x = 'variable', y = 'value')) +
geom_bar(stat = "identity", position = 'identity', fill = "steelblue") +
facet_wrap('~cluster') +
theme_grey() +
theme(axis_text_x = element_text(rotation = 90, hjust=1), figure_size=(20,8))
)
Update now I have this working in R and Python
Looking at this specific problem, check the outputs of the PCA - they're different, so k-means won't be the same. The reason is in your R code - you repeat the line data <- data[, -1], dropping the table names and the first column of the data. Remove the extra line, and the clusters look the same.
General comments on R and Python implementation of kmeans
In general, it looks like R and python use different algorithms by default. R uses "Hartigan-Wong" by default, and Python's scikit-learn probably uses "elkan". Set algorithm='Lloyd' in R and algorithm='full' in Python (which I believe currently will run Lloyd's algorithm as well) to ensure they're at least attempting the same thing.
You also have different initialisation methods - R is random and for Python you are using 'k-means++'. Set init='random' in Python to make these match.
They have different numbers of max iteartions - R defaults to 10, Python to 300. Set these as equal also.
Finally, you won't see any random variation in your python script if you set the random_state in the Python KMeans call (and check you haven't set.seed in R also).
Once you've done this, try running both multiple times, and compare the distributions of values. Hopefully you'll see overlap between the two implementations.
Check out the docs for the R implementation and the scikit-learn implementation.
And a final point here - kmeans is unsupervised. The class labels have no absolute meaning. Run the code multiple times, and class 0 will not always be assigned to the same data points, even if data points are grouped identically.
Here's a reproducible example of this:
import pandas as pd
from sklearn import cluster, datasets
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
X, y = datasets.make_blobs(100,2,centers=6)
df = pd.DataFrame(X)
random_states = list(range(0,60,10))
fig, ax = plt.subplots(3,2, figsize=(20,16))
for i, r in enumerate(random_states):
clusters = KMeans(n_clusters=6,init='k-means++', random_state=r).fit(X)
df = (df
.assign(**{
'Cluster': clusters.labels_,
'Cluster Centroid D1': lambda x: x['Cluster'].apply(lambda label: clusters.cluster_centers_[label][0]),
'Cluster Centroid D2': lambda x: x['Cluster'].apply(lambda label: clusters.cluster_centers_[label][1]),
})
)
row = i//2
col = i - row*2
sns.scatterplot(data=df, x=0, y=1, hue='Cluster', s=200, palette='coolwarm', legend=True, ax=ax[row,col])
sns.scatterplot(data=df, x="Cluster Centroid D1", y="Cluster Centroid D2", hue='Cluster', s=1000,
palette='coolwarm', legend=False, alpha=0.1, ax=ax[row,col])
Here's a version with your data:
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
TableStats = pd.read_csv('TableStats2.csv')
sc = StandardScaler()
pca = PCA()
tables = TableStats.iloc[:,0]
y = tables
features = ['Range Scans', 'Singleton Lookups', 'Row Locks', 'Row Lock Waits (ms)',
'Page Locks', 'Page Lock Waits (ms)', 'Page IO Latch Wait (ms)']
# Separating out the features
x = TableStats.loc[:, features].values
x = sc.fit_transform(x)
dpca = pca.fit_transform(x)
x1 = dpca[:,0]
y1 = dpca[:,1]
random_states = [1,2,3,4,5,6]
for r in random_states:
df = pd.DataFrame(dpca,columns = ['Range Scans', 'Singleton Lookups', 'Row Locks', 'Row Lock Waits (ms)',
'Page Locks', 'Page Lock Waits (ms)', 'Page IO Latch Wait (ms)'])
clusters = KMeans(n_clusters=6,init='k-means++', random_state=r).fit(df)
df = (df
.assign(**{
'Cluster': clusters.labels_,
'Cluster Centroid D1': lambda x: x['Cluster'].apply(lambda label: clusters.cluster_centers_[label][0]),
'Cluster Centroid D2': lambda x: x['Cluster'].apply(lambda label: clusters.cluster_centers_[label][1]),
})
)
plt.figure(figsize=(20, 11))
ax = sns.scatterplot(data=df, x=x1, y=y1, hue='Cluster', s=200, palette='coolwarm', legend=True)
ax = sns.scatterplot(data=df, x="Cluster Centroid D1", y="Cluster Centroid D2", hue='Cluster',
s=1000, palette='coolwarm', legend=False, alpha=0.1)
plt.legend(loc='upper right', title='Cluster')
ax.set_title("Clustered Points", fontsize='xx-large', y=1.05);
plt.show()
I have code that creates a scatterplot matrix and I would like to add a linear regression line to each facet. Code and the current graph are shown below. I currently have a scatterplot for each of the variable combinations for the first five variables in the dataset. I would like to add the regression lines so that when the individual hovers over the line they can also see the correlation.
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import plotly.express as px
from sklearn import datasets
from typing import Tuple, List
import plotly.graph_objects as go
from plotly.subplots import make_subplots
def load_data() -> Tuple[np.ndarray, np.ndarray, List[str]]:
"""Load the wine dataset
Returns:
features: the dataset features
target: the labels of the dataset
feature_names: names of each feature
"""
wine = datasets.load_wine()
features = wine['data']
target = wine['target']
feature_names = wine['feature_names']
return features, target, feature_names
features, target, feature_names = load_data()
Data = {
feature_names[0]:features[:,0].tolist(),
feature_names[1]:features[:,1].tolist(),
feature_names[2]:features[:,2].tolist(),
'Target': target.tolist()
}
Data = pd.DataFrame(data = Data)
index_vals = Data['Target'].astype('category').cat.codes
fig = go.Figure(data = go.Splom(dimensions = [
dict(label = feature_names[0],values = Data[feature_names[0]]),
dict(label = feature_names[1],values = Data[feature_names[1]]),
dict(label = feature_names[2],values = Data[feature_names[2]])],
text = Data['Target'],
marker = dict(color = index_vals,showscale = False,size = 8)
))
fig.update_layout(
title='Wine Dataset',
dragmode='select',
width=900,
height=600,
hovermode='closest',
)
fig.show()
With newer versions of plotly, all you need to do is include trendline="ols" in px.scatter. Here's an example that builds on the dataset px.data.tips():
Complete code:
import plotly.express as px
df = px.data.tips()
fig = px.scatter(df, x="total_bill", y="tip",
trendline = 'ols',
trendline_color_override = 'red',
facet_col="day",
facet_col_wrap = 2)
fig.update_xaxes(matches=None)
fig.show()
I was able to create two linked plots using holoviews + bokeh backend, basically following this code example.
Here's an example of code from the reference:
import pandas as pd
import numpy as np
import holoviews as hv
import seaborn as sns
from holoviews import opts
hv.extension('bokeh', width=90)
# Declare dataset
df = sns.load_dataset('tips')
df = df[['total_bill', 'tip', 'size']]
# Declare HeatMap
corr = df.corr()
heatmap = hv.HeatMap((corr.columns, corr.index, corr))
# Declare Tap stream with heatmap as source and initial values
posxy = hv.streams.Tap(source=heatmap, x='total_bill', y='tip')
# Define function to compute histogram based on tap location
def tap_histogram(x, y):
m, b = np.polyfit(df[x], df[y], deg=1)
x_data = np.linspace(df.tip.min(), df.tip.max())
y_data = m*x_data + b
right = (hv.Curve((x_data, y_data), x, y)
* hv.Scatter((df[x], df[y]), x, y))
right.opts(opts.Scatter(
height=400, width=400, color='red', ylim=(0, 100),
framewise=True, tools=['hover']))
return right
tap_dmap = hv.DynamicMap(tap_histogram, streams=[posxy])
(heatmap + tap_dmap).opts(
opts.HeatMap(tools=['tap', 'hover'],
height=400, width=400, toolbar='above'),
opts.Curve(framewise=True))
Now, I wanna create a hover tool specifying the different parameters on the dependent plot.
So far I am only being able to use the default hover (.opts(tools['hover'])) as in the code above.
When I try to build a custom hover to dynamically change the fields based on x and y streamed values, it does not update the hover after tapping on the heatmap. It only keeps the initial values of x and y.
Here's an example of my current code:
Try to tap in total_bil x size, for example.
import pandas as pd
import numpy as np
import holoviews as hv
import seaborn as sns
from holoviews import opts
from bokeh.models import HoverTool
hv.extension('bokeh', width=90)
# Declare dataset
df = sns.load_dataset('tips')
df = df[['total_bill', 'tip', 'size']]
# Declare HeatMap
corr = df.corr()
heatmap = hv.HeatMap((corr.columns, corr.index, corr))
# Declare Tap stream with heatmap as source and initial values
posxy = hv.streams.Tap(source=heatmap, x='total_bill', y='tip')
# Define function to compute histogram based on tap location
def tap_histogram(x, y):
m, b = np.polyfit(df[x], df[y], deg=1)
x_data = np.linspace(df.tip.min(), df.tip.max())
y_data = m*x_data + b
right = (hv.Curve((x_data, y_data), x, y)
* hv.Scatter((df[x], df[y]), x, y))
tooltips = [(x, '#'+x),
(y, '#'+y)
]
hover = HoverTool(tooltips=tooltips)
right.opts(opts.Scatter(
height=400, width=400, color='red', ylim=(0, 100),
framewise=True, tools=[hover]))
return right
tap_dmap = hv.DynamicMap(tap_histogram, streams=[posxy])
(heatmap + tap_dmap).opts(
opts.HeatMap(tools=['tap', 'hover'],
height=400, width=400, toolbar='above'),
opts.Curve(framewise=True))
So the question is:
Can I plot a histogram in Plotly, where all values that are bigger than some threshold will be grouped into one bin?
The desired output:
But using standard plotly Histogram class I was able only to get this output:
import pandas as pd
from plotly import graph_objs as go
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode()
test_df = pd.DataFrame({'values': [1]*10 + [2]*9 +
[3.1]*4 + [3.6]*4 +
[4]*7 + [5]*6 + [6]*5 + [7]*4 + [8]*3 +
[9]*2 + [10]*1 +
[111.2]*2 + [222.3]*2 + [333.4]*1}) # <- I want to group them into one bin "> 10"
data = [go.Histogram(x=test_df['values'],
xbins=dict(
start=0,
end=11,
size=1
),
autobinx = False)]
layout = go.Layout(
title='values'
)
fig = go.Figure(data=data, layout=layout)
iplot(fig, filename='basic histogram')
So after spending some time I found a solution myself using numpy.Histogram and plotly Bar chart.
Leaving it here in case anyone will face the same problem.
def plot_bar_with_outliers(series, name, end):
start = int(series.min())
size = 1
# Making a histogram
largest_value = series.max()
if largest_value > end:
hist = np.histogram(series, bins=list(range(start, end+size, size)) + [largest_value])
else:
hist = np.histogram(series, bins=list(range(start, end+size, size)) + [end+size])
# Adding labels to the chart
labels = []
for i, j in zip(hist[1][0::1], hist[1][1::1]):
if j <= end:
labels.append('{} - {}'.format(i, j))
else:
labels.append('> {}'.format(i))
# Plotting the graph
data = [go.Bar(x=labels,
y=hist[0])]
layout = go.Layout(
title=name
)
fig = go.Figure(data=data, layout=layout)
iplot(fig, filename='basic histogram')
plot_bar_with_outliers(test_df['values'], 'values', end=11)
An alternative to the above option is the following:
import numpy as np
# Initialize the values that you want in the histogram.
values = [7,8,8,8,9,10,10,11,12,13,14]
# Initialize the maximum x-axis value that you want.
maximum_value = 11
# Plot the histogram.
fig = go.Figure()
fig.add_trace(
go.Histogram(
x=[np.minimum(maximum_value, num) for num in values],
xbins = {"size": 1}
)
)
fig.show()
Link to Image