plotly -- overlay significance stars / annotation over boxplots - python

I'm trying to generate a faceted boxplot for linear model results, with treatments on the x axis. A conventional way to show significance is to append asterisks.
I'm finding this surprisingly difficult to do in plotly.
Example code:
import numpy as np
import pandas as pd
# Data
n = 10
conditiona = ['left', 'right']
conditionb = ['top', 'middle', 'bottom']
N = n * len(conditiona) * len(conditionb)
trt = np.repeat(['a', 'b','c'], N)
eff = np.repeat([3, 2, 1], N)
noise = np.random.normal(size = 3* N, loc = 0, scale = 1)
pval = np.repeat(['**', '', ''], N)
col = np.tile( np.repeat( conditiona, n * len(conditionb)), 3)
row = np.tile( np.repeat( conditionb, n) , len(conditiona) *3)
df = pd.DataFrame( { 'y' : noise + eff, 'trt' : trt, 'p' : pval, 'column' : col,
'row' : row})
## Plot
import plotly.graph_objects as go
from plotly.subplots import make_subplots
rows = df.row.unique().tolist()
cols = df.column.unique().tolist()
groups = df.trt.unique().tolist()
labs = [i + ' ' + j for j in rows for i in cols]
colors = ['red', 'green', 'blue']
fig = make_subplots(rows = len(rows), cols = len(cols),
shared_xaxes=True, subplot_titles = labs)
for group, dx in df.groupby(['row','column','trt']):
r = rows.index( group[0] ) + 1 # 1-based numbering
c = cols.index( group[1] ) + 1
name = str(group[2])
id = groups.index(group[2])
tr = go.Box( y = dx['y'], boxpoints = 'all', name = name ,marker_color = colors[id], text = dx['p'])
# tr2 = go.Scatter(x = 'x0', <- how do I get relative x coordinates of tr to put in here ?
# y = dx['y'].median(), text = dx['p'].unique())
fig.add_trace( tr, row = r, col = c )
fig.show()
[Desired] Output:
Is there an easy way to 'extract' the x coordinates of a box trace so I can overlay a marker?
Seems like this shouldn't be hard.

Figured it out eventually. You just have to know how plotly sets things up beforehand, apparently.
You can use annotations with xref and yref referencingthe subplots. The pattern of assignment is confusing (to me) and poorly documented.
y_refs increase sequentially from the bottom left, reading left to right. Thus in this figure bottom left panel is 'y1', bottom right is 'y2', middle left is 'y3' , middle right is 'y4' and so on.
ncol = len(cols)
fig = make_subplots(rows = len(rows), cols = len(cols),
shared_xaxes=True, subplot_titles = labs)
for group, dx in df.groupby(['row','column','trt']):
r = rows.index( group[0] ) + 1 # 1-based numbering
c = cols.index( group[1] ) + 1
name = str(group[2])
id = groups.index(group[2])
tr = go.Box( y = dx['y'], boxpoints = 'all', name = name ,marker_color = colors[id], text = dx['p'])
fig.add_trace( tr, row = r, col = c )
xref = 'x' + str(c)
yref = 'y' + str( (r-1)*ncol + c ) # yrefs added in strange pattern
fig.add_annotation(x = name,
y = dx.y.median(),
text = dx.p.unique()[0],
ax = 0, ay = 0,showarrow = False,
xref = xref, yref = yref,
font= dict(size = 24))
fig.show()

Related

How to create an altair condition with three or more choices [duplicate]

I want to combine hover and click in selections in Altair plot. The code below produces the result that I want: points are mostly transparent by default, hovering over the point increases the opacity, and then clicking on the point increases the opacity even more. I find this useful so that users can hover over a point to get a quick sense of the results, and then click on the point to "lock" the selection. While I am satisfied with the results, the method seems a bit cumbersome because I need to define different chart layers for the hover and click selections. If I could construct a multi-way condition expression, then it seems like I would be able to simplify the code quite a bit. I tried writing the opacity condition as alt.condition(click_selection, CLICK_OPACITY, alt.condition(hover_selection, HOVER_OPACITY, DEFAULT_OPACITY)), but I got an error. Is there a way to simplify my code below to combine hover and click selections?
import altair as alt
import numpy as np
import pandas as pd
a_values = np.arange(1, 4)
x_values = np.linspace(0, 2, 1000)
DEFAULT_OPACITY = 0.3
HOVER_OPACITY = 0.5
CLICK_OPACITY = 1.0
a_df = pd.DataFrame({'a': a_values})
df = pd.DataFrame({
'a': np.tile(A=a_values, reps=len(x_values)),
'x': np.repeat(a=x_values, repeats=len(a_values)),
})
df['y'] = df['a'] * np.sin(2 * np.pi * df['x'])
hover_selection = alt.selection_single(
clear='mouseout',
empty='none',
fields=['a'],
name='hover_selection',
on='mouseover',
)
click_selection = alt.selection_single(
empty='none',
fields=['a'],
name='click_selection',
on='click',
)
a_base = alt.Chart(a_df).mark_point(
filled=True, size=100,
).encode(
x=alt.X(shorthand='a:Q', scale=alt.Scale(domain=(min(a_values) - 1, max(a_values) + 1))),
y=alt.Y(shorthand='a:Q', scale=alt.Scale(domain=(min(a_values) - 1, max(a_values) + 1))),
)
a_hover = a_base.encode(
opacity=alt.condition(hover_selection, alt.value(HOVER_OPACITY), alt.value(DEFAULT_OPACITY))
).add_selection(hover_selection)
a_click = a_base.encode(
opacity=alt.condition(click_selection, alt.value(CLICK_OPACITY), alt.value(0.0)),
).add_selection(click_selection)
y_base = alt.Chart(df).mark_line().encode(
x=alt.X(shorthand='x:Q', scale=alt.Scale(domain=(0, 2))),
y=alt.Y(shorthand='y:Q', scale=alt.Scale(domain=(-3, 3))),
)
y_hover = y_base.encode(
opacity=alt.value(HOVER_OPACITY),
).transform_filter(hover_selection)
y_click = y_base.encode(
opacity=alt.value(CLICK_OPACITY),
).transform_filter(click_selection)
alt.hconcat(
alt.layer(a_hover, a_click),
alt.layer(y_hover, y_click),
)
VegaLite supports multiple selections in the same condition, but I don't think it is possible to write within alt.Condition. However, you can see the alt.Condition returns a dictionary so you could write this directly passing a list of selections.
This way you could clarify this section
a_base = alt.Chart(a_df).mark_point(
filled=True, size=100,
).encode(
x=alt.X(shorthand='a:Q', scale=alt.Scale(domain=(min(a_values) - 1, max(a_values) + 1))),
y=alt.Y(shorthand='a:Q', scale=alt.Scale(domain=(min(a_values) - 1, max(a_values) + 1))),
)
a_hover = a_base.encode(
opacity=alt.condition(hover_selection, alt.value(HOVER_OPACITY), alt.value(DEFAULT_OPACITY))
).add_selection(hover_selection)
a_click = a_base.encode(
opacity=alt.condition(click_selection, alt.value(CLICK_OPACITY), alt.value(0.0)),
).add_selection(click_selection)
to something like this:
hover_and_click_condition = {
'condition': [
{'selection': 'hover_selection', 'value': HOVER_OPACITY},
{'selection': 'click_selection', 'value': CLICK_OPACITY}],
'value': DEFAULT_OPACITY}
a = alt.Chart(a_df).mark_point(
filled=True, size=100,
).encode(
x=alt.X(shorthand='a:Q', scale=alt.Scale(domain=(min(a_values) - 1, max(a_values) + 1))),
y=alt.Y(shorthand='a:Q', scale=alt.Scale(domain=(min(a_values) - 1, max(a_values) + 1))),
opacity=hover_and_click_condition
).add_selection(hover_selection, click_selection)
For the transform filters, you could rewrite this section
y_base = alt.Chart(df).mark_line().encode(
x=alt.X(shorthand='x:Q', scale=alt.Scale(domain=(0, 2))),
y=alt.Y(shorthand='y:Q', scale=alt.Scale(domain=(-3, 3))),
)
y_hover = y_base.encode(
opacity=alt.value(HOVER_OPACITY),
).transform_filter(hover_selection)
y_click = y_base.encode(
opacity=alt.value(CLICK_OPACITY),
).transform_filter(click_selection)
alt.hconcat(
alt.layer(a_hover, a_click),
alt.layer(y_hover, y_click),
)
like this:
y = alt.Chart(df).mark_line().encode(
x=alt.X(shorthand='x:Q', scale=alt.Scale(domain=(0, 2))),
y=alt.Y(shorthand='y:Q', scale=alt.Scale(domain=(-3, 3))),
opacity=hover_and_click_condition
)
a | (y.transform_filter(click_selection) + y.transform_filter(hover_selection))

Reorder Sankey diagram vertically based on label value

I'm trying to plot patient flows between 3 clusters in a Sankey diagram. I have a pd.DataFrame counts with from-to values, see below. To reproduce this DF, here is the counts dict that should be loaded into a pd.DataFrame (which is the input for the visualize_cluster_flow_counts function).
from to value
0 C1_1 C1_2 867
1 C1_1 C2_2 405
2 C1_1 C0_2 2
3 C2_1 C1_2 46
4 C2_1 C2_2 458
... ... ... ...
175 C0_20 C0_21 130
176 C0_20 C2_21 1
177 C2_20 C1_21 12
178 C2_20 C0_21 0
179 C2_20 C2_21 96
The from and to values in the DataFrame represent the cluster number (either 0, 1, or 2) and the amount of days for the x-axis (between 1 and 21). If I plot the Sankey diagram with these values, this is the result:
Code:
import plotly.graph_objects as go
def visualize_cluster_flow_counts(counts):
all_sources = list(set(counts['from'].values.tolist() + counts['to'].values.tolist()))
froms, tos, vals, labs = [], [], [], []
for index, row in counts.iterrows():
froms.append(all_sources.index(row.values[0]))
tos.append(all_sources.index(row.values[1]))
vals.append(row[2])
labs.append(row[3])
fig = go.Figure(data=[go.Sankey(
arrangement='snap',
node = dict(
pad = 15,
thickness = 5,
line = dict(color = "black", width = 0.1),
label = all_sources,
color = "blue"
),
link = dict(
source = froms,
target = tos,
value = vals,
label = labs
))])
fig.update_layout(title_text="Patient flow between clusters over time: 48h (2 days) - 504h (21 days)", font_size=10)
fig.show()
visualize_cluster_flow_counts(counts)
However, I would like to vertically order the bars so that the C0's are always on top, the C1's are always in the middle, and the C2's are always at the bottom (or the other way around, doesn't matter). I know that we can set node.x and node.y to manually assign the coordinates. So, I set the x-values to the amount of days * (1/range of days), which is an increment of +- 0.045. And I set the y-values based on the cluster value: either 0, 0.5 or 1. I then obtain the image below. The vertical order is good, but the vertical margins between the bars are obviously way off; they should be similar to the first result.
The code to produce this is:
import plotly.graph_objects as go
def find_node_coordinates(sources):
x_nodes, y_nodes = [], []
for s in sources:
# Shift each x with +- 0.045
x = float(s.split("_")[-1]) * (1/21)
x_nodes.append(x)
# Choose either 0, 0.5 or 1 for the y-value
cluster_number = s[1]
if cluster_number == "0": y = 1
elif cluster_number == "1": y = 0.5
else: y = 1e-09
y_nodes.append(y)
return x_nodes, y_nodes
def visualize_cluster_flow_counts(counts):
all_sources = list(set(counts['from'].values.tolist() + counts['to'].values.tolist()))
node_x, node_y = find_node_coordinates(all_sources)
froms, tos, vals, labs = [], [], [], []
for index, row in counts.iterrows():
froms.append(all_sources.index(row.values[0]))
tos.append(all_sources.index(row.values[1]))
vals.append(row[2])
labs.append(row[3])
fig = go.Figure(data=[go.Sankey(
arrangement='snap',
node = dict(
pad = 15,
thickness = 5,
line = dict(color = "black", width = 0.1),
label = all_sources,
color = "blue",
x = node_x,
y = node_y,
),
link = dict(
source = froms,
target = tos,
value = vals,
label = labs
))])
fig.update_layout(title_text="Patient flow between clusters over time: 48h (2 days) - 504h (21 days)", font_size=10)
fig.show()
visualize_cluster_flow_counts(counts)
Question: how do I fix the margins of the bars, so that the result looks like the first result? So, for clarity: the bars should be pushed to the bottom. Or is there another way that the Sankey diagram can vertically re-order the bars automatically based on the label value?
Firstly I don't think there is a way with the current exposed API to achieve your goal smoothly you can check the source code here.
Try to change your find_node_coordinates function as follows (note that you should pass the counts DataFrame to):
counts = pd.DataFrame(counts_dict)
def find_node_coordinates(sources, counts):
x_nodes, y_nodes = [], []
flat_on_top = False
range = 1 # The y range
total_margin_width = 0.15
y_range = 1 - total_margin_width
margin = total_margin_width / 2 # From number of Cs
srcs = counts['from'].values.tolist()
dsts = counts['to'].values.tolist()
values = counts['value'].values.tolist()
max_acc = 0
def _calc_day_flux(d=1):
_max_acc = 0
for i in [0,1,2]:
# The first ones
from_source = 'C{}_{}'.format(i,d)
indices = [i for i, val in enumerate(srcs) if val == from_source]
for j in indices:
_max_acc += values[j]
return _max_acc
def _calc_node_io_flux(node_str):
c,d = int(node_str.split('_')[0][-1]), int(node_str.split('_')[1])
_flux_src = 0
_flux_dst = 0
indices_src = [i for i, val in enumerate(srcs) if val == node_str]
indices_dst = [j for j, val in enumerate(dsts) if val == node_str]
for j in indices_src:
_flux_src += values[j]
for j in indices_dst:
_flux_dst += values[j]
return max(_flux_dst, _flux_src)
max_acc = _calc_day_flux()
graph_unit_per_val = y_range / max_acc
print("Graph Unit per Acc Val", graph_unit_per_val)
for s in sources:
# Shift each x with +- 0.045
d = int(s.split("_")[-1])
x = float(d) * (1/21)
x_nodes.append(x)
print(s, _calc_node_io_flux(s))
# Choose either 0, 0.5 or 1 for the y-v alue
cluster_number = s[1]
# Flat on Top
if flat_on_top:
if cluster_number == "0":
y = _calc_node_io_flux('C{}_{}'.format(2, d))*graph_unit_per_val + margin + _calc_node_io_flux('C{}_{}'.format(1, d))*graph_unit_per_val + margin + _calc_node_io_flux('C{}_{}'.format(0, d))*graph_unit_per_val/2
elif cluster_number == "1": y = _calc_node_io_flux('C{}_{}'.format(2, d))*graph_unit_per_val + margin + _calc_node_io_flux('C{}_{}'.format(1, d))*graph_unit_per_val/2
else: y = 1e-09
# Flat On Bottom
else:
if cluster_number == "0": y = 1 - (_calc_node_io_flux('C{}_{}'.format(0,d))*graph_unit_per_val / 2)
elif cluster_number == "1": y = 1 - (_calc_node_io_flux('C{}_{}'.format(0,d))*graph_unit_per_val + margin + _calc_node_io_flux('C{}_{}'.format(1,d)) * graph_unit_per_val /2 )
elif cluster_number == "2": y = 1 - (_calc_node_io_flux('C{}_{}'.format(0,d))*graph_unit_per_val + margin + _calc_node_io_flux('C{}_{}'.format(1,d)) * graph_unit_per_val + margin + _calc_node_io_flux('C{}_{}'.format(2,d)) * graph_unit_per_val /2 )
y_nodes.append(y)
return x_nodes, y_nodes
Sankey graphs supposed to weigh their connection width by their corresponding normalized values right? Here I do the same, first, it calculates each node flux, later by calculating the normalized coordinate the center of each node calculated according to their flux.
Here is the sample output of your code with the modified function, note that I tried to adhere to your code as much as possible so it's a bit unoptimized(for example, one could store the values of nodes above each specified source node to avoid its flux recalculation).
With flag flat_on_top = True
With flag flat_on_top = False
There is a bit of inconsistency in the flat_on_bottom version which I think is caused by the padding or other internal sources of Plotly API.

Plotly: How to find and annotate the intersection point between two lines?

I am trying to find and annotate the intersection point of two-line using Plotly. I know we can use (plt.plot(*intersection.xy,'ko')) to get the intersection point in Mathplotlib, but how can do it in plotly or if it can be done.
import numpy as np
import pandas as pd
import plotly.express as px
test = pd.merge(ps,uts,on = 'Time')
print(test)
time = test['Time']
omfr = test['Orifice Mass Flow Rate']
qab = test['Qab']
fig = px.line(test, x=time, y=[omfr,qab])
fig.update_layout( title='Pipe stress and UTS (MPa)',xaxis_title="Time (s)",yaxis_title="Pipe stress and UTS (MPa)",
hovermode='x')
Output of the code:
Intersection point:
Annotating and showing the intersection is easy. Finding it is the hard part, and my suggestion in that regard builds directly on the contributions in the post How do I compute the intersection point of two lines. I'll include a few lines on the details in my suggestion when I find the time. For now, the complete code snippet at the end of my answer will produce the following figure using this dataset:
Data
x y1 y2
0 1 1 11.00
1 2 8 14.59
2 3 27 21.21
3 4 64 31.11
Plot
Edit - Annotation
If you'd like to change the text annotation, just change
text="intersect"
... to something like:
text = 'lines intersect at x = ' + str(round(x[0], 2)) + ' and y = ' + str(round(y[0], 2))
Result:
Complete code
import pandas as pd
import plotly.graph_objects as go
import numpy as np
# import dash
# sample dataframe
df = pd.DataFrame()
df['x'] = np.arange(4) +1
df['y1'] = df['x']**3
df['y2'] = [10+val**2.2 for val in df['x']]
# intersection stuff
def _rect_inter_inner(x1,x2):
n1=x1.shape[0]-1
n2=x2.shape[0]-1
X1=np.c_[x1[:-1],x1[1:]]
X2=np.c_[x2[:-1],x2[1:]]
S1=np.tile(X1.min(axis=1),(n2,1)).T
S2=np.tile(X2.max(axis=1),(n1,1))
S3=np.tile(X1.max(axis=1),(n2,1)).T
S4=np.tile(X2.min(axis=1),(n1,1))
return S1,S2,S3,S4
def _rectangle_intersection_(x1,y1,x2,y2):
S1,S2,S3,S4=_rect_inter_inner(x1,x2)
S5,S6,S7,S8=_rect_inter_inner(y1,y2)
C1=np.less_equal(S1,S2)
C2=np.greater_equal(S3,S4)
C3=np.less_equal(S5,S6)
C4=np.greater_equal(S7,S8)
ii,jj=np.nonzero(C1 & C2 & C3 & C4)
return ii,jj
def intersection(x1,y1,x2,y2):
ii,jj=_rectangle_intersection_(x1,y1,x2,y2)
n=len(ii)
dxy1=np.diff(np.c_[x1,y1],axis=0)
dxy2=np.diff(np.c_[x2,y2],axis=0)
T=np.zeros((4,n))
AA=np.zeros((4,4,n))
AA[0:2,2,:]=-1
AA[2:4,3,:]=-1
AA[0::2,0,:]=dxy1[ii,:].T
AA[1::2,1,:]=dxy2[jj,:].T
BB=np.zeros((4,n))
BB[0,:]=-x1[ii].ravel()
BB[1,:]=-x2[jj].ravel()
BB[2,:]=-y1[ii].ravel()
BB[3,:]=-y2[jj].ravel()
for i in range(n):
try:
T[:,i]=np.linalg.solve(AA[:,:,i],BB[:,i])
except:
T[:,i]=np.NaN
in_range= (T[0,:] >=0) & (T[1,:] >=0) & (T[0,:] <=1) & (T[1,:] <=1)
xy0=T[2:,in_range]
xy0=xy0.T
return xy0[:,0],xy0[:,1]
# plotly figure
x,y=intersection(np.array(df['x'].values.astype('float')),np.array(df['y1'].values.astype('float')),
np.array(df['x'].values.astype('float')),np.array(df['y2'].values.astype('float')))
fig = go.Figure(data=go.Scatter(x=df['x'], y=df['y1'], mode = 'lines'))
fig.add_traces(go.Scatter(x=df['x'], y=df['y2'], mode = 'lines'))
fig.add_traces(go.Scatter(x=x, y=y,
mode = 'markers',
marker=dict(line=dict(color='black', width = 2),
symbol = 'diamond',
size = 14,
color = 'rgba(255, 255, 0, 0.6)'),
name = 'intersect'),
)
fig.add_annotation(x=x[0], y=y[0],
# text="intersect",
text = 'lines intersect at x = ' + str(round(x[0], 2)) + ' and y = ' + str(round(y[0], 2)),
font=dict(family="sans serif",
size=18,
color="black"),
ax=0,
ay=-100,
showarrow=True,
arrowhead=1)
fig.show()

Cluster groups by direction and magnitude - Python

I'm hoping to cluster vectors based on the direction and magnitude using python. I've found limited examples using R but none for python. Not to confuse with standard k-means for scatter points, I'm actually trying to cluster the whole vector.
The following takes two sets of xy points to generate a vector. I'm then hoping to cluster these vectors based on the length and direction.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans
df = pd.DataFrame(np.random.randint(0,20,size=(100, 4)), columns=list('ABCD'))
plt.rcParams['image.cmap'] = 'Paired'
fig,ax = plt.subplots()
ax.set_xlim(-5, 25)
ax.set_ylim(-5, 25)
A = df['A']
B = df['B']
C = df['C']
D = df['D']
ax.quiver(A, B, (C-A), (D-B), angles = 'xy', scale_units = 'xy', scale = 1, alpha = 0.5)
X_1 = np.array(df[['A','B','C','D']])
model = KMeans(n_clusters = 20)
model.fit(X_1)
cluster_labels = model.predict(X_1)
df['n_cluster'] = cluster_labels
centroids_1 = pd.DataFrame(data = model.cluster_centers_, columns = ['start_x', 'start_y', 'end_x', 'end_y'])
cc = model.cluster_centers_
a = cc[:, 0]
b = cc[:, 1]
c = cc[:, 2]
d = cc[:, 3]
lc1 = ax.quiver(a, b, (c-a), (d-b), angles = 'xy', scale_units = 'xy', scale = 1, alpha = 0.8)
The following figure displays an example
What about this :
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import hdbscan
df = pd.DataFrame(np.random.randint(0,20,size=(100, 4)), columns=list('ABCD'))
plt.rcParams['image.cmap'] = 'Paired'
A = df['A'] #X start
B = df['B'] #Y start
C = df['C'] #X arrive
D = df['D'] #Y arrive
clusterer = hdbscan.HDBSCAN()
df['LENGTH'] = np.sqrt(np.square(df.C-df.A) + np.square(df.D-df.B))
df['DIRECTION'] = np.degrees(np.arctan2(df.D-df.B, df.C-df.A))
coords = df[['LENGTH', 'DIRECTION']].values
clusterer.fit_predict(coords)
cluster_labels = clusterer.labels_
num_clusters = len(set(cluster_labels))
clusters = pd.DataFrame(
[(coords[cluster_labels==n], len(coords[cluster_labels==n])) for n in range(num_clusters)],
columns=["points", "weight"]
)
colors = {0:"green", 1:"blue", 2:"red", 3:"yellow", 4:"pink"}
df['CLUSTER'] = np.nan
for x, (cluster, weight) in enumerate(clusters[clusters.weight>0].values.tolist()):
df_this_cluster = pd.DataFrame(cluster, columns=['LENGTH', 'DIRECTION'])
df_this_cluster['TEMP'] = x
df = df.merge(df_this_cluster, on=['LENGTH', 'DIRECTION'], how='left')
ix = df[df.TEMP.notnull()].index
df.loc[ix, "CLUSTER"] = df.loc[ix, "TEMP"]
df.drop("TEMP", axis=1, inplace=True)
df['COLOR'] = df['CLUSTER'].map(colors).fillna('black')
fig,ax = plt.subplots()
ax.set_xlim(-5, 25)
ax.set_ylim(-5, 25)
ax.quiver(df.A, df.B, (df.C-df.A), (df.D-df.B), angles='xy', scale_units='xy', scale=1, alpha=0.5, color=df.COLOR)
This will use clustering based on length and direction (direction being transformed to degrees, radians' small range doesn't match very well with the model on my first try).
I don't think this will be a very "cartesian" solution as the two values beeing analysed in the model are not in the same metrics... But the visual results are not so bad...
I did try another match based on the 4 coordinates, which is more rigorous. But it is (quite expectably) clustering the vectors by subareas of the space (when there are any) :
coords = df[['A', 'B', 'C', 'D']].values
clusterer.fit_predict(coords)
cluster_labels = clusterer.labels_
num_clusters = len(set(cluster_labels))
clusters = pd.DataFrame(
[(coords[cluster_labels==n], len(coords[cluster_labels==n])) for n in range(num_clusters)],
columns=["points", "weight"]
)
colors = {0:"green", 1:"blue", 2:"red", 3:"yellow", 4:"pink"}
df['CLUSTER'] = np.nan
for x, (cluster, weight) in enumerate(clusters[clusters.weight>0].values.tolist()):
df_this_cluster = pd.DataFrame(cluster, columns=['A', 'B', 'C', 'D'])
df_this_cluster['TEMP'] = x
df = df.merge(df_this_cluster, on=['A', 'B', 'C', 'D'], how='left')
ix = df[df.TEMP.notnull()].index
df.loc[ix, "CLUSTER"] = df.loc[ix, "TEMP"]
df.drop("TEMP", axis=1, inplace=True)
df['COLOR'] = df['CLUSTER'].map(colors).fillna('black')
EDIT
I gave it another try, based on the (very good) suggestion that angles are not a good variable given the fact that there are discontinuities around 0/2pi ; so I choose to use both sinuses and cosinuses instead. I also scaled the length (to have matching scales for the 3 variables) :
So the result would be :
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import robust_scale
import hdbscan
df = pd.DataFrame(np.random.randint(0,20,size=(100, 4)), columns=list('ABCD'))
plt.rcParams['image.cmap'] = 'Paired'
A = df['A'] #X start
B = df['B'] #Y start
C = df['C'] #X arrive
D = df['D'] #Y arrive
clusterer = hdbscan.HDBSCAN()
df['LENGTH'] = robust_scale(np.sqrt(np.square(df.C-df.A) + np.square(df.D-df.B)))
df['DIRECTION'] = np.arctan2(df.D-df.B, df.C-df.A)
df['COS'] = np.cos(df['DIRECTION'])
df['SIN'] = np.sin(df['DIRECTION'])
columns = ['LENGTH', 'COS', 'SIN']
clusterer = hdbscan.HDBSCAN()
values = df[columns].values
clusterer.fit_predict(values)
cluster_labels = clusterer.labels_
num_clusters = len(set(cluster_labels))
clusters = pd.DataFrame(
[(values[cluster_labels==n], len(values[cluster_labels==n])) for n in range(num_clusters)],
columns=["points", "weight"]
)
def get_cmap(n, name='hsv'):
'''
Returns a function that maps each index in 0, 1, ..., n-1 to a distinct
RGB color; the keyword argument name must be a standard mpl colormap name.
Credits to #Ali
https://stackoverflow.com/questions/14720331/how-to-generate-random-colors-in-matplotlib#answer-25628397
'''
return plt.cm.get_cmap(name, n)
cmap = get_cmap(num_clusters+1)
colors = {x:cmap(x) for x in range(num_clusters)}
df['CLUSTER'] = np.nan
for x, (cluster, weight) in enumerate(clusters[clusters.weight>0].values.tolist()):
df_this_cluster = pd.DataFrame(cluster, columns=columns)
df_this_cluster['TEMP'] = x
df = df.merge(df_this_cluster, on=columns, how='left')
df.reset_index(drop=True, inplace=True)
ix = df[df.TEMP.notnull()].index
df.loc[ix, "CLUSTER"] = df.loc[ix, "TEMP"]
df.drop("TEMP", axis=1, inplace=True)
df['CLUSTER'] = df['CLUSTER'].fillna(num_clusters-1)
df['COLOR'] = df['CLUSTER'].map(colors)
print("Number of clusters : ", num_clusters-1)
nrows = num_clusters//2 if num_clusters%2==0 else num_clusters//2 + 1
fig,axes = plt.subplots(nrows=nrows, ncols=2)
axes = [y for row in axes for y in row]
for k,ax in enumerate(axes):
ax.set_xlim(-5, 25)
ax.set_ylim(-5, 25)
ax.set_aspect('equal', adjustable='box')
if k+1 <num_clusters:
ax.set_title(f"CLUSTER #{k+1}", fontsize=10)
this_df = df[df.CLUSTER==k]
ax.quiver(
this_df.A, #X
this_df.B, #Y
(this_df.C-this_df.A), #X component of vector
(this_df.D-this_df.B), #Y component of vector
angles = 'xy',
scale_units = 'xy',
scale = 1,
color=this_df.COLOR
)
The results are way better (though it depends much of the input dataset) ; the last subplots refers to the vectors not being found to be inside a cluster:
Edit #2
If by "direction" you mean angle in the [0..pi[ interval (ie undirected vectors), you will want to include the following code before computing the cosinuses/sinuses :
ix = df[df.DIRECTION<0].index
df.loc[ix, "DIRECTION"] += np.pi
Maybe you can also cluster the angles (besides the vector norms) by the projections of a normalized vector onto the two unit vectors (1,0) and (0,1) with this function. Handling the projections directly (which are essentially the angles), we won't went into trouble caused by the periodicity of cosine function
def get_norm_and_angle(e1):
e1_norm = np.linalg.norm(e1,axis=1)
e1 = e1 / e1_norm[:,None]
e2 = np.array([1,0])
e3 = np.array([0,1])
return np.stack((e1_norm,e1#e2,e1#e3),axis=1)
Based on this function, here is one possible solution where there is no constraint on how many clusters we want to find. In the script below, five features are used for clustering
Vector norm
Vector projections on x and y axis
Vector starting points
with these five features
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.cluster import KMeans
def get_norm_and_angle(e1):
e1_norm = np.linalg.norm(e1,axis=1)
e1 = e1 / e1_norm[:,None]
e2 = np.array([1,0])
e3 = np.array([0,1])
return np.stack((e1_norm,e1#e2,e1#e3),axis=1)
data = np.cumsum(np.random.randint(0,10,size=(50, 4)),axis=0)
df = pd.DataFrame(data, columns=list('ABCD'))
A = df['A'];B = df['B']
C = df['C'];D = df['D']
starting_points = np.stack((A,B),axis=1)
vectors = np.stack((C,D),axis=1) - np.stack((A,B),axis=1)
different_view = get_norm_and_angle(vectors)
different_view = np.hstack((different_view,starting_points))
num_clusters = 8
model = KMeans(n_clusters=num_clusters)
model.fit(different_view)
cluster_labels = model.predict(different_view)
df['n_cluster'] = cluster_labels
cluster_centers = model.cluster_centers_
cluster_offsets = cluster_centers[:,0][:,None] * cluster_centers[:,1:3]
cluster_starts = np.vstack([np.mean(starting_points[cluster_labels==ind],axis=0) for ind in range(num_clusters)])
main_streams = np.hstack((cluster_starts,cluster_starts+cluster_offsets))
a,b,c,d = main_streams.T
fig,ax = plt.subplots(figsize=(8,8))
ax.set_xlim(-np.max(data)*.1,np.max(data)*1.1)
ax.set_ylim(-np.max(data)*.1,np.max(data)*1.1)
colors = sns.color_palette(n_colors=num_clusters)
lc1 = ax.quiver(a, b, (c-a), (d-b), angles = 'xy', scale_units = 'xy', color = colors, scale = 1, alpha = 0.8, zorder=100)
lc2 = ax.quiver(A, B, (C-A), (D-B), angles = 'xy', scale_units = 'xy', scale = .6, alpha = 0.2)
start_colors = [colors[ind] for ind in cluster_labels]
ax.scatter(starting_points[:,0],starting_points[:,1],c=start_colors)
plt.show()
A sample output is
as you can see in the figure, vectors with close starting points are clustered into the same group.

How to make a sunburst plot in R or Python?

So far I have been unable to find an R library that can create a sunburst plot like those by John Stasko. Anyone knows how to accomplish that in R or Python?
Python version of sunburst diagram using matplotlib bars in polar projection:
import numpy as np
import matplotlib.pyplot as plt
def sunburst(nodes, total=np.pi * 2, offset=0, level=0, ax=None):
ax = ax or plt.subplot(111, projection='polar')
if level == 0 and len(nodes) == 1:
label, value, subnodes = nodes[0]
ax.bar([0], [0.5], [np.pi * 2])
ax.text(0, 0, label, ha='center', va='center')
sunburst(subnodes, total=value, level=level + 1, ax=ax)
elif nodes:
d = np.pi * 2 / total
labels = []
widths = []
local_offset = offset
for label, value, subnodes in nodes:
labels.append(label)
widths.append(value * d)
sunburst(subnodes, total=total, offset=local_offset,
level=level + 1, ax=ax)
local_offset += value
values = np.cumsum([offset * d] + widths[:-1])
heights = [1] * len(nodes)
bottoms = np.zeros(len(nodes)) + level - 0.5
rects = ax.bar(values, heights, widths, bottoms, linewidth=1,
edgecolor='white', align='edge')
for rect, label in zip(rects, labels):
x = rect.get_x() + rect.get_width() / 2
y = rect.get_y() + rect.get_height() / 2
rotation = (90 + (360 - np.degrees(x) % 180)) % 360
ax.text(x, y, label, rotation=rotation, ha='center', va='center')
if level == 0:
ax.set_theta_direction(-1)
ax.set_theta_zero_location('N')
ax.set_axis_off()
Example, how this function can be used:
data = [
('/', 100, [
('home', 70, [
('Images', 40, []),
('Videos', 20, []),
('Documents', 5, []),
]),
('usr', 15, [
('src', 6, [
('linux-headers', 4, []),
('virtualbox', 1, []),
]),
('lib', 4, []),
('share', 2, []),
('bin', 1, []),
('local', 1, []),
('include', 1, []),
]),
]),
]
sunburst(data)
You can even build an interactive version quite easily with R now:
# devtools::install_github("timelyportfolio/sunburstR")
library(sunburstR)
# read in sample visit-sequences.csv data provided in source
# https://gist.github.com/kerryrodden/7090426#file-visit-sequences-csv
sequences <- read.csv(
system.file("examples/visit-sequences.csv",package="sunburstR")
,header=F
,stringsAsFactors = FALSE
)
sunburst(sequences)
...and when you move your mouse above it, the magic happens:
Edit
The official site of this package can be found here (with many examples!): https://github.com/timelyportfolio/sunburstR
Hat Tip to #timelyportfolio who created this impressive piece of code!
You can create something along the lines of a sunburst plot using geom_tile from the ggplot2 package. Let's first create some random data:
require(ggplot2); theme_set(theme_bw())
require(plyr)
dat = data.frame(expand.grid(x = 1:10, y = 1:10),
z = sample(LETTERS[1:3], size = 100, replace = TRUE))
And then create the raster plot. Here, the x axis in the plot is coupled to the x variable in dat, the y axis to the y variable, and the fill of the pixels to the z variable. This yields the following plot:
p = ggplot(dat, aes(x = x, y = y, fill = z)) + geom_tile()
print(p)
The ggplot2 package supports all kinds of coordinate transformations, one of which takes one axis and projects it on a circle, i.e. polar coordinates:
p + coord_polar()
This roughly does what you need, now you can tweak dat to get the desired result.
Theres a package called ggsunburst. Sadly is not in CRAN but you can install following the instruction in the website: http://genome.crg.es/~didac/ggsunburst/ggsunburst.html.
Hope it helps to people who still looking for a good package like this.
Regards,
Here's a ggplot2 sunburst with two layers.
The basic idea is to just make a different bar for each layer, and make the bars wider for the outer layers. I also messed with the x-axis to make sure there's no hole in the middle of the inner pie chart. You can thus control the look of the sunburst by changing the width and x-axis values.
library(ggplot2)
# make some fake data
df <- data.frame(
'level1'=c('a', 'a', 'a', 'a', 'b', 'b', 'c', 'c', 'c'),
'level2'=c('a1', 'a2', 'a3', 'a4', 'b1', 'b2', 'c1', 'c2', 'c3'),
'value'=c(.025, .05, .027, .005, .012, .014, .1, .03, .18))
# sunburst plot
ggplot(df, aes(y=value)) +
geom_bar(aes(fill=level1, x=0), width=.5, stat='identity') +
geom_bar(aes(fill=level2, x=.25), width=.25, stat='identity') +
coord_polar(theta='y')
The only disadvantage this has compared to sunburst-specific software is that it assumes you want the outer layers to be collectively exhaustive (i.e. no gaps). "Partially exhaustive" outer layers (like in some of the other examples) are surely possible but more complicated.
For completeness, here it is cleaned up with nicer formatting and labels:
library(data.table)
# compute cumulative sum for outer labels
df <- data.table(df)
df[, cumulative:=cumsum(value)-(value/2)]
# store labels for inner circle
inner_df <- df[, c('level1', 'value'), with=FALSE]
inner_df[, level1_value:=sum(value), by='level1']
inner_df <- unique(text_df[, c('level1', 'level1_value'), with=FALSE])
inner_df[, cumulative:=cumsum(level1_value)]
inner_df[, prev:=shift(cumulative)]
inner_df[is.na(prev), position:=(level1_value/2)]
inner_df[!is.na(prev), position:=(level1_value/2)+prev]
colors <- c('#6a3d9a', '#1F78B4', '#33A02C', '#3F146D', '#56238D', '#855CB1', '#AD8CD0', '#08619A', '#3F8DC0', '#076302', '#1B8416', '#50B74B')
colorNames <- c(unique(as.character(df$level1)), unique(as.character(df$level2)))
names(colors) <- colorNames
ggplot(df, aes(y=value, x='')) +
geom_bar(aes(fill=level2, x=.25), width=.25, stat='identity') +
geom_bar(aes(fill=level1, x=0), width=.5, stat='identity') +
geom_text(data=inner_df, aes(label=level1, x=.05, y=position)) +
coord_polar(theta='y') +
scale_fill_manual('', values=colors) +
theme_minimal() +
guides(fill=guide_legend(ncol=1)) +
labs(title='') +
scale_x_continuous(breaks=NULL) +
scale_y_continuous(breaks=df$cumulative, labels=df$level2, 5) +
theme(axis.title.x=element_blank(), axis.title.y=element_blank(), panel.border=element_blank(), panel.grid=element_blank())
There are only a couple of libraries that I know of that do this natively:
The Javascript Infovis Toolkit (jit) (example).
D3.js
OCaml's Simple Plot Tool (SPT).
Neither of these are in Python or R, but getting a python/R script to write out a simple JSON file that can be loaded by either of the javascript libraries should be pretty achievable.
Since jbkunst mentioned ggsunburst, here I post an example for reproducing the sunburst by sirex.
It is not exactly the same because in ggsunburst the angle of a node is equal to the sum of the angles of its children nodes.
# install ggsunburst package
if (!require("ggplot2")) install.packages("ggplot2")
if (!require("rPython")) install.packages("rPython")
install.packages("http://genome.crg.es/~didac/ggsunburst/ggsunburst_0.0.9.tar.gz", repos=NULL, type="source")
library(ggsunburst)
# dataframe
# each row corresponds to a node in the hierarchy
# parent and node are required, the rest are optional attributes
# the attributes correspond to the node, not its parent
df <- read.table(header = T, sep = ",", text = "
parent,node,size,color,dist
,/,,B,1
/,home,,D,1
home,Images, 40,E,1
home,Videos, 20,E,1
home,Documents, 5,E,1
/,usr,,D,1
usr,src,,A,1
src,linux-headers, 4,C,1.5
src,virtualbox, 1,C,1.5
usr,lib, 4,A,1
usr,share, 2,A,1
usr,bin, 1,A,1
usr,local, 1,A,1
usr,include, 1,A,1
")
write.table(df, 'df.csv', sep = ",", row.names = F)
# compute coordinates from dataframe
# "node_attributes" is used to pass the attributes other than "size" and "dist",
# which are special attributes that alter the dimensions of the nodes
sb <- sunburst_data('df.csv', sep = ",", type = "node_parent", node_attributes = "color")
# plot
sunburst(sb, node_labels = T, node_labels.min = 10, rects.fill.aes = "color") +
scale_fill_brewer(palette = "Set1", guide = F)
Here is an example using R and plotly (based on my answer here):
library(datasets)
library(data.table)
library(plotly)
as.sunburstDF <- function(DF, valueCol = NULL){
require(data.table)
colNamesDF <- names(DF)
if(is.data.table(DF)){
DT <- copy(DF)
} else {
DT <- data.table(DF, stringsAsFactors = FALSE)
}
DT[, root := names(DF)[1]]
colNamesDT <- names(DT)
if(is.null(valueCol)){
setcolorder(DT, c("root", colNamesDF))
} else {
setnames(DT, valueCol, "values", skip_absent=TRUE)
setcolorder(DT, c("root", setdiff(colNamesDF, valueCol), "values"))
}
hierarchyCols <- setdiff(colNamesDT, "values")
hierarchyList <- list()
for(i in seq_along(hierarchyCols)){
currentCols <- colNamesDT[1:i]
if(is.null(valueCol)){
currentDT <- unique(DT[, ..currentCols][, values := .N, by = currentCols], by = currentCols)
} else {
currentDT <- DT[, lapply(.SD, sum, na.rm = TRUE), by=currentCols, .SDcols = "values"]
}
setnames(currentDT, length(currentCols), "labels")
hierarchyList[[i]] <- currentDT
}
hierarchyDT <- rbindlist(hierarchyList, use.names = TRUE, fill = TRUE)
parentCols <- setdiff(names(hierarchyDT), c("labels", "values", valueCol))
hierarchyDT[, parents := apply(.SD, 1, function(x){fifelse(all(is.na(x)), yes = NA_character_, no = paste(x[!is.na(x)], sep = ":", collapse = " - "))}), .SDcols = parentCols]
hierarchyDT[, ids := apply(.SD, 1, function(x){paste(x[!is.na(x)], collapse = " - ")}), .SDcols = c("parents", "labels")]
hierarchyDT[, c(parentCols) := NULL]
return(hierarchyDT)
}
DF <- as.data.table(Titanic)
setcolorder(DF, c("Survived", "Class", "Sex", "Age", "N"))
sunburstDF <- as.sunburstDF(DF, valueCol = "N")
# Sunburst
plot_ly(data = sunburstDF, ids = ~ids, labels= ~labels, parents = ~parents, values= ~values, type='sunburst', branchvalues = 'total')
# Treemap
# plot_ly(data = sunburstDF, ids = ~ids, labels= ~labels, parents = ~parents, values= ~values, type='treemap', branchvalues = 'total')
Some additional information can be found here.
You can also use plotly Sunburst on python as well as seen here
The same inputs can be used to create Icicle and Treemap graphs (supported too by plotly) which might also suit your needs.

Categories

Resources