Python: How to stack or overlay histograms using Plotly - python

I have two sets of data in separate lists. Each list element has a value from 0:100, and elements repeat.
For example:
first_data = [10,20,40,100,...,100,10,50]
second_data = [20,50,50,10,...,70,10,100]
I can plot one of these in a histogram using:
import plotly.graph_objects as go
.
.
.
fig = go.Figure()
fig.add_trace(go.Histogram(histfunc='count', x=first_data))
fig.show()
By setting histfunc to 'count', my histogram consists of an x-axis from 0 to 100 and bars for the number of repeated elements in first_data.
My question is: How can I overlay the second set of data over the same axis using the same "count" histogram?

One method to do this is by simply adding another trace, you were nearly there! The dataset used to create these examples, can be found in the last section of this post.
Note:
The following code uses the 'lower-level' plotly API, as (personally) I feel it's more transparent and enables the user to see what is being plotted, and why; rather than relying on the convenience modules of graph_objects and express.
Option 1 - Overlaid Bars:
from plotly.offline import plot
layout = {}
traces = []
traces.append({'x': data1, 'name': 'D1', 'opacity': 1.0})
traces.append({'x': data2, 'name': 'D2', 'opacity': 0.5})
# For each trace, add elements which are common to both.
for t in traces:
t.update({'type': 'histogram',
'histfunc': 'count',
'nbinsx': 50})
layout['barmode'] = 'overlay'
plot({'data': traces, 'layout': layout})
Output 1:
Option 2 - Curve Plot:
Another option is to plot the curve (Gaussian KDE) of the distribution, as shown here. It's worth noting that this method plots the probability density, rather than the counts.
X1, Y1 = calc_curve(data1)
X2, Y2 = calc_curve(data2)
traces = []
traces.append({'x': X1, 'y': Y1, 'name': 'D1'})
traces.append({'x': X2, 'y': Y2, 'name': 'D2'})
plot({'data': traces})
Output 2:
Associated calc_curve() function:
from scipy.stats import gaussian_kde
def calc_curve(data):
"""Calculate probability density."""
min_, max_ = data.min(), data.max()
X = [min_ + i * ((max_ - min_) / 500) for i in range(501)]
Y = gaussian_kde(data).evaluate(X)
return(X, Y)
Option 3 - Plot Bars and Curves:
Or, you can always combine the two methods together, using the probability density on the yaxis.
layout = {}
traces = []
traces.append({'x': data1, 'name': 'D1', 'opacity': 1.0})
traces.append({'x': data2, 'name': 'D2', 'opacity': 0.5})
for t in traces:
t.update({'type': 'histogram',
'histnorm': 'probability density',
'nbinsx': 50})
traces.append({'x': X1, 'y': Y1, 'name': 'D1'})
traces.append({'x': X2, 'y': Y2, 'name': 'D2'})
layout['barmode'] = 'overlay'
plot({'data': traces, 'layout': layout})
Output 3:
Dataset:
Here is the bit of code used to simulate your dataset of [0,100] values, and to create these examples:
import numpy as np
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler((0, 100))
np.random.seed(4)
data1 = mms.fit_transform(np.random.randn(10000).reshape(-1, 1)).ravel()
data2 = mms.fit_transform(np.random.randn(10000).reshape(-1, 1)).ravel()

Related

Join paired points within each category in seaborn pointplot

I've got some data, grouped by category (i.e. "a","b","c" etc), and I'd like to draw lines between each pair of points within each category.
Basically, each category has a "before" and "after" value, so I've split it that way with hue. This is the plot now, but eventually I want each "before" and "after" value for a given category to be joined with a line (i.e. a_before joins to a_after, b_before joints to b_after, etc).
sns.pointplot (x = ‘category’, y = ‘correlation’,
hue = ‘time’, linestyles = ‘’, dodge = .3, data = sample_data)
I set linestyles to '' because otherwise it joins all the points rather than only the paired points.Is there a way to do this with seaborn?
Thanks!
edit: I'd like it to look something like this:
(I set linestyles to '' because otherwise it joins all the points rather than only the paired points.)
Matplotlib stores the generated points into the lines field of the ax.
sns.pointplot() always generates (possibly empty) confidence intervals which also get stored into the lines. The same positions are also stored in ax.collections.
You can loop through collections[0] and collections[1] to access the exact position of the (dodged) points. Then, you can draw lines between them:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
sample_data = pd.DataFrame({'category': ['a', 'a', 'b', 'b', 'c', 'c'],
'correlation': [0.33, 0.58, 0.51, 0.7, 0.49, 0.72],
'time': ['before', 'after', 'before', 'after', 'before', 'after']})
ax = sns.pointplot(x='category', y='correlation', hue='time', palette=['skyblue', 'dodgerblue'],
linestyles='', dodge=.3, data=sample_data)for (x0, y0), (x1, y1) in zip(ax.collections[0].get_offsets(), ax.collections[1].get_offsets()):
ax.plot([x0, x1], [y0, y1], color='black', ls=':', zorder=0)
ax.axhline(0, color='black', ls='--')
ax.set_ylim(-1, 1)
plt.show()

Plotly: Choose a different intersection of X and Y axes

In Plotly, in order to create scatter plots, I usually do the following:
fig = px.scatter(df, x=x, y=y)
fig.update_xaxes(range=[2, 10])
fig.update_yaxes(range=[2, 10])
I want the yaxis to intersect the xaxis at x=6. So, instead of left yaxis representing negative numbers, I want it to be from [2,6] After the intersection, right side of graph is from [6,10].
Likewise, yaxis from below axis goes from [2,6]. Above the xaxis, it goes from [6,10].
How can I do this in Plotly?
Following on from my comment, as far as I am aware, what you're after is not currently available.
However, here is an example of a work-around which uses a shapes dictionary to add horizontal and vertical lines - acting as intersecting axes - placed at your required x/y intersection of 6.
Sample dataset:
import numpy as np
x = (np.random.randn(100)*2)+6
y1 = (np.random.randn(100)*2)+6
y2 = (np.random.randn(100)*2)+6
Example plotting code:
import plotly.io as pio
layout = {'title': 'Intersection of X/Y Axes Demonstration'}
shapes = []
traces = []
traces.append({'x': x, 'y': y1, 'mode': 'markers'})
traces.append({'x': x, 'y': y2, 'mode': 'markers'})
shapes.append({'type': 'line',
'x0': 2, 'x1': 10,
'y0': 6, 'y1': 6})
shapes.append({'type': 'line',
'x0': 6, 'x1': 6,
'y0': 2, 'y1': 10})
layout['shapes'] = shapes
layout['xaxis'] = {'range': [2, 10]}
layout['yaxis'] = {'range': [2, 10]}
pio.show({'data': data, 'layout': layout})
Output:
Comments (TL;DR):
The example code shown here uses the low-level Plotly API (plotly.io), rather than a convenience wrapper such as graph_objects or express. The reason is that I (personally) feel it's helpful to users to show what is occurring 'under the hood', rather than masking the underlying code logic with a convenience wrapper.
This way, when the user needs to modify a finer detail of the graph, they will have a better understanding of the lists and dicts which Plotly is constructing for the underlying graphing engine (orca).
I think fig.add_hline() and fig.add_vline() is the function your need.
Example code
import plotly.express as px
import pandas as pd
df = pd.DataFrame({'x':[6,7,3], 'y':[4,5,6]})
fig = px.scatter(df, x='x', y='y')
fig.update_xaxes(range=[2, 10])
fig.update_yaxes(range=[2, 10])
fig.add_hline(y=4)
fig.add_vline(x=6)
fig.show()
Output

Plotly: Change y-axis scale

I have a dataset that looks like this:
x y z
0 Jan 28446000 110489.0
1 Feb 43267700 227900.0
When I plot a line chart like this:
px.line(data,x = 'x', y = ['y','z'], line_shape = 'spline', title="My Chart")
The y axis scale comes from 0 to 90 M. The first line on the chart for y is good enough. However, the second line appears to be always at 0M. What can I do to improve my chart such that we can see clearly how the values of both column change over the x values?
Is there any way I can normalize the data? Or perhaps I could change the scaling of the chart.
Often times we use data which is in different scales, and scaling the data would mask a characteristic we wish to display. One way to handle this is to add a secondary y-axis. An example is shown below.
Key points:
Create a layout dictionary object
Add a yaxis2 key to the dict, with the following: 'side': 'right', 'overlaying': 'y1'
This tells Plotly to create a secondary y-axis on the right side of the graph, and to overlay the primary y-axis.
Assign the appropriate trace to the newly created secondary y-axis as: 'yaxis': 'y2'
The other trace does not need to be assigned, as 'y1' is the default y-axis.
Comments (TL;DR):
The example code shown here uses the lower-level Plotly API, rather than a convenience wrapper such as graph_object to express. The reason is that I (personally) feel it's helpful to users to show what is occurring 'under the hood', rather than masking the underlying code logic with a convenience wrapper.
This way, when the user needs to modify a finer detail of the graph, they will have a better understanding of the lists and dicts which Plotly is constructing for the underlying graphing engine (orca).
The Docs:
Here is a link to the Plotly docs referencing multiple axes.
Example Code:
import pandas as pd
from plotly.offline import iplot
df = pd.DataFrame({'x': ['Jan', 'Feb'],
'y': [28446000, 43267700],
'z': [110489.0, 227900.0]})
layout = {'title': 'Secondary Y-Axis Demonstration',
'legend': {'orientation': 'h'}}
traces = []
traces.append({'x': df['x'], 'y': df['y'], 'name': 'Y Values'})
traces.append({'x': df['x'], 'y': df['z'], 'name': 'Z Values', 'yaxis': 'y2'})
# Add config common to all traces.
for t in traces:
t.update({'line': {'shape': 'spline'}})
layout['yaxis1'] = {'title': 'Y Values', 'range': [0, 50000000]}
layout['yaxis2'] = {'title': 'Z Values', 'side': 'right', 'overlaying': 'y1', 'range': [0, 400000]}
iplot({'data': traces, 'layout': layout})
Graph:

Python Data Visualization (plot, chart) with multiple squares [duplicate]

Something like this:
There is a very good package to do it in R. In python, the best that I could figure out is this, using the squarify package (inspired by a post on how to do treemaps):
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns # just to have better line color and width
import squarify
# for those using jupyter notebooks
%matplotlib inline
df = pd.DataFrame({
'v1': np.ones(100),
'v2': np.random.randint(1, 4, 100)})
df.sort_values(by='v2', inplace=True)
# color scale
cmap = mpl.cm.Accent
mini, maxi = df['v2'].min(), df['v2'].max()
norm = mpl.colors.Normalize(vmin=mini, vmax=maxi)
colors = [cmap(norm(value)) for value in df['v2']]
# figure
fig = plt.figure()
ax = fig.add_subplot(111, aspect="equal")
ax = squarify.plot(df['v1'], color=colors, ax=ax)
ax.set_xticks([])
ax.set_yticks([]);
But when I create not 100 but 200 elements (or other non-square numbers), the squares become misaligned.
Another problem is that if I change v2 to some categorical variable (e.g., a hundred As, Bs, Cs and Ds), I get this error:
could not convert string to float: 'a'
So, could anyone help me with these two questions:
how can I solve the alignment problem with non-square numbers of observations?
how can use categorical variables in v2?
Beyond this, I am really open if there are any other python packages that can create waffle plots more efficiently.
I spent a few days to build a more general solution, PyWaffle.
You can install it through
pip install pywaffle
The source code: https://github.com/gyli/PyWaffle
PyWaffle does not use matshow() method, but builds those squares one by one. That makes it easier for customization. Besides, what it provides is a custom Figure class, which returns a figure object. By updating attributes of the figure, you can basically control everything in the chart.
Some examples:
Colored or transparent background:
import matplotlib.pyplot as plt
from pywaffle import Waffle
data = {'Democratic': 48, 'Republican': 46, 'Libertarian': 3}
fig = plt.figure(
FigureClass=Waffle,
rows=5,
values=data,
colors=("#983D3D", "#232066", "#DCB732"),
title={'label': 'Vote Percentage in 2016 US Presidential Election', 'loc': 'left'},
labels=["{0} ({1}%)".format(k, v) for k, v in data.items()],
legend={'loc': 'lower left', 'bbox_to_anchor': (0, -0.4), 'ncol': len(data), 'framealpha': 0}
)
fig.gca().set_facecolor('#EEEEEE')
fig.set_facecolor('#EEEEEE')
plt.show()
Use icons replacing squares:
data = {'Democratic': 48, 'Republican': 46, 'Libertarian': 3}
fig = plt.figure(
FigureClass=Waffle,
rows=5,
values=data,
colors=("#232066", "#983D3D", "#DCB732"),
legend={'loc': 'upper left', 'bbox_to_anchor': (1, 1)},
icons='child', icon_size=18,
icon_legend=True
)
Multiple subplots in one chart:
import pandas as pd
data = pd.DataFrame(
{
'labels': ['Hillary Clinton', 'Donald Trump', 'Others'],
'Virginia': [1981473, 1769443, 233715],
'Maryland': [1677928, 943169, 160349],
'West Virginia': [188794, 489371, 36258],
},
).set_index('labels')
fig = plt.figure(
FigureClass=Waffle,
plots={
'311': {
'values': data['Virginia'] / 30000,
'labels': ["{0} ({1})".format(n, v) for n, v in data['Virginia'].items()],
'legend': {'loc': 'upper left', 'bbox_to_anchor': (1.05, 1), 'fontsize': 8},
'title': {'label': '2016 Virginia Presidential Election Results', 'loc': 'left'}
},
'312': {
'values': data['Maryland'] / 30000,
'labels': ["{0} ({1})".format(n, v) for n, v in data['Maryland'].items()],
'legend': {'loc': 'upper left', 'bbox_to_anchor': (1.2, 1), 'fontsize': 8},
'title': {'label': '2016 Maryland Presidential Election Results', 'loc': 'left'}
},
'313': {
'values': data['West Virginia'] / 30000,
'labels': ["{0} ({1})".format(n, v) for n, v in data['West Virginia'].items()],
'legend': {'loc': 'upper left', 'bbox_to_anchor': (1.3, 1), 'fontsize': 8},
'title': {'label': '2016 West Virginia Presidential Election Results', 'loc': 'left'}
},
},
rows=5,
colors=("#2196f3", "#ff5252", "#999999"), # Default argument values for subplots
figsize=(9, 5) # figsize is a parameter of plt.figure
)
I've put together a working example, below, which I think meets your needs. Some work is needed to fully generalize the approach, but I think you'll find that it's a good start. The trick was to use matshow() to solve your non-square problem, and to build a custom legend to easily account for categorical values.
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
# Let's make a default data frame with catagories and values.
df = pd.DataFrame({ 'catagories': ['cat1', 'cat2', 'cat3', 'cat4'],
'values': [84911, 14414, 10062, 8565] })
# Now, we define a desired height and width.
waffle_plot_width = 20
waffle_plot_height = 7
classes = df['catagories']
values = df['values']
def waffle_plot(classes, values, height, width, colormap):
# Compute the portion of the total assigned to each class.
class_portion = [float(v)/sum(values) for v in values]
# Compute the number of tiles for each catagories.
total_tiles = width * height
tiles_per_class = [round(p*total_tiles) for p in class_portion]
# Make a dummy matrix for use in plotting.
plot_matrix = np.zeros((height, width))
# Popoulate the dummy matrix with integer values.
class_index = 0
tile_index = 0
# Iterate over each tile.
for col in range(waffle_plot_width):
for row in range(height):
tile_index += 1
# If the number of tiles populated is sufficient for this class...
if tile_index > sum(tiles_per_class[0:class_index]):
# ...increment to the next class.
class_index += 1
# Set the class value to an integer, which increases with class.
plot_matrix[row, col] = class_index
# Create a new figure.
fig = plt.figure()
# Using matshow solves your "non-square" problem.
plt.matshow(plot_matrix, cmap=colormap)
plt.colorbar()
# Get the axis.
ax = plt.gca()
# Minor ticks
ax.set_xticks(np.arange(-.5, (width), 1), minor=True);
ax.set_yticks(np.arange(-.5, (height), 1), minor=True);
# Gridlines based on minor ticks
ax.grid(which='minor', color='w', linestyle='-', linewidth=2)
# Manually constructing a legend solves your "catagorical" problem.
legend_handles = []
for i, c in enumerate(classes):
lable_str = c + " (" + str(values[i]) + ")"
color_val = colormap(float(i+1)/len(classes))
legend_handles.append(mpatches.Patch(color=color_val, label=lable_str))
# Add the legend. Still a bit of work to do here, to perfect centering.
plt.legend(handles=legend_handles, loc=1, ncol=len(classes),
bbox_to_anchor=(0., -0.1, 0.95, .10))
plt.xticks([])
plt.yticks([])
# Call the plotting function.
waffle_plot(classes, values, waffle_plot_height, waffle_plot_width,
plt.cm.coolwarm)
Below is an example of the output this script produced. As you can see, it works fairly well for me, and meets all of your stated needs. Just let me know if it gives you any trouble. Enjoy!
You can use this function for automatic creation of a waffle with simple parameters:
def create_waffle_chart(categories, values, height, width, colormap, value_sign=''):
# compute the proportion of each category with respect to the total
total_values = sum(values)
category_proportions = [(float(value) / total_values) for value in values]
# compute the total number of tiles
total_num_tiles = width * height # total number of tiles
print ('Total number of tiles is', total_num_tiles)
# compute the number of tiles for each catagory
tiles_per_category = [round(proportion * total_num_tiles) for proportion in category_proportions]
# print out number of tiles per category
for i, tiles in enumerate(tiles_per_category):
print (df_dsn.index.values[i] + ': ' + str(tiles))
# initialize the waffle chart as an empty matrix
waffle_chart = np.zeros((height, width))
# define indices to loop through waffle chart
category_index = 0
tile_index = 0
# populate the waffle chart
for col in range(width):
for row in range(height):
tile_index += 1
# if the number of tiles populated for the current category
# is equal to its corresponding allocated tiles...
if tile_index > sum(tiles_per_category[0:category_index]):
# ...proceed to the next category
category_index += 1
# set the class value to an integer, which increases with class
waffle_chart[row, col] = category_index
# instantiate a new figure object
fig = plt.figure()
# use matshow to display the waffle chart
colormap = plt.cm.coolwarm
plt.matshow(waffle_chart, cmap=colormap)
plt.colorbar()
# get the axis
ax = plt.gca()
# set minor ticks
ax.set_xticks(np.arange(-.5, (width), 1), minor=True)
ax.set_yticks(np.arange(-.5, (height), 1), minor=True)
# add dridlines based on minor ticks
ax.grid(which='minor', color='w', linestyle='-', linewidth=2)
plt.xticks([])
plt.yticks([])
# compute cumulative sum of individual categories to match color schemes between chart and legend
values_cumsum = np.cumsum(values)
total_values = values_cumsum[len(values_cumsum) - 1]
# create legend
legend_handles = []
for i, category in enumerate(categories):
if value_sign == '%':
label_str = category + ' (' + str(values[i]) + value_sign + ')'
else:
label_str = category + ' (' + value_sign + str(values[i]) + ')'
color_val = colormap(float(values_cumsum[i])/total_values)
legend_handles.append(mpatches.Patch(color=color_val, label=label_str))
# add legend to chart
plt.legend(
handles=legend_handles,
loc='lower center',
ncol=len(categories),
bbox_to_anchor=(0., -0.2, 0.95, .1)
)

Pyplot scatterplot legend not working with smaller sample sizes

I'm using the code below to generate a scatter plot in pyplot where I'd like to have each of the 9 classes plotted in a different color. There are multiple points within each class.
I cannot figure out why the legend does not work with smaller sample sizes.
def plot_scatter_test(x, y, c, title):
data = pd.DataFrame({'x': x, 'y': y, 'c': c})
classes = len(np.unique(c))
colors = cm.rainbow(np.linspace(0, 1, classes))
ax = plt.subplot(111)
for s in range(0,classes):
ss = data[data['c']==s]
plt.scatter(x=ss['x'], y=ss['y'],c=colors[s], label=s)
ax.legend(loc='lower left',scatterpoints=1, ncol=3, fontsize=8, bbox_to_anchor=(0, -.4), title='Legend')
plt.show()
My data looks like this
When I plot this by calling
plot_scatter_test(test['x'], test['y'],test['group'])
I get varying colors in the chart, but the legend is a single color
So to make sure my data was ok, I created a random dataframe using the same type of data. Now I get different colors, but something is still wrong as they aren't sequential.
test2 = pd.DataFrame({
'y': np.random.uniform(0,1400,36),
'x': np.random.uniform(-250,-220,36),
'group': np.random.randint(0,9,36)
})
plot_scatter_test(test2['x'], test2['y'],test2['group'])
Finally, I create a larger plot of 360 data points, and everything looks the way I would expect it to. What am I doing wrong?
test3 = pd.DataFrame({
'y': np.random.uniform(0,1400,360),
'x': np.random.uniform(-250,-220,360),
'group': np.random.randint(0,9,360)
})
plot_scatter_test(test3['x'], test3['y'],test3['group'])
You need to make sure not to confuse the class itself with the number you use for indexing.
To better observe what I mean, use the following dataset with your function:
np.random.seed(22)
X,Y= np.meshgrid(np.arange(3,7), np.arange(4,8))
test2 = pd.DataFrame({
'y': Y.flatten(),
'x': X.flatten(),
'group': np.random.randint(0,9,len(X.flatten()))
})
plot_scatter_test(test2['x'], test2['y'],test2['group'])
which results in the following plot, where points are missing.
So, make a clear distinction between the index and the class, e.g. as follows
import numpy as np; np.random.seed(22)
import matplotlib.pyplot as plt
import pandas as pd
def plot_scatter_test(x, y, c, title="title"):
data = pd.DataFrame({'x': x, 'y': y, 'c': c})
classes = np.unique(c)
print classes
colors = plt.cm.rainbow(np.linspace(0, 1, len(classes)))
print colors
ax = plt.subplot(111)
for i, clas in enumerate(classes):
ss = data[data['c']==clas]
plt.scatter(ss["x"],ss["y"],c=[colors[i]]*len(ss), label=clas)
ax.legend(loc='lower left',scatterpoints=1, ncol=3, fontsize=8, title='Legend')
plt.show()
X,Y= np.meshgrid(np.arange(3,7), np.arange(4,8))
test2 = pd.DataFrame({
'y': Y.flatten(),
'x': X.flatten(),
'group': np.random.randint(0,9,len(X.flatten()))
})
plot_scatter_test(test2['x'], test2['y'],test2['group'])
Apart from that it is indeed necessary not to supply the color 4-tuple directly to c as this would be interpreted as four single colors.
I feel silly now after staring at this for a while. The error was in the color being passed. I was passing a single color to the .scatter function. However since there are multiple points, you need to pass an equal number of colors. Therefore
plt.scatter(x=ss['x'], y=ss['y'],c=colors[s], label=s)
Can be something like
plt.scatter(x=ss['x'], y=ss['y'],c=[colors[s]]*len(ss), label=s)

Categories

Resources