I'm currently using Matplotlib to create a histogram:
import matplotlib
import matplotlib.pyplot as pyplot
fig = pyplot.figure()
ax = fig.add_subplot(1,1,1,)
n, bins, patches = ax.hist(measurements, bins=50, range=(graph_minimum, graph_maximum), histtype='bar')
#ax.set_xticklabels([n], rotation='vertical')
for patch in patches:
pyplot.title('Spam and Ham')
pyplot.xlabel('Time (in seconds)')
pyplot.ylabel('Bits of Ham')
I'd like to make the x-axis labels a bit more meaningful.
Firstly, the x-axis ticks here seem to be limited to five ticks. No matter what I do, I can't seem to change this - even if I add more xticklabels, it only uses the first five. I'm not sure how Matplotlib calculates this, but I assume it's auto-calculated from the range/data?
Is there some way I can increase the resolution of x-tick labels - even to the point of one for each bar/bin?
(Ideally, I'd also like the seconds to be reformatted in micro-seconds/milli-seconds, but that's a question for another day).
Secondly, I'd like each individual bar labeled - with the actual number in that bin, as well as the percentage of the total of all bins.
The final output might look something like this:
Is something like that possible with Matplotlib?
Sure! To set the ticks, just, well... Set the ticks (see matplotlib.pyplot.xticks or ax.set_xticks). (Also, you don't need to manually set the facecolor of the patches. You can just pass in a keyword argument.)
For the rest, you'll need to do some slightly more fancy things with the labeling, but matplotlib makes it fairly easy.
As an example:
import matplotlib.pyplot as plt
import numpy as np
from matplotlib.ticker import FormatStrFormatter
data = np.random.randn(82)
fig, ax = plt.subplots()
counts, bins, patches = ax.hist(data, facecolor='yellow', edgecolor='gray')
# Set the ticks to be at the edges of the bins.
# Set the xaxis's tick labels to be formatted with 1 decimal place...
# Change the colors of bars at the edges...
twentyfifth, seventyfifth = np.percentile(data, [25, 75])
for patch, rightside, leftside in zip(patches, bins[1:], bins[:-1]):
if rightside < twentyfifth:
elif leftside > seventyfifth:
# Label the raw counts and the percentages below the x-axis...
bin_centers = 0.5 * np.diff(bins) + bins[:-1]
for count, x in zip(counts, bin_centers):
# Label the raw counts
ax.annotate(str(count), xy=(x, 0), xycoords=('data', 'axes fraction'),
xytext=(0, -18), textcoords='offset points', va='top', ha='center')
# Label the percentages
percent = '%0.0f%%' % (100 * float(count) / counts.sum())
ax.annotate(percent, xy=(x, 0), xycoords=('data', 'axes fraction'),
xytext=(0, -32), textcoords='offset points', va='top', ha='center')
# Give ourselves some more room at the bottom of the plot
One thing I wanted to add to the plots in the histogram with "density = True" was the relative frequency values for each bin, search but I couldn't find a function that would do that. A solution I made follows as image:
The function:
def label_densityHist(ax, n, bins, x=4, y=0.01, r=2, **kwargs):
Add labels,relative value of bin, to each bin in a density histogram .
:param ax: Object axe of matplotlib
The axis to plot.
:param n: list, array of int, float
The values of the histogram bins.
:param bins: list, array of int, float
The edges of the bins.
:param x: int, float
Related the x position of the bin labels. The higher, the lower the value on the x-axis.
Default: 4
:param y: int, float
Related the y position of the bin labels. The higher, the greater the value on the y-axis.
Default: 0.01
:param r: int
Number of decimal places.
Default: 2
:param **kwargs: Text properties in matplotlib
:return: None
import matplotlib.pyplot as plt
import numpy as np
dados = np.random.randn(100)
axe = plt.gca()
n, bins, _ = axe.hist(x=dados, edgecolor='black')
label_densityHist(axe,n, bins)
import matplotlib.pyplot as plt
import numpy as np
dados = np.random.randn(100)
axe = plt.gca()
n, bins, _ = axe.hist(x=dados, edgecolor='black')
label_densityHist(axe,n, bins, x=6, fontsize='large')
k = []
# calculate the relative frequency of each bin
for i in range(0,len(n)):
# rounded
k = around(k,r); #print(k)
# plot the label/text to each bin
for i in range(0, len(n)):
x_pos = (bins[i + 1] - bins[i]) / x + bins[i]
y_pos = n[i] + (n[i] * y)
label = str(k[i]) # relative frequency of each bin
ax.text(x_pos, y_pos, label, kwargs)
To add SI prefixes to your axis labels you want to use QuantiPhy. In fact, in its documentation it has an example that shows how to do this exact thing: MatPlotLib Example.
I think you would add something like this to your code:
from matplotlib.ticker import FuncFormatter
from quantiphy import Quantity
time_fmtr = FuncFormatter(lambda v, p: Quantity(v, 's').render(prec=2))
I am using seaborn's FacetGrid to do multiple histogram plots from a dataframe (plot_df) on the parameter - "xyz". But I want to do the following additional things too in those plots,
Create a vertical axes line at x-value = 0
Color all the bins that are equal to or lesser than 0 (on x-axis) with a different shade
Calculate the percentage area of the histogram for only those bins that are below 0 (on x-axis)
I am able to get lot of examples online but not with seaborn FacetGrid option
g = sns.FacetGrid(plot_df, col='xyz', height=5)```
g.map(plt.hist, "slack", bins=50)
You could loop through the generated axes (for xyz, ax in g.axes_dict.items(): ....) and call your plotting functions for each of those axes.
Or, you could call g.map_dataframe(...) with a custom function. That function will need to draw onto the "current ax".
Changing the x and y labels, needs to be done after the call to g.map_dataframe() because seaborn erases the x and y labels at the end of that functions.
You can call plt.setp(g.axes, xlabel='data', ylabel='frequency') to set the labels for all the subplots. Or g.set_ylabels('...') to only set the y labels for the "outer" subplots.
Here is some example code to get you started:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
def individual_plot(**kwargs):
ax = plt.gca() # get the current ax
data = kwargs['data']['slack'].values
xmin, xmax = data.min(), data.max()
bin_width = xmax / 50
# histogram part > 0
ax.hist(data, bins=np.arange(0.000001, xmax + 0.001, bin_width), color='tomato')
# histogram part < 0
ax.hist(data, bins=-np.arange(0, abs(xmin) + bin_width + 0.001, bin_width)[::-1], color='lime')
# line at x=0
ax.axvline(0, color='navy', ls='--')
# calculate and show part < 0
percent_under_zero = sum(data <= 0) / len(data) * 100
ax.text(0.5, 0.98, f'part < 0: {percent_under_zero:.1f} %',
color='k', ha='center', va='top', transform=ax.transAxes)
# first generate some test data
plot_df = pd.DataFrame({'xyz': np.repeat([*'xyz'], 1000),
'slack': np.random.randn(3000) * 10 + np.random.choice([10, 500], 3000, p=[0.9, 0.1])})
g = sns.FacetGrid(plot_df, col='xyz', height=5)
plt.setp(g.axes, xlabel='data', ylabel='frequency')
I have a seaborn heatmap that looks like this:
...generated from a pandas dataframe of randomly generated values a piece of which looks like this:
The values along the y axis are all in the range [0,1], and the ones on the x axis in the range [0,2*pi], and I just want some short floats at regular intervals for my tick labels, but I can only seem to get values that are in my dataframe. When I try specifying the values I want, it doesn't put them in the right place, as seen in the plot above. He's my code right now. How can I get the axis labels that I tried specifying with xticks and yticks in this code in the correct places (which would be evenly spaced along the axes)?
import pandas as pd
import numpy as np
import matplotlib as plt
from matplotlib.mlab import griddata
PHI, COSTH = np.meshgrid(phis, cos_thetas)
THICK = griddata(phis, cos_thetas, thicknesses, PHI, COSTH, interp='linear')
thick_df = pd.DataFrame(THICK, columns=phis, index=cos_thetas)
thick_df = thick_df.sort_index(axis=0, ascending=False)
thick_df = thick_df.sort_index(axis=1)
cmap = sns.cubehelix_palette(start=1.6, light=0.8, as_cmap=True, reverse=True)
yticks = np.array([0,0.2,0.4,0.6,0.8,1.0])
xticks = np.array([0,1,2,3,4,5,6])
g = sns.heatmap(thick_df, linewidth=0, xticklabels=xticks, yticklabels=yticks, square=True, cmap=cmap)
Here's something that should do what you want:
cmap = sns.cubehelix_palette(start=1.6, light=0.8, as_cmap=True, reverse=True)
yticks = np.linspace(0,1,6)
x_end = 6
xticks = np.arange(x_end+1)
ax = sns.heatmap(thick_df, linewidth=0, xticklabels=xticks, yticklabels=yticks[::-1], square=True, cmap=cmap)
You could pass ['{:,.2f}'.format(x) for x in xticks] instead of xticks to get a float with 2 decimals.
Note that I'm reversing the yticklabels because that's what seaborn does: see matrix.py#L138.
Seaborn calculates the tick positions around the same place (e.g.: #L148), for you that amounts to:
# thick_df.T.shape[0] = thick_df.shape[1]
xticks: np.arange(0, thick_df.T.shape[0], 1) + .5
yticks: np.arange(0, thick_df.T.shape[1], 1) + .5
I want to plot multiple histograms on the same plot and I need to compare the spread of the data. I want to do this by dividing each histogram by its maximum value so all the distributions have the same scale. However, the way matplotlib's histogram function works, I have not found an easy way to do this.
This is because n in
n, bins, patches = ax1.hist(y, bins = 20, histtype = 'step', color = 'k')
Is the number of counts in each bin but I can not repass this to hist since it will recalculate.
I have attempted the norm and density functions but these normalise the area of the distributions, rather than the height of the distribution. I could duplicate n and then repeat the bin edges using the bins output but this is tedious. Surely the hist function must allow for the bins values to be divided by a constant?
Example code is below, demonstrating the problem.
y1 = np.random.randn(100)
y2 = 2*np.random.randn(50)
x1 = np.linspace(1,101,100)
x2 = np.linspace(1,51,50)
gs = plt.GridSpec(1,2, wspace = 0, width_ratios = [3,1])
ax = plt.subplot(gs[0])
ax1 = plt.subplot(gs[1])
ax1.yaxis.set_ticklabels([]) # remove the major ticks
ax.scatter(x1, y1, marker='+',color = 'k')#, c=SNR, cmap=plt.cm.Greys)
ax.scatter(x2, y2, marker='o',color = 'k')#, c=SNR, cmap=plt.cm.Greys)
n1, bins1, patches1 = ax1.hist(y1, bins = 20, histtype = 'step', color = 'k',linewidth = 2, orientation = 'horizontal')
n2, bins2, patched2 = ax1.hist(y2, bins = 20, histtype = 'step', linestyle = 'dashed', color = 'k', orientation = 'horizontal')
I do not know whether matplotlib allows this normalisation by default but I wrote a function to do it myself.
It takes the output of n and bins from plt.hist (as above) and then passes this through the function below.
def hist_norm_height(n,bins,const):
''' Function to normalise bin height by a constant.
Needs n and bins from np.histogram or ax.hist.'''
n = np.repeat(n,2)
n = float32(n) / const
new_bins = [bins[0]]
return n,new_bins[:-1]
To plot now (I like step histograms), you pass it to plt.step.
Such as plt.step(new_bins,n). This will give you a histogram with height normalised by a constant.
You can assign the argument bins equal to a list of values. Use np.arange() or np.linspace() to generate the values. http://matplotlib.org/api/axes_api.html?highlight=hist#matplotlib.axes.Axes.hist
Slightly different approach set up for comparisons. Could be adapted to the step style:
# -*- coding: utf-8 -*-
import matplotlib.pyplot as plt
import numpy as np
y = []
y.append(np.random.normal(2, 2, size=40))
y.append(np.random.normal(3, 1.5, size=40))
ls = ['dashed','dotted','solid']
fig, (ax1, ax2, ax3) = plt.subplots(ncols=3)
for l, data in zip(ls, y):
n, b, p = ax1.hist(data, normed=False,
#histtype='step', #step's too much of a pain to get the bins
#color='k', linestyle=l,
ax2.hist(data, normed=True,
#histtype = 'step', color='k', linestyle=l,
n, b, p = ax3.hist(data, normed=False,
#histtype='step', #step's too much of a pain to get the bins
#color='k', linestyle=l,
high = float(max([r.get_height() for r in p]))
for r in p:
ax3.set_title('fix height')
a couple outputs:
This can be accomplished using numpy to obtain a priori histogram values, and then plotting them with a bar plot.
import numpy as np
import matplotlib.pyplot as plt
# Define random data and number of bins to use
x = np.random.randn(1000)
bins = 10
# Obtain the bin values and edges using numpy
hist, bin_edges = np.histogram(x, bins=bins, density=True)
# Plot bars with the proper positioning, height, and width.
(bin_edges[1:] + bin_edges[:-1]) * .5, hist / hist.max(),
width=(bin_edges[1] - bin_edges[0]), color="blue")
Hello Python/Matplotlib gurus,
I would like to label the y-axis at a random point where a particular horizontal line is drawn.
My Y-axis should not have any values, and only show major ticks.
To illustrate my request clearly, I will use some screenshots.
What I have currently:
What I want:
As you can see, E1 and E2 are not exactly at the major tick mark. Actually, I know the y-axis values (although they should be hidden, since it's a model graph). I also know the values of E1 and E2.
I would appreciate some help.
Let my code snippet be as follows:
ax3.axis([0,800,0,2500) #You can see that the major YTick-marks will be at 500 intervals
ax3.plot(x,y) #plot my lines
E1 = 1447
E2 = 2456
all_ticks = ax3.yaxis.get_all_ticks() #method that does not exist. If it did, I would be able to bind labels E1 and E2 to the respective values.
Thank you for the help!
For another graph, I use this code to have various colors for the labels. This works nicely. energy_range, labels_energy, colors_energy are numpy arrays as large as my y-axis, in my case, 2500.
#Modify the labels and colors of the Power y-axis
for i, y in enumerate(energy_range):
if (i == int(math.floor(E1))):
labels_energy[i] = '$E_1$'
colors_energy[i] = 'blue'
elif (i == int(math.floor(E2))):
labels_energy[i] = '$E_2$'
colors_energy[i] = 'green'
#Modify the colour of the energy y-axis ticks
for color,tick in zip(colors_energy,ax3.yaxis.get_major_ticks()):
print color, tick
if color:
print color
tick.label1.set_color(color) #set the color property
Full sample with dummy values:
import matplotlib
# matplotlib.use('Agg') #Remote, block show()
import numpy as np
import pylab as pylab
from pylab import *
import math
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator
import matplotlib.font_manager as fm
from matplotlib.font_manager import FontProperties
import matplotlib.dates as mdates
from datetime import datetime
import matplotlib.cm as cm
from matplotlib.ticker import MultipleLocator, FormatStrFormatter
from scipy import interpolate
def plot_sketch():
x = np.arange(0,800,1)
energy_range = range (0,2500,1) #Power graph y-axis range
labels_energy = [''] * len(energy_range)
colors_energy = [''] * len(energy_range)
#Set Axes ranges
#Add Energy lines; E=integral(P) dt
y=[i * P1 for i in x]
ax3.plot(x,y, color='b')
y = [i * P2 for i in x[:0.3*800]]
ax3.plot(x[:0.3*800],y, color='g')
last_val = y[-1]
y = [(i * P3 -last_val) for i in x[(0.3*800):(0.6*800)]]
ax3.plot(x[(0.3*800):(0.6*800)],y, color='g')
E1 = x[-1] * P1
E2 = (0.3 * x[-1]) * P2 + x[-1] * (0.6-0.3) * P3
#Modify the labels and colors of the Power y-axis
for i, y in enumerate(energy_range):
if (i == int(math.floor(E1))):
labels_energy[i] = '$E_1$'
colors_energy[i] = 'blue'
elif (i == int(math.floor(E2))):
labels_energy[i] = '$E_2$'
colors_energy[i] = 'green'
#Modify the colour of the power y-axis ticks
for color,tick in zip(colors_energy,ax3.yaxis.get_major_ticks()):
if color:
tick.label1.set_color(color) #set the color property
ax3.axhline(energy_range[int(math.floor(E1))], xmin=0, xmax=1, linewidth=0.25, color='b', linestyle='--')
ax3.axhline(energy_range[int(math.floor(E2))], xmin=0, xmax=0.6, linewidth=0.25, color='g', linestyle='--')
#Show grid
#fig = Sketch graph
fig = plt.figure(num=None, figsize=(14, 7), dpi=80, facecolor='w', edgecolor='k')
fig.canvas.set_window_title('Sketch graph')
ax3 = fig.add_subplot(111) #Energy plot
ax3.set_xlabel('Time (ms)', fontsize=12)
ax3.set_ylabel('Energy (J)', fontsize=12)
pylab.xlim(xmin=0) # start at 0
I think you're looking for the correct transform (check this out). In your case, what I think you want is to simply use the text method with the correct transform kwarg. Try adding this to your plot_sketch function after your axhline calls:
ax3.text(0, energy_range[int(math.floor(E1))],
'E1', color='g',
ax3.text(0, energy_range[int(math.floor(E2))],
'E2', color='b',
The get_yaxis_transform method returns a 'blended' transform which makes the x values input to the text call be plotted in axes units, and the y data in 'data' units. You can adjust the value of the x-data, (0) to be -0.003 or something if you want a little padding (or you could use a ScaledTranslation transform, but that's generally unnecessary if this is a one-off fix).
You'll probably also want to use the 'labelpad' option for set_ylabel, e.g.:
ax3.set_ylabel('Energy (J)', fontsize=12, labelpad=20)
I think my answer to a different post might be of help to you:
Matplotlib: Add strings as custom x-ticks but also keep existing (numeric) tick labels? Alternatives to matplotlib.pyplot.annotate?
It also works for the y-axis.Here is the result:
When drawing a dot plot using matplotlib, I would like to offset overlapping datapoints to keep them all visible. For example, if I have:
CategoryA: 0,0,3,0,5
CategoryB: 5,10,5,5,10
I want each of the CategoryA "0" datapoints to be set side by side, rather than right on top of each other, while still remaining distinct from CategoryB.
In R (ggplot2) there is a "jitter" option that does this. Is there a similar option in matplotlib, or is there another approach that would lead to a similar result?
Edit: to clarify, the "beeswarm" plot in R is essentially what I have in mind, and pybeeswarm is an early but useful start at a matplotlib/Python version.
Edit: to add that Seaborn's Swarmplot, introduced in version 0.7, is an excellent implementation of what I wanted.
Extending the answer by #user2467675, here’s how I did it:
def rand_jitter(arr):
stdev = .01 * (max(arr) - min(arr))
return arr + np.random.randn(len(arr)) * stdev
def jitter(x, y, s=20, c='b', marker='o', cmap=None, norm=None, vmin=None, vmax=None, alpha=None, linewidths=None, verts=None, hold=None, **kwargs):
return scatter(rand_jitter(x), rand_jitter(y), s=s, c=c, marker=marker, cmap=cmap, norm=norm, vmin=vmin, vmax=vmax, alpha=alpha, linewidths=linewidths, **kwargs)
The stdev variable makes sure that the jitter is enough to be seen on different scales, but it assumes that the limits of the axes are zero and the max value.
You can then call jitter instead of scatter.
Seaborn provides histogram-like categorical dot-plots through sns.swarmplot() and jittered categorical dot-plots via sns.stripplot():
import seaborn as sns
sns.set(style='ticks', context='talk')
iris = sns.load_dataset('iris')
sns.swarmplot('species', 'sepal_length', data=iris)
sns.stripplot('species', 'sepal_length', data=iris, jitter=0.2)
I used numpy.random to "scatter/beeswarm" the data along X-axis but around a fixed point for each category, and then basically do pyplot.scatter() for each category:
import matplotlib.pyplot as plt
import numpy as np
#random data for category A, B, with B "taller"
yA, yB = np.random.randn(100), 5.0+np.random.randn(1000)
xA, xB = np.random.normal(1, 0.1, len(yA)),
np.random.normal(3, 0.1, len(yB))
plt.scatter(xA, yA)
plt.scatter(xB, yB)
One way to approach the problem is to think of each 'row' in your scatter/dot/beeswarm plot as a bin in a histogram:
data = np.random.randn(100)
width = 0.8 # the maximum width of each 'row' in the scatter plot
xpos = 0 # the centre position of the scatter plot in x
counts, edges = np.histogram(data, bins=20)
centres = (edges[:-1] + edges[1:]) / 2.
yvals = centres.repeat(counts)
max_offset = width / counts.max()
offsets = np.hstack((np.arange(cc) - 0.5 * (cc - 1)) for cc in counts)
xvals = xpos + (offsets * max_offset)
fig, ax = plt.subplots(1, 1)
ax.scatter(xvals, yvals, s=30, c='b')
This obviously involves binning the data, so you may lose some precision. If you have discrete data, you could replace:
counts, edges = np.histogram(data, bins=20)
centres = (edges[:-1] + edges[1:]) / 2.
centres, counts = np.unique(data, return_counts=True)
An alternative approach that preserves the exact y-coordinates, even for continuous data, is to use a kernel density estimate to scale the amplitude of random jitter in the x-axis:
from scipy.stats import gaussian_kde
kde = gaussian_kde(data)
density = kde(data) # estimate the local density at each datapoint
# generate some random jitter between 0 and 1
jitter = np.random.rand(*data.shape) - 0.5
# scale the jitter by the KDE estimate and add it to the centre x-coordinate
xvals = 1 + (density * jitter * width * 2)
ax.scatter(xvals, data, s=30, c='g')
for sp in ['top', 'bottom', 'right']:
ax.tick_params(top=False, bottom=False, right=False)
ax.set_xticks([0, 1])
ax.set_xticklabels(['Histogram', 'KDE'], fontsize='x-large')
This second method is loosely based on how violin plots work. It still cannot guarantee that none of the points are overlapping, but I find that in practice it tends to give quite nice-looking results as long as there are a decent number of points (>20), and the distribution can be reasonably well approximated by a sum-of-Gaussians.
Not knowing of a direct mpl alternative here you have a very rudimentary proposal:
from matplotlib import pyplot as plt
from itertools import groupby
CA = [0,4,0,3,0,5]
CB = [0,0,4,4,2,2,2,2,3,0,5]
x = []
y = []
for indx, klass in enumerate([CA, CB]):
klass = groupby(sorted(klass))
for item, objt in klass:
objt = list(objt)
points = len(objt)
pos = 1 + indx + (1 - points) / 50.
for item in objt:
pos += 0.04
plt.plot(x, y, 'o')
Seaborn's swarmplot seems like the most apt fit for what you have in mind, but you can also jitter with Seaborn's regplot:
import seaborn as sns
iris = sns.load_dataset('iris')
sns.swarmplot('species', 'sepal_length', data=iris)
fit_reg=False, # do not fit a regression line
x_jitter=0.1, # could also dynamically set this with range of data
scatter_kws={'alpha': 0.5}) # set transparency to 50%
Extending the answer by #wordsforthewise (sorry, can't comment with my reputation), if you need both jitter and the use of hue to color the points by some categorical (like I did), Seaborn's lmplot is a great choice instead of reglpot:
import seaborn as sns
iris = sns.load_dataset('iris')
sns.lmplot(x='sepal_length', y='sepal_width', hue='species', data=iris, fit_reg=False, x_jitter=0.1, y_jitter=0.1)