Plotting histogram with text using python - python

I am trying to plot a histogram in python, and add text on the right upper corner.
here I am creating and plotting the histogram:
sample = stats.poisson.rvs(loc = 0,mu = lamda, size = 10001)
plt.hist(sample)
pd.DataFrame(sample).hist(bins=58,
figsize=(9,9),
edgecolor="k", linewidth=1)
Now, I am trying to plot the mean and median in the right upper corner:
plt.text(0.8, 0.9, s = 'mean = {0}'.format(round(np.mean(sample), 2)))
plt.text(0.8, 0.8, s = 'median = {0}'.format(np.median(sample)))
and here is the screenshot of the output:
As you can see, the x and y values of the text are coordinate values.
How can I pass relative x and y values (to place the text in the upper right corner)?

You need to specify which coordinate system you want to use, otherwise it will automatically use the data coordinate system.
In your case you want to use ax.transax.
plt.text(0.8, 0.9, s = 'mean = {0}'.format(round(np.mean(sample), 2)),transform=ax.transAxes)
plt.text(0.8, 0.8, s = 'median = {0}'.format(np.median(sample)),transform=ax.transAxes)
I suggest you to read this
You can also find an example in the matplotlib text documentation

Related

How do you set the coordinates of added annotations on a figure when the x axis is categorical?

I have a plot that I have made which has two different categories that is subdvided into three different groups. I have made calculations of the mean and median for each of these groups, but when I try to add annotate the figures with these numbers, they end up printing on top of each other, when I want each figure within the plot to be annotated with its respective mean and median.
So my code to make this plot currently looks like this:
fig = px.violin(CVs,
y="cv %",
x="group",
color="method",
box=True,
points=False,
hover_data=CVs.columns)
for i in CVs['method'].unique():
for j in CVs['group'].unique():
mean, median = np.round(CVs.loc[CVs['method']==i].agg({'cv %':['mean', 'median']}), 2)['cv %'].values
fig.add_annotation(x=j, y=0,
yshift=-65,
text="Mean: {}%".format(mean),
font=dict(size=10),
showarrow=False)
fig.add_annotation(x=j, y=0,
yshift=-75,
text="Median: {}%".format(median),
font=dict(size=10),
showarrow=False)
fig.update_traces(meanline_visible=True)
fig.update_layout(template='plotly_white', yaxis_zeroline=False, height=fig_height, width=fig_width)
iplot(fig)
From what I have read in the documentation (https://plotly.com/python/text-and-annotations/), it seems like you need indicate the coordinates of the added annotation using the parameters x and y.
I have tried to adhere to these parameters by setting y to 0 (since the y axis is numerical), and setting x to the pertinent group along the x axis (which is a categorical). However, as one can tell from the plot above, this doesn't seem to work. I have also tried setting x to a value that increments with each iteration of the for loop, but all the values I have tried (e.g. 1, 10, 0.1) haven't worked, the annotations keep printing on top of each other, just at different places along the x axis.
I want to have one set of annotations under each figure. Does anyone know how I can set this up?
Based on what you used (yshift) to adjust the annotation, I have done the same using xshift to move each of the labels below their respective plot. Note that you have fig_height and fig_width which was not provided, so I let plotly choose the size. You may need to adjust the offset a bit if figure is different. Hope this works.
CVs = px.data.tips() ##Used tips db
CVs.rename(columns={'sex': 'group', 'day':'method', 'total_bill': 'cv %'}, inplace=True) ##Replaced to names you have
CVs = CVs[CVs.method != 'Thur'] ##Removed one as there were 4 days in tips
fig = px.violin(CVs,
y="cv %",
x="group",
color="method",
box=True,
points=False,
hover_data=CVs.columns)
x_shift = -100 ##Start at -100 to the left of the j location
for i in CVs['method'].unique():
for j in CVs['group'].unique():
mean, median = np.round(CVs.loc[CVs['method']==i].agg({'cv %':['mean', 'median']}), 2)['cv %'].values
fig.add_annotation(x=j, y=0,
yshift=-65, xshift = x_shift,
text="Mean: {}%".format(mean),
font=dict(size=10),
showarrow=False)
fig.add_annotation(x=j, y=0,
yshift=-75, xshift = x_shift,
text="Median: {}%".format(median),
font=dict(size=10),
showarrow=False)
x_shift = x_shift + 100 ##After each entry (healthy/sick in your case), add 100
fig.update_traces(meanline_visible=True)
fig.update_layout(template='plotly_white', yaxis_zeroline=False)#, height=fig_height, width=fig_width)
#iplot(fig)
Plot

Is there a way I can find the range of local maxima of histogram?

I'm wondering if there's a way I can find the range of local maxima of a histogram. For instance, suppose I have the following histogram (just ignore the orange curve):
The histogram is actually obtained from a dictionary. I'm hoping to find the range of local maxima of this histogram (on the horizontal axis), which are, say, 1.3-1.6, and 2.1-2.4 in this case. I have no idea which tools would be helpful or which techniques I may want to use. I know there's a tool to find local maxima of a 1-D array:
from scipy.signal import argrelextrema
x = np.random.random(12)
argrelextrema(x, np.greater)
but I don't think it would work here since I'm looking for a range, and there're some 'wiggles' on the histogram. Can anyone give me some suggestions/examples about how I can obtain the range I'm looking for? Thanks a lot for the help
PS: I trying to not just search for the ranges of x whose y values are above a certain limit:)
I don't know if I correctly understand what you want to do, but you can treat the histogram as a Probability Density Function (PDF) of a bimodal distribution, then find the modes and the Highest Density Intervals (HDIs) around the two modes.
So, I create some sample data
import numpy as np
import pandas as pd
import scipy.stats as sps
from scipy.signal import find_peaks, argrelextrema
import matplotlib.pyplot as plt
d1 = sps.norm(loc=1.3, scale=.2)
d2 = sps.norm(loc=2.2, scale=.3)
r1 = d1.rvs(size=5000, random_state=1)
r2 = d2.rvs(size=5000, random_state=1)
r = np.concatenate((r1, r2))
h = plt.hist(r, bins=100, density=True);
We have only h, the result of the hist function that will contains the density (100) and the ranges of the bins (101).
print(h[0].size)
100
print(h[1].size)
101
So we first need to choose the mean of each bin
density = h[0]
values = h[1][:-1] + np.diff(h[1])[0] / 2
plt.hist(r, bins=100, density=True, alpha=.25)
plt.plot(values, density);
Now we can normalize the PDF (to sum to 1) and smooth the data with moving average that we'll use only to get the peaks (maxima) and minima
norm_density = density / density.sum()
norm_density_ma = pd.Series(norm_density).rolling(7, center=True).mean().values
plt.plot(values, norm_density_ma)
plt.plot(values, norm_density);
Now we can obtain indexes of maxima
peaks = find_peaks(norm_density_ma)[0]
peaks
array([24, 57])
and minima
minima = argrelextrema(norm_density_ma, np.less)[0]
minima
array([40])
and check they're correct
plt.plot(values, norm_density_ma)
plt.plot(values, norm_density)
for peak in peaks:
plt.axvline(values[peak], color='r')
plt.axvline(values[minima], color='k', ls='--');
Finally, we have to find out the HDIs around the two modes (peaks) from the normalized h histogram data. We can use a simple function to get the HDI of grid (see HDI_of_grid for details and Doing Bayesian Data Analysis by John K. Kruschke)
def HDI_of_grid(probMassVec, credMass=0.95):
sortedProbMass = np.sort(probMassVec, axis=None)[::-1]
HDIheightIdx = np.min(np.where(np.cumsum(sortedProbMass) >= credMass))
HDIheight = sortedProbMass[HDIheightIdx]
HDImass = np.sum(probMassVec[probMassVec >= HDIheight])
idx = np.where(probMassVec >= HDIheight)[0]
return {'indexes':idx, 'mass':HDImass, 'height':HDIheight}
Let's say we want the HDIs to contain a mass of 0.3
# HDI around the 1st mode
hdi1 = HDI_of_grid(norm_density, credMass=.3)
plt.plot(values, norm_density_ma)
plt.plot(values, norm_density)
plt.fill_between(
values[hdi1['indexes']],
0, norm_density[hdi1['indexes']],
alpha=.25
)
for peak in peaks:
plt.axvline(values[peak], color='r')
for the 2nd mode, we'll get HDI from minima to avoid the 1st mode
# HDI around the 2nd mode
hdi2 = HDI_of_grid(norm_density[minima[0]:], credMass=.3)
plt.plot(values, norm_density_ma)
plt.plot(values, norm_density)
plt.fill_between(
values[hdi1['indexes']],
0, norm_density[hdi1['indexes']],
alpha=.25
)
plt.fill_between(
values[hdi2['indexes']+minima],
0, norm_density[hdi2['indexes']+minima],
alpha=.25
)
for peak in peaks:
plt.axvline(values[peak], color='r')
And we have the values of the two HDIs
# 1st mode
values[peaks[0]]
1.320249129265321
# 0.3 HDI
values[hdi1['indexes']].take([0, -1])
array([1.12857599, 1.45715851])
# 2nd mode
values[peaks[1]]
2.2238510564735363
# 0.3 HDI
values[hdi2['indexes']+minima].take([0, -1])
array([1.95003229, 2.47028795])

Pyplot - show x-axis labels according to y-axis value

I have 1min 20s long video record of 23.813 FPS. More precisely, I have 1923 frames in which I've been scanning desired features. I've detected some specific behavior via neural network and using chosen metric I calculated a value for each frame.
So, now, I have X-Y values to plot a graph:
X: time (each step of size 0,041993869s)
Y: a value measured by neural network
In the default state, the plot looks like this:
So, I've tried to limit the number of bins in the faith that the bins will be spread over all my values. But they are not. As you can see, only first fifteen x-values are rendered:
pyplot.locator_params(axis='x', nbins=15)
But neither one is desired state. The desired state should render the labels of such x-bins with y-value higher than e.g. 1.2. So, it should look like this:
Is possible to achieve such result?
Code:
# draw plot
from pandas import read_csv
from matplotlib import pyplot
test_video_fps = 23.813
df = read_csv('/path/to/csv/file/file.csv', header=None)
df.columns = ['anomaly']
df['time'] = [round((i + 1) / test_video_fps, 2) for i in range(df.shape[0])]
axes = df.plot.bar(x='time', y='anomaly', rot='0')
# pyplot.locator_params(axis='x', nbins=15)
# axes.get_xaxis().set_visible(False)
fig = pyplot.gcf()
fig.set_size_inches(16, 10)
fig.savefig('/path/to/output/plot.png', dpi=100)
# pyplot.show()
Example:
Simple example with a subset of original data.
0.379799
0.383786
0.345488
0.433286
0.469474
0.431993
0.474253
0.418843
0.491070
0.447778
0.384890
0.410994
0.898229
1.872756
2.907009
3.691382
4.685749
4.599612
3.738768
8.043357
7.660785
2.311198
1.956096
2.877326
3.467511
3.896339
4.250552
6.485533
7.452986
7.103761
2.684189
2.516134
1.512196
1.435303
0.852047
0.842551
0.957888
0.983085
0.990608
1.046679
1.082040
1.119655
0.962391
1.263255
1.371034
1.652812
2.160451
2.646674
1.460051
1.163745
0.938030
0.862976
0.734119
0.567076
0.417270
Desired plot:
Your question has become a two-part problem, but it is interesting enough that I will answer both.
I will answer this in Matplotlib object oriented notation with numpy data rather than pandas. This will make things easier to explain, and can be easily generalized to pandas.
I will assume that you have the following two data arrays:
dt = 0.041993869
x = np.arange(0.0, 15 * dt, dt)
y = np.array([1., 1.1, 1.3, 7.6, 2.4, 0.8, 0.7, 0.8, 1.0, 1.5, 10.0, 4.5, 3.2, 0.9, 0.7])
Part 1: Identifying the locations where you want labels
The data can be masked to get the locations of the peaks:
mask = y > 1.2
Consecutive peaks can be easily eliminated by computing the diff. A diff of a boolean mask will be True at the locations where the mask changes sense. You will then have to take every other element to get the locations where it goes from False to True. The following code will capture all the corner cases where you start with a peak or end in the middle of a peak:
d = np.flatnonzero(np.diff(mask))
if mask[d[0]]: # First diff is end of peak: True to False
d = np.concatenate(([0], d[1::2] + 1))
else:
d = d[::2] + 1
d is now an array indices into x and y that represent the first element of each run of peaks. You can get the last element by swapping the indices [1::2] and [::2] in the if-else statement, and removing the + 1 in both cases.
The locations of the labels are now simply x[d].
Part 2: Locating and formatting the labels
For this part, you will need to access Matplotlib's object oriented API via the Axes object you are plotting on. You already have this in the pandas form, making the transfer easy. Here is a sample in raw Matplotlib:
fig, axes = plt.subplots()
axes.plot(x, y)
Now use the ticker API to easily set the locations and labels. You actually set the locations directly (not with a Locator) since you have a very fixed list of ticks:
axes.set_xticks(x[d])
axes.xaxis.set_major_formatter(ticker.StrMethodFormatter('{x:0.01g}s'))
For the sample data show here, you get

Path 'contains_points' yields incorrect results with Bezier curve

I am trying to select a region of data based on a matplotlib Path object, but when the path contains a Bezier curve (not just straight lines), the selected region doesn't completely fill in the curve. It looks like it's trying, but the far side of the curve gets chopped off.
For example, the following code defines a fairly simple closed path with one straight line and one cubic curve. When I look at the True/False result from the contains_points method, it does not seem to match either the curve itself or the raw vertices.
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.path import Path
from matplotlib.patches import PathPatch
# Make the Path
verts = [(1.0, 1.5), (-2.0, 0.25), (-1.0, 0.0), (1.0, 0.5), (1.0, 1.5)]
codes = [Path.MOVETO, Path.CURVE4, Path.CURVE4, Path.CURVE4, Path.CLOSEPOLY]
path1 = Path(verts, codes)
# Make a field with points to select
nx, ny = 101, 51
x = np.linspace(-2, 2, nx)
y = np.linspace(0, 2, ny)
yy, xx = np.meshgrid(y, x)
pts = np.column_stack((xx.ravel(), yy.ravel()))
# Construct a True/False array of contained points
tf = path1.contains_points(pts).reshape(nx, ny)
# Make a PathPatch for display
patch1 = PathPatch(path1, facecolor='c', edgecolor='b', lw=2, alpha=0.5)
# Plot the true/false array, the patch, and the vertices
fig, ax = plt.subplots()
ax.imshow(tf.T, origin='lower', extent=(x[0], x[-1], y[0], y[-1]))
ax.add_patch(patch1)
ax.plot(*zip(*verts), 'ro-')
plt.show()
This gives me this plot:
It looks like there is some sort of approximation going on - is this just a fundamental limitation of the calculation in matplotlib, or am I doing something wrong?
I can calculate the points inside the curve myself, but I was hoping to not reinvent this wheel if I don't have to.
It's worth noting that a simpler construction using quadratic curves does appear to work properly:
I am using matplotlib 2.0.0.
This has to do with the space in which the paths are evaluated, as explained in GitHub issue #6076. From a comment by mdboom there:
Path intersection is done by converting the curves to line segments
and then converting the intersection based on the line segments. This
conversion happens by "sampling" the curve at increments of 1.0. This
is generally the right thing to do when the paths are already scaled
in display space, because sampling the curve at a resolution finer
than a single pixel doesn't really help. However, when calculating the
intersection in data space as you've done here, we obviously need to
sample at a finer resolution.
This is discussing intersections, but contains_points is also affected. This enhancement is still open so we'll have to see if it is addressed in the next milestone. In the meantime, there are a couple options:
1) If you are going to be displaying a patch anyway, you can use the display transformation. In the example above, adding the following demonstrates the correct behavior (based on a comment by tacaswell on duplicate issue #8734, now closed):
# Work in transformed (pixel) coordinates
hit_patch = path1.transformed(ax.transData)
tf1 = hit_patch.contains_points(ax.transData.transform(pts)).reshape(nx, ny)
ax.imshow(tf2.T, origin='lower', extent=(x[0], x[-1], y[0], y[-1]))
2) If you aren't using a display and just want to calculate using a path, the best bet is to simply form the Bezier curve yourself and make a path out of line segments. Replacing the formation of path1 with the following calculation of path2 will produce the desired result.
from scipy.special import binom
def bernstein(n, i, x):
coeff = binom(n, i)
return coeff * (1-x)**(n-i) * x**i
def bezier(ctrlpts, nseg):
x = np.linspace(0, 1, nseg)
outpts = np.zeros((nseg, 2))
n = len(ctrlpts)-1
for i, point in enumerate(ctrlpts):
outpts[:,0] += bernstein(n, i, x) * point[0]
outpts[:,1] += bernstein(n, i, x) * point[1]
return outpts
verts1 = [(1.0, 1.5), (-2.0, 0.25), (-1.0, 0.0), (1.0, 0.5), (1.0, 1.5)]
nsegments = 31
verts2 = np.concatenate([bezier(verts1[:4], nsegments), np.array([verts1[4]])])
codes2 = [Path.MOVETO] + [Path.LINETO]*(nsegments-1) + [Path.CLOSEPOLY]
path2 = Path(verts2, codes2)
Either method yields something that looks like the following:

Adding hexbin plots together

I have four hexbin plots which have all been normalized. How do I add them together to make one big distribution?
I have tried concatenating the input vectors and then creating the hexbin plot, but this throws off the normalization of the individual distributions:
So how do I add the individual hexbin distributions whilst still maintainging the induvidual normalization?
The relevant part of my code is:
def hex_plot(x,y,max_v):
bounds = [0,max_v*m.exp(-(3**2)/2),max_v*m.exp(-2),max_v*m.exp(-0.5),max_v] # The sigma bounds
norm = mpl.colors.BoundaryNorm(bounds, ncolors=4)
hex_ = plt.hexbin(x, y, C=None, gridsize=gridsize,reduce_C_function=np.mean,cmap=cmap,mincnt=1,norm=norm)
print "Hex plot max: ",hex_.norm.vmax
return hex_
gridsize=50
cmap = mpl.colors.ListedColormap(['grey','#6A92D4','#1049A9','#052C6E'])
hex_plot(x_tot,y_tot,34840)
Thank you.
I've written a bit of code that does what you're after. From the snippet in your question, it looks like you already know the height (max_v) of your distribution given your binning scheme, so I worked under that assumption. Depending on the data you're applying this to, this might not actually be the case, in which case the following will fail (it's only as good as your guess/knowledge of the height of the distributions). For the purposes of my example data, I've just taken a reasonable guess (based on a quick plot) for the values of max_v1 and max_v2. Switching the c1 and c2 I've defined for the commented versions should reproduce your original problem.
import scipy
import matplotlib.pyplot as pyplot
import matplotlib.colors
import math
#need to know the height of the distributions a priori
max_v1 = 850 #approximate height of distribution 1 (defined below) with binning defined below
max_v2 = 400 #approximate height of distribution 2 (defined below) with binning defined below
max_v = max(max_v1,max_v2)
#make 2 differently sized datasets (so will require different normalizations)
#all normal distributions with assorted means/variances
x1 = scipy.randn(50000)/6.0+0.5
y1 = scipy.randn(50000)/3.0+0.5
x2 = scipy.randn(100000)/2.0-0.5
y2 = scipy.randn(100000)/2.0-0.5
#c1 = scipy.ones(len(x1)) #I don't assign meaningful weights here
#c2 = scipy.ones(len(x2)) #I don't assign meaningful weights here
c1 = scipy.ones(len(x1))*(max_v/max_v1) #highest distribution: no net change in normalization here
c2 = scipy.ones(len(x2))*(max_v/max_v2) #renormalized to same height as highest distribution
#define plot boundaries
xmin=-2.0
xmax=2.0
ymin=-2.0
ymax=2.0
#custom colormap
cmap = matplotlib.colors.ListedColormap(['grey','#6A92D4','#1049A9','#052C6E'])
#the bounds of 1sigma, 2sigma, etc. regions
bounds = [0,max_v*math.exp(-(3**2)/2),max_v*math.exp(-2),max_v*math.exp(-0.5),max_v]
norm = matplotlib.colors.BoundaryNorm(bounds, ncolors=4)
#make the hexbin plot
normalized = pyplot
hexplot = normalized.subplot(111)
normalized.hexbin(scipy.concatenate((x1,x2)), scipy.concatenate((y1,y2)), C=scipy.concatenate((c1,c2)), cmap=cmap, mincnt=1, extent=(xmin,xmax,ymin,ymax),gridsize=50, reduce_C_function=scipy.sum, norm=norm) #combine distributions and weights
hexplot.axis([xmin,xmax,ymin,ymax])
cax = pyplot.axes([0.86, 0.1, 0.03, 0.85])
clims = cax.axis()
cb = normalized.colorbar(cax=cax)
cax.set_yticklabels([' ','3','2','1',' '])
normalized.subplots_adjust(wspace=0, hspace=0, bottom=0.1, right=0.78, top=0.95, left=0.12)
normalized.show()
Here's the result without the fix (commented c1 and c2 used),
and the result with the fix (code as-is);
Hope that helps.

Categories

Resources