Stacked histogram will not stack - python

I am trying to run the following code :
variable_values = #numpy vector, one dimension, 5053 values between 1 and 0.
label_values = #numpy vector, one dimension, 5053 values, discrete value of either 1 OR 0.
x = variable_values[variable_values != '?'].astype(float)
y = label_values[variable_values != '?'].astype(float)
print np.max(x) #prints 0.90101
print np.max(y) #prints 1.0
N = 5053
ind = np.arange(N) # the x locations for the groups
width = 0.45 # the width of the bars: can also be len(x) sequence
n, bins, patches = plt.hist(x, 5, stacked=True, normed = True)
#Stack the data
plt.figure()
plt.hist(x, bins, stacked=True, normed = True)
plt.hist(y, bins, stacked=True, normed = True)
plt.show()
What I want to achieve is the following graph :
With the colour on each bar split according to whether its value for label is 1 or 0.
Unfortunately my output currently is :
There are two things incorrect with this - it isn't stacked appropriately first of all. Second of all, the values on the Y axis go up to 1.6, but I believe the Y axis should hold the number of pieces of data that fall into each subgroup (so if all pieces of data had a value of 0-0.25 the only bar that would show data would be the first).

variable_values = #numpy vector, one dimension, 5053 values between 1 and 0.
label_values = #numpy vector, one dimension, 5053 values, discrete
value of either 1 OR 0.
You're trying to use the same bins for x and y. x probably being from 0-1 not including the edges. So y falls outside the range of bins you're plotting.
It's 1.6 because you have chosen to normalize the plot. Set that parameter to false to get the real counts.
This should fix most of these problems:
import numpy as np
import matplotlib.pyplot as plt
x = np.random.random(5053)
y = np.random.random_integers(0,1, 5053)
# x = variable_values[variable_values != '?'].astype(float)
# y = label_values[variable_values != '?'].astype(float)
print np.max(x) #prints 0.90101
print np.max(y) #prints 1.0
N = 5053
ind = np.arange(N) # the x locations for the groups
width = 0.45 # the width of the bars: can also be len(x) sequence
n, bins, patches = plt.hist(x, 5, stacked=True, normed = True)
bins[0] = 0
bins[-1] = 1
#Stack the data
plt.figure()
plt.hist(y, bins, stacked=True, normed = False)
plt.hist(x, bins, stacked=True, normed = False)
plt.show()

May I suggest a more simple solution:
variable_values=np.random.random(size=5053)
label_values=np.random.randint(0,2, size=5053)
plt.hist(variable_values, label='1')
plt.hist(variable_values[label_values==0], label='0')
plt.legend(loc='upper right')
plt.savefig('temp.png')
Actually since the label_values is either 1 or 0, you don't even need to stack the histgram. Just make a histogram of both 1 and 0's and then superimpose a histogram for the 0's on top.
To use stack histogram, although I prefer only to use when there are many different classes:
plt.hist([variable_values[label_values==1],variable_values[label_values==0]], stacked=True, label=['1', '0'])

Related

SciPy Bernoulli random number generation, different behavior within a for loop for different sample sizes

The purpose of this code is to demonstrate CLT.
If I do the following:
num_samples = 10000
sample_means = np.empty(num_samples)
for i in range(num_samples):
mean = np.mean(st.bernoulli.rvs(p=0.5, size=100))
sample_means[i] = mean
sample_demeaned = np.subtract(sample_means, 0.5)
denominator = np.divide(0.5, np.sqrt(100))
z_ed = np.divide(sample_demeaned, denominator)
plt.hist(z_ed, bins=40, edgecolor='k', density=True)
x = np.linspace(st.norm.ppf(0.001), st.norm.ppf(0.999), 10000)
y = st.norm.pdf(x)
plt.plot(x, y, color='red')
I get:
However, if I try to do it with a for loop for different sample sizes:
num_samples = 10000
sample_sizes = np.array([5, 20, 75, 100])
sample_std_means = np.empty(shape=(num_samples, len(sample_sizes)))
for col, size in enumerate(sample_sizes):
sample_means = np.empty(num_samples)
for i in range(num_samples):
mean = np.mean(st.bernoulli.rvs(p=0.5, size=size))
sample_means[i] = mean
sample_demeaned = np.subtract(sample_means, 0.5)
denominator = np.divide(0.5, np.sqrt(size))
z_ed = np.divide(sample_demeaned, denominator)
sample_std_means[:, col] = sample_means
And then plot each of them in a 2x2 grid:
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(10, 7))
x = np.linspace(st.norm.ppf(0.001), st.norm.ppf(0.999), 10000)
y = st.norm.pdf(x)
for i, ax in enumerate(axes.flatten()):
ax.hist(sample_std_means[i], bins=40, edgecolor='k', color='midnightblue')
ax.set_ylabel('Density')
ax.set_xlabel(f'n = {sample_sizes[i]}')
ax.plot(x, y, color='red')
ax.set_xlim((-3, 3))
plt.show()
I get the following image:
I cannot debug the discrepancy here. Any help is highly appreciated.
Please note that scipy.stats and numpy have been imported as st and np respectively in both code blocks.
First, note that one numpy's strong points is that it allows operations which mix arrays and single numbers. This is called broadcasting. So, for example sample_demeaned = np.subtract(sample_means, 0.5) can be written more concise as sample_demeaned = sample_means - 0.5.
Several issues are going wrong:
sample_std_means[:, col] = sample_means should use the just calculated z_ed instead of sample_means.
ax.hist(sample_std_means[i], ...) uses the i'th row of the array. That row only contains 4 elements. You'd want sample_std_means[;,i] to take the i'th column.
The pdf is drawn in its normalized form (with an area below the curve equal to one). However, the histogram's height is proportional to the number of samples. Its total area is num_samples * bin_width, where the histogram's default bin width is the length from the first to the last element divided by the number of bins. To get both the pdf and histogram with similar sizes, either the histogram should be normalized (using density=True) or the pdf should be multiplied by the expected area of the histogram.
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
num_samples = 10000
sample_sizes = np.array([5, 20, 75, 100])
sample_std_means = np.empty(shape=(num_samples, len(sample_sizes)))
for col, size in enumerate(sample_sizes):
sample_means = np.empty(num_samples)
for i in range(num_samples):
sample_means[i] = np.mean(st.bernoulli.rvs(p=0.5, size=size))
sample_demeaned = sample_means - 0.5
z_ed = sample_demeaned / (0.5 / np.sqrt(size))
sample_std_means[:, col] = z_ed
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(10, 7))
x = np.linspace(st.norm.ppf(0.001), st.norm.ppf(0.999), 1000)
y = st.norm.pdf(x)
for i, ax in enumerate(axes.flatten()):
ax.hist(sample_std_means[:, i], bins=40, edgecolor='k', color='midnightblue', density=True)
ax.set_ylabel('Density')
ax.set_xlabel(f'n = {sample_sizes[i]}')
# bin_width = (sample_std_means[:, i].max() - sample_std_means[:, i].min()) / 40
# ax.plot(x, y * num_samples * bin_width, color='red')
ax.plot(x, y, color='red')
ax.set_xlim((-3, 3))
plt.show()
Now note the weird empty bars in the histograms. A histogram works best for continuous distributions. But the mean of n Bernoulli trials can have at most n+1 different outcomes. When all trials would be True, the mean would be n/n = 1. When all would be False, the mean would be 0. Combined, the possible means are 0, 1/n, 2/n, ..., 1. The histogram of such a discrete distribution should take these values into account for the boundaries between the bins.
The following code creates a scatter plot, using the position of the means and a random y-value to visualize how many there are per x. Also, the position of the bin boundaries is calculated and visualized by dotted vertical lines.
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(10, 7))
for i, ax in enumerate(axes.flatten()):
ax.scatter(sample_std_means[:, i], np.random.uniform(0, 1, num_samples), color='r', alpha=0.5, lw=0, s=1)
# there are n+1 possible mean values for n bernoulli trials
# n+2 boundaries will be needed to separate the bins
bins = np.arange(-1, sample_sizes[i]+1) / sample_sizes[i]
bins += (bins[1] - bins[0]) / 2 # shift half a bin
bins -= 0.5 # subtract the mean
bins /= (0.5 / np.sqrt(sample_sizes[i])) # correction factor
for b in bins:
ax.axvline(b, color='g', ls=':')
ax.set_xlabel(f'n = {sample_sizes[i]}')
ax.set_xlim((-3, 3))
And here are the histograms using these bins:
ax.hist(sample_std_means[:, i], bins=bins, edgecolor='k', color='midnightblue', density=True)

Scale negative xticks to different scale than positive x ticks

is this possible?
I have a barchart representing the difference between two dataframe columns divided by the original dataframe column
difference = (df - full_df)/full_df
I then plot the difference
difference.plot(kind='barh',color = ['r' if x > 0 else 'b' for x in difference.values]).\
set_yticklabels([str(tick)[:45] for tick in difference.index])
plt.xticks(fontsize=20)
plt.gca().set_title('Selected minus full feature set averages divided by full', fontsize=30)
axs[1].yaxis.tick_right()
axs[1].yaxis.grid(color='gray', linestyle='dashed')
axs[1].xaxis.grid(color='gray', linestyle='dashed')
plt.yticks(fontsize=23)
plt.tight_layout()
Most the positive x numbers are going to be in the range of 0 < x < 10. All of the negative numbers should between -1 < x < 0. Is there a way to set the xtick intervals below zero to .1 (or something like that) and the xtick intervals above 0 to 1 so the x axis would look like:
[-1,-.9,-.8,-.7,-.6,-.5,-.4,-.3,-.2,-.1,0,1,2,3,4,5,6,7,8,9, to inf] ?
I may not understand your question correctly, but I think are looking for a plot with negative x and positive x on different scales, but taking up the same amount of space. I think the easiest solution would be to scale your x negative axis data so that it fits well with the y axis data and then relabel the ticks on their original scale.
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
# original data
data_x = x = [-1,-.9,-.8,-.7,-.6,-.5,-.4,-.3,-.2,-.1,0,1,2,3,4,5,6,7,8,9,]
data_y = y = range(len(data_x))
# scale negative x values
x_mod =[i*10 if i < 0 else i for i in x]
# draw plots
with sns.axes_style('whitegrid'):
fig, ax = plt.subplots(2,1,figsize=(5,8))
# unscaled
ax1 = ax[0]
ax2 = ax[1]
ax1.plot(data_x, data_y)
ax1.vlines(0, 0, 20, color='black') # mark x = 0
ax1.set_title('unscaled data')
# scaled
ax2.plot(x_mod, data_y)
# fix the xticks and their labeling
xticks = list(np.concatenate([np.arange(-1,0, 0.2),np.arange(0,11,2)]))
xtick_locs = list(np.concatenate([np.arange(-1,0, 0.2) *10, np.arange(0,11,2)]))
ax2.set(xticks=xtick_locs, xticklabels = xticks)
ax2.vlines(0, 0, 20, color='black')
ax2.set_title('scaled data')

Setting discrete colormap corresponding to specific data range in Matplotlib

Some background
I have a 2-d array in the shape of (50,50), the data value are range from -40 ~ 40.
But I want to plot the data in three data range[<0], [0,20], [>20]
Then, I need to generate a colormap corresponding to the three section.
I have some thought now
## ratio is the original 2-d array
binlabel = np.zeros_like(ratio)
binlabel[ratio<0] = 1
binlabel[(ratio>0)&(ratio<20)] = 2
binlabel[ratio>20] = 3
def discrete_cmap(N, base_cmap=None):
base = plt.cm.get_cmap(base_cmap)
color_list = base(np.linspace(0, 1, N))
cmap_name = base.name + str(N)
return base.from_list(cmap_name, color_list, N)
fig = plt.figure()
ax = plt.gca()
plt.pcolormesh(binlabel, cmap = discrete_cmap(3, 'jet'))
divider = make_axes_locatable(ax)
cax = divider.append_axes("bottom", size="4%", pad=0.45)
cbar = plt.colorbar(ratio_plot, cax=cax, orientation="horizontal")
labels = [1.35,2,2.65]
loc = labels
cbar.set_ticks(loc)
cbar.ax.set_xticklabels(['< 0', '0~20', '>20'])
Is there any better approach? Any advice would be appreciate.
There are various answers to other questions using ListedColormap and BoundaryNorm, but here's an alternative. I've ignored the placement of your colorbar, as that's not relevant to your question.
You can replace your binlabel calculation with a call to np.digitize() and replace your discrete_cmap() function by using the lut argument to get_cmap(). Also, I find it easier to place the color bounds at .5 midpoints between the indexes rather than scale to awkward fractions of odd numbers:
import matplotlib.colors as mcol
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import numpy as np
ratio = np.random.random((50,50)) * 50.0 - 20.0
fig2, ax2 = plt.subplots(figsize=(5,5))
# Turn the data into an array of N bin indexes (i.e., 0, 1 and 2).
bounds = [0,20]
iratio = np.digitize(ratio.flat,bounds).reshape(ratio.shape)
# Create a colormap containing N colors and a Normalizer that defines where
# the boundaries of the colors should be relative to the indexes (i.e., -0.5,
# 0.5, 1.5, 2.5).
cmap = cm.get_cmap("jet",lut=len(bounds)+1)
cmap_bounds = np.arange(len(bounds)+2) - 0.5
norm = mcol.BoundaryNorm(cmap_bounds,cmap.N)
# Plot using the colormap and the Normalizer.
ratio_plot = plt.pcolormesh(iratio,cmap=cmap,norm=norm)
cbar = plt.colorbar(ratio_plot,ticks=[0,1,2],orientation="horizontal")
cbar.set_ticklabels(["< 0","0~20",">20"])

Dividing matplotlib histogram by maximum bin value

I want to plot multiple histograms on the same plot and I need to compare the spread of the data. I want to do this by dividing each histogram by its maximum value so all the distributions have the same scale. However, the way matplotlib's histogram function works, I have not found an easy way to do this.
This is because n in
n, bins, patches = ax1.hist(y, bins = 20, histtype = 'step', color = 'k')
Is the number of counts in each bin but I can not repass this to hist since it will recalculate.
I have attempted the norm and density functions but these normalise the area of the distributions, rather than the height of the distribution. I could duplicate n and then repeat the bin edges using the bins output but this is tedious. Surely the hist function must allow for the bins values to be divided by a constant?
Example code is below, demonstrating the problem.
y1 = np.random.randn(100)
y2 = 2*np.random.randn(50)
x1 = np.linspace(1,101,100)
x2 = np.linspace(1,51,50)
gs = plt.GridSpec(1,2, wspace = 0, width_ratios = [3,1])
ax = plt.subplot(gs[0])
ax1 = plt.subplot(gs[1])
ax1.yaxis.set_ticklabels([]) # remove the major ticks
ax.scatter(x1, y1, marker='+',color = 'k')#, c=SNR, cmap=plt.cm.Greys)
ax.scatter(x2, y2, marker='o',color = 'k')#, c=SNR, cmap=plt.cm.Greys)
n1, bins1, patches1 = ax1.hist(y1, bins = 20, histtype = 'step', color = 'k',linewidth = 2, orientation = 'horizontal')
n2, bins2, patched2 = ax1.hist(y2, bins = 20, histtype = 'step', linestyle = 'dashed', color = 'k', orientation = 'horizontal')
I do not know whether matplotlib allows this normalisation by default but I wrote a function to do it myself.
It takes the output of n and bins from plt.hist (as above) and then passes this through the function below.
def hist_norm_height(n,bins,const):
''' Function to normalise bin height by a constant.
Needs n and bins from np.histogram or ax.hist.'''
n = np.repeat(n,2)
n = float32(n) / const
new_bins = [bins[0]]
new_bins.extend(np.repeat(bins[1:],2))
return n,new_bins[:-1]
To plot now (I like step histograms), you pass it to plt.step.
Such as plt.step(new_bins,n). This will give you a histogram with height normalised by a constant.
You can assign the argument bins equal to a list of values. Use np.arange() or np.linspace() to generate the values. http://matplotlib.org/api/axes_api.html?highlight=hist#matplotlib.axes.Axes.hist
Slightly different approach set up for comparisons. Could be adapted to the step style:
# -*- coding: utf-8 -*-
import matplotlib.pyplot as plt
import numpy as np
y = []
y.append(np.random.normal(2, 2, size=40))
y.append(np.random.normal(3, 1.5, size=40))
y.append(np.random.normal(4,4,size=40))
ls = ['dashed','dotted','solid']
fig, (ax1, ax2, ax3) = plt.subplots(ncols=3)
for l, data in zip(ls, y):
n, b, p = ax1.hist(data, normed=False,
#histtype='step', #step's too much of a pain to get the bins
#color='k', linestyle=l,
alpha=0.2
)
ax2.hist(data, normed=True,
#histtype = 'step', color='k', linestyle=l,
alpha=0.2
)
n, b, p = ax3.hist(data, normed=False,
#histtype='step', #step's too much of a pain to get the bins
#color='k', linestyle=l,
alpha=0.2
)
high = float(max([r.get_height() for r in p]))
for r in p:
r.set_height(r.get_height()/high)
ax3.add_patch(r)
ax3.set_ylim(0,1)
ax1.set_title('hist')
ax2.set_title('area==1')
ax3.set_title('fix height')
plt.show()
a couple outputs:
This can be accomplished using numpy to obtain a priori histogram values, and then plotting them with a bar plot.
import numpy as np
import matplotlib.pyplot as plt
# Define random data and number of bins to use
x = np.random.randn(1000)
bins = 10
plt.figure()
# Obtain the bin values and edges using numpy
hist, bin_edges = np.histogram(x, bins=bins, density=True)
# Plot bars with the proper positioning, height, and width.
plt.bar(
(bin_edges[1:] + bin_edges[:-1]) * .5, hist / hist.max(),
width=(bin_edges[1] - bin_edges[0]), color="blue")
plt.show()

Color-coding a histogram

I have a set of N objects with two properties: x and y.
I would like to depict the distribution of x with a histogram in MATPLOTLIB using hist(). Easy enough. Now, I would like to color-code EACH bar of the histogram with a color that represents the average value of y in that set with a colormap. Is there an easy way to do this? Here, x and y are both N-d numpy arrays. Thanks!
fig = plt.figure()
n, bins, patches = plt.hist(x, 100, normed=1, histtype='stepfilled')
plt.setp(patches, 'facecolor', 'g', 'alpha', 0.1)
plt.xlabel('x')
plt.ylabel('Normalized frequency')
plt.show()
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
# set up the bins
Nbins = 10
bins = np.linspace(0, 1, Nbins +1, endpoint=True)
# get some fake data
x = np.random.rand(300)
y = np.arange(300)
# figure out which bin each x goes into
bin_num = np.digitize(x, bins, right=True) - 1
# compute the counts per bin
hist_vals = np.bincount(bin_num)
# set up array for bins
means = np.zeros(Nbins)
# numpy slicing magic to sum the y values by bin
means[bin_num] += y
# take the average
means /= hist_vals
# make the figure/axes objects
fig, ax = plt.subplots(1,1)
# get a color map
my_cmap = cm.get_cmap('jet')
# get normalize function (takes data in range [vmin, vmax] -> [0, 1])
my_norm = Normalize()
# use bar plot
ax.bar(bins[:-1], hist_vals, color=my_cmap(my_norm(means)), width=np.diff(bins))
# make sure the figure updates
plt.draw()
plt.show()
related: vary the color of each bar in bargraph using particular value

Categories

Resources