Box plot with min, max, average and standard deviation - python

I need to create a box plot with results for some runs - for each of these runs I have the minimum output, maximum output, average output and standard deviation. This means that I will need 16 boxplots with labels.
The examples I ran into so far plot a numerical distribution, but in my case, this is not feasible.
Is there any way to do this in Python (Matplotlib) / R?

The answer given by #Roland above is important: a box plot shows fundamentally different quantities, and if you make a similar plot using the quantities you have, it might confuse users. I might represent this information using stacked errorbar plots. For example:
import matplotlib.pyplot as plt
import numpy as np
# construct some data like what you have:
x = np.random.randn(100, 8)
mins = x.min(0)
maxes = x.max(0)
means = x.mean(0)
std = x.std(0)
# create stacked errorbars:
plt.errorbar(np.arange(8), means, std, fmt='ok', lw=3)
plt.errorbar(np.arange(8), means, [means - mins, maxes - means],
fmt='.k', ecolor='gray', lw=1)
plt.xlim(-1, 8)

Related

pyplot - How to set a specific range in x axis

I have a CDF plot with data of wifi usage in MB. For better understanding I would like to present the usage starting in KB and finishing in TB. I would like to know how to set a specific range for x axis to replace the produce by plt.plot() and show the axis x, per example, as [1KB 10KB 1MB 10MB 1TB 10TB], even the space between bins not representing the real values.
My code for now:
wifi = np.sort(matrix[matrix['wifi_total_mb']>0]['wifi_total_mb'].values)
g = sns.distplot(wifi, kde_kws=dict(cumulative=True))
plt.show()
Thanks
EDIT 1
I know that I can use plt.xticks and i already tried it: plt.xticks([0.00098, 0.00977, 1, 10, 1024, 10240, 1048576, 10485760, 24117248]). These are values in MB that represents the sample range I specified before. But the plot is still wrong.
The result expected
In excel it is pretty easy makes what I want to. Look the image, with the same range I get the plot I wanted.
Thanks
It may be better to calculate the data to plot manually, instead of relying on some seaborn helper function like distplot. This also makes it easier to understand the underlying issue of histogramming with very unequal bin sizes.
Calculating histogram
The histogram of the data can be calculated by using np.histogram(). It can take the desired bins as argument.
In order to get the cummulative histogram, np.cumsum does the job.
Now there are two options here: (a) plotting the real data or (b) plotting the data enumerated by bin.
(a) Plotting the real data:
Because the bin sizes are pretty unequal, a logarithmic scaling seems adequate, which can be done by semilogx(x,y). The bin edges can be shown as xticks using set_xticks (and since the semilogx plot will not automatically set the labels correctly, we also need to set them to the bin edges' values).
(b) Plotting data enumerated by bin:
The second option is to plot the histogram values bin by bin, independent of the actual bin size. Is is very close to the Excel solution from the question. In this case the x values of the plot are simply values from 0 to number of bins and the xticklabels are the bin edges.
Here is the complete code:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#use the bin from the question
bins = [0, 0.00098, 0.00977, 1, 10, 1024, 10240, 1048576, 10485760, 24117248]
# invent some data
data = np.random.lognormal(2,4,10000)
# calculate histogram of the data into the given bins
hist, _bins = np.histogram(data, bins=bins)
# make histogram cumulative
cum_hist=np.cumsum(hist)
# normalize data to 1
norm_cum_hist = cum_hist/float(cum_hist.max())
fig, (ax, ax2) = plt.subplots(nrows=2)
plt.subplots_adjust(hspace=0.5, bottom=0.17)
# First option plots the actual data, i.e. the bin width is reflected
# by the spacing between values on x-axis.
ax.set_title("Plotting actual data")
ax.semilogx(bins[1:],norm_cum_hist, marker="s")
ax.set_xticks(bins[1:])
ax.set_xticklabels(bins[1:] ,rotation=45, horizontalalignment="right")
# Second option plots the data bin by bin, i.e. every bin has the same width,
# independent of it's actual value.
ax2.set_title("Plotting bin by bin")
ax2.plot(range(len(bins[1:])),norm_cum_hist, marker="s")
ax2.set_xticks(range(len(bins[1:])))
ax2.set_xticklabels(bins[1:] ,rotation=45, horizontalalignment="right")
for axes in [ax, ax2]:
axes.set_ylim([0,1.05])
plt.show()

Matplotlib - Boxplot calculated on log10 values but shown in logarithmic scale

I think this is a simple question, but I just still can't seem to think of a simple solution. I have a set of data of molecular abundances, with values ranging many orders of magnitude. I want to represent these abundances with boxplots (box-and-whiskers plots), and I want the boxes to be calculated on log scale because of the wide range of values.
I know I can just calculate the log10 of the data and send it to matplotlib's boxplot, but this does not retain the logarithmic scale in plots later.
So my question is basically this:
When I have calculated a boxplot based on the log10 of my values, how do I convert the plot afterward to be shown on a logarithmic scale instead of linear with the log10 values?
I can change tick labels to partly fix this, but I have no clue how I get logarithmic scales back to the plot.
Or is there another more direct way to plotting this. A different package maybe that has this options already included?
Many thanks for the help.
I'd advice against doing the boxplot on the raw values and setting the y-axis to logarithmic, because the boxplot function is not designed to work across orders of magnitudes and you may get too many outliers (depends on your data, of course).
Instead, you can plot the logarithm of the data and manually adjust the y-labels.
Here is a very crude example:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator, FormatStrFormatter
np.random.seed(42)
values = 10 ** np.random.uniform(-3, 3, size=100)
fig = plt.figure(figsize=(9, 3))
ax = plt.subplot(1, 3, 1)
ax.boxplot(np.log10(values))
ax.set_yticks(np.arange(-3, 4))
ax.set_yticklabels(10.0**np.arange(-3, 4))
ax.set_title('log')
ax = plt.subplot(1, 3, 2)
ax.boxplot(values)
ax.set_yscale('log')
ax.set_title('raw')
ax = plt.subplot(1, 3, 3)
ax.boxplot(values, whis=[5, 95])
ax.set_yscale('log')
ax.set_title('5%')
plt.show()
The right figure shows the box plot on the raw values. This leads to many outliers, because the maximum whisker length is computed as a multiple (default: 1.5) of the interquartile range (the box height), which does not scale across orders of magnitude.
Alternatively, you could specify to draw the whiskers for a given percentile range:
ax.boxplot(values, whis=[5, 95])
In this case you get a fixed amount of outlires (5%) above and below.
You can use plt.yscale:
plt.boxplot(data); plt.yscale('log')

Creating two x-axes for a line-plot in matplotlib with unknown transform function between scales

Using matplotlib, two x-axes for 1 line plot can easily be obtained using twiny().
If the transform between the two x-scales can be described by a function, the corresponding ticks can be set by applying this transform function.
(this is described here: How to add a second x-axis in matplotlib)
How can I achieve this, if the transform function between the scales is unknown?
Edit:
Imagine the following situation:
You have 2 thermometers, both measuring the temperature. Thermometer 1 is measuring in °C and thermometer 2 in an imaginary unit, lets call it °D. Basically, what you know is that with increasing °C °D is increasing as well. Additionally, both thermometers have some degree of inaccuracy.
Both thermometers measure the same physical quantity, hence I should be able to represent them with a single line and two scales. However, in contrast to plotting tempoeratures in °C vs. K or °F, the transformation between the scales is unknown.
This means for example I have:
import numpy as np
from matplotlib import pyplot as plt
temp1 = np.sort(np.random.uniform(size=21))
temp2 = np.sort(np.random.uniform(low=-20, high=20, size=21))
y = np.linspace(0,1,21, endpoint=True)
A transform function between temp1 and temp2 is existent, but unknow. Y, however, is the same.
Additionally, I know that temp1 and y are confined to the range (0,1)
Now we may plot like this:
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax1.set_aspect('equal')
ax2 = plt.twiny(ax1)
ax1.plot(x1,y, 'k-')
ax2.plot(x2,y, 'r:')
ax1.set_xlabel(r'1st x-axis')
ax2.set_xlabel(r'2nd x-axis')
ax1.set_xlim([0,1])
ax1.set_ylim([0,1])
fig.savefig('dual_x_faulty.png', format='png')
This leads to the following plot:
You can see that both curves are not the same, and the plot is not square (as it would be without twinning the y axis).
So, here is what I want (and can't achieve on my own):
Plotting a 3d-array (temp1, temp2, y) in a 2d line plot by having two x-axes
Matplotlib shoud 'automagically' set the ticks of temp2 such, that the curves (temp1, y) and (temp2, y) are congruent
Is there a workaround?
Thanks for your help!

Plotting histogram with matplotlib

i try to plot data in a histogram or bar in python. The data size (array size) is between 0-10000. The data itself (each entry of the array) depends on the input and has a range between 0 and e+20 (mostly the data is in th same range). So i want to do a hist plot with matplotlib. I want to plot how often a data is in some intervall (to illustrate the mean and deviation). Sometimes it works like this:
hist1.
But sometimes there is a problem with the intevall size like this:
hist2.
In this plot i need more bars at point 0-100 etc.
Can anyone help me with this?
The plots are just made with:
from numpy.linalg import *
import matplotlib.pyplot as plt
plt.hist(numbers,bins=100)
plt.show()
By default, hist produces a plot with an x range that covers the full range of your data.
If you have one outsider at very high x in comparison with the other values, then you will see this image with a 'compressed' figure.
I you want to have always the same view you can fix the limits with xlim.
Alternatively, if you want to see your distribution always centered and as nicer as possible, you can calculate the mean and the standard deviation of your data and fix the x range accordingly (p.e. for mean +/- 5 stdev)

Displaying 3 histograms on 1 axis in a legible way - matplotlib

I have produced 3 sets of data which are organised in numpy arrays. I'm interested in plotting the probability distribution of these three sets of data as normed histograms. All three distributions should look almost identical so it seems sensible to plot all three on the same axis for ease of comparison.
By default matplotlib histograms are plotted as bars which makes the image I want look very messy. Hence, my question is whether it is possible to force pyplot.hist to only draw a box/circle/triangle where the top of the bar would be in the default form so I can cleanly display all three distributions on the same graph or whether I have to calculate the histogram data and then plot it separately as a scatter graph.
Thanks in advance.
There are two ways to plot three histograms simultaniously, but both are not what you've asked for. To do what you ask, you must calculate the histogram, e.g. by using numpy.histogram, then plot using the plot method. Use scatter only if you want to associate other information with your points by setting a size for each point.
The first alternative approach to using hist involves passing all three data sets at once to the hist method. The hist method then adjusts the widths and placements of each bar so that all three sets are clearly presented.
The second alternative is to use the histtype='step' option, which makes clear plots for each set.
Here is a script demonstrating this:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(101)
a = np.random.normal(size=1000)
b = np.random.normal(size=1000)
c = np.random.normal(size=1000)
common_params = dict(bins=20,
range=(-5, 5),
normed=True)
plt.subplots_adjust(hspace=.4)
plt.subplot(311)
plt.title('Default')
plt.hist(a, **common_params)
plt.hist(b, **common_params)
plt.hist(c, **common_params)
plt.subplot(312)
plt.title('Skinny shift - 3 at a time')
plt.hist((a, b, c), **common_params)
plt.subplot(313)
common_params['histtype'] = 'step'
plt.title('With steps')
plt.hist(a, **common_params)
plt.hist(b, **common_params)
plt.hist(c, **common_params)
plt.savefig('3hist.png')
plt.show()
And here is the resulting plot:
Keep in mind you could do all this with the object oriented interface as well, e.g. make individual subplots, etc.

Categories

Resources