i try to plot data in a histogram or bar in python. The data size (array size) is between 0-10000. The data itself (each entry of the array) depends on the input and has a range between 0 and e+20 (mostly the data is in th same range). So i want to do a hist plot with matplotlib. I want to plot how often a data is in some intervall (to illustrate the mean and deviation). Sometimes it works like this:
hist1.
But sometimes there is a problem with the intevall size like this:
hist2.
In this plot i need more bars at point 0-100 etc.
Can anyone help me with this?
The plots are just made with:
from numpy.linalg import *
import matplotlib.pyplot as plt
plt.hist(numbers,bins=100)
plt.show()
By default, hist produces a plot with an x range that covers the full range of your data.
If you have one outsider at very high x in comparison with the other values, then you will see this image with a 'compressed' figure.
I you want to have always the same view you can fix the limits with xlim.
Alternatively, if you want to see your distribution always centered and as nicer as possible, you can calculate the mean and the standard deviation of your data and fix the x range accordingly (p.e. for mean +/- 5 stdev)
Related
I am attempting to show the results of a tensorflow model onto a timeplot. I have two arrays, one for the predictions of the model and one for the actual values. Both arrays are of the size (3500,) with values ranging from 0-16. I want to be able to show a timeplot that displays the predicted value and the actual value at a certain point in time. I do not know how to setup a timeplot that has the values on the y-axis and time on the x-axis.
I am currently using the Matplotlib library to make this work but I am not too familiar with it
plt.scatter(max_test,max_predictions)
plt.show()
This gives me a scatter plot with the values (0-16) on both the y and x-axis.
I want to replace the values on the x-axis with the time that each value occurred at (preferably every 5 seconds) and only show one value for both the arrays. Ideally it would look something like this:
If you have the start time and stop time, you can import numpy and use linspace for the x axis:
x_axis_val = np.linspace(start, stop, len(max_test))
To plot both sets of y-values on the same graph do:
plots = [max_predictions, max_test]
for plot in plots:
plt.scatter(x_axis_val,plot)
plt.show()
I have a CDF plot with data of wifi usage in MB. For better understanding I would like to present the usage starting in KB and finishing in TB. I would like to know how to set a specific range for x axis to replace the produce by plt.plot() and show the axis x, per example, as [1KB 10KB 1MB 10MB 1TB 10TB], even the space between bins not representing the real values.
My code for now:
wifi = np.sort(matrix[matrix['wifi_total_mb']>0]['wifi_total_mb'].values)
g = sns.distplot(wifi, kde_kws=dict(cumulative=True))
plt.show()
Thanks
EDIT 1
I know that I can use plt.xticks and i already tried it: plt.xticks([0.00098, 0.00977, 1, 10, 1024, 10240, 1048576, 10485760, 24117248]). These are values in MB that represents the sample range I specified before. But the plot is still wrong.
The result expected
In excel it is pretty easy makes what I want to. Look the image, with the same range I get the plot I wanted.
Thanks
It may be better to calculate the data to plot manually, instead of relying on some seaborn helper function like distplot. This also makes it easier to understand the underlying issue of histogramming with very unequal bin sizes.
Calculating histogram
The histogram of the data can be calculated by using np.histogram(). It can take the desired bins as argument.
In order to get the cummulative histogram, np.cumsum does the job.
Now there are two options here: (a) plotting the real data or (b) plotting the data enumerated by bin.
(a) Plotting the real data:
Because the bin sizes are pretty unequal, a logarithmic scaling seems adequate, which can be done by semilogx(x,y). The bin edges can be shown as xticks using set_xticks (and since the semilogx plot will not automatically set the labels correctly, we also need to set them to the bin edges' values).
(b) Plotting data enumerated by bin:
The second option is to plot the histogram values bin by bin, independent of the actual bin size. Is is very close to the Excel solution from the question. In this case the x values of the plot are simply values from 0 to number of bins and the xticklabels are the bin edges.
Here is the complete code:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#use the bin from the question
bins = [0, 0.00098, 0.00977, 1, 10, 1024, 10240, 1048576, 10485760, 24117248]
# invent some data
data = np.random.lognormal(2,4,10000)
# calculate histogram of the data into the given bins
hist, _bins = np.histogram(data, bins=bins)
# make histogram cumulative
cum_hist=np.cumsum(hist)
# normalize data to 1
norm_cum_hist = cum_hist/float(cum_hist.max())
fig, (ax, ax2) = plt.subplots(nrows=2)
plt.subplots_adjust(hspace=0.5, bottom=0.17)
# First option plots the actual data, i.e. the bin width is reflected
# by the spacing between values on x-axis.
ax.set_title("Plotting actual data")
ax.semilogx(bins[1:],norm_cum_hist, marker="s")
ax.set_xticks(bins[1:])
ax.set_xticklabels(bins[1:] ,rotation=45, horizontalalignment="right")
# Second option plots the data bin by bin, i.e. every bin has the same width,
# independent of it's actual value.
ax2.set_title("Plotting bin by bin")
ax2.plot(range(len(bins[1:])),norm_cum_hist, marker="s")
ax2.set_xticks(range(len(bins[1:])))
ax2.set_xticklabels(bins[1:] ,rotation=45, horizontalalignment="right")
for axes in [ax, ax2]:
axes.set_ylim([0,1.05])
plt.show()
I need to create a box plot with results for some runs - for each of these runs I have the minimum output, maximum output, average output and standard deviation. This means that I will need 16 boxplots with labels.
The examples I ran into so far plot a numerical distribution, but in my case, this is not feasible.
Is there any way to do this in Python (Matplotlib) / R?
The answer given by #Roland above is important: a box plot shows fundamentally different quantities, and if you make a similar plot using the quantities you have, it might confuse users. I might represent this information using stacked errorbar plots. For example:
import matplotlib.pyplot as plt
import numpy as np
# construct some data like what you have:
x = np.random.randn(100, 8)
mins = x.min(0)
maxes = x.max(0)
means = x.mean(0)
std = x.std(0)
# create stacked errorbars:
plt.errorbar(np.arange(8), means, std, fmt='ok', lw=3)
plt.errorbar(np.arange(8), means, [means - mins, maxes - means],
fmt='.k', ecolor='gray', lw=1)
plt.xlim(-1, 8)
I'm using pyplot to make a histogram. Here is approximately what I'm doing:
import numpy as np
import pylab as pl
A = {my dataset as a dictionary: different numbers and their frequencies}
numbers = A.keys()
frequencies = A.values()
plot = np.transpose(np.array([[numbers,frequencies]])
n = <my bins-value here>
pl.hist(plot,bins=n,log=True)
pl.show()
I have noticed that, regardless of the number of bins I specify, the second bin is always green, like below. Why is it green? What does this mean? How do I prevent this from happening?
You can't really use hist that way. hist computes value frequencies given the raw data. You have already computed the frequencies, and you're trying to pass them to hist, but that's not the input hist needs. When you pass in a two-dimensional array, as you're doing, hist displays multiple histograms, one for each column. This is documented:
Multiple data can be provided via x as a list of datasets of potentially different length ([x0, x1, ...]), or as a 2-D ndarray in which each column is a dataset.
So you're getting one bar graph (the blue ones) for your labels, and another (the green ones) for their counts. Presumably all the green ones are lumped together because their range is much smaller.
If you generated your frequencies from raw data, you can pass that raw data to hist to get your histogram. If you only have the histogram data, you should use matplotlib's bar function to make a bar graph yourself using the histogram data. However, you'd have to bin it yourself. The bottom line is that you can either let hist do everything, or nothing: you can have it compute the frequencies and the bins and do the plot, or you can compute the frequencies and the bins and do the plot, but you can't just compute the frequencies yourself and have hist just do the binning and the plot.
I'm trying to reproduce this plot in python with little luck:
It's a simple number density contour currently done in SuperMongo. I'd like to drop it in favor of Python but the closest I can get is:
which is by using hexbin(). How could I go about getting the python plot to resemble the SuperMongo one? I don't have enough rep to post images, sorry for the links. Thanks for your time!
Example simple contour plot from a fellow SuperMongo => python sufferer:
import numpy as np
from matplotlib.colors import LogNorm
from matplotlib import pyplot as plt
plt.interactive(True)
fig=plt.figure(1)
plt.clf()
# generate input data; you already have that
x1 = np.random.normal(0,10,100000)
y1 = np.random.normal(0,7,100000)/10.
x2 = np.random.normal(-15,7,100000)
y2 = np.random.normal(-10,10,100000)/10.
x=np.concatenate([x1,x2])
y=np.concatenate([y1,y2])
# calculate the 2D density of the data given
counts,xbins,ybins=np.histogram2d(x,y,bins=100,normed=LogNorm())
# make the contour plot
plt.contour(counts.transpose(),extent=[xbins.min(),xbins.max(),
ybins.min(),ybins.max()],linewidths=3,colors='black',
linestyles='solid')
plt.show()
produces a nice contour plot.
The contour function offers a lot of fancy adjustments, for example let's set the levels by hand:
plt.clf()
mylevels=[1.e-4, 1.e-3, 1.e-2]
plt.contour(counts.transpose(),mylevels,extent=[xbins.min(),xbins.max(),
ybins.min(),ybins.max()],linewidths=3,colors='black',
linestyles='solid')
plt.show()
producing this plot:
And finally, in SM one can do contour plots on linear and log scales, so I spent a little time trying to figure out how to do this in matplotlib. Here is an example when the y points need to be plotted on the log scale and the x points still on the linear scale:
plt.clf()
# this is our new data which ought to be plotted on the log scale
ynew=10**y
# but the binning needs to be done in linear space
counts,xbins,ybins=np.histogram2d(x,y,bins=100,normed=LogNorm())
mylevels=[1.e-4,1.e-3,1.e-2]
# and the plotting needs to be done in the data (i.e., exponential) space
plt.contour(xbins[:-1],10**ybins[:-1],counts.transpose(),mylevels,
extent=[xbins.min(),xbins.max(),ybins.min(),ybins.max()],
linewidths=3,colors='black',linestyles='solid')
plt.yscale('log')
plt.show()
This produces a plot which looks very similar to the linear one, but with a nice vertical log axis, which is what was intended:
Have you checked out matplotlib's contour plot?
Unfortunately I couldn't view yours images. Do you mean something like this? It was done by MathGL -- GPL plotting library, which have Python interface too. And you can use arbitrary data arrays as input (including numpy's one).
You can use numpy.histogram2d to get a number density distribution of your array.
Try this example:
http://micropore.wordpress.com/2011/10/01/2d-density-plot-or-2d-histogram/