Why is my matplotlib.pyplot.hist not binning my data - python

I am attempting to create a histogram out of an array I made. When I plot the histogram it does not plot like a regular histogram it just gives me lines where my data points are.
I have attempted to set bins = [0,10,20,30,40,50,60,70,80,90] including with 0 and 100 on the ends. I've tried bins = range() and bins= 'auto'
array2 = np.random.uniform(10.0,100.0,size=(1,100))
#create a random array uniformly distributed between 1 and 100
print array2
plt.hist(array2)
#print a histogram
plt.title('Histogram of a Uniformly Distributed Sample between 10 and
100')
plt.xlim(0,100)
plt.show()
I'm really new and I'm not sure how to paste pictures. The plot is just a bunch of vertical lines at the data points instead of a binned histogram. Or sometimes with some of the choices I make for bins = I end up with a complete blank plot. I woul like to appologize if this has been dealt with before I have not been able to find any previous questions that gave me help.

You create a 2D array with one row and 100 columns. Hence you get 100 histograms, each with one bin.
Use a 1D vector of data instead.
array2 = np.random.uniform(10.0,100.0,size=100)

Related

Graphing a continuous bar plot with different colors/width

I am working on a relatively large dataset (5000 rows) in pandas and would like to draw a bar plot, but continuous and with different colors 1.
For every depth data there will be a value of SBT.
Initially, I thought to generate a bar for each depth, but due to the amount of data, the graph does not display it very well and it takes a really long time to load.
In the meantime, I generated a plot of the data, but with lines.
I added the code and the picture of this plot below 2.
fig, SBTcla = plt.subplots()
SBTcla.plot(SBT,datos['Depth (m)'], color='black',label='SBT')
plt.xlim(0, 10)
plt.grid(color='grey', linestyle='--', linewidth=1)
plt.title('SBT');
plt.xlabel('SBT');
plt.ylabel('Profundidad (mts)');
plt.gca().invert_yaxis();
Your graph consists of a lot of points with no information. Consecutive rows which contain the same SBT could we eliminated. Grouping by consecutive rows with equal content can be done by a shift and cummulative sum. The boolean expression looks for steps from one region to the next. If it is a step it returns true and the sum increases by one.
x = datos.groupby((datos['SBT'].shift() != datos['SBT']).cumsum())
Each group can be plotted on its own, with a filled style

pyplot - How to set a specific range in x axis

I have a CDF plot with data of wifi usage in MB. For better understanding I would like to present the usage starting in KB and finishing in TB. I would like to know how to set a specific range for x axis to replace the produce by plt.plot() and show the axis x, per example, as [1KB 10KB 1MB 10MB 1TB 10TB], even the space between bins not representing the real values.
My code for now:
wifi = np.sort(matrix[matrix['wifi_total_mb']>0]['wifi_total_mb'].values)
g = sns.distplot(wifi, kde_kws=dict(cumulative=True))
plt.show()
Thanks
EDIT 1
I know that I can use plt.xticks and i already tried it: plt.xticks([0.00098, 0.00977, 1, 10, 1024, 10240, 1048576, 10485760, 24117248]). These are values in MB that represents the sample range I specified before. But the plot is still wrong.
The result expected
In excel it is pretty easy makes what I want to. Look the image, with the same range I get the plot I wanted.
Thanks
It may be better to calculate the data to plot manually, instead of relying on some seaborn helper function like distplot. This also makes it easier to understand the underlying issue of histogramming with very unequal bin sizes.
Calculating histogram
The histogram of the data can be calculated by using np.histogram(). It can take the desired bins as argument.
In order to get the cummulative histogram, np.cumsum does the job.
Now there are two options here: (a) plotting the real data or (b) plotting the data enumerated by bin.
(a) Plotting the real data:
Because the bin sizes are pretty unequal, a logarithmic scaling seems adequate, which can be done by semilogx(x,y). The bin edges can be shown as xticks using set_xticks (and since the semilogx plot will not automatically set the labels correctly, we also need to set them to the bin edges' values).
(b) Plotting data enumerated by bin:
The second option is to plot the histogram values bin by bin, independent of the actual bin size. Is is very close to the Excel solution from the question. In this case the x values of the plot are simply values from 0 to number of bins and the xticklabels are the bin edges.
Here is the complete code:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#use the bin from the question
bins = [0, 0.00098, 0.00977, 1, 10, 1024, 10240, 1048576, 10485760, 24117248]
# invent some data
data = np.random.lognormal(2,4,10000)
# calculate histogram of the data into the given bins
hist, _bins = np.histogram(data, bins=bins)
# make histogram cumulative
cum_hist=np.cumsum(hist)
# normalize data to 1
norm_cum_hist = cum_hist/float(cum_hist.max())
fig, (ax, ax2) = plt.subplots(nrows=2)
plt.subplots_adjust(hspace=0.5, bottom=0.17)
# First option plots the actual data, i.e. the bin width is reflected
# by the spacing between values on x-axis.
ax.set_title("Plotting actual data")
ax.semilogx(bins[1:],norm_cum_hist, marker="s")
ax.set_xticks(bins[1:])
ax.set_xticklabels(bins[1:] ,rotation=45, horizontalalignment="right")
# Second option plots the data bin by bin, i.e. every bin has the same width,
# independent of it's actual value.
ax2.set_title("Plotting bin by bin")
ax2.plot(range(len(bins[1:])),norm_cum_hist, marker="s")
ax2.set_xticks(range(len(bins[1:])))
ax2.set_xticklabels(bins[1:] ,rotation=45, horizontalalignment="right")
for axes in [ax, ax2]:
axes.set_ylim([0,1.05])
plt.show()

Python histograms: Manually normalising counts and re-plotting as histogram

I tried searching for something similar, and the closest thing I could find was this which helped me to extract and manipulate the data, but now I can't figure out how to re-plot the histogram. I have some array of voltages, and I have first plotted a histogram of occurrences of those voltages. I want to instead make a histogram of events per hour ( so the y-axis of a normal histogram divided by the number of hours I took data ) and then re-plot the histogram with the manipulated y data.
I have an array which contains the number of events per hour ( composed of the original y axis from pyplot.hist divided by the number of hours data was taken ), and the bins from the histogram. I have composed that array using the following code ( taken from the answer linked above ):
import numpy
import matplotlib.pyplot as pyplot
mydata = numpy.random.normal(-15, 1, 500) # this seems to have to be 'uneven' on either side of 0, otherwise the code looks fine. FYI, my actual data is all positive
pyplot.figure(1)
hist1 = pyplot.hist(mydata, bins=50, alpha=0.5, label='set 1', color='red')
hist1_flux = [hist1[0]/5.0, 0.5*(hist1[1][1:]+hist1[1][:-1])]
pyplot.figure(2)
pyplot.bar(hist1_flux[1], hist1_flux[0])
This code doesn't exactly match what's going on in my code; my data is composed of 1000 arrays of 1000 data points each ( voltages ). I have made histograms of that, which gives me number of occurrences of a given voltage range ( or bin width ). All I want to do is re-plot a histogram of the number of events per hour (so yaxis of the histogram / 5 hours) with the same original bin width, but when I divide hist1[0]/5 and replot in the above way, the 'bin width' is all wrong.
I feel like there must be an easier way to do this, rather than manually replotting my own histograms.
Thanks in advance, and I'm really sorry if I've missed something obvious.
The problem, illustrated in the output of my sample code AND my original data is as follows:
Upper plots: code snippet output.
Lower plots: My actual data.
It's because the bar function takes an argument width, which is by default 0.8 (plt.bar(left, height, width=0.8, bottom=None, hold=None, **kwargs)), so you need to change it to the distance between two bars:
pyplot.bar(hist1_flux[1], hist1_flux[0],
width=hist1_flux[1][1] - hist1_flux[1][0])

How do you bin a dataset and start the bins from 0?

I have a dataset where I have observations at arange of depths in the sea. I am trying to plot the frequency on the x-axis and the depth on the y-axis of a histogram.
#
This section is sorted
In order to do this, I wanted to bin the data for every 5 metre category. The only problem is, the depths in the dataset start at 2m, but I want the bins to start at 0m, and increase in 5m intervals. I don't know how to set the bins to start at 0m.
#
In addition, I'm plotting a histogram to show this, and I would like the depth to be on the y-axis (so that the plot is a bit more intuitive to look at)
Currently I have code that sorts the data into bins (I don't know the size) and then plots a histogram, but with the depth on the x-axis. Here is my code:
import numpy as np
#Data read in at this point
depthout = []
Dout = np.array(depthout)
bins = np.linspace(0,260,61)
plt.hist(Dout, bins)
plt.show()

Matplotlib Histogram: Green and Blue Bins

I'm using pyplot to make a histogram. Here is approximately what I'm doing:
import numpy as np
import pylab as pl
A = {my dataset as a dictionary: different numbers and their frequencies}
numbers = A.keys()
frequencies = A.values()
plot = np.transpose(np.array([[numbers,frequencies]])
n = <my bins-value here>
pl.hist(plot,bins=n,log=True)
pl.show()
I have noticed that, regardless of the number of bins I specify, the second bin is always green, like below. Why is it green? What does this mean? How do I prevent this from happening?
You can't really use hist that way. hist computes value frequencies given the raw data. You have already computed the frequencies, and you're trying to pass them to hist, but that's not the input hist needs. When you pass in a two-dimensional array, as you're doing, hist displays multiple histograms, one for each column. This is documented:
Multiple data can be provided via x as a list of datasets of potentially different length ([x0, x1, ...]), or as a 2-D ndarray in which each column is a dataset.
So you're getting one bar graph (the blue ones) for your labels, and another (the green ones) for their counts. Presumably all the green ones are lumped together because their range is much smaller.
If you generated your frequencies from raw data, you can pass that raw data to hist to get your histogram. If you only have the histogram data, you should use matplotlib's bar function to make a bar graph yourself using the histogram data. However, you'd have to bin it yourself. The bottom line is that you can either let hist do everything, or nothing: you can have it compute the frequencies and the bins and do the plot, or you can compute the frequencies and the bins and do the plot, but you can't just compute the frequencies yourself and have hist just do the binning and the plot.

Categories

Resources