How do you bin a dataset and start the bins from 0? - python

I have a dataset where I have observations at arange of depths in the sea. I am trying to plot the frequency on the x-axis and the depth on the y-axis of a histogram.
#
This section is sorted
In order to do this, I wanted to bin the data for every 5 metre category. The only problem is, the depths in the dataset start at 2m, but I want the bins to start at 0m, and increase in 5m intervals. I don't know how to set the bins to start at 0m.
#
In addition, I'm plotting a histogram to show this, and I would like the depth to be on the y-axis (so that the plot is a bit more intuitive to look at)
Currently I have code that sorts the data into bins (I don't know the size) and then plots a histogram, but with the depth on the x-axis. Here is my code:
import numpy as np
#Data read in at this point
depthout = []
Dout = np.array(depthout)
bins = np.linspace(0,260,61)
plt.hist(Dout, bins)
plt.show()

Related

Why is my matplotlib.pyplot.hist not binning my data

I am attempting to create a histogram out of an array I made. When I plot the histogram it does not plot like a regular histogram it just gives me lines where my data points are.
I have attempted to set bins = [0,10,20,30,40,50,60,70,80,90] including with 0 and 100 on the ends. I've tried bins = range() and bins= 'auto'
array2 = np.random.uniform(10.0,100.0,size=(1,100))
#create a random array uniformly distributed between 1 and 100
print array2
plt.hist(array2)
#print a histogram
plt.title('Histogram of a Uniformly Distributed Sample between 10 and
100')
plt.xlim(0,100)
plt.show()
I'm really new and I'm not sure how to paste pictures. The plot is just a bunch of vertical lines at the data points instead of a binned histogram. Or sometimes with some of the choices I make for bins = I end up with a complete blank plot. I woul like to appologize if this has been dealt with before I have not been able to find any previous questions that gave me help.
You create a 2D array with one row and 100 columns. Hence you get 100 histograms, each with one bin.
Use a 1D vector of data instead.
array2 = np.random.uniform(10.0,100.0,size=100)

Density plot using seaborn

I'm trying to make a density plot of the hourly demand:
data
The 'hr' means different hours, 'cnt' means demand.
I know how to make a density plot such as:
sns.kdeplot(bike['hr'])
However, this only works when the demand for different hours is unknown. Thus I can count each hour as its demand. Now I know the demand count of each hour, how I can make a density plot of such data?
A density plot aims to show an estimate of a distribution. To make a graph showing the density of hourly demand, we would really expect to see many iid samples of demand, with time-stamps, i.e. one row per sample. Then a density plot would make sense.
But in the type of data here, where the demand ('cnt') is sampled regularly and aggregated over that sample period (the hour), a density plot is not directly meaningful. But a bar graph as a histogram does make sense, using the hours as the bins.
Below I show how to use pandas functions to produce such a plot -- really simple. For reference I also show how we might produce a density plot, through a sort of reconstruction of "original" samples.
df = pd.read_csv("../data/hour.csv") # load dataset, inc cols hr, cnt, no NaNs
# using the bar plotter built in to pandas objects
fig, ax = plt.subplots(1,2)
df.groupby('hr').agg({'cnt':sum}).plot.bar(ax=ax[0])
# reconstructed samples - has df.cnt.sum() rows, each one containing an hour of a rental.
samples = np.hstack([ np.repeat(h, df.cnt.iloc[i]) for i, h in enumerate(df.hr)])
# plot a density estimate
sns.kdeplot(samples, bw=0.5, lw=3, c="r", ax=ax[1])
# to make a useful comparison with a density estimate, we need to have our bar areas
# sum up to 1, so we use groupby.apply to divide by the total of all counts.
tot = float(df.cnt.sum())
df.groupby('hr').apply(lambda x: x['cnt'].sum()/tot).plot.bar(ax=ax[1], color='C0')
Demand for bikes seems to be low during the night... But it is also apparent that they are probably used for commuting, with peaks at hours 8am and 5-6pm.

pyplot - How to set a specific range in x axis

I have a CDF plot with data of wifi usage in MB. For better understanding I would like to present the usage starting in KB and finishing in TB. I would like to know how to set a specific range for x axis to replace the produce by plt.plot() and show the axis x, per example, as [1KB 10KB 1MB 10MB 1TB 10TB], even the space between bins not representing the real values.
My code for now:
wifi = np.sort(matrix[matrix['wifi_total_mb']>0]['wifi_total_mb'].values)
g = sns.distplot(wifi, kde_kws=dict(cumulative=True))
plt.show()
Thanks
EDIT 1
I know that I can use plt.xticks and i already tried it: plt.xticks([0.00098, 0.00977, 1, 10, 1024, 10240, 1048576, 10485760, 24117248]). These are values in MB that represents the sample range I specified before. But the plot is still wrong.
The result expected
In excel it is pretty easy makes what I want to. Look the image, with the same range I get the plot I wanted.
Thanks
It may be better to calculate the data to plot manually, instead of relying on some seaborn helper function like distplot. This also makes it easier to understand the underlying issue of histogramming with very unequal bin sizes.
Calculating histogram
The histogram of the data can be calculated by using np.histogram(). It can take the desired bins as argument.
In order to get the cummulative histogram, np.cumsum does the job.
Now there are two options here: (a) plotting the real data or (b) plotting the data enumerated by bin.
(a) Plotting the real data:
Because the bin sizes are pretty unequal, a logarithmic scaling seems adequate, which can be done by semilogx(x,y). The bin edges can be shown as xticks using set_xticks (and since the semilogx plot will not automatically set the labels correctly, we also need to set them to the bin edges' values).
(b) Plotting data enumerated by bin:
The second option is to plot the histogram values bin by bin, independent of the actual bin size. Is is very close to the Excel solution from the question. In this case the x values of the plot are simply values from 0 to number of bins and the xticklabels are the bin edges.
Here is the complete code:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#use the bin from the question
bins = [0, 0.00098, 0.00977, 1, 10, 1024, 10240, 1048576, 10485760, 24117248]
# invent some data
data = np.random.lognormal(2,4,10000)
# calculate histogram of the data into the given bins
hist, _bins = np.histogram(data, bins=bins)
# make histogram cumulative
cum_hist=np.cumsum(hist)
# normalize data to 1
norm_cum_hist = cum_hist/float(cum_hist.max())
fig, (ax, ax2) = plt.subplots(nrows=2)
plt.subplots_adjust(hspace=0.5, bottom=0.17)
# First option plots the actual data, i.e. the bin width is reflected
# by the spacing between values on x-axis.
ax.set_title("Plotting actual data")
ax.semilogx(bins[1:],norm_cum_hist, marker="s")
ax.set_xticks(bins[1:])
ax.set_xticklabels(bins[1:] ,rotation=45, horizontalalignment="right")
# Second option plots the data bin by bin, i.e. every bin has the same width,
# independent of it's actual value.
ax2.set_title("Plotting bin by bin")
ax2.plot(range(len(bins[1:])),norm_cum_hist, marker="s")
ax2.set_xticks(range(len(bins[1:])))
ax2.set_xticklabels(bins[1:] ,rotation=45, horizontalalignment="right")
for axes in [ax, ax2]:
axes.set_ylim([0,1.05])
plt.show()

Python histograms: Manually normalising counts and re-plotting as histogram

I tried searching for something similar, and the closest thing I could find was this which helped me to extract and manipulate the data, but now I can't figure out how to re-plot the histogram. I have some array of voltages, and I have first plotted a histogram of occurrences of those voltages. I want to instead make a histogram of events per hour ( so the y-axis of a normal histogram divided by the number of hours I took data ) and then re-plot the histogram with the manipulated y data.
I have an array which contains the number of events per hour ( composed of the original y axis from pyplot.hist divided by the number of hours data was taken ), and the bins from the histogram. I have composed that array using the following code ( taken from the answer linked above ):
import numpy
import matplotlib.pyplot as pyplot
mydata = numpy.random.normal(-15, 1, 500) # this seems to have to be 'uneven' on either side of 0, otherwise the code looks fine. FYI, my actual data is all positive
pyplot.figure(1)
hist1 = pyplot.hist(mydata, bins=50, alpha=0.5, label='set 1', color='red')
hist1_flux = [hist1[0]/5.0, 0.5*(hist1[1][1:]+hist1[1][:-1])]
pyplot.figure(2)
pyplot.bar(hist1_flux[1], hist1_flux[0])
This code doesn't exactly match what's going on in my code; my data is composed of 1000 arrays of 1000 data points each ( voltages ). I have made histograms of that, which gives me number of occurrences of a given voltage range ( or bin width ). All I want to do is re-plot a histogram of the number of events per hour (so yaxis of the histogram / 5 hours) with the same original bin width, but when I divide hist1[0]/5 and replot in the above way, the 'bin width' is all wrong.
I feel like there must be an easier way to do this, rather than manually replotting my own histograms.
Thanks in advance, and I'm really sorry if I've missed something obvious.
The problem, illustrated in the output of my sample code AND my original data is as follows:
Upper plots: code snippet output.
Lower plots: My actual data.
It's because the bar function takes an argument width, which is by default 0.8 (plt.bar(left, height, width=0.8, bottom=None, hold=None, **kwargs)), so you need to change it to the distance between two bars:
pyplot.bar(hist1_flux[1], hist1_flux[0],
width=hist1_flux[1][1] - hist1_flux[1][0])

Plotting histogram with matplotlib

i try to plot data in a histogram or bar in python. The data size (array size) is between 0-10000. The data itself (each entry of the array) depends on the input and has a range between 0 and e+20 (mostly the data is in th same range). So i want to do a hist plot with matplotlib. I want to plot how often a data is in some intervall (to illustrate the mean and deviation). Sometimes it works like this:
hist1.
But sometimes there is a problem with the intevall size like this:
hist2.
In this plot i need more bars at point 0-100 etc.
Can anyone help me with this?
The plots are just made with:
from numpy.linalg import *
import matplotlib.pyplot as plt
plt.hist(numbers,bins=100)
plt.show()
By default, hist produces a plot with an x range that covers the full range of your data.
If you have one outsider at very high x in comparison with the other values, then you will see this image with a 'compressed' figure.
I you want to have always the same view you can fix the limits with xlim.
Alternatively, if you want to see your distribution always centered and as nicer as possible, you can calculate the mean and the standard deviation of your data and fix the x range accordingly (p.e. for mean +/- 5 stdev)

Categories

Resources