I'm trying to draw part of a histogram using matplotlib.
Instead of drawing the whole histogram which has a lot of outliers and large values I want to focus on just a small part. The original histogram looks like this:
hist(data, bins=arange(data.min(), data.max(), 1000), normed=1, cumulative=False)
plt.ylabel("PDF")
And after focusing it looks like this:
hist(data, bins=arange(0, 121, 1), normed=1, cumulative=False)
plt.ylabel("PDF")
Notice that the last bin is stretched and worst of all the Y ticks are scaled so that the sum is exactly 1 (so points out of the current range are not taken into account at all)
I know that I can achieve what I want by drawing the histogram over the whole possible range and then restricting the axis to the part I'm interested in, but it wastes a lot of time calculating bins that I won't use/see anyway.
hist(btsd-40, bins=arange(btsd.min(), btsd.max(), 1), normed=1, cumulative=False)
axis([0,120,0,0.0025])
Is there a fast and easy way to draw just the focused region but still get the Y scale correct?
In order to plot a subset of the histogram, I don't think you can get around to calculating the whole histogram.
Have you tried computing the histogram with numpy.histogram and then plotting a region using pylab.plot or something? I.e.
import numpy as np
import pylab as plt
data = np.random.normal(size=10000)*10000
plt.figure(0)
plt.hist(data, bins=np.arange(data.min(), data.max(), 1000))
plt.figure(1)
hist1 = np.histogram(data, bins=np.arange(data.min(), data.max(), 1000))
plt.bar(hist1[1][:-1], hist1[0], width=1000)
plt.figure(2)
hist2 = np.histogram(data, bins=np.arange(data.min(), data.max(), 200))
mask = (hist2[1][:-1] < 20000) * (hist2[1][:-1] > 0)
plt.bar(hist2[1][mask], hist2[0][mask], width=200)
Original histogram:
Histogram calculated manually:
Histogram calculated manually, cropped:
(N.B.: values are smaller because bins are narrower)
I think, you can normalize your data using a given weight. (repeat is a numpy function).
hist(data, bins=arange(0, 121, 1), weights=repeat(1.0/len(data), len(data)))
Related
I am plotting density map of ~40k points but hist2d returns a uniform density map. This is my code
hist2d(x, y, bins=(1000, 1000), cmap=plt.cm.jet)
Here is the scatter plot
Here is the histogram
I was expecting that there is a red horizontal portion in the center and the gradually turns blue towards higher/lower y values
EDIT:
#bb1 suggested decrease the number of bins but by setting it to bins=(100, 1000), I get this result
I think you are specifying too many bins. By setting bins=(1000,000) you get 1,000,000 bins. With 40,000 points, most of the bins will be empty and they overwhelm the image.
You may also consider using seaborn kdeplot() function instead of plt.hist2d(). It will visualize the density of data without subdividing data into bins:
import seaborn as sns
sns.kdeplot(x=x, y=y, levels = 100, fill=True, cmap="mako", thresh=0)
I am using matplotlib's hist2d function to make a 2d histogram of data that I have, however I am having trouble interpreting the result.
Here is the plot I have:
This was created using the line:
hist = plt.hist2d(X, Y, (160,160), norm=mpl.colors.LogNorm(vmin=1, vmax=20))
This returns a 2d array of (160, 160), as well as the bin edges etc.
In the plot there are bins which have a high frequency of values (yellow bins). I would like to be able to get the results of this histogram and filter out the bins that have low values, preserving the high bins. But I would expect there to be 160*160 values, but I can only find 160 X and 160 Y values.
What I would like to do is essentially filter out the more dense data from the less dense data. If this means representing the data as a single value (a bin), then that is ok.
Am I misinterpreting the function or am I not accessing the data results correctly? I have tried with spicy also but the results seem to be in the same or similar format.
Not sure if this is what you wanted.
The hist2d docs specify that the function returns a tuple of size 4, where the first item h is a heatmap.
This h will have the same shape as bins.
You can capture the output (it will still plot), and use argwhere to find coordinates where values exceed, say, the 90th percentile:
h, xedges, yedges, img = hist = plt.hist2d(X, Y, bins=(160,160), norm=mpl.colors.LogNorm(vmin=1, vmax=20))
print(list(np.argwhere(h > np.percentile(h, 90))))
You need Seaborn package.
You mentioned
I would like to be able to get the results of this histogram and filter out the bins that have low values, preserving the high bins.
You should definitely be using one of those:
seaborn.joinplot(...,kind='hex') : it shows the counts of observations that fall within hexagonal bins. This plot works best with relatively large dataset.
seaborn.joinplot(...,kind='kde') : use the kernel density estimation to visualize a bivariate distribution. I recommed it better.
Example 'kde'
Use number of levels n_levels and shade_lowest=False to ignore low values.
import seaborn as sns
import numpy as np
import matplotlib.pylab as plt
x, y = np.random.randn(2, 300)
plt.figure(figsize=(6,5))
sns.kdeplot(x, y, zorder=0, n_levels=6, shade=True, cbar=True,
shade_lowest=False, cmap='viridis')
I have only the angle values for a set of data. Now i need to plot a angle distribution curve ie., angle on the x axis v/s no.of times/frequency of angle occurring on the y axis.
These are the angles sorted out for a set of data:-
[98.1706427, 99.09896751, 99.10879006, 100.47518838, 101.22770381, 101.70374296,
103.15715294, 104.4653976,105.50441485, 106.82885361, 107.4605319, 108.93228646,
111.22463712, 112.23658018, 113.31223886, 113.4000603, 114.14565594, 114.79809084,
115.15788861, 115.42991416, 115.66216071, 115.69821092, 116.56319054, 117.09232139,
119.30835385, 119.31377834, 125.88278338, 127.80937901, 132.16187185, 132.61262906,
136.6751744, 138.34164387,]
How can i do this..??
How can i write a python program for this...?? and plot it in a graph as a distribution curve
Function hist actually returns the x and y coordinates of the bins. You can use this function to prepare the data for the line plot:
y, x, _ = plt.hist(angles) # No need for the 3rd return value
xc = (x[:-1] + x[1:]) / 2 # Take centerpoints
# plt.clf()
plt.plot(xc, y)
plt.show() # Etc.
You will end up having both the histogram and the line plot. If this is not desirable, clean the canvas before plotting the line by uncommenting the call to clf().
EDIT:
If you want a line plot as well, it is better to generate the histogram with numpy and then use that information also for the line:
from matplotlib import pyplot as plt
import numpy as np
angles = [98.1706427, 99.09896751, 99.10879006, 100.47518838, 101.22770381,
101.70374296, 103.15715294, 104.4653976, 105.50441485, 106.82885361,
107.4605319, 108.93228646, 111.22463712, 112.23658018, 113.31223886,
113.4000603, 114.14565594, 114.79809084, 115.15788861, 115.42991416,
115.66216071, 115.69821092, 116.56319054, 117.09232139, 119.30835385,
119.31377834, 125.88278338, 127.80937901, 132.16187185, 132.61262906,
136.6751744, 138.34164387, ]
hist,edges = np.histogram(angles, bins=20)
bin_centers = 0.5*(edges[:-1] + edges[1:])
bin_widths = (edges[1:]-edges[:-1])
plt.bar(bin_centers,hist,width=bin_widths)
plt.plot(bin_centers, hist,'r')
plt.xlabel('angle [$^\circ$]')
plt.ylabel('frequency')
plt.show()
this looks like this:
If you are not interested in the histogram itself, leave out the line plt.bar(bin_centers,hist,width=bin_widths).
EDIT2:
I don't really see the scientific value in a smoothed histogram. If you increase the resolution of the histogram (the bins parameter in the np.histogram command), it can change quite considerably. For instance, new peaks may occur if you increase the bin count, or two peaks may merge into one if you decrease the bin count. Keeping this in mind, smoothing the histogram curve suggests that you have more data than you do. However, if you really must, you can smooth a curve as explained in this answer, i.e.
from scipy.interpolate import spline
x = np.linspace(edges[0], edges[-1], 500)
y = spline(bin_centers, hist, x)
and then plot y over x.
I have a CDF plot with data of wifi usage in MB. For better understanding I would like to present the usage starting in KB and finishing in TB. I would like to know how to set a specific range for x axis to replace the produce by plt.plot() and show the axis x, per example, as [1KB 10KB 1MB 10MB 1TB 10TB], even the space between bins not representing the real values.
My code for now:
wifi = np.sort(matrix[matrix['wifi_total_mb']>0]['wifi_total_mb'].values)
g = sns.distplot(wifi, kde_kws=dict(cumulative=True))
plt.show()
Thanks
EDIT 1
I know that I can use plt.xticks and i already tried it: plt.xticks([0.00098, 0.00977, 1, 10, 1024, 10240, 1048576, 10485760, 24117248]). These are values in MB that represents the sample range I specified before. But the plot is still wrong.
The result expected
In excel it is pretty easy makes what I want to. Look the image, with the same range I get the plot I wanted.
Thanks
It may be better to calculate the data to plot manually, instead of relying on some seaborn helper function like distplot. This also makes it easier to understand the underlying issue of histogramming with very unequal bin sizes.
Calculating histogram
The histogram of the data can be calculated by using np.histogram(). It can take the desired bins as argument.
In order to get the cummulative histogram, np.cumsum does the job.
Now there are two options here: (a) plotting the real data or (b) plotting the data enumerated by bin.
(a) Plotting the real data:
Because the bin sizes are pretty unequal, a logarithmic scaling seems adequate, which can be done by semilogx(x,y). The bin edges can be shown as xticks using set_xticks (and since the semilogx plot will not automatically set the labels correctly, we also need to set them to the bin edges' values).
(b) Plotting data enumerated by bin:
The second option is to plot the histogram values bin by bin, independent of the actual bin size. Is is very close to the Excel solution from the question. In this case the x values of the plot are simply values from 0 to number of bins and the xticklabels are the bin edges.
Here is the complete code:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#use the bin from the question
bins = [0, 0.00098, 0.00977, 1, 10, 1024, 10240, 1048576, 10485760, 24117248]
# invent some data
data = np.random.lognormal(2,4,10000)
# calculate histogram of the data into the given bins
hist, _bins = np.histogram(data, bins=bins)
# make histogram cumulative
cum_hist=np.cumsum(hist)
# normalize data to 1
norm_cum_hist = cum_hist/float(cum_hist.max())
fig, (ax, ax2) = plt.subplots(nrows=2)
plt.subplots_adjust(hspace=0.5, bottom=0.17)
# First option plots the actual data, i.e. the bin width is reflected
# by the spacing between values on x-axis.
ax.set_title("Plotting actual data")
ax.semilogx(bins[1:],norm_cum_hist, marker="s")
ax.set_xticks(bins[1:])
ax.set_xticklabels(bins[1:] ,rotation=45, horizontalalignment="right")
# Second option plots the data bin by bin, i.e. every bin has the same width,
# independent of it's actual value.
ax2.set_title("Plotting bin by bin")
ax2.plot(range(len(bins[1:])),norm_cum_hist, marker="s")
ax2.set_xticks(range(len(bins[1:])))
ax2.set_xticklabels(bins[1:] ,rotation=45, horizontalalignment="right")
for axes in [ax, ax2]:
axes.set_ylim([0,1.05])
plt.show()
I think this is a simple question, but I just still can't seem to think of a simple solution. I have a set of data of molecular abundances, with values ranging many orders of magnitude. I want to represent these abundances with boxplots (box-and-whiskers plots), and I want the boxes to be calculated on log scale because of the wide range of values.
I know I can just calculate the log10 of the data and send it to matplotlib's boxplot, but this does not retain the logarithmic scale in plots later.
So my question is basically this:
When I have calculated a boxplot based on the log10 of my values, how do I convert the plot afterward to be shown on a logarithmic scale instead of linear with the log10 values?
I can change tick labels to partly fix this, but I have no clue how I get logarithmic scales back to the plot.
Or is there another more direct way to plotting this. A different package maybe that has this options already included?
Many thanks for the help.
I'd advice against doing the boxplot on the raw values and setting the y-axis to logarithmic, because the boxplot function is not designed to work across orders of magnitudes and you may get too many outliers (depends on your data, of course).
Instead, you can plot the logarithm of the data and manually adjust the y-labels.
Here is a very crude example:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator, FormatStrFormatter
np.random.seed(42)
values = 10 ** np.random.uniform(-3, 3, size=100)
fig = plt.figure(figsize=(9, 3))
ax = plt.subplot(1, 3, 1)
ax.boxplot(np.log10(values))
ax.set_yticks(np.arange(-3, 4))
ax.set_yticklabels(10.0**np.arange(-3, 4))
ax.set_title('log')
ax = plt.subplot(1, 3, 2)
ax.boxplot(values)
ax.set_yscale('log')
ax.set_title('raw')
ax = plt.subplot(1, 3, 3)
ax.boxplot(values, whis=[5, 95])
ax.set_yscale('log')
ax.set_title('5%')
plt.show()
The right figure shows the box plot on the raw values. This leads to many outliers, because the maximum whisker length is computed as a multiple (default: 1.5) of the interquartile range (the box height), which does not scale across orders of magnitude.
Alternatively, you could specify to draw the whiskers for a given percentile range:
ax.boxplot(values, whis=[5, 95])
In this case you get a fixed amount of outlires (5%) above and below.
You can use plt.yscale:
plt.boxplot(data); plt.yscale('log')