While reading up on numpy, I encountered the function numpy.histogram().
What is it for and how does it work? In the docs they mention bins: What are they?
Some googling led me to the definition of Histograms in general. I get that. But unfortunately I can't link this knowledge to the examples given in the docs.
A bin is range that represents the width of a single bar of the histogram along the X-axis. You could also call this the interval. (Wikipedia defines them more formally as "disjoint categories".)
The Numpy histogram function doesn't draw the histogram, but it computes the occurrences of input data that fall within each bin, which in turns determines the area (not necessarily the height if the bins aren't of equal width) of each bar.
In this example:
np.histogram([1, 2, 1], bins=[0, 1, 2, 3])
There are 3 bins, for values ranging from 0 to 1 (excl 1.), 1 to 2 (excl. 2) and 2 to 3 (incl. 3), respectively. The way Numpy defines these bins if by giving a list of delimiters ([0, 1, 2, 3]) in this example, although it also returns the bins in the results, since it can choose them automatically from the input, if none are specified. If bins=5, for example, it will use 5 bins of equal width spread between the minimum input value and the maximum input value.
The input values are 1, 2 and 1. Therefore, bin "1 to 2" contains two occurrences (the two 1 values), and bin "2 to 3" contains one occurrence (the 2). These results are in the first item in the returned tuple: array([0, 2, 1]).
Since the bins here are of equal width, you can use the number of occurrences for the height of each bar. When drawn, you would have:
a bar of height 0 for range/bin [0,1] on the X-axis,
a bar of height 2 for range/bin [1,2],
a bar of height 1 for range/bin [2,3].
You can plot this directly with Matplotlib (its hist function also returns the bins and the values):
>>> import matplotlib.pyplot as plt
>>> plt.hist([1, 2, 1], bins=[0, 1, 2, 3])
(array([0, 2, 1]), array([0, 1, 2, 3]), <a list of 3 Patch objects>)
>>> plt.show()
import numpy as np
hist, bin_edges = np.histogram([1, 1, 2, 2, 2, 2, 3], bins = range(5))
Below, hist indicates that there are 0 items in bin #0, 2 in bin #1, 4 in bin #3, 1 in bin #4.
print(hist)
# array([0, 2, 4, 1])
bin_edges indicates that bin #0 is the interval [0,1), bin #1 is [1,2), ...,
bin #3 is [3,4).
print (bin_edges)
# array([0, 1, 2, 3, 4]))
Play with the above code, change the input to np.histogram and see how it works.
But a picture is worth a thousand words:
import matplotlib.pyplot as plt
plt.bar(bin_edges[:-1], hist, width = 1)
plt.xlim(min(bin_edges), max(bin_edges))
plt.show()
Another useful thing to do with numpy.histogram is to plot the output as the x and y coordinates on a linegraph. For example:
arr = np.random.randint(1, 51, 500)
y, x = np.histogram(arr, bins=np.arange(51))
fig, ax = plt.subplots()
ax.plot(x[:-1], y)
fig.show()
This can be a useful way to visualize histograms where you would like a higher level of granularity without bars everywhere. Very useful in image histograms for identifying extreme pixel values.
Related
This code snippet:
import matplotlib.pyplot as plt
plt.plot(([1, 2, 3], [1, 2, 3]))
plt.show()
produces:
What function is being plotted here? Is this use case described in matplotlib documentation?
This snippet:
plt.plot(([1, 2, 3], [1, 2, 3], [2, 3, 4]))
produces:
From the new test case you provided we can see it is picking the i-th element on the list and building a series.
So it ends up plotting the series y = {1, 1, 2}, y = {2, 2 , 3} and y = {3, 3, 4}.
On a more generic note, we can assume that using a tuple of list will plot multiple series.
Honestly, it doesn't look that user friendly to write the input like that but there might be some case where it is more convenient.
The x-values are picked by a default according to the docs:
The horizontal / vertical coordinates of the data points. x values are optional and default to range(len(y)).
Calling plt.plot(y) is calling plot in the Axes class. Looking at the source code, the key description closest to your problem states the following for plotting multiple sets of data:
- If *x* and/or *y* are 2D arrays a separate data set will be drawn
for every column. If both *x* and *y* are 2D, they must have the
same shape. If only one of them is 2D with shape (N, m) the other
must have length N and will be used for every data set m.
Example:
>>> x = [1, 2, 3]
>>> y = np.array([[1, 2], [3, 4], [5, 6]])
>>> plot(x, y)
is equivalent to:
>>> for col in range(y.shape[1]):
... plot(x, y[:, col])
The main difference here compared to your example is that x is implicitly defined based on the length of your tuple (described elsewhere in the documentation) and that you are using a tuple rather than an np.array. I tried digging further into the source code to see where tuples would become arrays. In particular at line 1632: lines = [*self._get_lines(*args, data=data, **kwargs)] seems to be where the different lines are likely generated, but that is as far as I got.
Of note, this is one of three ways to plot multiple lines of data, this being the most compact.
While reading up on numpy, I encountered the function numpy.histogram().
What is it for and how does it work? In the docs they mention bins: What are they?
Some googling led me to the definition of Histograms in general. I get that. But unfortunately I can't link this knowledge to the examples given in the docs.
A bin is range that represents the width of a single bar of the histogram along the X-axis. You could also call this the interval. (Wikipedia defines them more formally as "disjoint categories".)
The Numpy histogram function doesn't draw the histogram, but it computes the occurrences of input data that fall within each bin, which in turns determines the area (not necessarily the height if the bins aren't of equal width) of each bar.
In this example:
np.histogram([1, 2, 1], bins=[0, 1, 2, 3])
There are 3 bins, for values ranging from 0 to 1 (excl 1.), 1 to 2 (excl. 2) and 2 to 3 (incl. 3), respectively. The way Numpy defines these bins if by giving a list of delimiters ([0, 1, 2, 3]) in this example, although it also returns the bins in the results, since it can choose them automatically from the input, if none are specified. If bins=5, for example, it will use 5 bins of equal width spread between the minimum input value and the maximum input value.
The input values are 1, 2 and 1. Therefore, bin "1 to 2" contains two occurrences (the two 1 values), and bin "2 to 3" contains one occurrence (the 2). These results are in the first item in the returned tuple: array([0, 2, 1]).
Since the bins here are of equal width, you can use the number of occurrences for the height of each bar. When drawn, you would have:
a bar of height 0 for range/bin [0,1] on the X-axis,
a bar of height 2 for range/bin [1,2],
a bar of height 1 for range/bin [2,3].
You can plot this directly with Matplotlib (its hist function also returns the bins and the values):
>>> import matplotlib.pyplot as plt
>>> plt.hist([1, 2, 1], bins=[0, 1, 2, 3])
(array([0, 2, 1]), array([0, 1, 2, 3]), <a list of 3 Patch objects>)
>>> plt.show()
import numpy as np
hist, bin_edges = np.histogram([1, 1, 2, 2, 2, 2, 3], bins = range(5))
Below, hist indicates that there are 0 items in bin #0, 2 in bin #1, 4 in bin #3, 1 in bin #4.
print(hist)
# array([0, 2, 4, 1])
bin_edges indicates that bin #0 is the interval [0,1), bin #1 is [1,2), ...,
bin #3 is [3,4).
print (bin_edges)
# array([0, 1, 2, 3, 4]))
Play with the above code, change the input to np.histogram and see how it works.
But a picture is worth a thousand words:
import matplotlib.pyplot as plt
plt.bar(bin_edges[:-1], hist, width = 1)
plt.xlim(min(bin_edges), max(bin_edges))
plt.show()
Another useful thing to do with numpy.histogram is to plot the output as the x and y coordinates on a linegraph. For example:
arr = np.random.randint(1, 51, 500)
y, x = np.histogram(arr, bins=np.arange(51))
fig, ax = plt.subplots()
ax.plot(x[:-1], y)
fig.show()
This can be a useful way to visualize histograms where you would like a higher level of granularity without bars everywhere. Very useful in image histograms for identifying extreme pixel values.
I have a dataset with 13k Kickstarter projects and their tweets over the duration of a project. Each project contains a list with the number of tweets for each day,
e.g. [10, 2, 4, 7, 2, 4, 3, 0, 4, 0, 1, 3, 0, 3, 4, 0, 0, 2, 3, 2, 0, 4, 5, 1, 0, 2, 0, 2, 1, 2, 0].
I've taken a subset of the data by setting the duration of the projects on 31 days so that each list has the same length, containing 31 values.
This piece of code prints each list of tweets:
for project in data:
data[project]["tweets"]
What is the easiest way to plot a histogram with matplotlib? I need a frequency distribution of the total number of tweets for each day. How do I count the values from each index? Is their an easy way using Pandas to do this?
The lists are also accessible in a Pandas data frame:
df = pd.DataFrame.from_dict(data, orient='index')
df1 = df[['tweets']]
Histogram is probably not what you need. It's a good solution if you have a list of numbers (for example, IQs of people) and you want to attribute each number to a category (f.e. 79-, 80-99, 100+). There will be 3 bins and height of each bin will represent the quantity of numbers that fit in the corresponding category.
In your case, you already have the height of each bin, so (as I understand) what you want is a plot that looks like like a histogram. This (as I understand) is not supported by matplotlib and would require using matplotlib not the way it was intended to be used.
If you're OK with using plots instead of histograms, that's what you can do.
import matplotlib.pyplot as plt
lists = [data[project]["tweets"] for project in data] # Collect all lists into one
sum_list = [sum(x) for x in zip(*lists)] # Create a list with sums of tweets for each day
plt.plot(sum_list) # Create a plot for sum_list
plt.show() # Show the plot
If you want to make a plot look like a histogram you should do that:
plt.bar(range(0, len(sum_list)), sum_list)
instead of plt.plot.
I have many measurements of several quantities in an array, like this:
m = array([[2, 1, 3, 2, 1, 4, 2], # measurements for quantity A
[8, 7, 6, 7, 5, 6, 8], # measurements for quantity B
[0, 1, 2, 0, 3, 2, 1], # measurements for quantity C
[5, 6, 7, 5, 6, 5, 7]] # measurements for quantity D
)
The quantities are correlated and I need to plot various contour plots. Like "contours of B vs. D x A".
It is true that in the general case the functions might be not well defined -- for example in the above data, columns 0 and 3 show that for the same (D=5,A=2) point there are two distinct values for B (B=8 and B=7). But still, for some combinations I know there is a functional dependence, which I need plotted.
The contour() function from MatPlotLib expects three arrays: X and Y can be 1D arrays, and Z has to be a 2D array with corresponding values. How should I prepare/extract these arrays from m?
You will probably want to use something like scipy.interpolate.griddata to prepare your Z arrays. This will interpolate your data to a regularly spaced 2D array, given your input X and Y, and a set of sorted, regularly spaced X and Y arrays which you will need for eventual plotting. For example, if X and Y contain data points between 1 and 10, then you need to construct a set of new X and Y with a step size that makes sense for your data, e.g.
Xout = numpy.linspace(1,10,10)
Yout = numpy.linspace(1,10,10)
To turn your Xout and Yout arrays into 2D arrays you can use numpy.meshgrid, e.g.
Xout_2d, Yout_2d = numpy.meshgrid(Xout,Yout)
Then you can use those new regularly spaced arrays to construct your interpolated Z array that you can use for plotting, e.g.
Zout = scipy.interpolate.griddata((X,Y),Z,(Xout_2d,Yout_2d))
This interpolated 2D Zout should be usable for a contour plot with Xout_2d and Yout_2d.
Extracting your arrays from m is simple, you just do something like this:
A, B, C, D = (row for row in m)
I read data from binary files into numpy arrays with np.fromfile. These data represent Z values on a grid for which spacing and shape are known so there is no problem reshaping the 1D array into the the shape of the grid and plotting with plt.imshow. So if I have N grids I can plot N subplots showing all data in one figure but what I'd really like to do is plot them as one image.
I can't just stack the arrays because the data in each array is spaced differently and because they have different shapes.
My idea was to "supersample" all grids to the spacing of the finest grid, stack and plot but I am not sure that is such a good idea as these grid files can become quite large.
By the way: Let's say I wanted to do that, how do I go from:
0, 1, 2
3, 4, 5
to:
0, 0, 1, 1, 2, 2
0, 0, 1, 1, 2, 2
3, 3, 4, 4, 5, 5
3, 3, 4, 4, 5, 5
I'm open to any suggestions.
Thanks,
Shahar
The answer if you just plot is: don't. plt.imshow has a keyword argument extent which you can use to zoom the imagine when plotting. Other then that I would suggest scipy.ndimage.zoom, with order=0, it is equivalent to repeating values, but you can zoom to any size easily or use a different order to get some smooth interpolation. np.tile could be an option for very simple zooming too.
Here is an example:
a = np.arange(9).reshape(3,3)
b = np.arange(36).reshape(6,6)
plt.imshow(a, extent=[0,1,0,1], interpolation='none')
plt.imshow(b, extent=(1,2,0,1), interpolation='none')
# note scaling is "broke"
plt.xlim(0,2)
of course to get the same color range for both, you should add vim=... and vmax keywords.