I have a dataset with 13k Kickstarter projects and their tweets over the duration of a project. Each project contains a list with the number of tweets for each day,
e.g. [10, 2, 4, 7, 2, 4, 3, 0, 4, 0, 1, 3, 0, 3, 4, 0, 0, 2, 3, 2, 0, 4, 5, 1, 0, 2, 0, 2, 1, 2, 0].
I've taken a subset of the data by setting the duration of the projects on 31 days so that each list has the same length, containing 31 values.
This piece of code prints each list of tweets:
for project in data:
data[project]["tweets"]
What is the easiest way to plot a histogram with matplotlib? I need a frequency distribution of the total number of tweets for each day. How do I count the values from each index? Is their an easy way using Pandas to do this?
The lists are also accessible in a Pandas data frame:
df = pd.DataFrame.from_dict(data, orient='index')
df1 = df[['tweets']]
Histogram is probably not what you need. It's a good solution if you have a list of numbers (for example, IQs of people) and you want to attribute each number to a category (f.e. 79-, 80-99, 100+). There will be 3 bins and height of each bin will represent the quantity of numbers that fit in the corresponding category.
In your case, you already have the height of each bin, so (as I understand) what you want is a plot that looks like like a histogram. This (as I understand) is not supported by matplotlib and would require using matplotlib not the way it was intended to be used.
If you're OK with using plots instead of histograms, that's what you can do.
import matplotlib.pyplot as plt
lists = [data[project]["tweets"] for project in data] # Collect all lists into one
sum_list = [sum(x) for x in zip(*lists)] # Create a list with sums of tweets for each day
plt.plot(sum_list) # Create a plot for sum_list
plt.show() # Show the plot
If you want to make a plot look like a histogram you should do that:
plt.bar(range(0, len(sum_list)), sum_list)
instead of plt.plot.
Related
I feel like I am going crazy. What I want to do is create a scatter plot with an x axis split into 10 segments, with multiple values in each segment in the Y axis. For example:
PP = [pp1m, pp2m, pp3m, pp4m, pp5m, pp6m, pp7m, pp8m, pp9m, pp10m]
timeline = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
plt.scatter(timeline, PP)
In the above, PP consists of 10 lists containing 33 values each. Am I using the wrong graph type or just organizing my data incorrectly?
You need to iterate through the PP list like this:
for sublist in PP:
plt.scatter(timeline, sublist)
While reading up on numpy, I encountered the function numpy.histogram().
What is it for and how does it work? In the docs they mention bins: What are they?
Some googling led me to the definition of Histograms in general. I get that. But unfortunately I can't link this knowledge to the examples given in the docs.
A bin is range that represents the width of a single bar of the histogram along the X-axis. You could also call this the interval. (Wikipedia defines them more formally as "disjoint categories".)
The Numpy histogram function doesn't draw the histogram, but it computes the occurrences of input data that fall within each bin, which in turns determines the area (not necessarily the height if the bins aren't of equal width) of each bar.
In this example:
np.histogram([1, 2, 1], bins=[0, 1, 2, 3])
There are 3 bins, for values ranging from 0 to 1 (excl 1.), 1 to 2 (excl. 2) and 2 to 3 (incl. 3), respectively. The way Numpy defines these bins if by giving a list of delimiters ([0, 1, 2, 3]) in this example, although it also returns the bins in the results, since it can choose them automatically from the input, if none are specified. If bins=5, for example, it will use 5 bins of equal width spread between the minimum input value and the maximum input value.
The input values are 1, 2 and 1. Therefore, bin "1 to 2" contains two occurrences (the two 1 values), and bin "2 to 3" contains one occurrence (the 2). These results are in the first item in the returned tuple: array([0, 2, 1]).
Since the bins here are of equal width, you can use the number of occurrences for the height of each bar. When drawn, you would have:
a bar of height 0 for range/bin [0,1] on the X-axis,
a bar of height 2 for range/bin [1,2],
a bar of height 1 for range/bin [2,3].
You can plot this directly with Matplotlib (its hist function also returns the bins and the values):
>>> import matplotlib.pyplot as plt
>>> plt.hist([1, 2, 1], bins=[0, 1, 2, 3])
(array([0, 2, 1]), array([0, 1, 2, 3]), <a list of 3 Patch objects>)
>>> plt.show()
import numpy as np
hist, bin_edges = np.histogram([1, 1, 2, 2, 2, 2, 3], bins = range(5))
Below, hist indicates that there are 0 items in bin #0, 2 in bin #1, 4 in bin #3, 1 in bin #4.
print(hist)
# array([0, 2, 4, 1])
bin_edges indicates that bin #0 is the interval [0,1), bin #1 is [1,2), ...,
bin #3 is [3,4).
print (bin_edges)
# array([0, 1, 2, 3, 4]))
Play with the above code, change the input to np.histogram and see how it works.
But a picture is worth a thousand words:
import matplotlib.pyplot as plt
plt.bar(bin_edges[:-1], hist, width = 1)
plt.xlim(min(bin_edges), max(bin_edges))
plt.show()
Another useful thing to do with numpy.histogram is to plot the output as the x and y coordinates on a linegraph. For example:
arr = np.random.randint(1, 51, 500)
y, x = np.histogram(arr, bins=np.arange(51))
fig, ax = plt.subplots()
ax.plot(x[:-1], y)
fig.show()
This can be a useful way to visualize histograms where you would like a higher level of granularity without bars everywhere. Very useful in image histograms for identifying extreme pixel values.
While reading up on numpy, I encountered the function numpy.histogram().
What is it for and how does it work? In the docs they mention bins: What are they?
Some googling led me to the definition of Histograms in general. I get that. But unfortunately I can't link this knowledge to the examples given in the docs.
A bin is range that represents the width of a single bar of the histogram along the X-axis. You could also call this the interval. (Wikipedia defines them more formally as "disjoint categories".)
The Numpy histogram function doesn't draw the histogram, but it computes the occurrences of input data that fall within each bin, which in turns determines the area (not necessarily the height if the bins aren't of equal width) of each bar.
In this example:
np.histogram([1, 2, 1], bins=[0, 1, 2, 3])
There are 3 bins, for values ranging from 0 to 1 (excl 1.), 1 to 2 (excl. 2) and 2 to 3 (incl. 3), respectively. The way Numpy defines these bins if by giving a list of delimiters ([0, 1, 2, 3]) in this example, although it also returns the bins in the results, since it can choose them automatically from the input, if none are specified. If bins=5, for example, it will use 5 bins of equal width spread between the minimum input value and the maximum input value.
The input values are 1, 2 and 1. Therefore, bin "1 to 2" contains two occurrences (the two 1 values), and bin "2 to 3" contains one occurrence (the 2). These results are in the first item in the returned tuple: array([0, 2, 1]).
Since the bins here are of equal width, you can use the number of occurrences for the height of each bar. When drawn, you would have:
a bar of height 0 for range/bin [0,1] on the X-axis,
a bar of height 2 for range/bin [1,2],
a bar of height 1 for range/bin [2,3].
You can plot this directly with Matplotlib (its hist function also returns the bins and the values):
>>> import matplotlib.pyplot as plt
>>> plt.hist([1, 2, 1], bins=[0, 1, 2, 3])
(array([0, 2, 1]), array([0, 1, 2, 3]), <a list of 3 Patch objects>)
>>> plt.show()
import numpy as np
hist, bin_edges = np.histogram([1, 1, 2, 2, 2, 2, 3], bins = range(5))
Below, hist indicates that there are 0 items in bin #0, 2 in bin #1, 4 in bin #3, 1 in bin #4.
print(hist)
# array([0, 2, 4, 1])
bin_edges indicates that bin #0 is the interval [0,1), bin #1 is [1,2), ...,
bin #3 is [3,4).
print (bin_edges)
# array([0, 1, 2, 3, 4]))
Play with the above code, change the input to np.histogram and see how it works.
But a picture is worth a thousand words:
import matplotlib.pyplot as plt
plt.bar(bin_edges[:-1], hist, width = 1)
plt.xlim(min(bin_edges), max(bin_edges))
plt.show()
Another useful thing to do with numpy.histogram is to plot the output as the x and y coordinates on a linegraph. For example:
arr = np.random.randint(1, 51, 500)
y, x = np.histogram(arr, bins=np.arange(51))
fig, ax = plt.subplots()
ax.plot(x[:-1], y)
fig.show()
This can be a useful way to visualize histograms where you would like a higher level of granularity without bars everywhere. Very useful in image histograms for identifying extreme pixel values.
I need to write a function in Python 3 which returns an array of positions (x,y) on a rectangular field (e.g. 100x100 points) that are scattered according to a homogenous spatial Poisson process.
So far I have found this resource with Python code, but unfortunately, I'm unable to find/install scipy for Python 3:
http://connor-johnson.com/2014/02/25/spatial-point-processes/
It has helped me understand what a Poisson point process actually is and how it works, though.
I have been playing around with numpy.random.poisson for a while now, but I am having a tough time interpreting what it returns.
http://docs.scipy.org/doc/numpy/reference/generated/numpy.random.poisson.html
>>> import numpy as np
>>> np.random.poisson(1, (1, 5, 5))
array([[[0, 2, 0, 1, 0],
[3, 2, 0, 2, 1],
[0, 1, 3, 3, 2],
[0, 1, 2, 0, 2],
[1, 2, 1, 0, 3]]])
What I think that command does is creating one 5x5 field = (1, 5, 5) and scattering objects with a rate of lambda = 1 over that field. The numbers displayed in the resulting matrix are the probability of an object lying on that specific position.
How can I scatter, say, ten objects over that 5x5 field according to a homogenous spatial Poisson process? My first guess would be to iterate over the whole array and insert an object on every position with a "3", then one on every other position with a "2", and so on, but I'm unsure of the actual probability I should use to determine if an object should be inserted or not.
According to the following resource, I can simulate 10 objects being scattered over a field with a rate of 1 by simply multiplying the rate and the object count (10*1 = 10) and using that value as my lambda, i.e.
>>> np.random.poisson(10, (1, 5, 5))
array([[[12, 12, 10, 16, 16],
[ 8, 6, 8, 12, 9],
[12, 4, 10, 3, 8],
[15, 10, 10, 15, 7],
[ 8, 13, 12, 9, 7]]])
However, I don't see how that should make things easier. I only increase the rate at which objects appear by 10 that way.
Poisson point process in matlab
To sum it up, my primary question is: How can I use numpy.random.poisson(lam, size) to model a number n of objects being scattered over a 2-dimensional field dx*dy?
It seems I've looked at the problem in the wrong way. After more offline research I found out that it actually is sufficient to create a random Poisson value which represents the number of objects, for example
n = np.random.poisson(100) and create the same amount of random values between 0 and 1
x = np.random.rand(n)
y = np.random.rand(n)
Now I just need to join the two arrays of x- and y-values to an array of (x,y) tuples. Those are the random positions I was looking for. I can multiply every x and y value by the side length of my field, e.g. 100, to scale the values to the 100x100 field I want to display.
I thought that the "randomness" of those positions should be determined by a random Poisson process, but it seems that just the number of positions needs to be determined by it, not the actual positional values.
That's all correct. You definitely don't need SciPy, though when I first simulated a Poisson point process in Python I also used SciPy. I presented the original code with details in the simulation process in this post:
https://hpaulkeeler.com/poisson-point-process-simulation/
I just use NumPy in the more recent code:
import numpy as np; #NumPy package for arrays, random number generation, etc
import matplotlib.pyplot as plt #for plotting
#Simulation window parameters
xMin=0;xMax=1;
yMin=0;yMax=1;
xDelta=xMax-xMin;yDelta=yMax-yMin; #rectangle dimensions
areaTotal=xDelta*yDelta;
#Point process parameters
lambda0=100; #intensity (ie mean density) of the Poisson process
#Simulate a Poisson point process
numbPoints = np.random.poisson(lambda0*areaTotal);#Poisson number of points
xx = xDelta*np.random.uniform(0,1,numbPoints)+xMin;#x coordinates of Poisson points
yy = yDelta*np.random.uniform(0,1,numbPoints)+yMin;#y coordinates of Poisson points
The code can also be found here:
https://github.com/hpaulkeeler/posts/tree/master/PoissonRectangle
I've also uploaded there more Python (and MATLAB and Julia) code for simulating several points processes, including Poisson point processes on various shapes and cluster point processes.
https://github.com/hpaulkeeler/posts
I read data from binary files into numpy arrays with np.fromfile. These data represent Z values on a grid for which spacing and shape are known so there is no problem reshaping the 1D array into the the shape of the grid and plotting with plt.imshow. So if I have N grids I can plot N subplots showing all data in one figure but what I'd really like to do is plot them as one image.
I can't just stack the arrays because the data in each array is spaced differently and because they have different shapes.
My idea was to "supersample" all grids to the spacing of the finest grid, stack and plot but I am not sure that is such a good idea as these grid files can become quite large.
By the way: Let's say I wanted to do that, how do I go from:
0, 1, 2
3, 4, 5
to:
0, 0, 1, 1, 2, 2
0, 0, 1, 1, 2, 2
3, 3, 4, 4, 5, 5
3, 3, 4, 4, 5, 5
I'm open to any suggestions.
Thanks,
Shahar
The answer if you just plot is: don't. plt.imshow has a keyword argument extent which you can use to zoom the imagine when plotting. Other then that I would suggest scipy.ndimage.zoom, with order=0, it is equivalent to repeating values, but you can zoom to any size easily or use a different order to get some smooth interpolation. np.tile could be an option for very simple zooming too.
Here is an example:
a = np.arange(9).reshape(3,3)
b = np.arange(36).reshape(6,6)
plt.imshow(a, extent=[0,1,0,1], interpolation='none')
plt.imshow(b, extent=(1,2,0,1), interpolation='none')
# note scaling is "broke"
plt.xlim(0,2)
of course to get the same color range for both, you should add vim=... and vmax keywords.