save PyMC3 traceplot subplots to image file - python

I am trying very simply to plot subplots generated by the PyMC3 traceplot function (see here) to a file.
The function generates a numpy.ndarray (2d) of subplots.
I need to move or copy these subplots into a matplotlib.figure in order to save the image file. Everything I can find shows how to generate the figure's subplots first, then build them out.
As a minimum example, I lifted the sample PyMC3 code from Here, and added to it just a few lines in an attempt to handle the subplots.
from pymc3 import *
import theano.tensor as tt
from theano import as_op
from numpy import arange, array, empty
### Added these three lines relative to source #######################
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
__all__ = ['disasters_data', 'switchpoint', 'early_mean', 'late_mean', 'rate', 'disasters']
# Time series of recorded coal mining disasters in the UK from 1851 to 1962
disasters_data = array([4, 5, 4, 0, 1, 4, 3, 4, 0, 6, 3, 3, 4, 0, 2, 6,
3, 3, 5, 4, 5, 3, 1, 4, 4, 1, 5, 5, 3, 4, 2, 5,
2, 2, 3, 4, 2, 1, 3, 2, 2, 1, 1, 1, 1, 3, 0, 0,
1, 0, 1, 1, 0, 0, 3, 1, 0, 3, 2, 2, 0, 1, 1, 1,
0, 1, 0, 1, 0, 0, 0, 2, 1, 0, 0, 0, 1, 1, 0, 2,
3, 3, 1, 1, 2, 1, 1, 1, 1, 2, 4, 2, 0, 0, 1, 4,
0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1])
years = len(disasters_data)
#as_op(itypes=[tt.lscalar, tt.dscalar, tt.dscalar], otypes=[tt.dvector])
def rateFunc(switchpoint, early_mean, late_mean):
out = empty(years)
out[:switchpoint] = early_mean
out[switchpoint:] = late_mean
return out
with Model() as model:
# Prior for distribution of switchpoint location
switchpoint = DiscreteUniform('switchpoint', lower=0, upper=years)
# Priors for pre- and post-switch mean number of disasters
early_mean = Exponential('early_mean', lam=1.)
late_mean = Exponential('late_mean', lam=1.)
# Allocate appropriate Poisson rates to years before and after current switchpoint location
rate = rateFunc(switchpoint, early_mean, late_mean)
# Data likelihood
disasters = Poisson('disasters', rate, observed=disasters_data)
# Initial values for stochastic nodes
start = {'early_mean': 2., 'late_mean': 3.}
# Use slice sampler for means
step1 = Slice([early_mean, late_mean])
# Use Metropolis for switchpoint, since it accomodates discrete variables
step2 = Metropolis([switchpoint])
# njobs>1 works only with most recent (mid August 2014) Thenao version:
# https://github.com/Theano/Theano/pull/2021
tr = sample(1000, tune=500, start=start, step=[step1, step2], njobs=1)
### gnashing of teeth starts here ################################
fig, axarr = plt.subplots(3,2)
# This gives a KeyError
# axarr = traceplot(tr, axarr)
# This finishes without error
trarr = traceplot(tr)
# doesn't work
# axarr[0, 0] = trarr[0, 0]
fig.savefig("disaster.png")
I've tried a few variations along the subplot() and add_subplot() lines, to no avail -- all errors point toward the fact that empty subplots must first be created for the figure, not assigned to pre-existing subplots.
A different example (see here, about 80% of the way down, beginning with
### Mysterious code to be explained in Chapter 3.
) avoids the utility altogether and builds out the subplots manually, so maybe there's no good answer to this? Is the pymc3.traceplot output indeed an orphaned ndarray of subplots that can't be used?

I ran into the same problem. I am working with pymc3 3.5 and matplotlib 2.1.2.
I realized it's possible to export the traceplot by:
trarr = traceplot(tr)
fig = plt.gcf() # to get the current figure...
fig.savefig("disaster.png") # and save it directly

Can you print type(trarr[0,0]) and post the result?
First of all, matplotlib axes objects are part of a figure and can only live inside a figure. It is therefore not possible to simply take an axes and put it to a different figure. However, in your case it may be, that fig.add_axes(trarr[0,0]) nonetheless works. I doubt it, but you can still try.
Apart from that, traceplot() has a keyword argument called ax.
ax : axes
Matplotlib axes. Defaults to None.
Although it is pretty unclear, how you'd specify several subplots as one axes object, you can still try to play around with it. Try to put a single axes in or your own created subplots axes array axarr or only part of it.
Edit, just that noone oversees the small line in the comments:
According to the answer in the bug report, traceplot(tr, ax = axarr) is indeed reported to work just fine.

Related

Automatically detecting clusters in a 2d array/heatmap

I have run object detection on a video file and summed the seconds each pixel is activated to find the amount of time an object is shown in this area which gives me a 2d array of time values. Since these objects are in the same position of the video most of the time it leads to some areas of the screen having much higher activation than others. Now I would like to find a way to automatically detect "clusters" without knowing the number of clusters beforehand. I have considered using something like k-means but also read a little about finding local maximums, but I can't quite figure out how to put all this together or which method is the best to go with. Also, the objects vary in size, so I'm not sure I can go with the local maximum method?
The final result would be a list of ids and maximum time value for each cluster.
[[3, 3, 3, 0, 0, 0, 0, 0, 0]
[3, 3, 3, 0, 0, 0, 2, 2, 2]
[3, 3, 3, 0, 0, 0, 2, 2, 2]
[0, 0, 0, 0, 0, 0, 2, 2, 2]]
From this example array I would end out with a list:
id | Seconds
1 | 3
2 | 2
I havn't tried much since I have no clue where to start and any recommendations of methods with code examples or links to where I can find it to accomplish this would be greatly appreciated! :)
You could look at different methods for clustering in: https://scikit-learn.org/stable/modules/clustering.html
If you do not know the number of clusters beforehand you might want to use a different algorithm than K-means (one which is not dependent on the number of clusters). I would suggest reading about dbscan and hdbscan for this task. Good luck :)

How do I do this tensor transformation and preserve the gradients?

In Tensorflow, I have a float tensor T with shape [batch_size, 3]. For example, T[0] = [4, 4, 3].
I want to turn that into a size 5 one hot in order to yield entries from an embedding dictionary. In the above case, that would look like
T[0] = [[0, 0, 0, 0, 1], [0, 0, 0, 0, 1], [0, 0, 0, 1, 0]].
If I can get it into this format, then I can multiply it by the embedding dictionary. However, this is in the middle of the graph and I need the gradients to flow through it. Is there a clever way to use stop_gradient a la How Can I Define Only the Gradient for a Tensorflow Subgraph? to make this work? I'm coming up short.
I was able to solve this in the following way:
expanded = tf.expand_dims(inputs, 2)
embedding_input = tf.cast(tf.one_hot(tf.to_int32(inputs), 5), inputs.dtype)
embedding_input = tf.stop_gradient(embedding_input - expanded) + expanded

argrelextrema and flat extrema

Function argrelextrema from scipy.signal does not detect flat extrema.
Example:
import numpy as np
from scipy.signal import argrelextrema
data = np.array([ 0, 1, 2, 1, 0, 1, 3, 3, 1, 0 ])
argrelextrema(data, np.greater)
(array([2]),)
the first max (2) is detected, the second max (3, 3) is not detected.
Any workaround for this behaviour?
Thanks.
Short answer: Probably argrelextrema will not be flexible enough for your task. Consider writing your own function matching your needs.
Longer answer: Are you bound to use argrelextrema? If yes, then you can play around with the comparator and the order arguments of argrelextrema (see the reference).
For your easy example, it would be enough to chose np.greater_equal as comparator.
>>> data = np.array([ 0, 1, 2, 1, 0, 1, 3, 3, 1, 0 ])
>>> print(argrelextrema(data, np.greater_equal,order=1))
(array([2, 6, 7]),)
Note however that in this way
>>> data = np.array([ 0, 1, 2, 1, 0, 1, 3, 3, 4, 1, 0 ])
>>> print(argrelextrema(data, np.greater_equal,order=1))
(array([2, 6, 8]),)
behaves differently that you would probably like, finding the first 3 and the 4 as maxima, since argrelextrema now sees everything as a maximum that is greater or equal to its two nearest neighbors. You can now use the order argument to decide to how many neighbors this comparison must hold - choosing order=2 would change my upper example to only find 4 as a maximum.
>>> print(argrelextrema(data, np.greater_equal,order=2))
(array([2, 8]),)
There is, however, a downside to this - let's change the data once more:
>>> data = np.array([ 0, 1, 2, 1, 0, 1, 3, 3, 4, 1, 5 ])
>>> print(argrelextrema(data, np.greater_equal,order=2))
(array([ 2, 10]),)
Adding another peak as a last value keeps you from finding your peak at 4, as argrelextrema is now seeing a second-neighbor that is greater than 4 (which can be useful for noisy data, but not necessarily the behavior expected in all cases).
Using argrelextrema, you will always be limited to binary operations between a fixed number of neighbors. Note, however, that all argrelextrema is doing in your example above is to return n, if data[n] > data[n-1] and data[n] > data[n+1]. You could easily implement this yourself, and then refine the rules, for example by checking the second neighbor in case that the first neighbor has the same value.
For the sake of completeness, there seems to be a more elaborate function in scipy.signal, find_peaks_cwt. I have however no experience using it and can therefore not give you more details about it.
I'm really surprised that no one figured out an answer to this. All you need to do is preprocess the array to remove duplicates that are located next to each other and you can run argrelextrema like so:
import numpy as np
from scipy.signal import argrelextrema
data = np.array([ 0, 1, 2, 1, 0, 1, 3, 3, 1, 0 ])
filter_table = [False] + list(np.equal(data[:-1], data[1:]))
data = np.array([x for idx, x in enumerate(data) if not filter_table[idx]])
argrelextrema(data, np.greater)

How to plot histogram of multiple lists?

I have a dataset with 13k Kickstarter projects and their tweets over the duration of a project. Each project contains a list with the number of tweets for each day,
e.g. [10, 2, 4, 7, 2, 4, 3, 0, 4, 0, 1, 3, 0, 3, 4, 0, 0, 2, 3, 2, 0, 4, 5, 1, 0, 2, 0, 2, 1, 2, 0].
I've taken a subset of the data by setting the duration of the projects on 31 days so that each list has the same length, containing 31 values.
This piece of code prints each list of tweets:
for project in data:
data[project]["tweets"]
What is the easiest way to plot a histogram with matplotlib? I need a frequency distribution of the total number of tweets for each day. How do I count the values from each index? Is their an easy way using Pandas to do this?
The lists are also accessible in a Pandas data frame:
df = pd.DataFrame.from_dict(data, orient='index')
df1 = df[['tweets']]
Histogram is probably not what you need. It's a good solution if you have a list of numbers (for example, IQs of people) and you want to attribute each number to a category (f.e. 79-, 80-99, 100+). There will be 3 bins and height of each bin will represent the quantity of numbers that fit in the corresponding category.
In your case, you already have the height of each bin, so (as I understand) what you want is a plot that looks like like a histogram. This (as I understand) is not supported by matplotlib and would require using matplotlib not the way it was intended to be used.
If you're OK with using plots instead of histograms, that's what you can do.
import matplotlib.pyplot as plt
lists = [data[project]["tweets"] for project in data] # Collect all lists into one
sum_list = [sum(x) for x in zip(*lists)] # Create a list with sums of tweets for each day
plt.plot(sum_list) # Create a plot for sum_list
plt.show() # Show the plot
If you want to make a plot look like a histogram you should do that:
plt.bar(range(0, len(sum_list)), sum_list)
instead of plt.plot.

Stitching grids of varying grid spacing

I read data from binary files into numpy arrays with np.fromfile. These data represent Z values on a grid for which spacing and shape are known so there is no problem reshaping the 1D array into the the shape of the grid and plotting with plt.imshow. So if I have N grids I can plot N subplots showing all data in one figure but what I'd really like to do is plot them as one image.
I can't just stack the arrays because the data in each array is spaced differently and because they have different shapes.
My idea was to "supersample" all grids to the spacing of the finest grid, stack and plot but I am not sure that is such a good idea as these grid files can become quite large.
By the way: Let's say I wanted to do that, how do I go from:
0, 1, 2
3, 4, 5
to:
0, 0, 1, 1, 2, 2
0, 0, 1, 1, 2, 2
3, 3, 4, 4, 5, 5
3, 3, 4, 4, 5, 5
I'm open to any suggestions.
Thanks,
Shahar
The answer if you just plot is: don't. plt.imshow has a keyword argument extent which you can use to zoom the imagine when plotting. Other then that I would suggest scipy.ndimage.zoom, with order=0, it is equivalent to repeating values, but you can zoom to any size easily or use a different order to get some smooth interpolation. np.tile could be an option for very simple zooming too.
Here is an example:
a = np.arange(9).reshape(3,3)
b = np.arange(36).reshape(6,6)
plt.imshow(a, extent=[0,1,0,1], interpolation='none')
plt.imshow(b, extent=(1,2,0,1), interpolation='none')
# note scaling is "broke"
plt.xlim(0,2)
of course to get the same color range for both, you should add vim=... and vmax keywords.

Categories

Resources