Matplotlib_venn: weird output

Matplotlib_venn: weird output - python

I have a script that makes a Venn diagram of 3 sets using the matplotlib_venn module, as follows:
union = set_1.union(set_2).union(set_3)
indicators = ['%d%d%d' % (a in set_1, a in set_2, a in set_3) for a in union]
subsets = Counter(indicators)
fig = plt.figure((n+1)*2 - 1)
ax = fig.add_subplot(1, 1, 1)
v = venn3(subsets, (compare[0], compare[1], compare[2]), ax=ax)
plt.show()
Here are two images I get out of this, from two different datasets with three sets each (one small and one large):
In this image, The numbers are off. 180 should be in the middle and the 2 should be somewhere on the right side of the image, at the barely visible yellow/green part. I first thought this was due to the small size of the data set, but looking at the larger set ...
... I can still see that the numbers are slightly off, although not as much as previously. The larger, common set is still not in the middle, and the other numbers seem to be a little off to where the "center" of their set is.
Any ideas as to why this is, and what can be done to remedy the problem?
Using the venn3_unweighted function rather than venn3 shows perfectly nice (an non-proportional) images, including any 0s in the smaller data set, but it just doesn't work with the proportional version.

Related

Plotting millions of data points in Python?

I have written a complicated code. The code produces a set of numbers which I want to plot them. The problem is that I cannot put those numbers in a list since there are 2 700 000 000 of them.
So I need to plot one point then produce second point (the first point is replaced by second point so the first one is erased because I cannot store them). These numbers are generated in different sections of the code so I need to hold (MATLAB code) the figure.
For making it more conceivable to you, I write a simple code here and I want you to show me how to plot it.
import matplotlib.pyplot as plt
i=0
j=10
while i<2700000000:
plt.stem(i, j, '-')
i = i + 1
j = j + 2
plt.show()
Suppose I have billions of i and j!

Hmm I'm not sure if I understood you correctly but this:
import matplotlib.pyplot as plt
i=0
j=10
fig=plt.figure()
ax=fig.gca()
while i<10000: # Fewer points for speed.
ax.stem([i], [j]) # Need to provide iterable arguments to ax.stem
i = i + 1
j = j + 2
fig.show()
generates the following figure:
Isn't this what you're trying to achieve? After all the input numbers aren't stored anywhere, just added to the figure as soon as they are generated. You don't really need Matlab's hold equivalent, the figure won't be shown until you call fig.show() or plt.show() to show the current figure.
Or are you trying to overcome the problem that you can' hold the matplotlib.figure in your RAM? In which case my answer doesn't answer your question. Then you either have to save partial figures (only parts of the data) as pictures and combine them, as suggested in the comments, or think about an alternative way to show the data, as suggested in the other answer.

Heatmap with varying y axis

I would like to create a visualization like the upper part of this image. Essentially, a heatmap where each point in time has a fixed number of components but these components are anchored to the y axis by means of labels (that I can supply) rather than by their first index in the heatmap's matrix.
I am aware of pcolormesh, but that does not seem to give me the y-axis functionality I seek.
Lastly, I am also open to solutions in R, although a Python option would be much preferable.

I am not completely sure if I understand your meaning correctly, but by looking at the picture you have linked, you might be best off with a roll-your-own solution.
First, you need to create an array with the heatmap values so that you have on row for each label and one column for each time slot. You fill the array with nans and then write whatever heatmap values you have to the correct positions.
Then you need to trick imshow a bit to scale and show the image in the correct way.
For example:
# create some masked data
a=cumsum(random.random((20,200)), axis=0)
X,Y=meshgrid(arange(a.shape[1]),arange(a.shape[0]))
a[Y<15*sin(X/50.)]=nan
a[Y>10+15*sin(X/50.)]=nan
# draw the image along with some curves
imshow(a,interpolation='nearest',origin='lower',extent=[-2,2,0,3])
xd = linspace(-2, 2, 200)
yd = 1 + .1 * cumsum(random.random(200)-.5)
plot(xd, yd,'w',linewidth=3)
plot(xd, yd,'k',linewidth=1)
axis('normal')
Gives:

Python Image Processing: Measuring Layer Widths from Electron Micrograph

I have an image from an electron micrograph depicting dense and rare layers in a biological system, as shown below.
The layers in question are in the middle of the image, starting just to near the label "re" and tapering up to the left. I would like to:
1) count the total number of dark/dense and light/rare layers
2) measure the width of each layer, given that the black scale bar in the bottom right is 1 micron long
I've been trying to do this in Python. If I crop the image beforehand so as to only contain parts of a few layers, such the 3 dark and 3 light layers shown here:
I am able to count the number of layers using the code:
import numpy as np
import matplotlib.pyplot as plt
from scipy import ndimage
from PIL import Image
tap = Image.open("VDtap.png").convert('L')
tap_a = np.array(tap)
tap_g = ndimage.gaussian_filter(tap_a, 1)
tap_norm = (tap_g - tap_g.min())/(float(tap_g.max()) - tap_g.min())
tap_norm[tap_norm < 0.5] = 0
tap_norm[tap_norm >= 0.5] = 1
result = 255 - (tap_norm * 255).astype(np.uint8)
tap_labeled, count = ndimage.label(result)
plt.imshow(tap_labeled)
plt.show()
However, I'm not sure how to incorporate the scale bar and measure the widths of these layers that I have counted. Even worse, when analyzing the entire image so as to include the scale bar I am having trouble even distinguishing the layers from everything else that is going on in the image.
I would really appreciate any insight in tackling this problem. Thanks in advance.
EDIT 1:
I've made a bit of progress on this problem so far. If I crop the image beforehand so as to contain just a bit of the layers, I've been able to use the following code to get at the thicknesses of each layer.
import numpy as np
import matplotlib.pyplot as plt
from scipy import ndimage
from PIL import Image
from skimage.measure import regionprops
tap = Image.open("VDtap.png").convert('L')
tap_a = np.array(tap)
tap_g = ndimage.gaussian_filter(tap_a, 1)
tap_norm = (tap_g - tap_g.min())/(float(tap_g.max()) - tap_g.min())
tap_norm[tap_norm < 0.5] = 0
tap_norm[tap_norm >= 0.5] = 1
result = 255 - (tap_norm * 255).astype(np.uint8)
tap_labeled, count = ndimage.label(result)
props = regionprops(tap_labeled)
ds = np.array([])
for i in xrange(len(props)):
if i==0:
ds = np.append(ds, props[i].bbox[1] - 0)
else:
ds = np.append(ds, props[i].bbox[1] - props[i-1].bbox[3])
ds = np.append(ds, props[i].bbox[3] - props[i].bbox[1])
Essentially, I discovered the Python module skimage, which can take a labeled image array and return the four coordinates of a boundary box for each labeled object; the 1 and [3] positions give the x coordinates of the boundary box, so their difference yields the extent of each layer in the x-dimension. Also, the first part of the for loop (the if-else condition) is used to get the light/rare layers that precede each dark/dense layer, since only the dark layers get labeled by ndimage.label.
Unfortunately this is still not ideal. Firstly, I would like to not have to crop the image beforehand, as I intend to repeat this procedure for many such images. I've considered that perhaps the (rough) periodicity of the layers could be highlighted using some sort of filter, but I'm not sure if such a filter exists? Secondly, the code above really only gives me the relative width of each layer - I still haven't figured out a way to incorporate the scale bar so as to get the actual widths.

I don't want to be a party-pooper, but I think your problem is harder than you first thought. I can't post a working code snippet because there are so many parts of your post that require in depth attention. I have worked in several bio/med labs and this work is usual done with a human to tag specific image points and a computer to calculate distances. That being said, one should probably try to automate =D.
To you, the problem is a simple, yet tedious job, of getting out a ruler and making a few hundred measurements. Perfect for a computer right? Well yes and no. The computer has no idea how to identify any of the bands in the picture and has to be told exactly what its looking for, and that will be tricky.
Identifying the scale bar
What do you know about the scale bars in all your images. Are they always the same number of vertical and horizontal pictures, are they always solid black? Are there always just one bar (what about the solid line for the letter r)? My suggestion is to try a wavelet transform. Imagine the 2d analog to the function
(probably helps to draw this function)
f(x) =
0 if |x| > 1,
1 if |x| <1 && |x| > 0.5
-1 if |x| < 0.5
Then when our wavelet f(x, y) is convolved over the image, the output image will have high values only when it finds the black scale bar. Also the length that I set to 1 can also be tuned for wavelets and that will help you find the scale bar too.
Finding the ridges
I'd solve the above problem first because it seems easier and sets you up for this one. I'd construct another wavelet for this one but just as a preprocessing step. For this wavelet I'd try a 2d 0-sum box function again, but this try to match three (or more) boxes next to each other. Also in addition to the height and width parameters for the box, we need a spacing and tilt angle parameter. You probably don't have to get very close to the actual value, just close enough that the rest of the image blackens out.
Measuring the ridges
There are lots and lots of ways to do this, but let's use our previous step for simplicity. Take your 3 box wavelet answer and it should be centered at the middle ridge and report a box "width" that is the average width of those three ridges it has captured. Probably close enough considering how slowly the widths are changing!
Good hunting!

Opacity misleading when plotting two histograms at the same time with matplotlib

Let's say I have two histograms and I set the opacity using the parameter of hist: 'alpha=0.5'
I have plotted two histograms yet I get three colors! I understand this makes sense from an opacity point of view.
But! It makes is very confusing to show someone a graph of two things with three colors. Can I just somehow set the smallest bar for each bin to be in front with no opacity?
Example graph

The usual way this issue is handled is to have the plots with some small separation. This is done by default when plt.hist is given multiple sets of data:
import pylab as plt
x = 200 + 25*plt.randn(1000)
y = 150 + 25*plt.randn(1000)
n, bins, patches = plt.hist([x, y])
You instead which to stack them (this could be done above using the argument histtype='barstacked') but notice that the ordering is incorrect.
This can be fixed by individually checking each pair of points to see which is larger and then using zorder to set which one comes first. For simplicity I am using the output of the code above (e.g n is two stacked arrays of the number of points in each bin for x and y):
n_x = n[0]
n_y = n[1]
for i in range(len(n[0])):
if n_x[i] > n_y[i]:
zorder=1
else:
zorder=0
plt.bar(bins[:-1][i], n_x[i], width=10)
plt.bar(bins[:-1][i], n_y[i], width=10, color="g", zorder=zorder)
Here is the resulting image:
By changing the ordering like this the image looks very weird indeed, this is probably why it is not implemented and needs a hack to do it. I would stick with the small separation method, anyone used to these plots assumes they take the same x-value.

How to correlate two time series with gaps and different time bases?

I have two time series of 3D accelerometer data that have different time bases (clocks started at different times, with some very slight creep during the sampling time), as well as containing many gaps of different size (due to delays associated with writing to separate flash devices).
The accelerometers I'm using are the inexpensive GCDC X250-2. I'm running the accelerometers at their highest gain, so the data has a significant noise floor.
The time series each have about 2 million data points (over an hour at 512 samples/sec), and contain about 500 events of interest, where a typical event spans 100-150 samples (200-300 ms each). Many of these events are affected by data outages during flash writes.
So, the data isn't pristine, and isn't even very pretty. But my eyeball inspection shows it clearly contains the information I'm interested in. (I can post plots, if needed.)
The accelerometers are in similar environments but are only moderately coupled, meaning that I can tell by eye which events match from each accelerometer, but I have been unsuccessful so far doing so in software. Due to physical limitations, the devices are also mounted in different orientations, where the axes don't match, but they are as close to orthogonal as I could make them. So, for example, for 3-axis accelerometers A & B, +Ax maps to -By (up-down), +Az maps to -Bx (left-right), and +Ay maps to -Bz (front-back).
My initial goal is to correlate shock events on the vertical axis, though I would eventually like to a) automatically discover the axis mapping, b) correlate activity on the mapped aces, and c) extract behavior differences between the two accelerometers (such as twisting or flexing).
The nature of the times series data makes Python's numpy.correlate() unusable. I've also looked at R's Zoo package, but have made no headway with it. I've looked to different fields of signal analysis for help, but I've made no progress.
Anyone have any clues for what I can do, or approaches I should research?
Update 28 Feb 2011: Added some plots here showing examples of the data.

My interpretation of your question: Given two very long, noisy time series, find a shift of one that matches large 'bumps' in one signal to large bumps in the other signal.
My suggestion: interpolate the data so it's uniformly spaced, rectify and smooth the data (assuming the phase of the fast oscillations is uninteresting), and do a one-point-at-a-time cross correlation (assuming a small shift will line up the data).
import numpy
from scipy.ndimage import gaussian_filter
"""
sig1 and sig 2 are assumed to be large, 1D numpy arrays
sig1 is sampled at times t1, sig2 is sampled at times t2
t_start, t_end, is your desired sampling interval
t_len is your desired number of measurements
"""
t = numpy.linspace(t_start, t_end, t_len)
sig1 = numpy.interp(t, t1, sig1)
sig2 = numpy.interp(t, t2, sig2)
#Now sig1 and sig2 are sampled at the same points.
"""
Rectify and smooth, so 'peaks' will stand out.
This makes big assumptions about your data;
these assumptions seem true-ish based on your plots.
"""
sigma = 10 #Tune this parameter to get the right smoothing
sig1, sig2 = abs(sig1), abs(sig2)
sig1, sig2 = gaussian_filter(sig1, sigma), gaussian_filter(sig2, sigma)
"""
Now sig1 and sig2 should look smoothly varying, with humps at each 'event'.
Hopefully we can search a small range of shifts to find the maximum of the
cross-correlation. This assumes your data are *nearly* lined up already.
"""
max_xc = 0
best_shift = 0
for shift in range(-10, 10): #Tune this search range
xc = (numpy.roll(sig1, shift) * sig2).sum()
if xc > max_xc:
max_xc = xc
best_shift = shift
print 'Best shift:', best_shift
"""
If best_shift is at the edges of your search range,
you should expand the search range.
"""

If the data contains gaps of unknown sizes that are different in each time series, then I would give up on trying to correlate entire sequences, and instead try cross correlating pairs of short windows on each time series, say overlapping windows twice the length of a typical event (300 samples long). Find potential high cross correlation matches across all possibilities, and then impose a sequential ordering constraint on the potential matches to get sequences of matched windows.
From there you have smaller problems that are easier to analyze.

This isn't a technical answer, but it might help you come up with one:
Convert the plot to an image, and stick it into a decent image program like gimp or photoshop
break the plots into discrete images whenever there's a gap
put the first series of plots in a horizontal line
put the second series in a horizontal line right underneath it
visually identify the first correlated event
if the two events are not lined up vertically:
select whichever instance is further to the left and everything to the right of it on that row
drag those things to the right until they line up
This is pretty much how an audio editor works, so you if you converted it into a simple audio format like an uncompressed WAV file, you could manipulate it directly in something like Audacity. (It'll sound horrible, of course, but you'll be able to move the data plots around pretty easily.)
Actually, audacity has a scripting language called nyquist, too, so if you don't need the program to detect the correlations (or you're at least willing to defer that step for the time being) you could probably use some combination of audacity's markers and nyquist to automate the alignment and export the clean data in your format of choice once you tag the correlation points.

My guess is, you'll have to manually build an offset table that aligns the "matches" between the series. Below is an example of a way to get those matches. The idea is to shift the data left-right until it lines up and then adjust the scale until it "matches". Give it a try.
library(rpanel)
#Generate the x1 and x2 data
n1 <- rnorm(500)
n2 <- rnorm(200)
x1 <- c(n1, rep(0,100), n2, rep(0,150))
x2 <- c(rep(0,50), 2*n1, rep(0,150), 3*n2, rep(0,50))
#Build the panel function that will draw/update the graph
lvm.draw <- function(panel) {
plot(x=(1:length(panel$dat3))+panel$off, y=panel$dat3, ylim=panel$dat1, xlab="", ylab="y", main=paste("Alignment Graph Offset = ", panel$off, " Scale = ", panel$sca, sep=""), typ="l")
lines(x=1:length(panel$dat3), y=panel$sca*panel$dat4, col="red")
grid()
panel
}
#Build the panel
xlimdat <- c(1, length(x1))
ylimdat <- c(-5, 5)
panel <- rp.control(title = "Eye-Ball-It", dat1=ylimdat, dat2=xlimdat, dat3=x1, dat4=x2, off=100, sca=1.0, size=c(300, 160))
rp.slider(panel, var=off, from=-500, to=500, action=lvm.draw, title="Offset", pos=c(5, 5, 290, 70), showvalue=TRUE)
rp.slider(panel, var=sca, from=0, to=2, action=lvm.draw, title="Scale", pos=c(5, 70, 290, 90), showvalue=TRUE)

It sounds like you want to minimize the function (Ax'+By) + (Az'+Bx) + (Ay'+Bz) for a pair of values: Namely, the time-offset: t0 and a time scale factor: tr. where Ax' = tr*(Ax + t0), etc..
I would look into SciPy's bivariate optimize functions. And I would use a mask or temporarily zero the data (both Ax' and By for example) over the "gaps" (assuming the gaps can be programmatically determined).
To make the process more efficient, start with a coarse sampling of A and B, but set the precision in fmin (or whatever optimizer you've selected) that is commensurate with your sampling. Then proceed with progressively finer-sampled windows of the full dataset until your windows are narrow and are not down-sampled.
Edit - matching axes
Regarding the issue of trying to identify which axis is co-linear with a given axis, and not knowing at thing about the characteristics of your data, i can point towards a similar question. Look into pHash or any of the other methods outlined in this post to help identify similar waveforms.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.