Visualization of scatter plots with overlapping points in matplotlib

Visualization of scatter plots with overlapping points in matplotlib - python

I have to represent about 30,000 points in a scatter plot in matplotlib. These points belong to two different classes, so I want to depict them with different colors.
I succeded in doing so, but there is an issue. The points overlap in many regions and the class that I depict for last will be visualized on top of the other one, hiding it. Furthermore, with the scatter plot is not possible to show how many points lie in each region.
I have also tried to make a 2d histogram with histogram2d and imshow, but it's difficult to show the points belonging to both classes in a clear way.
Can you suggest a way to make clear both the distribution of the classes and the concentration of the points?
EDIT: To be more clear, this is the
link to my data file in the format "x,y,class"

One approach is to plot the data as a scatter plot with a low alpha, so you can see the individual points as well as a rough measure of density. (The downside to this is that the approach has a limited range of overlap it can show -- i.e., a maximum density of about 1/alpha.)
Here's an example:
As you can imagine, because of the limited range of overlaps that can be expressed, there's a tradeoff between visibility of the individual points and the expression of amount of overlap (and the size of the marker, plot, etc).
import numpy as np
import matplotlib.pyplot as plt
N = 10000
mean = [0, 0]
cov = [[2, 2], [0, 2]]
x,y = np.random.multivariate_normal(mean, cov, N).T
plt.scatter(x, y, s=70, alpha=0.03)
plt.ylim((-5, 5))
plt.xlim((-5, 5))
plt.show()
(I'm assuming here you meant 30e3 points, not 30e6. For 30e6, I think some type of averaged density plot would be necessary.)

You could also colour the points by first computing a kernel density estimate of the distribution of the scatter, and using the density values to specify a colour for each point of the scatter. To modify the code in the earlier example :
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde as kde
from matplotlib.colors import Normalize
from matplotlib import cm
N = 10000
mean = [0,0]
cov = [[2,2],[0,2]]
samples = np.random.multivariate_normal(mean,cov,N).T
densObj = kde( samples )
def makeColours( vals ):
colours = np.zeros( (len(vals),3) )
norm = Normalize( vmin=vals.min(), vmax=vals.max() )
#Can put any colormap you like here.
colours = [cm.ScalarMappable( norm=norm, cmap='jet').to_rgba( val ) for val in vals]
return colours
colours = makeColours( densObj.evaluate( samples ) )
plt.scatter( samples[0], samples[1], color=colours )
plt.show()
I learnt this trick a while ago when I noticed the documentation of the scatter function --
c : color or sequence of color, optional, default : 'b'
c can be a single color format string, or a sequence of color specifications of length N, or a sequence of N numbers to be mapped to colors using the cmap and norm specified via kwargs (see below). Note that c should not be a single numeric RGB or RGBA sequence because that is indistinguishable from an array of values to be colormapped. c can be a 2-D array in which the rows are RGB or RGBA, however, including the case of a single row to specify the same color for all points.

My answer may not perfectly answer your question, however, I too tried to plot overlapping points, but mine were perfectly overlapped. I therefore came up with this function in order to offset identical points.
import numpy as np
def dodge_points(points, component_index, offset):
"""Dodge every point by a multiplicative offset (multiplier is based on frequency of appearance)
Args:
points (array-like (2D)): Array containing the points
component_index (int): Index / column on which the offset will be applied
offset (float): Offset amount. Effective offset for each point is `index of appearance` * offset
Returns:
array-like (2D): Dodged points
"""
# Extract uniques points so we can map an offset for each
uniques, inv, counts = np.unique(
points, return_inverse=True, return_counts=True, axis=0
)
for i, num_identical in enumerate(counts):
# Prepare dodge values
dodge_values = np.array([offset * i for i in range(num_identical)])
# Find where the dodge values must be applied, in order
points_loc = np.where(inv == i)[0]
#Apply the dodge values
points[points_loc, component_index] += dodge_values
return points
Here is an example of before and after.
Before:
After:
This method only works for EXACTLY overlapping points (or if you are willing to round points off in a way that np.unique finds matching points).

Related

How to get colour value from cmap with RGB tuple? [duplicate]

How do I invert a color mapped image?
I have a 2D image which plots data on a colormap. I'd like to read the image in and 'reverse' the color map, that is, look up a specific RGB value, and turn it into a float.
For example:
using this image: http://matplotlib.sourceforge.net/_images/mri_demo.png
I should be able to get a 440x360 matrix of floats, knowing the colormap was cm.jet
from pylab import imread
import matplotlib.cm as cm
a=imread('mri_demo.png')
b=colormap2float(a,cm.jet) #<-tricky part

There may be better ways to do this; I'm not sure.
If you read help(cm.jet) you will see the algorithm used to map values in the interval [0,1] to RGB 3-tuples. You could, with a little paper and pencil, work out formulas to invert the piecewise-linear functions which define the mapping.
However, there are a number of issues which make the paper and pencil solution somewhat unappealing:
It's a lot of laborious algebra, and
the solution is specific for cm.jet.
You'd have to do all this work again
if you change the color map. How to automate the solving of these algebraic equations is interesting, but not a problem I know how to solve.
In general, the color map may not be
invertible (more than one value may
be mapped to the same color). In the
case of cm.jet, values between 0.11
and 0.125 are all mapped to the RGB
3-tuple (0,0,1), for example. So if
your image contains a pure blue
pixel, there is really no way to
tell if it came from a value of 0.11
or a value of, say, 0.125.
The mapping from [0,1] to
3-tuples is a curve in 3-space. The
colors in your image may not lie
perfectly on this curve. There might
be round-off error, for example. So any practical solution has to be able to interpolate or somehow project points in 3-space onto the curve.
Due to the non-uniqueness issue, and the projection/interpolation issue, there can be many possible solutions to the problem you pose. Below is just one possibility.
Here is one way to resolve the uniqueness and projection/interpolation issues:
Create a gradient which acts as a "code book". The gradient is an array of RGBA 4-tuples in the cm.jet color map. The colors of the gradient correspond to values from 0 to 1. Use scipy's vector quantization function scipy.cluster.vq.vq to map all the colors in your image, mri_demo.png, onto the nearest color in gradient.
Since a color map may use the same color for many values, the gradient may contain duplicate colors. I leave it up to scipy.cluster.vq.vq to decide which (possibly) non-unique code book index to associate with a particular color.
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np
import scipy.cluster.vq as scv
def colormap2arr(arr,cmap):
# http://stackoverflow.com/questions/3720840/how-to-reverse-color-map-image-to-scalar-values/3722674#3722674
gradient=cmap(np.linspace(0.0,1.0,100))
# Reshape arr to something like (240*240, 4), all the 4-tuples in a long list...
arr2=arr.reshape((arr.shape[0]*arr.shape[1],arr.shape[2]))
# Use vector quantization to shift the values in arr2 to the nearest point in
# the code book (gradient).
code,dist=scv.vq(arr2,gradient)
# code is an array of length arr2 (240*240), holding the code book index for
# each observation. (arr2 are the "observations".)
# Scale the values so they are from 0 to 1.
values=code.astype('float')/gradient.shape[0]
# Reshape values back to (240,240)
values=values.reshape(arr.shape[0],arr.shape[1])
values=values[::-1]
return values
arr=plt.imread('mri_demo.png')
values=colormap2arr(arr,cm.jet)
# Proof that it works:
plt.imshow(values,interpolation='bilinear', cmap=cm.jet,
origin='lower', extent=[-3,3,-3,3])
plt.show()
The image you see should be close to reproducing mri_demo.png:
(The original mri_demo.png had a white border. Since white is not a color in cm.jet, note that scipy.cluster.vq.vq maps white to to closest point in the gradient code book, which happens to be a pale green color.)

Here is a simpler approach, that works for many colormaps, e.g. viridis, though not for LinearSegmentedColormaps such as 'jet'.
The colormaps are stored as lists of [r,g,b] values. For lots of colormaps, this map has exactly 256 entries. A value between 0 and 1 is looked up using its nearest neighbor in the color list. So, you can't get the exact value back, only an approximation.
Some code to illustrate the concepts:
from matplotlib import pyplot as plt
def find_value_in_colormap(tup, cmap):
# for a cmap like viridis, the result of the colormap lookup is a tuple (r, g, b, a), with a always being 1
# but the colors array is stored as a list [r, g, b]
# for some colormaps, the situation is reversed: the lookup returns a list, while the colors array contains tuples
tup = list(tup)[:3]
colors = cmap.colors
if tup in colors:
ind = colors.index(tup)
elif tuple(tup) in colors:
ind = colors.index(tuple(tup))
else: # tup was not generated by this colormap
return None
return (ind + 0.5) / len(colors)
val = 0.3
tup = plt.cm.viridis(val)
print(find_value_in_colormap(tup, plt.cm.viridis))
This prints the approximate value:
0.298828125
being the value corresponding to the color triple.
To illustrate what happens, here is a visualization of the function looking up a color for a value, followed by getting the value corresponding to that color.
from matplotlib import pyplot as plt
import numpy as np
x = np.linspace(-0.1, 1.1, 10000)
y = [ find_value_in_colormap(plt.cm.viridis(x), plt.cm.viridis) for x in x]
fig, axes = plt.subplots(ncols=3, figsize=(12,4))
for ax in axes.ravel():
ax.plot(x, x, label='identity: y = x')
ax.plot(x, y, label='lookup, then reverse')
ax.legend(loc='best')
axes[0].set_title('overall view')
axes[1].set_title('zoom near x=0')
axes[1].set_xlim(-0.02, 0.02)
axes[1].set_ylim(-0.02, 0.02)
axes[2].set_title('zoom near x=1')
axes[2].set_xlim(0.98, 1.02)
axes[2].set_ylim(0.98, 1.02)
plt.show()
For a colormap with only a few colors, a plot can show the exact position where one color changes to the next. The plot is colored corresponding to the x-values.

Hy unutbu,
Thanks for your reply, I understand the process you explain, and reproduces it. It works very well, I use it to reverse IR camera shots in temperature grids, since a picture can be easily rework/reshape to fulfill my purpose using GIMP.
I'm able to create grids of scalar from camera shots that is really usefull in my tasks.
I use a palette file that I'm able to create using GIMP + Sample a Gradient Along a Path.
I pick the color bar of my original picture, convert it to palette then export as hex color sequence.
I read this palette file to create a colormap normalized by a temperature sample to be used as the code book.
I read the original image and use the vector quantization to reverse color into values.
I slightly improve the pythonic style of the code by using code book indices as index filter in the temperature sample array and apply some filters pass to smooth my results.
from numpy import linspace, savetxt
from matplotlib.colors import Normalize, LinearSegmentedColormap
from scipy.cluster.vq import vq
# sample the values to find from colorbar extremums
vmin = -20.
vmax = 120.
precision = 1.
resolution = 1 + vmax-vmin/precision
sample = linspace(vmin,vmax,resolution)
# create code_book from sample
cmap = LinearSegmentedColormap.from_list('Custom', hex_color_list)
norm = Normalize()
code_book = cmap(norm(sample))
# quantize colors
indices = vq(flat_image,code_book)[0]
# filter sample from quantization results **(improved)**
values = sample[indices]
savetxt(image_file_name[:-3]+'.csv',values ,delimiter=',',fmt='%-8.1f')
The results are finally exported in .csv
Most important thing is to create a well representative palette file to obtain a good precision. I start to obtain a good gradient (code book) using 12 colors and more.
This process is useful since sometimes camera shots cannot be translated to gray-scale easily and linearly.
Thanks to all contributors unutbu, Rob A, scipy community ;)

The LinearSegmentedColormap doesn't give me the same interpolation if I don't it manually during my test, so I prefer to use my own :
As an advantage, matplotlib is not more required since I integrate my code within an existing software.
def codeBook(color_list, N=256):
"""
return N colors interpolated from rgb color list
!!! workaround to matplotlib colormap to avoid dependency !!!
"""
# seperate r g b channel
rgb = np.array(color_list).T
# normalize data points sets
new_x = np.linspace(0., 1., N)
x = np.linspace(0., 1., len(color_list))
# interpolate each color channel
rgb = [np.interp(new_x, x, channel) for channel in rgb]
# round elements of the array to the nearest integer.
return np.rint(np.column_stack( rgb )).astype('int')

How can you create a KDE from histogram values only?

I have a set of values that I'd like to plot the gaussian kernel density estimation of, however there are two problems that I'm having:
I only have the values of bars not the values themselves
I am plotting onto a categorical axis
Here's the plot I've generated so far:
The order of the y axis is actually relevant since it is representative of the phylogeny of each bacterial species.
I'd like to add a gaussian kde overlay for each color, but so far I haven't been able to leverage seaborn or scipy to do this.
Here's the code for the above grouped bar plot using python and matplotlib:
enterN = len(color1_plotting_values)
fig, ax = plt.subplots(figsize=(20,30))
ind = np.arange(N) # the x locations for the groups
width = .5 # the width of the bars
p1 = ax.barh(Species_Ordering.Species.values, color1_plotting_values, width, label='Color1', log=True)
p2 = ax.barh(Species_Ordering.Species.values, color2_plotting_values, width, label='Color2', log=True)
for b in p2:
b.xy = (b.xy[0], b.xy[1]+width)
Thanks!

How to plot a "KDE" starting from a histogram
The protocol for kernel density estimation requires the underlying data. You could come up with a new method that uses the empirical pdf (ie the histogram) instead, but then it wouldn't be a KDE distribution.
Not all hope is lost, though. You can get a good approximation of a KDE distribution by first taking samples from the histogram, and then using KDE on those samples. Here's a complete working example:
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as sts
n = 100000
# generate some random multimodal histogram data
samples = np.concatenate([np.random.normal(np.random.randint(-8, 8), size=n)*np.random.uniform(.4, 2) for i in range(4)])
h,e = np.histogram(samples, bins=100, density=True)
x = np.linspace(e.min(), e.max())
# plot the histogram
plt.figure(figsize=(8,6))
plt.bar(e[:-1], h, width=np.diff(e), ec='k', align='edge', label='histogram')
# plot the real KDE
kde = sts.gaussian_kde(samples)
plt.plot(x, kde.pdf(x), c='C1', lw=8, label='KDE')
# resample the histogram and find the KDE.
resamples = np.random.choice((e[:-1] + e[1:])/2, size=n*5, p=h/h.sum())
rkde = sts.gaussian_kde(resamples)
# plot the KDE
plt.plot(x, rkde.pdf(x), '--', c='C3', lw=4, label='resampled KDE')
plt.title('n = %d' % n)
plt.legend()
plt.show()
Output:
The red dashed line and the orange line nearly completely overlap in the plot, showing that the real KDE and the KDE calculated by resampling the histogram are in excellent agreement.
If your histograms are really noisy (like what you get if you set n = 10 in the above code), you should be a bit cautious when using the resampled KDE for anything other than plotting purposes:
Overall the agreement between the real and resampled KDEs is still good, but the deviations are noticeable.
Munge your categorial data into an appropriate form
Since you haven't posted your actual data I can't give you detailed advice. I think your best bet will be to just number your categories in order, then use that number as the "x" value of each bar in the histogram.

I have stated my reservations to applying a KDE to OP's categorical data in my comments above. Basically, as the phylogenetic distance between species does not obey the triangle inequality, there cannot be a valid kernel that could be used for kernel density estimation. However, there are other density estimation methods that do not require the construction of a kernel. One such method is k-nearest neighbour inverse distance weighting, which only requires non-negative distances which need not satisfy the triangle inequality (nor even need to be symmetric, I think). The following outlines this approach:
import numpy as np
#--------------------------------------------------------------------------------
# simulate data
total_classes = 10
sample_values = np.random.rand(total_classes)
distance_matrix = 100 * np.random.rand(total_classes, total_classes)
# Distances to the values itself are zero; hence remove diagonal.
distance_matrix -= np.diag(np.diag(distance_matrix))
# --------------------------------------------------------------------------------
# For each sample, compute an average based on the values of the k-nearest neighbors.
# Weigh each sample value by the inverse of the corresponding distance.
# Apply a regularizer to the distance matrix.
# This limits the influence of values with very small distances.
# In particular, this affects how the value of the sample itself (which has distance 0)
# is weighted w.r.t. other values.
regularizer = 1.
distance_matrix += regularizer
# Set number of neighbours to "interpolate" over.
k = 3
# Compute average based on sample value itself and k neighbouring values weighted by the inverse distance.
# The following assumes that the value of distance_matrix[ii, jj] corresponds to the distance from ii to jj.
for ii in range(total_classes):
# determine neighbours
indices = np.argsort(distance_matrix[ii, :])[:k+1] # +1 to include the value of the sample itself
# compute weights
distances = distance_matrix[ii, indices]
weights = 1. / distances
weights /= np.sum(weights) # weights need to sum to 1
# compute weighted average
values = sample_values[indices]
new_sample_values[ii] = np.sum(values * weights)
print(new_sample_values)

THE EASY WAY
For now, I am skipping any philosophical argument about the validity of using Kernel density in such settings. Will come around that later.
An easy way to do this is using scikit-learn KernelDensity:
import numpy as np
import pandas as pd
from sklearn.neighbors import KernelDensity
from sklearn import preprocessing
ds=pd.read_csv('data-by-State.csv')
Y=ds.loc[:,'State'].values # State is AL, AK, AZ, etc...
# With categorical data we need some label encoding here...
le = preprocessing.LabelEncoder()
le.fit(Y) # le.classes_ would be ['AL', 'AK', 'AZ',...
y=le.transform(Y) # y would be [0, 2, 3, ..., 6, 7, 9]
y=y[:, np.newaxis] # preparing for kde
kde = KernelDensity(kernel='gaussian', bandwidth=0.75).fit(y)
# You can control the bandwidth so the KDE function performs better
# To find the optimum bandwidth for your data you can try Crossvalidation
x=np.linspace(0,5,100)[:, np.newaxis] # let's get some x values to plot on
log_dens=kde.score_samples(x)
dens=np.exp(log_dens) # these are the density function values
array([0.06625658, 0.06661817, 0.06676005, 0.06669403, 0.06643584,
0.06600488, 0.0654239 , 0.06471854, 0.06391682, 0.06304861,
0.06214499, 0.06123764, 0.06035818, 0.05953754, 0.05880534,
0.05818931, 0.05771472, 0.05740393, 0.057276 , 0.05734634,
0.05762648, 0.05812393, 0.05884214, 0.05978051, 0.06093455,
..............
0.11885574, 0.11883695, 0.11881434, 0.11878766, 0.11875657,
0.11872066, 0.11867943, 0.11863229, 0.11857859, 0.1185176 ,
0.11844852, 0.11837051, 0.11828267, 0.11818407, 0.11807377])
And these values are all you need to plot your Kernel Density over your histogram. Capito?
Now, on the theoretical side, if X is a categorical(*), unordered variable with c possible values, then for 0 ≤ h < 1
is a valid kernel. For an ordered X,
where |x1-x2|should be understood as how many levels apart x1 and x2 are. As h tends to zero, both of these become indicators and return a relative frequency counting. h is oftentimes referred to as bandwidth.
(*) No distance needs to be defined on the variable space. Doesn't need to be a metric space.
Devroye, Luc and Gábor Lugosi (2001). Combinatorial Methods in Density Estimation. Berlin: Springer-Verlag.

Using pyplot to draw histogram

I have a list.
Index of list is degree number.
Value is the probability of this degree number.
It looks like, x[ 1 ] = 0.01 means, the degree 1 's probability is 0.01.
I want to draw a distribution graph of this list, and I try
hist = plt.figure(1)
plt.hist(PrDeg, bins = 1)
plt.title("Degree Probability Histogram")
plt.xlabel("Degree")
plt.ylabel("Prob.")
hist.savefig("Prob_Hist")
PrDeg is the list which i mention above.
But the saved figure is not correct.
The X axis value becomes to Prob. and Y is Degree ( Index of list )
How can I exchange x and y axis value by using pyplot ?

Histograms do not usually show you probabilities, they show the count or frequency of observations within different intervals of values, called bins. pyplot defines interval or bins by splitting the range between the minimum and maximum value of your array into n equally sized bins, where n is the number you specified with argument : bins = 1. So, in this case your histogram has a single bin which gives it its odd aspect. By increasing that number you will be able to better see what actually happens there.
The only information that we can get from such an histogram is that the values of your data range from 0.0 to ~0.122 and that len(PrDeg) is close to 1800. If I am right about that much, it means your graph looks like what one would expect from an histogram and it is therefore not incorrect.
To answer your question about swapping the axes, the argument orientation=u'horizontal' is what you are looking for. I used it in the example below, renaming the axes accordingly:
import numpy as np
import matplotlib.pyplot as plt
PrDeg = np.random.normal(0,1,10000)
print PrDeg
hist = plt.figure(1)
plt.hist(PrDeg, bins = 100, orientation=u'horizontal')
plt.title("Degree Probability Histogram")
plt.xlabel("count")
plt.ylabel("Values randomly generated by numpy")
hist.savefig("Prob_Hist")
plt.show()

Python: Choose the n points better distributed from a bunch of points

I have a numpy array of points in an XY plane like:
I want to select the n points (let's say 100) better distributed from all these points. This is, I want the density of points to be constant anywhere.
Something like this:
Is there any pythonic way or any numpy/scipy function to do this?

#EMS is very correct that you should give a lot of thought to exactly what you want.
There more sophisticated ways to do this (EMS's suggestions are very good!), but a brute-force-ish approach is to bin the points onto a regular, rectangular grid and draw a random point from each bin.
The major downside is that you won't get the number of points you ask for. Instead, you'll get some number smaller than that number.
A bit of creative indexing with pandas makes this "gridding" approach quite easy, though you can certainly do it with "pure" numpy, as well.
As an example of the simplest possible, brute force, grid approach: (There's a lot we could do better, here.)
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
total_num = 100000
x, y = np.random.normal(0, 1, (2, total_num))
# We'll always get fewer than this number for two reasons.
# 1) We're choosing a square grid, and "subset_num" may not be a perfect square
# 2) There won't be data in every cell of the grid
subset_num = 1000
# Bin points onto a rectangular grid with approximately "subset_num" cells
nbins = int(np.sqrt(subset_num))
xbins = np.linspace(x.min(), x.max(), nbins+1)
ybins = np.linspace(y.min(), y.max(), nbins+1)
# Make a dataframe indexed by the grid coordinates.
i, j = np.digitize(y, ybins), np.digitize(x, xbins)
df = pd.DataFrame(dict(x=x, y=y), index=[i, j])
# Group by which cell the points fall into and choose a random point from each
groups = df.groupby(df.index)
new = groups.agg(lambda x: np.random.permutation(x)[0])
# Plot the results
fig, axes = plt.subplots(ncols=2, sharex=True, sharey=True)
axes[0].plot(x, y, 'k.')
axes[0].set_title('Original $(n={})$'.format(total_num))
axes[1].plot(new.x, new.y, 'k.')
axes[1].set_title('Subset $(n={})$'.format(len(new)))
plt.setp(axes, aspect=1, adjustable='box-forced')
fig.tight_layout()
plt.show()
Loosely based on #EMS's suggestion in a comment, here's another approach.
We'll calculate the density of points using a kernel density estimate, and then use the inverse of that as the probability that a given point will be chosen.
scipy.stats.gaussian_kde is not optimized for this use case (or for large numbers of points in general). It's the bottleneck here. It's possible to write a more optimized version for this specific use case in several ways (approximations, special case here of pairwise distances, etc). However, that's beyond the scope of this particular question. Just be aware that for this specific example with 1e5 points, it will take a minute or two to run.
The advantage of this method is that you get the exact number of points that you asked for. The disadvantage is that you are likely to have local clusters of selected points.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde
total_num = 100000
subset_num = 1000
x, y = np.random.normal(0, 1, (2, total_num))
# Let's approximate the PDF of the point distribution with a kernel density
# estimate. scipy.stats.gaussian_kde is slow for large numbers of points, so
# you might want to use another implementation in some cases.
xy = np.vstack([x, y])
dens = gaussian_kde(xy)(xy)
# Try playing around with this weight. Compare 1/dens, 1-dens, and (1-dens)**2
weight = 1 / dens
weight /= weight.sum()
# Draw a sample using np.random.choice with the specified probabilities.
# We'll need to view things as an object array because np.random.choice
# expects a 1D array.
dat = xy.T.ravel().view([('x', float), ('y', float)])
subset = np.random.choice(dat, subset_num, p=weight)
# Plot the results
fig, axes = plt.subplots(ncols=2, sharex=True, sharey=True)
axes[0].scatter(x, y, c=dens, edgecolor='')
axes[0].set_title('Original $(n={})$'.format(total_num))
axes[1].plot(subset['x'], subset['y'], 'k.')
axes[1].set_title('Subset $(n={})$'.format(len(subset)))
plt.setp(axes, aspect=1, adjustable='box-forced')
fig.tight_layout()
plt.show()

Unless you give a specific criterion for defining "better distributed" we can't give a definite answer.
The phrase "constant density of points anywhere" is also misleading, because you have to specify the empirical method for calculating density. Are you approximating it on a grid? If so, the grid size will matter, and points near the boundary won't be correctly represented.
A different approach might be as follows:
Calculate the distance matrix between all pairs of points
Treating this distance matrix as a weighted network, calculate some measure of centrality for each point in the data, such as eigenvalue centrality, Betweenness centrality or Bonacich centrality.
Order the points in descending order according to the centrality measure, and keep the first 100.
Repeat steps 1-4 possibly using a different notion of "distance" between points and with different centrality measures.
Many of these functions are provided directly by SciPy, NetworkX, and scikits.learn and will work directly on a NumPy array.
If you are definitely committed to thinking of the problem in terms of regular spacing and grid density, you might take a look at quasi-Monte Carlo methods. In particular, you could try to compute the convex hull of the set of points and then apply a QMC technique to regularly sample from anywhere within that convex hull. But again, this privileges the exterior of the region, which should be sampled far less than the interior.
Yet another interesting approach would be to simply run the K-means algorithm on the scattered data, with a fixed number of clusters K=100. After the algorithm converges, you'll have 100 points from your space (the mean of each cluster). You could repeat this several times with different random starting points for the cluster means and then sample from that larger set of possible means. Since your data do not appear to actually cluster into 100 components naturally, the convergence of this approach won't be very good and may require running the algorithm for a large number of iterations. This also has the downside that the resulting set of 100 points are not necessarily points that come form the observed data, and instead will be local averages of many points.

This method to iteratively pick the point from the remaining points which has the lowest minimum distance to the already picked points has terrible time complexity, but produces pretty uniformly distributed results:
from numpy import array, argmax, ndarray
from numpy.ma import vstack
from numpy.random import normal, randint
from scipy.spatial.distance import cdist
def well_spaced_points(points: ndarray, num_points: int):
"""
Pick `num_points` well-spaced points from `points` array.
:param points: An m x n array of m n-dimensional points.
:param num_points: The number of points to pick.
:rtype: ndarray
:return: A num_points x n array of points from the original array.
"""
# pick a random point
current_point_index = randint(0, num_points)
picked_points = array([points[current_point_index]])
remaining_points = vstack((
points[: current_point_index],
points[current_point_index + 1:]
))
# while there are more points to pick
while picked_points.shape[0] < num_points:
# find the furthest point to the current point
distance_pk_rmn = cdist(picked_points, remaining_points)
min_distance_pk = distance_pk_rmn.min(axis=0)
i_furthest = argmax(min_distance_pk)
# add it to picked points and remove it from remaining
picked_points = vstack((
picked_points,
remaining_points[i_furthest]
))
remaining_points = vstack((
remaining_points[: i_furthest],
remaining_points[i_furthest + 1:]
))
return picked_points

How to reverse a color map image to scalar values?

How do I invert a color mapped image?
I have a 2D image which plots data on a colormap. I'd like to read the image in and 'reverse' the color map, that is, look up a specific RGB value, and turn it into a float.
For example:
using this image: http://matplotlib.sourceforge.net/_images/mri_demo.png
I should be able to get a 440x360 matrix of floats, knowing the colormap was cm.jet
from pylab import imread
import matplotlib.cm as cm
a=imread('mri_demo.png')
b=colormap2float(a,cm.jet) #<-tricky part

There may be better ways to do this; I'm not sure.
If you read help(cm.jet) you will see the algorithm used to map values in the interval [0,1] to RGB 3-tuples. You could, with a little paper and pencil, work out formulas to invert the piecewise-linear functions which define the mapping.
However, there are a number of issues which make the paper and pencil solution somewhat unappealing:
It's a lot of laborious algebra, and
the solution is specific for cm.jet.
You'd have to do all this work again
if you change the color map. How to automate the solving of these algebraic equations is interesting, but not a problem I know how to solve.
In general, the color map may not be
invertible (more than one value may
be mapped to the same color). In the
case of cm.jet, values between 0.11
and 0.125 are all mapped to the RGB
3-tuple (0,0,1), for example. So if
your image contains a pure blue
pixel, there is really no way to
tell if it came from a value of 0.11
or a value of, say, 0.125.
The mapping from [0,1] to
3-tuples is a curve in 3-space. The
colors in your image may not lie
perfectly on this curve. There might
be round-off error, for example. So any practical solution has to be able to interpolate or somehow project points in 3-space onto the curve.
Due to the non-uniqueness issue, and the projection/interpolation issue, there can be many possible solutions to the problem you pose. Below is just one possibility.
Here is one way to resolve the uniqueness and projection/interpolation issues:
Create a gradient which acts as a "code book". The gradient is an array of RGBA 4-tuples in the cm.jet color map. The colors of the gradient correspond to values from 0 to 1. Use scipy's vector quantization function scipy.cluster.vq.vq to map all the colors in your image, mri_demo.png, onto the nearest color in gradient.
Since a color map may use the same color for many values, the gradient may contain duplicate colors. I leave it up to scipy.cluster.vq.vq to decide which (possibly) non-unique code book index to associate with a particular color.
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np
import scipy.cluster.vq as scv
def colormap2arr(arr,cmap):
# http://stackoverflow.com/questions/3720840/how-to-reverse-color-map-image-to-scalar-values/3722674#3722674
gradient=cmap(np.linspace(0.0,1.0,100))
# Reshape arr to something like (240*240, 4), all the 4-tuples in a long list...
arr2=arr.reshape((arr.shape[0]*arr.shape[1],arr.shape[2]))
# Use vector quantization to shift the values in arr2 to the nearest point in
# the code book (gradient).
code,dist=scv.vq(arr2,gradient)
# code is an array of length arr2 (240*240), holding the code book index for
# each observation. (arr2 are the "observations".)
# Scale the values so they are from 0 to 1.
values=code.astype('float')/gradient.shape[0]
# Reshape values back to (240,240)
values=values.reshape(arr.shape[0],arr.shape[1])
values=values[::-1]
return values
arr=plt.imread('mri_demo.png')
values=colormap2arr(arr,cm.jet)
# Proof that it works:
plt.imshow(values,interpolation='bilinear', cmap=cm.jet,
origin='lower', extent=[-3,3,-3,3])
plt.show()
The image you see should be close to reproducing mri_demo.png:
(The original mri_demo.png had a white border. Since white is not a color in cm.jet, note that scipy.cluster.vq.vq maps white to to closest point in the gradient code book, which happens to be a pale green color.)

Here is a simpler approach, that works for many colormaps, e.g. viridis, though not for LinearSegmentedColormaps such as 'jet'.
The colormaps are stored as lists of [r,g,b] values. For lots of colormaps, this map has exactly 256 entries. A value between 0 and 1 is looked up using its nearest neighbor in the color list. So, you can't get the exact value back, only an approximation.
Some code to illustrate the concepts:
from matplotlib import pyplot as plt
def find_value_in_colormap(tup, cmap):
# for a cmap like viridis, the result of the colormap lookup is a tuple (r, g, b, a), with a always being 1
# but the colors array is stored as a list [r, g, b]
# for some colormaps, the situation is reversed: the lookup returns a list, while the colors array contains tuples
tup = list(tup)[:3]
colors = cmap.colors
if tup in colors:
ind = colors.index(tup)
elif tuple(tup) in colors:
ind = colors.index(tuple(tup))
else: # tup was not generated by this colormap
return None
return (ind + 0.5) / len(colors)
val = 0.3
tup = plt.cm.viridis(val)
print(find_value_in_colormap(tup, plt.cm.viridis))
This prints the approximate value:
0.298828125
being the value corresponding to the color triple.
To illustrate what happens, here is a visualization of the function looking up a color for a value, followed by getting the value corresponding to that color.
from matplotlib import pyplot as plt
import numpy as np
x = np.linspace(-0.1, 1.1, 10000)
y = [ find_value_in_colormap(plt.cm.viridis(x), plt.cm.viridis) for x in x]
fig, axes = plt.subplots(ncols=3, figsize=(12,4))
for ax in axes.ravel():
ax.plot(x, x, label='identity: y = x')
ax.plot(x, y, label='lookup, then reverse')
ax.legend(loc='best')
axes[0].set_title('overall view')
axes[1].set_title('zoom near x=0')
axes[1].set_xlim(-0.02, 0.02)
axes[1].set_ylim(-0.02, 0.02)
axes[2].set_title('zoom near x=1')
axes[2].set_xlim(0.98, 1.02)
axes[2].set_ylim(0.98, 1.02)
plt.show()
For a colormap with only a few colors, a plot can show the exact position where one color changes to the next. The plot is colored corresponding to the x-values.

Hy unutbu,
Thanks for your reply, I understand the process you explain, and reproduces it. It works very well, I use it to reverse IR camera shots in temperature grids, since a picture can be easily rework/reshape to fulfill my purpose using GIMP.
I'm able to create grids of scalar from camera shots that is really usefull in my tasks.
I use a palette file that I'm able to create using GIMP + Sample a Gradient Along a Path.
I pick the color bar of my original picture, convert it to palette then export as hex color sequence.
I read this palette file to create a colormap normalized by a temperature sample to be used as the code book.
I read the original image and use the vector quantization to reverse color into values.
I slightly improve the pythonic style of the code by using code book indices as index filter in the temperature sample array and apply some filters pass to smooth my results.
from numpy import linspace, savetxt
from matplotlib.colors import Normalize, LinearSegmentedColormap
from scipy.cluster.vq import vq
# sample the values to find from colorbar extremums
vmin = -20.
vmax = 120.
precision = 1.
resolution = 1 + vmax-vmin/precision
sample = linspace(vmin,vmax,resolution)
# create code_book from sample
cmap = LinearSegmentedColormap.from_list('Custom', hex_color_list)
norm = Normalize()
code_book = cmap(norm(sample))
# quantize colors
indices = vq(flat_image,code_book)[0]
# filter sample from quantization results **(improved)**
values = sample[indices]
savetxt(image_file_name[:-3]+'.csv',values ,delimiter=',',fmt='%-8.1f')
The results are finally exported in .csv
Most important thing is to create a well representative palette file to obtain a good precision. I start to obtain a good gradient (code book) using 12 colors and more.
This process is useful since sometimes camera shots cannot be translated to gray-scale easily and linearly.
Thanks to all contributors unutbu, Rob A, scipy community ;)

The LinearSegmentedColormap doesn't give me the same interpolation if I don't it manually during my test, so I prefer to use my own :
As an advantage, matplotlib is not more required since I integrate my code within an existing software.
def codeBook(color_list, N=256):
"""
return N colors interpolated from rgb color list
!!! workaround to matplotlib colormap to avoid dependency !!!
"""
# seperate r g b channel
rgb = np.array(color_list).T
# normalize data points sets
new_x = np.linspace(0., 1., N)
x = np.linspace(0., 1., len(color_list))
# interpolate each color channel
rgb = [np.interp(new_x, x, channel) for channel in rgb]
# round elements of the array to the nearest integer.
return np.rint(np.column_stack( rgb )).astype('int')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.