Perform a different neighbourhood operation for specified pixels

Perform a different neighbourhood operation for specified pixels - python

I have an HxW "feature map", F. Let us assume that it is a HxWx1 map. Through some other operation, I have a set of pixels that are of interest to me, (say N pixels). Each of these pixels is associated with a different value, thus my set is of the form Nx3 where each pixel is of the form x, y and val. Note that this val is different from the feature map value at the location.
Here is my question. Is it possible to vectorize a neighbourhood operation for each of these points? For each pixel n from N, I wish to multiply the corresponding val to its 3x3 neighbourhood in the feature map F. For the 3x3 neighbourhood, this gives a new 3x3 set of elements new val. I want to replace the x y with the pixel with the maximum of new val (multiplied feature map) in the 3x3 window.
This sounds similar to a convolution (slight abuse of terminology here) followed by a max pool operation, but not exactly since each pixel location has a different val to be multiplied.
Sample input and output, and walkthrough for required solution
Let us assume H=10 and W=10
Here is a sample F
0.635955 0.922379 0.993406 0.007837 0.818661 0.983730 0.199866 0.757519 0.073152 0.015831
0.397718 0.097353 0.231351 0.177886 0.343099 0.419940 0.017342 0.087294 0.402266 0.366337
0.978686 0.476594 0.067836 0.148977 0.058994 0.810586 0.542894 0.797419 0.386559 0.225982
0.479860 0.033354 0.353366 0.431562 0.336208 0.674272 0.398151 0.713732 0.598623 0.829230
0.940838 0.869564 0.287100 0.669844 0.631836 0.748982 0.762292 0.597999 0.540236 0.758802
0.925995 0.141296 0.466772 0.672663 0.929746 0.544029 0.991860 0.197474 0.762866 0.798973
0.543519 0.128332 0.624323 0.876569 0.050709 0.223705 0.708381 0.380842 0.818092 0.163447
0.283125 0.329618 0.283481 0.672950 0.136922 0.897785 0.385479 0.764824 0.132671 0.091148
0.661984 0.369459 0.501181 0.352681 0.554113 0.133283 0.593048 0.108534 0.397813 0.836065
0.654929 0.928576 0.539204 0.931213 0.344114 0.591214 0.126809 0.456681 0.036531 0.725228
My structure of pixels, let us say N=3
The three values in the order of row,col,val: (for simplicity I assume x is rows, and y is cols, though it isn't necessarily the case). This is completely independent of the feature map in the previous step.
3,2,0.38
4,4,0.602
7,5,0.9647
The neighborhood around (3,2) is:
[[0.4765941 , 0.06783561, 0.14897662],
[0.03335438, 0.35336647, 0.4315618 ],
[0.86956374, 0.28709952, 0.66984412]]
Thus val * neighborhood yields. (here val is 0.38)
[[0.18110576, 0.02577753, 0.05661112],
[0.01267466, 0.13427926, 0.16399349],
[0.33043422, 0.10909782, 0.25454077]]
The location of max value here is (2,0) i.e. (1,-1) with respect to center pixel. Thus my updated (x,y) should be (3,2) + (1,-1) = (4,1).
Similarly for the other two, the updated pixels are : (5,4) and (7,5)
How can I parallelize this entire thing?
(Hopefully to be loaded onto a GPU using Pytorch, but not necessarily, I have not come to that stage yet.)
Note: I had asked this question a few days ago, but it was poorly framed without proper info. Hopefully this solves the issue.
Edit: For this specific instance, F can be produced as a random array:
F = np.random.rand(10,10)

If I understand correctly, you want this:
from skimage.util.shape import view_as_windows
idx = pixels[:,0:2].astype(int)
print((np.unravel_index((view_as_windows(F,(3,3))[tuple(idx.T-1)]*pixels[:,-1][:,None,None]).reshape(-1,9).argmax(1),(3,3))+idx.T).T-1)
#if you need to replace the values of F with new values
F[tuple(idx.T)] = (view_as_windows(F,(3,3))[tuple(idx.T-1)]*pixels[:,-1][:,None,None]).reshape(-1,9).max(1)
I assumed your window shape is (3,3). Of course, you can change it. And if you need to deal with edge neighborhoods, pad your F with enough 0s (depending on your window size) using np.pad before using the view_as_windows.
output:
[[4 1]
[5 4]
[7 5]]

Related

Sum of Guassian by Multiple regression

I have 16 guassian curves which I have to fit with one guassian curve. I was unable to imply the sum of guassian(multiple regression) in python.
Here is the code I am using:
import matplotlib.pyplot as plt
import numpy as np
a=np.array([3750.0, -250.0, 6750.0, 2750.0, -2050.0, 6350.0, 1550.0, -4050.0, 5750.0, 150.0, -6250.0, 4950.0, -1450.0, -8650.0, 3950.0, -3250.0])
v1=np.array( [2.5470357695283954, 0.1937004980283323, 0.43831655553839766, 6.07645636407398, 0.6331239135554633, 0.969937308645575, 13.38133838752005, 1.3226417845166933, 1.5531178254607325, 27.599625693090765, 2.031000233294804, 1.635762971986014, 53.83073800155456, 2.0719664311822843, 0.0, 100.0])
x=[]
s=[]
v5=9.9e2
for j in range(0,len(a)):
for i in range(-1500,1500):
v11=a[j]+i
x.append(v11)
z=np.exp((-4*np.log(2)*((v11-a[j])/(v5))**2))*((4.5*np.log(2)/(np.pi))**0.5)
s.append(z*v1[j])
plt.plot(x,s,'--r',)
plt.stem(a,v1)
Which generates the following plot (with the problem circled):
Instead of the desired output:

The output of your code shows this overlapping because you are not summing the 16 gaussians but instead creating an array containing [x1_g1,x1_g1,...,x3000_g1,x1_g2,...,x3000_g16] and the same for s. It is a 1d array containing the 3000 x values of the first gaussian, then the 3000 x values of the second gaussian and so on. But they are not added. Thus, the plot shows the 16 independent gaussians instead of the sum which is the desired output.
In the actual code, the x values of each gaussian are different (going -1500 and +1500 around its center) which makes adding the 16 gaussians more complicated.
If we consider only the first 2 gaussians for instance, centered at 3750 and -250, the values appended in x from the first gaussian go from 2250 to 5250 in steps of 1, as well as their images in s which are s(2250)... Afterwards, the values of the second gaussian (x between -1750 and 1250) are appended (not added), which will result in an x list like that:
x = [2250,2251,<in steps of 1>,5249,5250,-1750,-1749,<in steps of 1>,1250]
And s is a list where each position contains the image of the same position in x. Strating from this format, getting the final output which is the sum of the gaussians id difficult, because we wolud have to check for equivalent values of x, and sum their contributions...
However, if instead we always evaluated the gaussians at the same positions (in the exemple between -1750 and 5250 in steps of 1), we will have much more values stored, and most of them will be zero, but adding them will be straightforward.
Half-way vectorization
One option similar to the code in the question is the following:
a = np.array([3750.0, -250.0, 6750.0, 2750.0, -2050.0, 6350.0, 1550.0, -4050.0, 5750.0, 150.0, -6250.0, 4950.0, -1450.0, -8650.0, 3950.0, -3250.0])
v1 = np.array( [2.5470357695283954, 0.1937004980283323, 0.43831655553839766, 6.07645636407398, 0.6331239135554633, 0.969937308645575, 13.38133838752005, 1.3226417845166933, 1.5531178254607325, 27.599625693090765, 2.031000233294804, 1.635762971986014, 53.83073800155456, 2.0719664311822843, 0.0, 100.0])
v5 = 9.9e2
xrange = np.arange(a.min()-1500,a.max()+1500)
# This generates an array between the minimum of a minus 1500 and the maximum of a
# plus 1500. This way, all the values in the old x list are contained in ths array
# Therefore, it becomes really easy to sum the contribution of each gaussian,
# because only an element-wise sum is needed.
s = np.zeros(len(xrange))
for j,aj in enumerate(a):
z = np.exp((-4*np.log(2)*((xrange-aj)/(v5))**2))*((4.5*np.log(2)/(np.pi))**0.5)
s += z*v1[j]
plt.plot(xrange,s,'--r')
plt.stem(a,v1)
The output plot is the same as for the completely vectorized solution.
Completely vectorized solution
One simple solution is to define a unique xrange for all 16 gaussians, then calculate s for each of them (on the same x values) and finally sum over the 16 gaussians:
a = np.array([3750.0, -250.0, 6750.0, 2750.0, -2050.0, 6350.0, 1550.0, -4050.0, 5750.0, 150.0, -6250.0, 4950.0, -1450.0, -8650.0, 3950.0, -3250.0])
v1 = np.array( [2.5470357695283954, 0.1937004980283323, 0.43831655553839766, 6.07645636407398, 0.6331239135554633, 0.969937308645575, 13.38133838752005, 1.3226417845166933, 1.5531178254607325, 27.599625693090765, 2.031000233294804, 1.635762971986014, 53.83073800155456, 2.0719664311822843, 0.0, 100.0])
v5 = 9.9e2
xrange = np.arange(a.min()-1500,a.max()+1500)
z = np.exp((-4*np.log(2)*((xrange-a.reshape((len(a),1)))/(v5))**2))*((4.5*np.log(2)/(np.pi))**0.5)
s = z*v1.reshape((len(a),1))
plt.plot(xrange,s.sum(axis=0),'--r')
plt.stem(a,v1)
Note that I have removed the 2 nested loops using numpy.
The loop over range(-1500,1500) can be avoided defining i=np.arange(-1500,1500) instead of the for i in ... and leaving the rest of the code untouched (only indentation has to be updated). Thet is because numpy operated element-wise over the arrays.
The second loop is a bit trickier than that. The a and v1 arrays are reshaped to a 2d array, in order to generate a z with the shape (16,len(xrange)). Thas is why combining an array xrange of length muxh larger than 16 with a does not raise any error of dimensions not matching, because one is the 1st dimension and the other the second.
The code above generates the following plot:
Groupby solution
There is also the option of working with the same code to generate x and s and afterwards, plot every unique value of x (the same value of x can be found in x[i1],x[i2],x[i3]) versus s[i1]+s[i2]+s[i3].
This can be done adding the following code after the loops:
x,s = np.array(x),np.array(s)
ind = np.argsort(x)
x,s = x[ind],s[ind]
unique_x = np.unique(x)
catsums=[]
for k in unique_x:
catsums.append(np.sum(s[np.where(x==k)]))
plt.plot(u,catsums,'--r')
plt.stem(a,v1)
This groupby can also be vectorized using numpy or pandas as it is explained in this other SO answer

Using Mann Kendall in python with a lot of data

I have a set of 46 years worth of rainfall data. It's in the form of 46 numpy arrays each with a shape of 145, 192, so each year is a different array of maximum rainfall data at each lat and lon coordinate in the given model.
I need to create a global map of tau values by doing an M-K test (Mann-Kendall) for each coordinate over the 46 years.
I'm still learning python, so I've been having trouble finding a way to go through all the data in a simple way that doesn't involve me making 27840 new arrays for each coordinate.
So far I've looked into how to use scipy.stats.kendalltau and using the definition from here: https://github.com/mps9506/Mann-Kendall-Trend
EDIT:
To clarify and add a little more detail, I need to perform a test on for each coordinate and not just each file individually. For example, for the first M-K test, I would want my x=46 and I would want y=data1[0,0],data2[0,0],data3[0,0]...data46[0,0]. Then to repeat this process for every single coordinate in each array. In total the M-K test would be done 27840 times and leave me with 27840 tau values that I can then plot on a global map.
EDIT 2:
I'm now running into a different problem. Going off of the suggested code, I have the following:
for i in range(145):
for j in range(192):
out[i,j] = mk_test(yrmax[:,i,j],alpha=0.05)
print out
I used numpy.stack to stack all 46 arrays into a single array (yrmax) with shape: (46L, 145L, 192L) I've tested it out and it calculates p and tau correctly if I change the code from out[i,j] to just out. However, doing this messes up the for loop so it only takes the results from the last coordinate in stead of all of them. And if I leave the code as it is above, I get the error: TypeError: list indices must be integers, not tuple
My first guess was that it has to do with mk_test and how the information is supposed to be returned in the definition. So I've tried altering the code from the link above to change how the data is returned, but I keep getting errors relating back to tuples. So now I'm not sure where it's going wrong and how to fix it.
EDIT 3:
One more clarification I thought I should add. I've already modified the definition in the link so it returns only the two number values I want for creating maps, p and z.

I don't think this is as big an ask as you may imagine. From your description it sounds like you don't actually want the scipy kendalltau, but the function in the repository you posted. Here is a little example I set up:
from time import time
import numpy as np
from mk_test import mk_test
data = np.array([np.random.rand(145, 192) for _ in range(46)])
mk_res = np.empty((145, 192), dtype=object)
start = time()
for i in range(145):
for j in range(192):
out[i, j] = mk_test(data[:, i, j], alpha=0.05)
print(f'Elapsed Time: {time() - start} s')
Elapsed Time: 35.21990394592285 s
My system is a MacBook Pro 2.7 GHz Intel Core I7 with 16 GB Ram so nothing special.
Each entry in the mk_res array (shape 145, 192) corresponds to one of your coordinate points and contains an entry like so:
array(['no trend', 'False', '0.894546014835', '0.132554125342'], dtype='<U14')
One thing that might be useful would be to modify the code in mk_test.py to return all numerical values. So instead of 'no trend'/'positive'/'negative' you could return 0/1/-1, and 1/0 for True/False and then you wouldn't have to worry about the whole object array type. I don't know what kind of analysis you might want to do downstream but I imagine that would preemptively circumvent any headaches.

Thanks to the answers provided and some work I was able to work out a solution that I'll provide here for anyone else that needs to use the Mann-Kendall test for data analysis.
The first thing I needed to do was flatten the original array I had into a 1D array. I know there is probably an easier way to go about doing this, but I ultimately used the following code based on code Grr suggested using.
`x = 46
out1 = np.empty(x)
out = np.empty((0))
for i in range(146):
for j in range(193):
out1 = yrmax[:,i,j]
out = np.append(out, out1, axis=0) `
Then I reshaped the resulting array (out) as follows:
out2 = np.reshape(out,(27840,46))
I did this so my data would be in a format compatible with scipy.stats.kendalltau 27840 is the total number of values I have at every coordinate that will be on my map (i.e. it's just 145*192) and the 46 is the number of years the data spans.
I then used the following loop I modified from Grr's code to find Kendall-tau and it's respective p-value at each latitude and longitude over the 46 year period.
`x = range(46)
y = np.zeros((0))
for j in range(27840):
b = sc.stats.kendalltau(x,out2[j,:])
y = np.append(y, b, axis=0)`
Finally, I reshaped the data one for time as shown:newdata = np.reshape(y,(145,192,2)) so the final array is in a suitable format to be used to create a global map of both tau and p-values.
Thanks everyone for the assistance!

Depending on your situation, it might just be easiest to make the arrays.
You won't really need them all in memory at once (not that it sounds like a terrible amount of data). Something like this only has to deal with one "copied out" coordinate trend at once:
SIZE = (145,192)
year_matrices = load_years() # list of one 145x192 arrays per year
result_matrix = numpy.zeros(SIZE)
for x in range(SIZE[0]):
for y in range(SIZE[1]):
coord_trend = map(lambda d: d[x][y], year_matrices)
result_matrix[x][y] = analyze_trend(coord_trend)
print result_matrix
Now, there are things like itertools.izip that could help you if you really want to avoid actually copying the data.
Here's a concrete example of how Python's "zip" might works with data like yours (although as if you'd used ndarray.flatten on each year):
year_arrays = [
['y0_coord0_val', 'y0_coord1_val', 'y0_coord2_val', 'y0_coord2_val'],
['y1_coord0_val', 'y1_coord1_val', 'y1_coord2_val', 'y1_coord2_val'],
['y2_coord0_val', 'y2_coord1_val', 'y2_coord2_val', 'y2_coord2_val'],
]
assert len(year_arrays) == 3
assert len(year_arrays[0]) == 4
coord_arrays = zip(*year_arrays) # i.e. `zip(year_arrays[0], year_arrays[1], year_arrays[2])`
# original data is essentially transposed
assert len(coord_arrays) == 4
assert len(coord_arrays[0]) == 3
assert coord_arrays[0] == ('y0_coord0_val', 'y1_coord0_val', 'y2_coord0_val', 'y3_coord0_val')
assert coord_arrays[1] == ('y0_coord1_val', 'y1_coord1_val', 'y2_coord1_val', 'y3_coord1_val')
assert coord_arrays[2] == ('y0_coord2_val', 'y1_coord2_val', 'y2_coord2_val', 'y3_coord2_val')
assert coord_arrays[3] == ('y0_coord2_val', 'y1_coord2_val', 'y2_coord2_val', 'y3_coord2_val')
flat_result = map(analyze_trend, coord_arrays)
The example above still copies the data (and all at once, rather than a coordinate at a time!) but hopefully shows what's going on.
Now, if you replace zip with itertools.izip and map with itertools.map then the copies needn't occur — itertools wraps the original arrays and keeps track of where it should be fetching values from internally.
There's a catch, though: to take advantage itertools you to access the data only sequentially (i.e. through iteration). In your case, it looks like the code at https://github.com/mps9506/Mann-Kendall-Trend/blob/master/mk_test.py might not be compatible with that. (I haven't reviewed the algorithm itself to see if it could be.)
Also please note that in the example I've glossed over the numpy ndarray stuff and just show flat coordinate arrays. It looks like numpy has some of it's own options for handling this instead of itertools, e.g. this answer says "Taking the transpose of an array does not make a copy". Your question was somewhat general, so I've tried to give some general tips as to ways one might deal with larger data in Python.

I ran into the same task and have managed to come up with a vectorized solution using numpy and scipy.
The formula are the same as in this page: https://vsp.pnnl.gov/help/Vsample/Design_Trend_Mann_Kendall.htm.
The trickiest part is to work out the adjustment for the tied values. I modified the code as in this answer to compute the number of tied values for each record, in a vectorized manner.
Below are the 2 functions:
import copy
import numpy as np
from scipy.stats import norm
def countTies(x):
'''Count number of ties in rows of a 2D matrix
Args:
x (ndarray): 2d matrix.
Returns:
result (ndarray): 2d matrix with same shape as <x>. In each
row, the number of ties are inserted at (not really) arbitary
locations.
The locations of tie numbers in are not important, since
they will be subsequently put into a formula of sum(t*(t-1)*(2t+5)).
Inspired by: https://stackoverflow.com/a/24892274/2005415.
'''
if np.ndim(x) != 2:
raise Exception("<x> should be 2D.")
m, n = x.shape
pad0 = np.zeros([m, 1]).astype('int')
x = copy.deepcopy(x)
x.sort(axis=1)
diff = np.diff(x, axis=1)
cated = np.concatenate([pad0, np.where(diff==0, 1, 0), pad0], axis=1)
absdiff = np.abs(np.diff(cated, axis=1))
rows, cols = np.where(absdiff==1)
rows = rows.reshape(-1, 2)[:, 0]
cols = cols.reshape(-1, 2)
counts = np.diff(cols, axis=1)+1
result = np.zeros(x.shape).astype('int')
result[rows, cols[:,1]] = counts.flatten()
return result
def MannKendallTrend2D(data, tails=2, axis=0, verbose=True):
'''Vectorized Mann-Kendall tests on 2D matrix rows/columns
Args:
data (ndarray): 2d array with shape (m, n).
Keyword Args:
tails (int): 1 for 1-tail, 2 for 2-tail test.
axis (int): 0: test trend in each column. 1: test trend in each
row.
Returns:
z (ndarray): If <axis> = 0, 1d array with length <n>, standard scores
corresponding to data in each row in <x>.
If <axis> = 1, 1d array with length <m>, standard scores
corresponding to data in each column in <x>.
p (ndarray): p-values corresponding to <z>.
'''
if np.ndim(data) != 2:
raise Exception("<data> should be 2D.")
# alway put records in rows and do M-K test on each row
if axis == 0:
data = data.T
m, n = data.shape
mask = np.triu(np.ones([n, n])).astype('int')
mask = np.repeat(mask[None,...], m, axis=0)
s = np.sign(data[:,None,:]-data[:,:,None]).astype('int')
s = (s * mask).sum(axis=(1,2))
#--------------------Count ties--------------------
counts = countTies(data)
tt = counts * (counts - 1) * (2*counts + 5)
tt = tt.sum(axis=1)
#-----------------Sample Gaussian-----------------
var = (n * (n-1) * (2*n+5) - tt) / 18.
eps = 1e-8 # avoid dividing 0
z = (s - np.sign(s)) / (np.sqrt(var) + eps)
p = norm.cdf(z)
p = np.where(p>0.5, 1-p, p)
if tails==2:
p=p*2
return z, p
I assume your data come in the layout of (time, latitude, longitude), and you are examining the temporal trend for each lat/lon cell.
To simulate this task, I synthesized a sample data array of shape (50, 145, 192). The 50 time points are taken from Example 5.9 of the book Wilks 2011, Statistical methods in the atmospheric sciences. And then I simply duplicated the same time series 27840 times to make it (50, 145, 192).
Below is the computation:
x = np.array([0.44,1.18,2.69,2.08,3.66,1.72,2.82,0.72,1.46,1.30,1.35,0.54,\
2.74,1.13,2.50,1.72,2.27,2.82,1.98,2.44,2.53,2.00,1.12,2.13,1.36,\
4.9,2.94,1.75,1.69,1.88,1.31,1.76,2.17,2.38,1.16,1.39,1.36,\
1.03,1.11,1.35,1.44,1.84,1.69,3.,1.36,6.37,4.55,0.52,0.87,1.51])
# create a big cube with shape: (T, Y, X)
arr = np.zeros([len(x), 145, 192])
for i in range(arr.shape[1]):
for j in range(arr.shape[2]):
arr[:, i, j] = x
print(arr.shape)
# re-arrange into tabular layout: (Y*X, T)
arr = np.transpose(arr, [1, 2, 0])
arr = arr.reshape(-1, len(x))
print(arr.shape)
import time
t1 = time.time()
z, p = MannKendallTrend2D(arr, tails=2, axis=1)
p = p.reshape(145, 192)
t2 = time.time()
print('time =', t2-t1)
The p-value for that sample time series is 0.63341565, which I have validated against the pymannkendall module result. Since arr contains merely duplicated copies of x, the resultant p is a 2d array of size (145, 192), with all 0.63341565.
And it took me only 1.28 seconds to compute that.

How does a numpy function handle a logical if operator for the axis argument?

I stumbled onto this on accident, but can't make sense of what is going on. I am doing a K-means clustering assignment with images and trying to vectorize the code to make it run as fast as possible. I came up with the following code:
image_values =np.array( [[[ 0.36302522 0.51708686 0.20952381]
[ 0.46330538 0.69915968 0.2140056 ]
[ 0.7904762 0.93837535 0.27002802]
[ 0.78375351 0.89187676 0.24201682]
[ 0.57871151 0.79775912 0.24593839]
[ 0.2896359 0.39103645 0.64481789]
[ 0.23809525 0.30924368 0.64257705]]
[[ 0.36302522 0.51708686 0.20952381]
[ 0.46330538 0.69915968 0.2140056 ]
[ 0.7904762 0.93837535 0.27002802]
[ 0.78375351 0.89187676 0.24201682]
[ 0.57871151 0.79775912 0.24593839]
[ 0.2896359 0.39103645 0.64481789]
[ 0.23809525 0.30924368 0.64257705]]
[[ 0.36302522 0.51708686 0.20952381]
[ 0.46330538 0.69915968 0.2140056 ]
[ 0.7904762 0.93837535 0.27002802]
[ 0.78375351 0.89187676 0.24201682]
[ 0.57871151 0.79775912 0.24593839]
[ 0.2896359 0.39103645 0.64481789]
[ 0.23809525 0.30924368 0.64257705]]])
means = np.array([[0.909,0.839,0.6509],[0.813,0.808,0.694],[0.331,0.407,0.597]]) #random centroids
err = 1
while err > .01:
J = [np.sum((image_values-avg)**2, axis = 2) for avg in means]
K = np.argmin(J, axis = 0)
old_means = means
means = np.array([np.mean(image_values[K==i], axis ==True) for i in range(len(means))])
print means
err = abs(sum(old_means)-sum(means))
print err
In each new means calculation, I used my K array to select which pixel values should be included in each mean calculation but I couldn't get the axis to agree. I actually made a typo where instead of axis=3, I typed axis==3 and it worked! I tried a bunch of different numbers, and found out that it doesn't matter what the number is, the result is the same. I tried a bunch of numbers and Booleans with the equal operator they didn't work. I've gone through the documentation, but I couldn't figure it out.
What does numpy do when it gets a logical if in the axis argument of one of its array functions?
Thanks!

I am not entirely sure I fully understood what you're trying to do. Here's what I assume; You have one single image with RGB values and you would like to cluster the pixels within this image. Each centroid will thus define one value for each color channel respectively. I assume that each row in your means matrix is one centroid with the columns being the RGB values.
In your approach, I think you might have a mistake in the way you are subtracting the centroids. You will need to create a distance matrix for each centroid (at the moment your not subtracting each color channel correctly).
Here's one proposition. Please note that with given example data you will run into a NaN error since not all centroids have pixels that are closest to them. You also might need to adjust the stopping criterion to your needs.
err = 1
while err > 0.1:
# There are three centroids. We would like to compute the
# distance for each pixel to each centroid. Here, the image
# is thus replicated three times.
dist = np.tile(image_values, (3,1,1,1))
# The 2D matrix needs to be reshaped to fit the dimensions of
# the dist matrix. With the new shape, the matrix can directly
# be subtracted.
means2 = means.reshape(3,3,1,1)
# Subtract each respective RGB value of the centroid for
# each "replica" of the image
J = np.power(dist - means2, 2)
# Sum the r,g,b channels together to get the total distance for a pixel
J = J.sum(axis=1)
# Check for which cluster the pixel is closest
K = np.argmin(J, axis=0)
# I couldn't think of a better way than this loop
newMeans = np.zeros((3,3))
for i in range(means.shape[0]): # do each centroid
# In axis 1 there are pixels which we would like to
# average for each color channel (axis 0 are the RGB channels)
newMeans[i,:] = image_values[:,K==i].mean(axis=1)
err = np.power(means - newMeans, 2).sum()
means = newMeans

Loading .txt data into 10x256 3d numpy array

I'm trying to load some text files into numpy arrays. The .txt files represent pixels of an image where each pixel is given an arbitrary relative coordinate between -10 and +10 (for x) and 0 and 10 for y. In total, the image is 10x256 pixels. The catch is that each pixel isn't given an RGB values it is given a list of intensities that corresponds to the wavelength vales in the first /n separated "header". Each coordinate is given as the two first tab separated item and the first entry only has "0 0" because that The format of the text files is as follows:
Line 1: "0 0 625.15360 625.69449 626.23538 ..." (two coordinates followed by the wavelengths)
Line 2: "-10.00000 -10.00000 839 841 833 843 838 847 ..."
Line 3: "-10.00000 -9.92157 838 839 838 ..."
Where 839 and 838 represent the intensity of the wavelength 625.15360 for two different adjacent pixels one on top of another (with a small change in y). Furthermore, 841 and 839 would be the intensity of the 625.69449 wavelength, and so on and so forth.
My reasoning thus far has been to iterate through the file using np.genfromtxt() and adding to a new array 3D numpy array with variables (x,y, lambda) each being assigned one single intensity value. Also, I think it would make much more sense if x and y spanned from 0-9 and 0-255 respectively to represent the image instead of the arbitrary relative coordinates given in the data...
Problem: I don't know how to load the data into a 3x3 (stuck figuring out 2x2) and I can't seem to slice properly...
What I have so far:
intensity_array2 = np.zeros([len(unique_y),len(unique_x)], dtype= int)
for element in np.nditer(intensity_array2, op_flags=['readwrite']):
for i in range(len(unique_y)):
for j in range(len(unique_x)):
with open(os.path.join(path_name,right_file)) as rf:
intensity_array2[i,j] = np.genfromtxt(rf, skip_header = (i*j)+j, delimiter = " ")
Where len(unique_y) = 10 and len(unique_x) = 256 are found in a function above.

I'm not sure I entirely understand your file format, so forgive me if this does not make sense. However, if there is any way you can load all the data in at once I am sure it will run much faster. It appears to me that you can use this to get all the data into memory:
data = np.genfromtxt(rf, delimiter = " ")
Then create your 3D array:
intensity_array2 = np.zeros( (10, 256, num_wavlengths) )
Then fill in the values of the 3D array:
intensity_array2[ data[:,0], data[:,1], :] = data[:, 2:]
This will not work exactly because your x and y indices can go negative--you might need to add an offset in this case. Also, if your input file is in a predictable format, you may be able to simply call np.reshape() on the data matrix to get what you want.

Building on Lukeclh's answer, try:
data = np.genfromtxt(rf)
Then, cleave off the wavelength values
wavelengths = data[0]
intensities = data[1:]
We can now rearrange the data using reshape:
intensitiesshaped = np.reshape(intensities, (len(unique_x),len(unique_y),-1))
The "-1" value says 'the rest goes here'.
We still have the leading values (of on each of these arrays. To trim them, we can do:
wavelengths = wavelengths[2:]
intensitiesshaped = intensities[:,:,2:]
This just throws the information in the first two indices away. If you need to retain it you'll have to do something a bit more sophisticated.

Bootstrapping function grinds to a halt, due to python pseudorandom generator?

I am working on a kind of bootstrapping procedure for visual fixation data, and would be helped by the insights of others on this issue I am having. I suspect that either I'm missing something related to the functioning of the random number generator (random.randrange), or it shows my currently novice understanding of numpy array iteration and slicing. Being a psychologist with only hobby-level programming experience, i would not be surprised if it turns out I'm doing this in a really backwards way.
When you want to perform statistical analysis on visual fixation data, you often need to take center-bias into account, which is the bias whereby observers tend to fixate more to the center of an image at first and more randomly in the image later. This bias causes a temporal correlation between fixations, and an ROC-analysis (Receiver Operator Characteristic) performed on such data needs a baseline based on a specific kind of bootstrap method.
In this case, the data resides in a numpy array named original. This array is of shape (22, 800, 15, 2), where the dimensions indicate [observer, image, fixation (x, y)]. So, 15 fixations per observer per image.
In the bootstrap, we generally want to replace each fixation with another fixation that occurs somewhere in the set of all other images and all observers, but at the same time (in this case: the same fixation index, index 2 of original).
I think this means that we have to do the following:
create a new array of the same dimensions as original. This array will be called shuffled.
check if current x or y in original == NaN. If so, do not change this fixation. Otherwise continue;
choose a random fixation from the subset of original that satisfies the following index: [all observers, all images except the current image, current fixation]. Make sure it does not contain NaN, otherwise pick another random fixation until it does not contain NaN;
Set shuffled to the random fixation at the current location in original.
I have a function that takes array original and does what is described above with the slight modification that when only one of the original x, y pair is NaN, it only sets that x or y in the random fixation to np.nan. When I iterate through the loops I saw good results. After iterating through +- 10 loops I was satisfied as all data looked perfect, after which I proceeded to remove the raw_input() breakpoints I had set and let the function process all of the data without interruption. When I did so, I noticed that the function slows down each loop and grinds to a halt when it reaches observer=0 image=48.
My code is as follows:
for obs_index, obs in enumerate(original):
for img_index, img in enumerate(obs):
print obs_index, img_index
for fix_index, fix in enumerate(img):
# do the following because sometimes only x or y in the original is NaN
rand_fix = (np.nan, np.nan)
while np.isnan(rand_fix[0]) or np.isnan(rand_fix[1]):
rand_obs = randrange(observers)
rand_img = img_index
while rand_img == img_index:
rand_img = randrange(images)
rand_fix = original[rand_obs, rand_img, fix_index]
# do the following because sometimes only x or y in the original is NaN
if np.isnan(fix[0]):
rand_fix[0] = np.nan
if np.isnan(fix[1]):
rand_fix[1] = np.nan
shuffled[obs_index, img_index, fix_index] = rand_fix
When this function finishes, shuffled should contain correctly shuffled fixation data for use in ROC-analysis.

SOLVED
I came up with the following code, that no longer slows down:
for obs_index, obs in enumerate(original):
for img_index, img in enumerate(obs):
for fix_index, fix in enumerate(img):
x = fix[0]
y = fix[1]
rand_x = np.nan
rand_y = np.nan
if not(np.isnan(x) or np.isnan(y)):
while np.isnan(rand_x) or np.isnan(rand_y):
rand_obs = randrange(observers)
rand_img = img_index
while rand_img == img_index:
rand_img = randrange(images)
rand_x = original[rand_obs, rand_img, fix_index, 0]
rand_y = original[rand_obs, rand_img, fix_index, 1]
shuffled[obs_index, img_index, fix_index, 0] = rand_x
shuffled[obs_index, img_index, fix_index, 1] = rand_y
I also fixed the way the new fixation was assigned to the location in shuffled, to follow numpy indexing properly.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.