How can I simplify the following code so it runs faster?

How can I simplify the following code so it runs faster? - python

I have a three-dimensional array containing many 2D images (frames). I want to remove the background by considering a threshold for each pixel value and copy new elements in a new 3D array. I wrote the following code lines, but it is too expensive for running. How can I speed up this code?
ss = stack #3D array (571, 1040, 1392)
T,ni,nj = ss.shape
Background_intensity = np.ones([T,ni,nj])
Intensity = np.zeros([T,ni,nj])
DeltaF_F_max = np.zeros([T,ni,nj])
for t in range(T):
for i in range(ni):
for j in range(nj):
if ss[t,i,j]<12:
Background_intensity[t,i,j] = ss[t,i,j]
if Background_intensity[t,i,j] == 0 :
Background_intensity[t,i,j] = 1
else:
Intensity[t,i,j] = ss[t,i,j]
DeltaF_F_max[t,i,j]=(((Intensity[t,i,j] - Background_intensity[t,i,j])))/(Background_intensity[t,i,j])

I had a go at this with Numpy. I am not sure what results you got, but it takes around 20s on my Mac. It is quite a memory hog even after I reduced all the sizes by a factor of 8 because you don't need an int64 to store a 1 or a number under 12 or 255.
I wonder if you need to do 571 images all in one go or whether you could do them "on-the-fly" as you acquire them rather than gathering them all in one enormous lump.
You could also consider doing this with Numba as it is very good at optimising for loops - try putting [numba] in the search box above, or looking at this example - using prange to parallelise the loops across your CPU cores.
Anyway, here is my code:
#!/usr/bin/env python3
# https://stackoverflow.com/q/71460343/2836621
import numpy as np
T, ni, nj = 571, 1040, 1392
# Create representative input data, such that around 1/3 of it is < 12 for testing
ss = np.random.randint(0,36,(T,ni,nj), np.uint8)
# Ravel into 1-D representation for simpler indexing
ss_r = ss.ravel()
# Create extra arrays but using 800MB rather than 6.3GB each, also ravelled
Background_intensity = np.ones(T*ni*nj, np.uint8)
Intensity = np.zeros(T*ni*nj, np.uint8)
# Make Boolean (True/False) mask of elements below threshold
mask = ss_r < 12
# Quick check here - print(np.count_nonzero(mask)/np.size(ss)) and check it is 0.333
# Set Background_intensity to "ss" according to mask
Background_intensity[mask] = ss_r[mask]
# Make sure no zeroes present
Background_intensity[Background_intensity==0] = 1
# This corresponds to the "else" of your original "if" statement
Intensity[~mask] = ss_r[~mask]
# Final calculation and reshaping back to original shape
DeltaF_F_max = (Intensity - Background_intensity)/Background_intensity
DeltaF_F_max.reshape((T,ni,nj))

Related

How to parallelize a for loop with a shared array return?

I have a numpy array with an image, and a binary segmentation mask containing separate binary "blobs". Something like the binary mask in:
I wish to extract image statistics from pixels in correspondence of each of the binary blobs, separately. These values are stored inside a new numpy array, named cnr_map.
My current implementation uses a for loop. However, when the number of binary blobs increases, it is really slow, and I'm wondering if it is possible to parallelize it.
from scipy.ndimage import label
labeled_array, num_features = label(mask)
cnr_map = np.copy(mask)
for k in range(num_features):
foreground_mask = labeled_array == k
background_mask = 1.0 - foreground_mask
a = np.mean(image[foreground_mask == 1])
b = np.mean(image[background_mask == 1])
c = np.std(image[background_mask == 1])
cnr = np.abs(a - b) / (c + 1e-12)
cnr_map[foreground_mask] = cnr
How can I parallelize the work so that the for loop runs faster?
I have seen this question, but my case is a bit different as I want to return a numpy array with the cumulative modifications of the multiple processes (i.e. cnr_map), and I don't understand how to do it.

Is there a faster method for iterating over a very big 2D numpy array than using np.where?

i have a huge 2D numpy array filled with integer values. I collect them from a .tif-image via gdal.GetRasterBand().
The pixel values of the image represent unique cluster-identification numbers. So all pixels inside one cluster have the same value.
In my script i want to check if the clusters have more pixels than a specific threshold. If the clustersize is bigger than the threshold I want to keep the cluster and give them a pixel value 1. If a cluster has less pixel then the threshold, all pixels of this cluster should get the value 0.
My code so far works, but is very very slow. And because i want to vary the threshold, it takes like forever.
I would really appreciate your help. Thank you.
# Import GeoTIFF via GDAL and convert to NumpyArray
data = gdal.Open(image)
raster = data.GetRasterBand(1)
raster = raster.ReadAsArray()
# Different thresholds for iteration
thresh = [0,10,25,50,100,1000,2000]
for threshold in thresh:
clusteredRaster = np.array(raster.copy(), dtype = int)
for clump in np.unique(clusteredRaster): # Unique ids of the clusters in image
if clusteredRaster[np.where(clusteredRaster == clump)].size >= threshold:
clusteredRaster[np.where(clusteredRaster == clump)] = int(1)
else:
clusteredRaster[np.where(clusteredRaster == clump)] = int(0)
'''
[ClusterImage][1]
In the image you can see the cluster image. Each color stands vor a specific clusternumber. I want to delete the small ones (under a specific size) and just keep the big ones.
[1]: https://i.stack.imgur.com/miEKg.png

there are a number of modifications that can be done to improve performance,
clusteredRaster = np.array(raster.copy(), dtype = int)
can be replaced with
clusteredRaster = raster.astype(int)
which this is essentially both a copy and a casting operator so this operation is faster.
now for
clusteredRaster[np.where(clusteredRaster == clump)] = int(1)
you don't need to call np.where, this will work faster
clusteredRaster[clusteredRaster == clump] = int(1)
also done for this part
clusteredRaster[np.where(clusteredRaster == clump)].size
you can also remove the evaluation of clusteredRaster==clump twice as follow:
for clump in np.unique(clusteredRaster): # Unique ids of the clusters in image
indicies = clusteredRaster==clump
if clusteredRaster[indicies].size >= threshold:
clusteredRaster[indicies] = int(1)
else:
clusteredRaster[indicies] = int(0)
i think your function will now work twice as fast, however if you want to run faster, you have to use smaller datatypes like np.uint8 instead of plain int, provided your image is encoded in RGB and can be represented by 8 bit ints (or maybe np.uint16 if 8 bits is too low ?)
this is as fast as it can get from python side.
there are faster methods like using C modules with openmp to multithread your work across multiple cores, this can easily be done with something like numba or cython without having to worry about writing C code, but there's a lot of reading to do if you want to achieve the fastest performance ever, like which threading backend to use (TBB vs openmp) and some os and device dependent capabilities.

In addition to the changes suggested by Ahmed Mohamed AEK you can also take the calculation of unique values, indices, and counts outside of the for loops. Plus you don't need to copy raster each time - you can make an array of np.uint8s.
This gives the same results as your original implementation:
data = gdal.Open(image)
raster = data.GetRasterBand(1).ReadAsArray()
# Different thresholds for iteration
thresh = [0, 10, 25, 50, 100, 1000, 2000]
# determine the unique clumps and their frequencies outside of the for loops
clumps, counts = np.unique(raster, return_counts=True)
# only determine the indices once, rather than for each threshold
indices = np.asarray([raster==clump for clump in clumps])
for threshold in thresh:
clustered_raster = np.zeros_like(raster, dtype=np.uint8)
for clump_indices, clump_counts in zip(indices, counts):
clustered_raster[clump_indices] = clump_counts >= threshold

I got a easy solution based on your helpful answers!
The idea is to find the unique values and cluster sizes per threshold and instant fill in correct values, thus avoid a loop.
It reduces the iteration time from initially 142 seconds per iteration to 0.52 seconds and reproduces the same results.
data = gdal.Open(image)
raster = data.GetRasterBand(1).ReadAsArray()
thresh = [0, 10, 25, 50, 100, 1000, 2000]
for threshold in thresh:
# Create new 0-raster with same dimensions as input raster
clusteredRaster = np.zeros(raster.shape, dtype = uint8)
# Get unique cluster IDs and count the size of the occurence
clumps, counts = np.unique(raster, return_counts=True)
# Get only the clumps which are bigger than the threshold
biggerClumps = clumps[counts >= threshold]
# fill in ones for the relevant cluster IDs
clusteredRaster[np.isin(raster,biggerClumps)] = 1

Sum values from numpy array if condition on value in another array is met

I'm facing a problem with vectorizing a function so that it applies efficiently on a numpy array.
My program entries :
A pos_part 2D Array of Nb_particles lines, 3 columns (basicaly x,y,z coordinates, only z is relevant for the part that bothers me) Nb_particles can up to several hundreds of thousands.
An prop_part 1D array with Nb_particles values. This part I got covered, creation is made with some nice numpy functions ; I just put here a basic distribution that ressembles real values.
A z_distances 1D Array, a simple np.arange betwwen z=0 and z=z_max.
Then come the calculation that takes time, because where I can't find a way to do things properply with only numpy operation of arrays. What i want to do is :
For all distances z_i in z_distances, sum all values from prop_part if corresponding particle coordinate z_particle < z_i. This would return a 1D array the same length as z_distances.
My ideas so far :
Version 0, for loop, enumerate and np.where do retrieve the index of values that I need to sum. Obviously quite long.
Version 1, using a mask on a new array (combination of z coordinates and particle properties), and sum on the masked array. Seems better than v0
Version 2, another mask and a np.vectorize, but i understand it's not efficient as vectorize is basicaly a for loop. Still seems better than v0
Version 3, I'm trying to use mask on a function that can I directly apply to z_distances, but it's not working so far.
So, here I am. There is maybe something to do with a sort and a cumulative sum, but I don't know how to do this, so any help would be greatly appreciated. Please find below the code to make things clearer
Thanks in advance.
import numpy as np
import time
import matplotlib.pyplot as plt
# Creation of particles' positions
Nb_part = 150_000
pos_part = 10*np.random.rand(Nb_part,3)
pos_part[:,0] = pos_part[:,1] = 0
#usefull property creation
beta = 1/1.5
prop_part = (1/beta)*np.exp(-pos_part[:,2]/beta)
z_distances = np.arange(0,10,0.1)
#my version 0
t0=time.time()
result = np.empty(len(z_distances))
for index_dist, val_dist in enumerate(z_distances):
positions = np.where(pos_part[:,2]<val_dist)[0]
result[index_dist] = sum(prop_part[i] for i in positions)
print("v0 :",time.time()-t0)
#A graph to help understand
plt.figure()
plt.plot(z_distances,result, c="red")
plt.ylabel("Sum of particles' usefull property for particles with z-pos<d")
plt.xlabel("d")
#version 1 ??
t1=time.time()
combi = np.column_stack((pos_part[:,2],prop_part))
result2 = np.empty(len(z_distances))
for index_dist, val_dist in enumerate(z_distances):
mask = (combi[:,0]<val_dist)
result2[index_dist]=sum(combi[:,1][mask])
print("v1 :",time.time()-t1)
plt.plot(z_distances,result2, c="blue")
#version 2
t2=time.time()
def themask(a):
mask = (combi[:,0]<a)
return sum(combi[:,1][mask])
thefunc = np.vectorize(themask)
result3 = thefunc(z_distances)
print("v2 :",time.time()-t2)
plt.plot(z_distances,result3, c="green")
### This does not work so far
# version 3
# =============================
# t3=time.time()
# def thesum(a):
# mask = combi[combi[:,0]<a]
# return sum(mask[:,1])
# result4 = thesum(z_distances)
# print("v3 :",time.time()-t3)
# =============================

You can get a lot more performance by writing your first version completely in numpy. Replace pythons sum with np.sum. Instead of the for i in positions list comprehension, simply pass the positions mask you are creating anyways.
Indeed, the np.where is not necessary and my best version looks like:
#my version 0
t0=time.time()
result = np.empty(len(z_distances))
for index_dist, val_dist in enumerate(z_distances):
positions = pos_part[:, 2] < val_dist
result[index_dist] = np.sum(prop_part[positions])
print("v0 :",time.time()-t0)
# out: v0 : 0.06322097778320312
You can get a bit faster if z_distances is very long by using numba.
Running calc for the first time usually creates some overhead which we can get rid of by running the function for some small set of `z_distances.
The below code achieves roughly a factor of two speedup over pure numpy on my laptop.
import numba as nb
#nb.njit(parallel=True)
def calc(result, z_distances):
n = z_distances.shape[0]
for ii in nb.prange(n):
pos = pos_part[:, 2] < z_distances[ii]
result[ii] = np.sum(prop_part[pos])
return result
result4 = np.zeros_like(result)
# _t = time.time()
# calc(result4, z_distances[:10])
# print(time.time()-_t)
t3 = time.time()
result4 = calc(result4, z_distances)
print("v3 :", time.time()-t3)
plt.plot(z_distances, result4)

Using Mann Kendall in python with a lot of data

I have a set of 46 years worth of rainfall data. It's in the form of 46 numpy arrays each with a shape of 145, 192, so each year is a different array of maximum rainfall data at each lat and lon coordinate in the given model.
I need to create a global map of tau values by doing an M-K test (Mann-Kendall) for each coordinate over the 46 years.
I'm still learning python, so I've been having trouble finding a way to go through all the data in a simple way that doesn't involve me making 27840 new arrays for each coordinate.
So far I've looked into how to use scipy.stats.kendalltau and using the definition from here: https://github.com/mps9506/Mann-Kendall-Trend
EDIT:
To clarify and add a little more detail, I need to perform a test on for each coordinate and not just each file individually. For example, for the first M-K test, I would want my x=46 and I would want y=data1[0,0],data2[0,0],data3[0,0]...data46[0,0]. Then to repeat this process for every single coordinate in each array. In total the M-K test would be done 27840 times and leave me with 27840 tau values that I can then plot on a global map.
EDIT 2:
I'm now running into a different problem. Going off of the suggested code, I have the following:
for i in range(145):
for j in range(192):
out[i,j] = mk_test(yrmax[:,i,j],alpha=0.05)
print out
I used numpy.stack to stack all 46 arrays into a single array (yrmax) with shape: (46L, 145L, 192L) I've tested it out and it calculates p and tau correctly if I change the code from out[i,j] to just out. However, doing this messes up the for loop so it only takes the results from the last coordinate in stead of all of them. And if I leave the code as it is above, I get the error: TypeError: list indices must be integers, not tuple
My first guess was that it has to do with mk_test and how the information is supposed to be returned in the definition. So I've tried altering the code from the link above to change how the data is returned, but I keep getting errors relating back to tuples. So now I'm not sure where it's going wrong and how to fix it.
EDIT 3:
One more clarification I thought I should add. I've already modified the definition in the link so it returns only the two number values I want for creating maps, p and z.

I don't think this is as big an ask as you may imagine. From your description it sounds like you don't actually want the scipy kendalltau, but the function in the repository you posted. Here is a little example I set up:
from time import time
import numpy as np
from mk_test import mk_test
data = np.array([np.random.rand(145, 192) for _ in range(46)])
mk_res = np.empty((145, 192), dtype=object)
start = time()
for i in range(145):
for j in range(192):
out[i, j] = mk_test(data[:, i, j], alpha=0.05)
print(f'Elapsed Time: {time() - start} s')
Elapsed Time: 35.21990394592285 s
My system is a MacBook Pro 2.7 GHz Intel Core I7 with 16 GB Ram so nothing special.
Each entry in the mk_res array (shape 145, 192) corresponds to one of your coordinate points and contains an entry like so:
array(['no trend', 'False', '0.894546014835', '0.132554125342'], dtype='<U14')
One thing that might be useful would be to modify the code in mk_test.py to return all numerical values. So instead of 'no trend'/'positive'/'negative' you could return 0/1/-1, and 1/0 for True/False and then you wouldn't have to worry about the whole object array type. I don't know what kind of analysis you might want to do downstream but I imagine that would preemptively circumvent any headaches.

Thanks to the answers provided and some work I was able to work out a solution that I'll provide here for anyone else that needs to use the Mann-Kendall test for data analysis.
The first thing I needed to do was flatten the original array I had into a 1D array. I know there is probably an easier way to go about doing this, but I ultimately used the following code based on code Grr suggested using.
`x = 46
out1 = np.empty(x)
out = np.empty((0))
for i in range(146):
for j in range(193):
out1 = yrmax[:,i,j]
out = np.append(out, out1, axis=0) `
Then I reshaped the resulting array (out) as follows:
out2 = np.reshape(out,(27840,46))
I did this so my data would be in a format compatible with scipy.stats.kendalltau 27840 is the total number of values I have at every coordinate that will be on my map (i.e. it's just 145*192) and the 46 is the number of years the data spans.
I then used the following loop I modified from Grr's code to find Kendall-tau and it's respective p-value at each latitude and longitude over the 46 year period.
`x = range(46)
y = np.zeros((0))
for j in range(27840):
b = sc.stats.kendalltau(x,out2[j,:])
y = np.append(y, b, axis=0)`
Finally, I reshaped the data one for time as shown:newdata = np.reshape(y,(145,192,2)) so the final array is in a suitable format to be used to create a global map of both tau and p-values.
Thanks everyone for the assistance!

Depending on your situation, it might just be easiest to make the arrays.
You won't really need them all in memory at once (not that it sounds like a terrible amount of data). Something like this only has to deal with one "copied out" coordinate trend at once:
SIZE = (145,192)
year_matrices = load_years() # list of one 145x192 arrays per year
result_matrix = numpy.zeros(SIZE)
for x in range(SIZE[0]):
for y in range(SIZE[1]):
coord_trend = map(lambda d: d[x][y], year_matrices)
result_matrix[x][y] = analyze_trend(coord_trend)
print result_matrix
Now, there are things like itertools.izip that could help you if you really want to avoid actually copying the data.
Here's a concrete example of how Python's "zip" might works with data like yours (although as if you'd used ndarray.flatten on each year):
year_arrays = [
['y0_coord0_val', 'y0_coord1_val', 'y0_coord2_val', 'y0_coord2_val'],
['y1_coord0_val', 'y1_coord1_val', 'y1_coord2_val', 'y1_coord2_val'],
['y2_coord0_val', 'y2_coord1_val', 'y2_coord2_val', 'y2_coord2_val'],
]
assert len(year_arrays) == 3
assert len(year_arrays[0]) == 4
coord_arrays = zip(*year_arrays) # i.e. `zip(year_arrays[0], year_arrays[1], year_arrays[2])`
# original data is essentially transposed
assert len(coord_arrays) == 4
assert len(coord_arrays[0]) == 3
assert coord_arrays[0] == ('y0_coord0_val', 'y1_coord0_val', 'y2_coord0_val', 'y3_coord0_val')
assert coord_arrays[1] == ('y0_coord1_val', 'y1_coord1_val', 'y2_coord1_val', 'y3_coord1_val')
assert coord_arrays[2] == ('y0_coord2_val', 'y1_coord2_val', 'y2_coord2_val', 'y3_coord2_val')
assert coord_arrays[3] == ('y0_coord2_val', 'y1_coord2_val', 'y2_coord2_val', 'y3_coord2_val')
flat_result = map(analyze_trend, coord_arrays)
The example above still copies the data (and all at once, rather than a coordinate at a time!) but hopefully shows what's going on.
Now, if you replace zip with itertools.izip and map with itertools.map then the copies needn't occur — itertools wraps the original arrays and keeps track of where it should be fetching values from internally.
There's a catch, though: to take advantage itertools you to access the data only sequentially (i.e. through iteration). In your case, it looks like the code at https://github.com/mps9506/Mann-Kendall-Trend/blob/master/mk_test.py might not be compatible with that. (I haven't reviewed the algorithm itself to see if it could be.)
Also please note that in the example I've glossed over the numpy ndarray stuff and just show flat coordinate arrays. It looks like numpy has some of it's own options for handling this instead of itertools, e.g. this answer says "Taking the transpose of an array does not make a copy". Your question was somewhat general, so I've tried to give some general tips as to ways one might deal with larger data in Python.

I ran into the same task and have managed to come up with a vectorized solution using numpy and scipy.
The formula are the same as in this page: https://vsp.pnnl.gov/help/Vsample/Design_Trend_Mann_Kendall.htm.
The trickiest part is to work out the adjustment for the tied values. I modified the code as in this answer to compute the number of tied values for each record, in a vectorized manner.
Below are the 2 functions:
import copy
import numpy as np
from scipy.stats import norm
def countTies(x):
'''Count number of ties in rows of a 2D matrix
Args:
x (ndarray): 2d matrix.
Returns:
result (ndarray): 2d matrix with same shape as <x>. In each
row, the number of ties are inserted at (not really) arbitary
locations.
The locations of tie numbers in are not important, since
they will be subsequently put into a formula of sum(t*(t-1)*(2t+5)).
Inspired by: https://stackoverflow.com/a/24892274/2005415.
'''
if np.ndim(x) != 2:
raise Exception("<x> should be 2D.")
m, n = x.shape
pad0 = np.zeros([m, 1]).astype('int')
x = copy.deepcopy(x)
x.sort(axis=1)
diff = np.diff(x, axis=1)
cated = np.concatenate([pad0, np.where(diff==0, 1, 0), pad0], axis=1)
absdiff = np.abs(np.diff(cated, axis=1))
rows, cols = np.where(absdiff==1)
rows = rows.reshape(-1, 2)[:, 0]
cols = cols.reshape(-1, 2)
counts = np.diff(cols, axis=1)+1
result = np.zeros(x.shape).astype('int')
result[rows, cols[:,1]] = counts.flatten()
return result
def MannKendallTrend2D(data, tails=2, axis=0, verbose=True):
'''Vectorized Mann-Kendall tests on 2D matrix rows/columns
Args:
data (ndarray): 2d array with shape (m, n).
Keyword Args:
tails (int): 1 for 1-tail, 2 for 2-tail test.
axis (int): 0: test trend in each column. 1: test trend in each
row.
Returns:
z (ndarray): If <axis> = 0, 1d array with length <n>, standard scores
corresponding to data in each row in <x>.
If <axis> = 1, 1d array with length <m>, standard scores
corresponding to data in each column in <x>.
p (ndarray): p-values corresponding to <z>.
'''
if np.ndim(data) != 2:
raise Exception("<data> should be 2D.")
# alway put records in rows and do M-K test on each row
if axis == 0:
data = data.T
m, n = data.shape
mask = np.triu(np.ones([n, n])).astype('int')
mask = np.repeat(mask[None,...], m, axis=0)
s = np.sign(data[:,None,:]-data[:,:,None]).astype('int')
s = (s * mask).sum(axis=(1,2))
#--------------------Count ties--------------------
counts = countTies(data)
tt = counts * (counts - 1) * (2*counts + 5)
tt = tt.sum(axis=1)
#-----------------Sample Gaussian-----------------
var = (n * (n-1) * (2*n+5) - tt) / 18.
eps = 1e-8 # avoid dividing 0
z = (s - np.sign(s)) / (np.sqrt(var) + eps)
p = norm.cdf(z)
p = np.where(p>0.5, 1-p, p)
if tails==2:
p=p*2
return z, p
I assume your data come in the layout of (time, latitude, longitude), and you are examining the temporal trend for each lat/lon cell.
To simulate this task, I synthesized a sample data array of shape (50, 145, 192). The 50 time points are taken from Example 5.9 of the book Wilks 2011, Statistical methods in the atmospheric sciences. And then I simply duplicated the same time series 27840 times to make it (50, 145, 192).
Below is the computation:
x = np.array([0.44,1.18,2.69,2.08,3.66,1.72,2.82,0.72,1.46,1.30,1.35,0.54,\
2.74,1.13,2.50,1.72,2.27,2.82,1.98,2.44,2.53,2.00,1.12,2.13,1.36,\
4.9,2.94,1.75,1.69,1.88,1.31,1.76,2.17,2.38,1.16,1.39,1.36,\
1.03,1.11,1.35,1.44,1.84,1.69,3.,1.36,6.37,4.55,0.52,0.87,1.51])
# create a big cube with shape: (T, Y, X)
arr = np.zeros([len(x), 145, 192])
for i in range(arr.shape[1]):
for j in range(arr.shape[2]):
arr[:, i, j] = x
print(arr.shape)
# re-arrange into tabular layout: (Y*X, T)
arr = np.transpose(arr, [1, 2, 0])
arr = arr.reshape(-1, len(x))
print(arr.shape)
import time
t1 = time.time()
z, p = MannKendallTrend2D(arr, tails=2, axis=1)
p = p.reshape(145, 192)
t2 = time.time()
print('time =', t2-t1)
The p-value for that sample time series is 0.63341565, which I have validated against the pymannkendall module result. Since arr contains merely duplicated copies of x, the resultant p is a 2d array of size (145, 192), with all 0.63341565.
And it took me only 1.28 seconds to compute that.

Bootstrapping function grinds to a halt, due to python pseudorandom generator?

I am working on a kind of bootstrapping procedure for visual fixation data, and would be helped by the insights of others on this issue I am having. I suspect that either I'm missing something related to the functioning of the random number generator (random.randrange), or it shows my currently novice understanding of numpy array iteration and slicing. Being a psychologist with only hobby-level programming experience, i would not be surprised if it turns out I'm doing this in a really backwards way.
When you want to perform statistical analysis on visual fixation data, you often need to take center-bias into account, which is the bias whereby observers tend to fixate more to the center of an image at first and more randomly in the image later. This bias causes a temporal correlation between fixations, and an ROC-analysis (Receiver Operator Characteristic) performed on such data needs a baseline based on a specific kind of bootstrap method.
In this case, the data resides in a numpy array named original. This array is of shape (22, 800, 15, 2), where the dimensions indicate [observer, image, fixation (x, y)]. So, 15 fixations per observer per image.
In the bootstrap, we generally want to replace each fixation with another fixation that occurs somewhere in the set of all other images and all observers, but at the same time (in this case: the same fixation index, index 2 of original).
I think this means that we have to do the following:
create a new array of the same dimensions as original. This array will be called shuffled.
check if current x or y in original == NaN. If so, do not change this fixation. Otherwise continue;
choose a random fixation from the subset of original that satisfies the following index: [all observers, all images except the current image, current fixation]. Make sure it does not contain NaN, otherwise pick another random fixation until it does not contain NaN;
Set shuffled to the random fixation at the current location in original.
I have a function that takes array original and does what is described above with the slight modification that when only one of the original x, y pair is NaN, it only sets that x or y in the random fixation to np.nan. When I iterate through the loops I saw good results. After iterating through +- 10 loops I was satisfied as all data looked perfect, after which I proceeded to remove the raw_input() breakpoints I had set and let the function process all of the data without interruption. When I did so, I noticed that the function slows down each loop and grinds to a halt when it reaches observer=0 image=48.
My code is as follows:
for obs_index, obs in enumerate(original):
for img_index, img in enumerate(obs):
print obs_index, img_index
for fix_index, fix in enumerate(img):
# do the following because sometimes only x or y in the original is NaN
rand_fix = (np.nan, np.nan)
while np.isnan(rand_fix[0]) or np.isnan(rand_fix[1]):
rand_obs = randrange(observers)
rand_img = img_index
while rand_img == img_index:
rand_img = randrange(images)
rand_fix = original[rand_obs, rand_img, fix_index]
# do the following because sometimes only x or y in the original is NaN
if np.isnan(fix[0]):
rand_fix[0] = np.nan
if np.isnan(fix[1]):
rand_fix[1] = np.nan
shuffled[obs_index, img_index, fix_index] = rand_fix
When this function finishes, shuffled should contain correctly shuffled fixation data for use in ROC-analysis.

SOLVED
I came up with the following code, that no longer slows down:
for obs_index, obs in enumerate(original):
for img_index, img in enumerate(obs):
for fix_index, fix in enumerate(img):
x = fix[0]
y = fix[1]
rand_x = np.nan
rand_y = np.nan
if not(np.isnan(x) or np.isnan(y)):
while np.isnan(rand_x) or np.isnan(rand_y):
rand_obs = randrange(observers)
rand_img = img_index
while rand_img == img_index:
rand_img = randrange(images)
rand_x = original[rand_obs, rand_img, fix_index, 0]
rand_y = original[rand_obs, rand_img, fix_index, 1]
shuffled[obs_index, img_index, fix_index, 0] = rand_x
shuffled[obs_index, img_index, fix_index, 1] = rand_y
I also fixed the way the new fixation was assigned to the location in shuffled, to follow numpy indexing properly.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.