Loop through large .tif stack (image raster) and extract positions - python

I want to run through a large tif stack +1500 frames and extract the coordinates of the local maxima for each frame. The code below does the job, however extremely slow for large files. When running on smaller bits (e.g. 20 frames) each frame is done almost instantly - when running on the whole dataset, each frame takes seconds.
Any solutions to run a faster code? I figure it is due to the loading of the large tiff file - however it should only be necessary one time initially?
I have the following code:
from pims import ImageSequence
from skimage.feature import peak_local_max
def cmask(index,array):
radius = 3
a,b = index
nx,ny = array.shape
y,x = np.ogrid[-a:nx-a,-b:ny-b]
mask = x*x + y*y <= radius*radius
return(sum(array[mask])) # number of pixels
images = ImageSequence('tryhard_red_small.tif')
frame_list = []
x = []
y = []
int_liposome = []
BG_liposome = []
for i in range(len(images[0])):
tmp_frame = images[0][i]
xy = pd.DataFrame(peak_local_max(tmp_frame, min_distance=8,threshold_abs=3000))
x.extend(xy[0].tolist())
y.extend(xy[1].tolist())
for j in range(len(xy)):
index = x[j],y[j]
int_liposome.append(cmask(index,tmp_frame))
frame_list.extend([i]*len(xy))
print "Frame: ", i, "of ",len(images[0])
features = pd.DataFrame(
{'lip_int':int_liposome,
'y' : y,
'x' : x,
'frame' : frame_list})

Have you tried profiling the code, say with %prun or %lprun in ipython? That'll tell you exactly where your slowdowns are occurring.
I can't make my own version of this without the tif stack, but I suspect the problem is the fact that you're using lists to store everything. Every time you do an append or an extension, python is having to allocate more memory. You could try getting the total count of maxima first, then allocating your output arrays, then rerunning to fill the arrays. Something like below
# run through once to get the count of local maxima
npeaks = (len(peak_local_max(f, min_distance=8, threshold_abs=3000))
for f in images[0])
total_peaks = sum(npeaks)
# allocate storage arrays and rerun
x = np.zeros(total_peaks, np.float)
y = np.zeros_like(x)
int_liposome = np.zeros_like(x)
BG_liposome = np.zeros_like(x)
frame_list = np.zeros(total_peaks, np.int)
index_0 = 0
for frame_ind, tmp_frame in enumerate(images[0]):
peaks = pd.DataFrame(peak_local_max(tmp_frame, min_distance=8,threshold_abs=3000))
index_1 = index_0 + len(peaks)
# copy the data from the DataFrame's underlying numpy array
x[index_0:index_1] = peaks[0].values
y[index_0:index_1] = peaks[1].values
for i, peak in enumerate(peaks, index_0):
int_liposome[i] = cmask(peak, tmp_frame)
frame_list[index_0:index_1] = frame_ind
# update the starting index
index_0 = index_1
print "Frame: ", frame_ind, "of ",len(images[0])

Related

How do I find the saturation point of a curve in python?

I have a graph of the number of FRB detections against the Signal to Noise Ratio.
At a certain point, the Signal to Noise ratio flattens out.
The input variable (the number of FRB detections) is defined by
N_vals = numpy.logspace(0, np.log10((10)**(11)), num = 1000)
and I have a series of arrays that correspond to outputs of the Signal to Noise Ratio (they have the same length).
So far, I have used numpy.gradient() on all the Signal-to-Noise (SNR) ratios to obtain the corresponding slope at every point.
I want to obtain the index at which the Signal-to-Noise Ratio dips below a certain threshold.
Using numpy functions designed to find the inflexion point won't work in my case as the gradient continues to increase - just very gradually.
Here is some code to illustrate my initial attempt:
import numpy as np
grad100 = np.gradient(NDM100)
grad300 = np.gradient(NDM300)
grad1000 = np.gradient(NDM1000)
#print(grad100)
grad2 = np.gradient(N2)
grad5 = np.gradient(N5)
grad10 = np.gradient(N10)
glist = [np.array(grad2), np.array(grad5), np.array(grad10), np.array(grad100), np.array(grad300), np.array(grad1000)]
indexlist = []
for g in glist:
for i in g:
satdex = np.where(i == 10**(-4))[0]
indexlist.append(satdex)
Doing this just gives me a list of empty arrays - for instance:
[array([], dtype=int64),..., array([], dtype=int64)]
Does anyone know a better way of doing this? I just want the indices corresponding to the points at which the gradient is 10**(-4) for each array. This is my 'saturation point'.
Please let me know if I need to provide more information and if so, what exactly. I'm not expecting anyone to run my code as there is a lot of it; rather, I'm after some general tips or some commentary on the structure of my code. I've attached the graph that corresponds to my data (the arrows show what I mean by the point at which the SNR flattens out).
I feel that this is a fairly simple programming problem and therefore doesn't warrant the detail that would be found in questions on error messages for example.
SNR curves with arrows indicating what I mean by 'saturation points'
Alright so I think I've got it. I'm attaching my code below. Obviously it's taken out of context here and won't run by itself so this is just so anyone that finds this question can see what kind of structure works. The general idea is that for a given set of curves, I find the x and y-values at which they begin to flatten out.
x = 499
N_vals2 = N_vals[500:]
grad100 = np.gradient(NDM100)
grad300 = np.gradient(NDM300)
grad1000 = np.gradient(NDM1000)
grad2 = np.gradient(N2)
grad5 = np.gradient(N5)
grad10 = np.gradient(N10)
preg_list = [grad100, grad300, grad1000, grad2, grad5, grad10]
g_list = []
for gl in preg_list:
g_list.append(gl[500:])
sneg_list = [NDM100, NDM300, NDM1000, N2, N5, N10]
sn_list = []
for sl in sneg_list:
sn_list.append(sl[500:])
t_list = []
gt_list = []
ic_list = []
for g in g_list:
threshold = 0.1*np.max(g)
thresh_array = np.full(len(g), fill_value = threshold)
t_list.append(threshold)
gt_list.append(thresh_array)
ic = np.isclose(g, thresh_array, rtol = 0.5)
ic_list.append(ic)
index_list = []
grad_list = []
for i in ic_list:
index = np.where(i == True)
index_list.append(index)
for j in g_list:
gval = j[index]
grad_list.append(gval)
saturation_indices = []
for gl in index_list:
first_index = gl[0][0]
saturation_indices.append(first_index)
#print(saturation_indices)
saturation_points = []
sn_list_firsts = [snf[0] for snf in sn_list]
for s in saturation_indices:
n = round(N_vals2[s], 0)
sn_tuple = (n, s)
saturation_points.append(sn_tuple)

Python MNIST Digit Data Testing Failure

Hello I am using Python to try to read the digit data provided by MNIST into a data structure I can use to train a neural network. I am testing to ensure the data was read properly by creating an image using PIL. The image that is being created is horribly wrong, and I am not sure if it is because I am using PIL incorrectly or my data structures and methods are not right.
The format of the two data files is described here:
http://yann.lecun.com/exdb/mnist/
Here are the applicable functions:
read_image_data reads the pixel data organizing it into a list of 2D array numpy arrays
def read_image_data():
fd = open("train-images.idx3-ubyte", "rb")
images_bin_string = fd.read()
num_images = struct.unpack(">i", images_bin_string[4:8])[0]
image_data_bank = []
uint32_num_bytes = 4
current_index = 8
num_rows = struct.unpack(">I", \
images_bin_string[current_index: current_index + uint32_num_bytes])[0]
num_cols = struct.unpack(">I", \
images_bin_string[current_index + uint32_num_bytes: \
current_index + uint32_num_bytes * 2])[0]
current_index += 8
i = 0
while i < num_images:
image_data = np.zeros([num_rows, num_cols])
for j in range(num_rows - 1):
for k in range(num_cols - 1):
image_data[j][k] = images_bin_string[current_index + j * k]
current_index += num_rows * num_cols
i += 1
image_data_bank.append(image_data)
return image_data_bank
read_label_data reads the corresponding labels into a list
def read_label_data():
fd = open("train-labels.idx1-ubyte", "rb")
images_bin_string = fd.read()
num_images = struct.unpack(">i", images_bin_string[4:8])[0]
image_data_bank = []
current_index = 8
i = 0
while i < num_images:
image_data_bank.append(images_bin_string[current_index])
current_index += 1
i += 1
return image_data_bank
collect_data zips the structures together
def collect_data():
print("Reading image data...")
image_data = read_image_data()
print("Reading label data...")
label_data = read_label_data()
print("Zipping data sets...")
all_data = np.array(list(zip(image_data, label_data)))
return all_data
lastly run_test uses PIL to print the pixels from the first 28x28 np structure created by read_image_data
def run_test(data):
example = data[0]
pixel_data = example[0]
number = example[1]
print(number)
im = Image.fromarray(pixel_data)
im.show()
When I run the script:
Collecting data... Reading image data... Reading label data... Zipping
data sets... 5
I must be messing something up with the PIL library, but I do not know what.
That is a really weird looking 5. I am guessing that I went wrong somewhere in my organization of the data. The directions did say "Pixels are organized row-wise.", but I think I covered that by having my outer loop as the row loop then the inner as the column loop
UPDATE
I reversed the order of the row and column index in the np.arrays in read_image_data and it is making no difference.
image_data[k][j] = images_bin_string[current_index + j * k]
UPDATE
Ran quick test with matplotlib
import matplotlib.image as mpimg
import matplotlib.pyplot as plt
imgplot = plt.imshow(pixel_data)
plt.show()
Here is what I got from matplotlib
That means it is definitely a problem with my code and not the library. The question is if it is the way I am passing the pixels to the imaging libraries or how I structured the data. If anyone can find the mistake, I would greatly appreciate.

Python fast Fourier transform for very noisy data

I have a file with velocity magnitude data and vorticity magnitude data from a fluid simulation.
I want to find out what is the frequency for these two data sets.
my code:
# -*- coding: utf-8 -*-
"""
Spyder Editor
This is a temporary script file.
"""
import re
import math
import matplotlib.pyplot as plt
import numpy as np
probeU1 = []
probeV1 = []
# this creates an array containig all the timesteps, cutting of the first 180, because the system has to stabelize.
number = [ round(x * 0.1, 1) for x in range(180, 301)]
# this function loops over the different time directories, and reads the velocity file.
for i in range(len(number)):
filenamepath = "/Refinement/Vorticity4/probes/" +str(number[i]) + "/U"
data= open(filenamepath,"r")
temparray = []
#removes all the formatting around the data
for line in data:
if line.startswith('#'):
continue
else:
line = re.sub('[()]', "", line)
values = line.split()
#print values[1], values[2]
xco = values[1::3]
yco = values[2::3]
#here it extracts all the velocity data from all the different probes
for i in range(len(xco)):
floatx = float(xco[i])
floaty = float(yco[i])
temp1 = math.pow(floatx,2)
temp2 = math.pow(floaty,2)
#print temp2, temp1
temp3 = temp1+temp2
temp4 = math.sqrt(temp3)
#takes the magnitude of the velocity
#print temp4
temparray.append(temp4)
probeU1.append(temparray)
#
#print probeU1[0]
#print len(probeU1[0])
#
# this function loops over the different time directories, and reads the vorticity file.
for i in range(len(number)):
filenamepath = "/Refinement/Vorticity4/probes/" +str(number[i]) + "/vorticity"
data= open(filenamepath,"r")
# print data.read()
temparray1 = []
for line in data:
if line.startswith('#'):
continue
else:
line = re.sub('[()]', "", line)
values = line.split()
zco = values[3::3]
#because the 2 dimensionallity the z-component of the vorticity is already the magnitude
for i in range(len(zco)):
abso = float(zco[i])
add = np.abs(abso)
temparray1.append(add)
probeV1.append(temparray1)
#Old code block to display the data and check that it made a wave pattern(which it did)
##Printing all probe data from 180-300 in one graph(unintelligible)
#for i in range(len(probeU1[1])):
# B=[]
# for l in probeU1:
# B.append(l[i])
## print 'B=', B
## print i
# plt.plot(number,B)
#
#
#plt.ylabel('magnitude of velocity')
#plt.show()
#
##Printing all probe data from 180-300 in one graph(unintelligible)
#for i in range(len(probeV1[1])):
# R=[]
# for l in probeV1:
# R.append(l[i])
## print 'R=', R
## print i
# plt.plot(number,R)
#
#
#plt.ylabel('magnitude of vorticity')
#plt.show()
#Here is where the magic happens, (i hope)
ans=[]
for i in range(len(probeU1[1])):
b=[]
#probeU1 is a nested list, because there are 117 different probes, wich all have the data from timestep 180-301
for l in probeU1:
b.append(l[i])
#the freqeuncy was not oscillating around 0, so moved it there by substracting the mean
B=b-np.mean(b)
#here the fft happens
u = np.fft.fft(B)
#This should calculate the frequencies?
freq = np.fft.fftfreq(len(B), d= (number[1] - number[0]))
# If im not mistakes this finds the peak frequency for 1 probe and passes it another list
val = np.argmax(np.abs(u))
ans.append(np.abs(freq[val]))
plt.plot(freq, np.abs(u))
#print np.mean(ans)
plt.xlabel('frequency?')
plt.savefig('velocitiy frequency')
plt.show()
# just duplicate to the one above it
ans1=[]
for i in range(len(probeV1[1])):
c=[]
for l in probeU1:
c.append(l[i])
C=c-np.mean(c)
y = np.fft.fft(C)
freq1 = np.fft.fftfreq(len(C), d= (number[1] - number[0]))
val = np.argmax(np.abs(y))
ans1.append(np.abs(freq1[val]))
plt.plot(freq1, np.abs(y))
#print np.mean(ans1)
plt.ylabel('frequency?')
plt.savefig('vorticity frequency')
plt.show()
data.close()
My data contains 117 probes each having their own 121 point of velocity magnitude data.
My aim is to find the dominate frequency for each probe and then collect all those and plot them in a histogram.
My question is about the part where it says this is where the magic happens. I believe the fft is already working correctly
y = np.fft.fft(C)
freq1 = np.fft.fftfreq(len(C), d= (number[1] - number[0]))
And if I'm not mistaken the freq1 list should contain all the frequencies for a given probe. I've checked this list visually and the amount of different frequencies is very high(20+) so the signal is probably very noisy.
# If im not mistakes this finds the peak frequency for 1 probe and passes it another list
val = np.argmax(np.abs(u))
ans.append(np.abs(freq1[val]))
That this part should in theory take the biggest signal from one probe and than put in the "ans" list. But I'm a bit confused as to how i can no correctly identify the right frequency. As there should i theory be one main frequency. How can I correctly estimate the "main" frequency from all this data from all the noise
For reference I'm modeling an Von Karmann vortex street and I'm looking for the frequency of vortex shedding. https://en.wikipedia.org/wiki/K%C3%A1rm%C3%A1n_vortex_street
Can anyone help me on how to solve this?
The line
freq1 = np.fft.fftfreq(len(C), d= (number[1] - number[0]))
Only generates an index going from
freq1 = [0, 1, ..., len(C)/2-1, -len(C)/2, ..., -1] / (d*len(C))
Which is useful to compute your frequencies array as
freq[i] = freq1[i]*alpha
Where alpha is your basic wavenumber computed as
alpha = 1/Ts
Being Ts your sampling period. I think that because freq1 is not scaled you array of frequencies is so high.
Note that if you are sampling your data using different time steps you will need to interpolate it at in a evenly space domain using numpy.interp (for example).
To estimate the main frequency just find the index where the fft-transformed variable is higher and relate that index to freq[i].

index error in my python program

This is a program for face recognition using pca logic. Everything went fine except for the index error that came up at the end of the program.
When I run the code I get an index error at the fourth last line of my program.
distances.append((dist, y[i]))
IndexError: list index out of range
can anyone just help in this. I am newbie into python, so am I not so expert in solving.
Here is my code :
from sklearn.decomposition import RandomizedPCA
import numpy as np
import glob
import cv2
import math
import os.path
import string
#function to get ID from filename
def ID_from_filename(filename):
part = string.split(filename, '/')
return part[1].replace("s", "")
#function to convert image to right format
def prepare_image(filename):
img_color = cv2.imread(filename)
img_gray = cv2.cvtColor(img_color, cv2.cv.CV_RGB2GRAY)
img_gray = cv2.equalizeHist(img_gray)
return img_gray.flat
IMG_RES = 92 * 112 # img resolution
NUM_EIGENFACES = 10 # images per train person
NUM_TRAINIMAGES = 110 # total images in training set
#loading training set from folder train_faces
folders = glob.glob('train_faces/*')
# Create an array with flattened images X
# and an array with ID of the people on each image y
X = np.zeros([NUM_TRAINIMAGES, IMG_RES], dtype='int8')
y = []
# Populate training array with flattened imags from subfolders of
train_faces and names
c = 0
for x, folder in enumerate(folders):
train_faces = glob.glob(folder + '/*')
for i, face in enumerate(train_faces):
X[c,:] = prepare_image(face)
y.append(ID_from_filename(face))
c = c + 1
# perform principal component analysis on the images
pca = RandomizedPCA(n_components=NUM_EIGENFACES, whiten=True).fit(X)
X_pca = pca.transform(X)
# load test faces (usually one), located in folder test_faces
test_faces = glob.glob('test_faces/*')
# Create an array with flattened images X
X = np.zeros([len(test_faces), IMG_RES], dtype='int8')
# Populate test array with flattened imags from subfolders of train_faces
for i, face in enumerate(test_faces):
X[i,:] = prepare_image(face)
# run through test images (usually one)
for j, ref_pca in enumerate(pca.transform(X)):
distances = []
# Calculate euclidian distance from test image to each of the known
images and save distances
for i, test_pca in enumerate(X_pca):
dist = math.sqrt(sum([diff**2 for diff in (ref_pca - test_pca)]))
distances.append((dist, y[i]))
found_ID = min(distances)[1]
print "Identified (result: "+ str(found_ID) +" - dist - " +
str(min(distances)[0]) + ")"
Your i in the loop below goes up to the length of X_pca - 1
for i, test_pca in enumerate(X_pca):
dist = math.sqrt(sum([diff**2 for diff in (ref_pca - test_pca)]))
distances.append((dist, y[i]))
However, your y is not built to have that length necessarily:
for x, folder in enumerate(folders):
train_faces = glob.glob(folder + '/*')
for i, face in enumerate(train_faces):
X[c,:] = prepare_image(face)
y.append(ID_from_filename(face))
So you are using an index i which is greater than the bounds of your list y.

Find the maximum x,y value fromn a series of images

I have a stack of bitmap images (between 2000-4000 ) that I'm doing a z-projection maximum intensity projection on. So from the stack, I need to get a 2d array of maximum values for each x,y position.
I have devised a simple script that splits up the files into chunks and uses multiprocessing.pool to calculate the maximum array for that chuck. These arrays are then compared to find the maximum for the stack.
It works, but it is slow. My system monitor show that my CPUs are hardly working.
Can anyone give me some pointers on how I might speed things up a bit?
import Image
import os
import numpy as np
import multiprocessing
import sys
#Get the stack of images
files = []
for fn in os.listdir(sys.argv[1]):
if fn.endswith('.bmp'):
files.append(os.path.join(sys.argv[1], fn))
def processChunk(filelist):
first = True
max_ = None
for img in filelist:
im = Image.open(img)
array = np.array(im)
if first:
max_ = array
first = False
max_ = np.maximum(array, max_)
return max_
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=8)
#Chop list into chunks
file_chunks = []
chunk_size = 100
ranges = range(0, len(files), chunk_size)
for chunk_idx in ranges:
file_chunks.append(files[chunk_idx:chunk_idx+chunk_size])
#find the maximum x,y vals in chunks of 100
first = True
maxi = None
max_arrays = pool.map(processChunk, file_chunks )
#Find the maximums from the maximums returned from each process
for array in max_arrays:
if first:
maxi = array
first = False
maxi = np.maximum(array, maxi)
img = Image.fromarray(maxi)
img.save("max_intensity.tif")
Edit:
Did some small benchmarking with sample data and you're right. Also, turns out (reading your code more closely), most of my original post is wrong. You are essentially doing the same number of iterations (slightly more, but not 3x more). I also found out that
x = np.maximum(x, y)
is slightly faster than both
x[y > x] = y[y > x]
#or
ind = y > x
x[ind] = y[ind]
I would then alter your code only slightly. Something like:
import numpy as np
from multiprocessing import Pool
def process(chunk):
max_ = np.zeros((4000, 4000))
for im in chunk:
im_array = np.array(Image.open(im))
max_ = np.maximum(max_, im_array)
return max_
if __name__ == "__main__":
p = Pool(8)
chunksize = 500 #4000/8 = 500, might have less overhead
chunks = [files[i:i+chunksize]
for i in range(0, len(files), chunksize)]
# this returns an array of (len(files)/chunksize, 4000, 4000)
max_arrays = np.array(p.map(process, chunks))
maxi = np.amax(max_array, axis=0) #finds maximum along first axis
img = Image.fromarray(maxi) #should be of shape (4000, 4000)
I think this is one of the fastest ways you can do this, although I have an idea for a tree or tournament style algorithm, possible a recursive one too. Good job.
How big are the images? Small enough to load two images into memory at once? If so, then can you do something like:
maxi = np.zeros(image_shape) # something like (1024, 1024)
for im in files:
im_array = np.array(Image.open(im))
inds = im_array > maxi # find where image intensity > max intensity
maxi[inds] = im_array[inds] # update the maximum value at each pixel
max_im = Image.fromarray(maxi)
max_im.save("max_intensity.tif")
After all iterations, the maxi array will contain the maximum intensity for each (x, y) coordinate. No need to break it into chunks. Also, there's only one for loop, not 3, so it will be faster and may not need multiprocessing.

Categories

Resources