Random access in a saved-on-disk numpy array

Random access in a saved-on-disk numpy array - python

I have one big numpy array A of shape (2_000_000, 2000) of dtype float64, which takes 32 GB.
(or alternatively the same data split into 10 arrays of shape (200_000, 2000), it may be easier for serialization?).
How can we serialize it to disk such that we can have fast random read access to any part of the data?
More precisely I need to be able to read ten thousands of windows of shape (16, 2 000) from A at random starting indexes i:
L = []
for i in range(10_000):
i = random.randint(0, 2_000_000 - 16):
window = A[i:i+16, :] # window of A of shape (16, 2000) starting at a random index i
L.append(window)
WINS = np.concatenate(L) # shape (10_000, 16, 2000) of float64, ie: ~ 2.4 GB
Let's say I only have 8 GB of RAM available for this task; it's totally impossible to load the whole 32 GB of A in RAM.
How can we read such windows in a serialized-on-disk numpy array? (.h5 format or any other)
Note: The fact the reading is done at randomized starting indexes is important.

This example shows how you can use an HDF5 file for the process you describe.
First, create a HDF5 file with a dataset of shape(2_000_000, 2000) and dtype=float64 values. I used variables for the dimensions so you can tinker with it.
import numpy as np
import h5py
import random
h5_a0, h5_a1 = 2_000_000, 2_000
with h5py.File('SO_68206763.h5','w') as h5f:
dset = h5f.create_dataset('test',shape=(h5_a0, h5_a1))
incr = 1_000
a0 = h5_a0//incr
for i in range(incr):
arr = np.random.random(a0*h5_a1).reshape(a0,h5_a1)
dset[i*a0:i*a0+a0, :] = arr
print(dset[-1,0:10]) # quick dataset check of values in last row
Next, open the file in read mode, read 10_000 random array slices of shape (16,2_000) and append to the list L. At the end, convert the list to the array WINS. Note, by default the array will have 2 axes -- you need to use .reshape() if you want 3 axes per your comment (reshape also shown).
with h5py.File('SO_68206763.h5','r') as h5f:
dset = h5f['test']
L = []
ds0, ds1 = dset.shape[0], dset.shape[1]
for i in range(10_000):
ir = random.randint(0, ds0 - 16)
window = dset[ir:ir+16, :] # window from dset of shape (16, 2000) starting at a random index i
L.append(window)
WINS = np.concatenate(L) # shape (160_000, 2_000) of float64,
print(WINS.shape, WINS.dtype)
WINS = np.concatenate(L).reshape(10_0000,16,ds1) # reshaped to (10_000, 16, 2_000) of float64
print(WINS.shape, WINS.dtype)
The procedure above is not memory efficient. You wind up with 2 copies of the randomly sliced data: in both list L and array WINS. If memory is limited, this could be a problem. To avoid the intermediate copy, read the random slide of data directly to an array. Doing this simplifies the code, and reduces the memory footprint. That method is shown below (WINS2 is a 2 axis array, and WINS3 is a 3 axis array).
with h5py.File('SO_68206763.h5','r') as h5f:
dset = h5f['test']
ds0, ds1 = dset.shape[0], dset.shape[1]
WINS2 = np.empty((10_000*16,ds1))
WINS3 = np.empty((10_000,16,ds1))
for i in range(10_000):
ir = random.randint(0, ds0 - 16)
WINS2[i*16:(i+1)*16,:] = dset[ir:ir+16, :]
WINS3[i,:,:] = dset[ir:ir+16, :]

An alternative soluton to h5py datasets that I tried and that works is using memmap, as suggested in #RyanPepper's comment.
Write the data as binary
import numpy as np
with open('a.bin', 'wb') as A:
for f in range(1000):
x = np.random.randn(10*2000).astype('float32').reshape(10, 2000)
A.write(x.tobytes())
A.flush()
Open later as memmap
A = np.memmap('a.bin', dtype='float32', mode='r').reshape((-1, 2000))
print(A.shape) # (10000, 2000)
print(A[1234:1234+16, :]) # window

Related

Extract patch and reconstruct image

i am trying my hands on a segmentation task, the images are 3d volumes since i cannot process them at once because of gpu memory constraints, i am extracting patches of the image and performing operations on them.
for extracting the patches i am
def cutup(data, blck, strd):
sh = np.array(data.shape)
blck = np.asanyarray(blck)
strd = np.asanyarray(strd)
nbl = (sh - blck) // strd + 1
strides = np.r_[data.strides * strd, data.strides]
dims = np.r_[nbl, blck]
data6 = stride_tricks.as_strided(data, strides=strides, shape=dims)
return data6.reshape(-1, *blck)
def make_patches(image_folder, mask_folder):
'''
Given niigz image and mask files will create numpy files
'''
for image, mask in tqdm.tqdm(zip(os.listdir(image_folder), os.listdir(mask_folder))):
mask_ = mask
mask = mask.split('_')
image = mask[0]
image_name = mask[0]
mask_name = mask[0]
image, mask = read_image_and_seg(os.path.join(image_folder, image), os.path.join(mask_folder,mask_))
if image.shape[1] > 600:
image = image[:,:600,:]
desired_size_w = 896
desired_size_h = 600
desired_size_z = 600
delta_w = desired_size_w - image.shape[0]
delta_h = desired_size_h - image.shape[1]
delta_z = desired_size_z - image.shape[2]
padded_image =np.pad(image, ((0,delta_w), (0,delta_h), (0, delta_z)), 'constant')
padded_mask =np.pad(mask, ((0,delta_w), (0,delta_h), (0, delta_z)), 'constant')
y = cutup(padded_image, (128,128,128),(128,128,128))#Actually extract more patches by changing stride size
y_ = cutup(padded_mask, (128,128,128),(128,128,128))
print(image_name)
for index, (im , label) in enumerate(zip(y , y_)):
if len(np.unique(im)) ==1:
continue
else:
if not os.path.exists(os.path.join('../data/patches/images/',image_name.split('.')[0]+str(index))):
np.save(os.path.join('../data/patches/images/',image_name.split('.')[0]+str(index)), im)
np.save(os.path.join('../data/patches/masks/', image_name.split('.')[0]+str(index)), label)
now this will extract non - overlapping patches and give me the patches in numpy array, just as an aside i am converting the image to shape(padding with 0) 896,640,640 so i can extarct all patches
The problem is i dont know if the above code works!to test it wanted to extract the patches and then take those patches and reconstruct the image,now i am not exactly sure how to go about this,
for now this is what i have
def reconstruct_image(folder_path_of_npy_files):
slice_shape = len(os.listdir(folder_path_of_npy_files))
recon_image = np.array([])
for index, file in enumerate(os.listdir(folder_path_of_npy_files)):
read_image = np.load(os.path.join(folder_path_of_npy_files, file))
recon_image = np.append(recon_image, read_image)
return recon_image
but this does not work as it makes an array of (x, 128,128,128) and keeps filling up the 0th dimension.
So my question is , how do i reconstruct the image? or is there just a plain better way to extract and reconstruct patches.
Thanks in advance.

If things are reasonably simple (not sliding window) then you could possibly use skimage.util.shape.view_as_blocks. For example:
import numpy as np
import skimage
# Create example
data = np.random.random((200,200,200))
blocks = skimage.util.shape.view_as_blocks(data, (10, 10, 10))
# Do the processing on the blocks here.
processed_blocks = blocks
new_data = np.reshape(process_blocks, (200, 200, 200))
But, if you are having memory constraint issues this may not be the best way to go as you are going to be duplicating the original data several times (data, blocks, new_data) etc and you might have to look at doing it a little smarter than my example here.
If you are having memory issues, the other thing you can do, very carefully, is to change the underlying data type of your data. For example, when I was doing MRI data, most original data was integer-ish but Python would represent it as float64. If you can accept some rounding on the data then you could do something like:
import numpy as np
import skimage
# Create example
data = 200*np.random.random((200,200,200)).astype(np.float16) # 2 byte float
blocks = skimage.util.shape.view_as_blocks(data, (10, 10, 10))
# Do the processing on the blocks here.
new_data = np.reshape(blocks, (200, 200, 200))
This version uses:
In [2]: whos
Variable Type Data/Info
-------------------------------
blocks ndarray 20x20x20x10x10x10: 8000000 elems, type `float16`, 16000000 bytes (15.2587890625 Mb)
data ndarray 200x200x200: 8000000 elems, type `float16`, 16000000 bytes (15.2587890625 Mb)
new_data ndarray 200x200x200: 8000000 elems, type `float16`, 16000000 bytes (15.2587890625 Mb)
vs the first version:
In [2]: whos
Variable Type Data/Info
-------------------------------
blocks ndarray 20x20x20x10x10x10: 8000000 elems, type `float64`, 64000000 bytes (61.03515625 Mb)
data ndarray 200x200x200: 8000000 elems, type `float64`, 64000000 bytes (61.03515625 Mb)
new_data ndarray 200x200x200: 8000000 elems, type `float64`, 64000000 bytes (61.03515625 Mb)
So, doing the np.float16 saves you about a factor of 4 in RAM.
But, making this type of change puts assumptions on the data and algorithm (possible rounding issues etc).

Convert large array of floats to colors and write to binary efficiently

I am working with large NetCDF4 files (about 1 GB and up but less than my 8 GB memory for now). 99% of the time the data type will be a float32. I want to map these values to an array of RGB colors which I will then write to a binary file to be read by another application for viewing. Because I only need 1 byte for each R, G, and B, I want to have an array of np.uint8 to represent this. In the end the array will take up 25% less space than the floats. However, as the original data is big, I don't want to keep both the original data and the color data in memory at the same time. For now I provide a color for the low value and the color for the high value. The problem is that in my program for a short period time, the color data consists of floats instead of np.uint8, which leads to taking up 3 times as much memory as the original data. Is there a way to skip the float conversion or at least only have one float in memory so that I don't take up this much memory? I have provided relevant code below:
from netCDF4 import Dataset
import numpy as np
import dask.array as da
import gc
import time
import sys
# Read file path
file_path = sys.argv[1]
# Default colors is blue for low and red for high
lowColor = np.array([0, 0, 255], dtype=int)
highColor = np.array([255, 0, 0], dtype=int)
data = Dataset(file_path)
allVariables = data.variables
# Sometimes we have time_bnds, lat_bnds, etc.
# Keep anything that doesn't have 'bnds'
varNames = list(filter(lambda x: 'bnds' not in x, list(allVariables.keys())))
# Remove the dimensions
varNames = list(filter(lambda x: x not in data.dimensions, varNames))
var = varNames[0]
flattened = allVariables[var][:].flatten()
origShape = allVariables[var].shape
if isinstance(flattened, np.ma.core.MaskedArray):
flattened = flattened.filled(np.nan)
# Find the minimum value and the range of values.
# Using these two we can make a percentage of how
# far 'up' each value and simply convert colors
# based on that. Because there's a chance of the data
# having NaNs, I can't use ptp().
lowVal = np.nanmin(flattened)
ptp = np.nanmax(flattened) - lowVal
# Subtract the min from each value and divide by ptp
# and add a dimension for dot product later.
percents = ((flattened - lowVal) / ptp)[np.newaxis, :]
# Remove flattened from memory as it is not needed anymore
flattened = None
gc.collect()
# Calculate the color difference
diff = (highColor - lowColor)[np.newaxis, :].T
# Do the dot product to create a list of colors
# Transpose so each color is each row. Also
# add the low color
colors = lowColor + np.dot(diff, percents).T # All floats here
# Round each value and cast to uint8 and finally reshape to
# the original data
colors = np.round(colors).astype(np.uint8)
colors = colors.reshape(origShape + (3,))
colors.tofile('colors_' + allVariables[var].name + '.bin')

Using Mann Kendall in python with a lot of data

I have a set of 46 years worth of rainfall data. It's in the form of 46 numpy arrays each with a shape of 145, 192, so each year is a different array of maximum rainfall data at each lat and lon coordinate in the given model.
I need to create a global map of tau values by doing an M-K test (Mann-Kendall) for each coordinate over the 46 years.
I'm still learning python, so I've been having trouble finding a way to go through all the data in a simple way that doesn't involve me making 27840 new arrays for each coordinate.
So far I've looked into how to use scipy.stats.kendalltau and using the definition from here: https://github.com/mps9506/Mann-Kendall-Trend
EDIT:
To clarify and add a little more detail, I need to perform a test on for each coordinate and not just each file individually. For example, for the first M-K test, I would want my x=46 and I would want y=data1[0,0],data2[0,0],data3[0,0]...data46[0,0]. Then to repeat this process for every single coordinate in each array. In total the M-K test would be done 27840 times and leave me with 27840 tau values that I can then plot on a global map.
EDIT 2:
I'm now running into a different problem. Going off of the suggested code, I have the following:
for i in range(145):
for j in range(192):
out[i,j] = mk_test(yrmax[:,i,j],alpha=0.05)
print out
I used numpy.stack to stack all 46 arrays into a single array (yrmax) with shape: (46L, 145L, 192L) I've tested it out and it calculates p and tau correctly if I change the code from out[i,j] to just out. However, doing this messes up the for loop so it only takes the results from the last coordinate in stead of all of them. And if I leave the code as it is above, I get the error: TypeError: list indices must be integers, not tuple
My first guess was that it has to do with mk_test and how the information is supposed to be returned in the definition. So I've tried altering the code from the link above to change how the data is returned, but I keep getting errors relating back to tuples. So now I'm not sure where it's going wrong and how to fix it.
EDIT 3:
One more clarification I thought I should add. I've already modified the definition in the link so it returns only the two number values I want for creating maps, p and z.

I don't think this is as big an ask as you may imagine. From your description it sounds like you don't actually want the scipy kendalltau, but the function in the repository you posted. Here is a little example I set up:
from time import time
import numpy as np
from mk_test import mk_test
data = np.array([np.random.rand(145, 192) for _ in range(46)])
mk_res = np.empty((145, 192), dtype=object)
start = time()
for i in range(145):
for j in range(192):
out[i, j] = mk_test(data[:, i, j], alpha=0.05)
print(f'Elapsed Time: {time() - start} s')
Elapsed Time: 35.21990394592285 s
My system is a MacBook Pro 2.7 GHz Intel Core I7 with 16 GB Ram so nothing special.
Each entry in the mk_res array (shape 145, 192) corresponds to one of your coordinate points and contains an entry like so:
array(['no trend', 'False', '0.894546014835', '0.132554125342'], dtype='<U14')
One thing that might be useful would be to modify the code in mk_test.py to return all numerical values. So instead of 'no trend'/'positive'/'negative' you could return 0/1/-1, and 1/0 for True/False and then you wouldn't have to worry about the whole object array type. I don't know what kind of analysis you might want to do downstream but I imagine that would preemptively circumvent any headaches.

Thanks to the answers provided and some work I was able to work out a solution that I'll provide here for anyone else that needs to use the Mann-Kendall test for data analysis.
The first thing I needed to do was flatten the original array I had into a 1D array. I know there is probably an easier way to go about doing this, but I ultimately used the following code based on code Grr suggested using.
`x = 46
out1 = np.empty(x)
out = np.empty((0))
for i in range(146):
for j in range(193):
out1 = yrmax[:,i,j]
out = np.append(out, out1, axis=0) `
Then I reshaped the resulting array (out) as follows:
out2 = np.reshape(out,(27840,46))
I did this so my data would be in a format compatible with scipy.stats.kendalltau 27840 is the total number of values I have at every coordinate that will be on my map (i.e. it's just 145*192) and the 46 is the number of years the data spans.
I then used the following loop I modified from Grr's code to find Kendall-tau and it's respective p-value at each latitude and longitude over the 46 year period.
`x = range(46)
y = np.zeros((0))
for j in range(27840):
b = sc.stats.kendalltau(x,out2[j,:])
y = np.append(y, b, axis=0)`
Finally, I reshaped the data one for time as shown:newdata = np.reshape(y,(145,192,2)) so the final array is in a suitable format to be used to create a global map of both tau and p-values.
Thanks everyone for the assistance!

Depending on your situation, it might just be easiest to make the arrays.
You won't really need them all in memory at once (not that it sounds like a terrible amount of data). Something like this only has to deal with one "copied out" coordinate trend at once:
SIZE = (145,192)
year_matrices = load_years() # list of one 145x192 arrays per year
result_matrix = numpy.zeros(SIZE)
for x in range(SIZE[0]):
for y in range(SIZE[1]):
coord_trend = map(lambda d: d[x][y], year_matrices)
result_matrix[x][y] = analyze_trend(coord_trend)
print result_matrix
Now, there are things like itertools.izip that could help you if you really want to avoid actually copying the data.
Here's a concrete example of how Python's "zip" might works with data like yours (although as if you'd used ndarray.flatten on each year):
year_arrays = [
['y0_coord0_val', 'y0_coord1_val', 'y0_coord2_val', 'y0_coord2_val'],
['y1_coord0_val', 'y1_coord1_val', 'y1_coord2_val', 'y1_coord2_val'],
['y2_coord0_val', 'y2_coord1_val', 'y2_coord2_val', 'y2_coord2_val'],
]
assert len(year_arrays) == 3
assert len(year_arrays[0]) == 4
coord_arrays = zip(*year_arrays) # i.e. `zip(year_arrays[0], year_arrays[1], year_arrays[2])`
# original data is essentially transposed
assert len(coord_arrays) == 4
assert len(coord_arrays[0]) == 3
assert coord_arrays[0] == ('y0_coord0_val', 'y1_coord0_val', 'y2_coord0_val', 'y3_coord0_val')
assert coord_arrays[1] == ('y0_coord1_val', 'y1_coord1_val', 'y2_coord1_val', 'y3_coord1_val')
assert coord_arrays[2] == ('y0_coord2_val', 'y1_coord2_val', 'y2_coord2_val', 'y3_coord2_val')
assert coord_arrays[3] == ('y0_coord2_val', 'y1_coord2_val', 'y2_coord2_val', 'y3_coord2_val')
flat_result = map(analyze_trend, coord_arrays)
The example above still copies the data (and all at once, rather than a coordinate at a time!) but hopefully shows what's going on.
Now, if you replace zip with itertools.izip and map with itertools.map then the copies needn't occur — itertools wraps the original arrays and keeps track of where it should be fetching values from internally.
There's a catch, though: to take advantage itertools you to access the data only sequentially (i.e. through iteration). In your case, it looks like the code at https://github.com/mps9506/Mann-Kendall-Trend/blob/master/mk_test.py might not be compatible with that. (I haven't reviewed the algorithm itself to see if it could be.)
Also please note that in the example I've glossed over the numpy ndarray stuff and just show flat coordinate arrays. It looks like numpy has some of it's own options for handling this instead of itertools, e.g. this answer says "Taking the transpose of an array does not make a copy". Your question was somewhat general, so I've tried to give some general tips as to ways one might deal with larger data in Python.

I ran into the same task and have managed to come up with a vectorized solution using numpy and scipy.
The formula are the same as in this page: https://vsp.pnnl.gov/help/Vsample/Design_Trend_Mann_Kendall.htm.
The trickiest part is to work out the adjustment for the tied values. I modified the code as in this answer to compute the number of tied values for each record, in a vectorized manner.
Below are the 2 functions:
import copy
import numpy as np
from scipy.stats import norm
def countTies(x):
'''Count number of ties in rows of a 2D matrix
Args:
x (ndarray): 2d matrix.
Returns:
result (ndarray): 2d matrix with same shape as <x>. In each
row, the number of ties are inserted at (not really) arbitary
locations.
The locations of tie numbers in are not important, since
they will be subsequently put into a formula of sum(t*(t-1)*(2t+5)).
Inspired by: https://stackoverflow.com/a/24892274/2005415.
'''
if np.ndim(x) != 2:
raise Exception("<x> should be 2D.")
m, n = x.shape
pad0 = np.zeros([m, 1]).astype('int')
x = copy.deepcopy(x)
x.sort(axis=1)
diff = np.diff(x, axis=1)
cated = np.concatenate([pad0, np.where(diff==0, 1, 0), pad0], axis=1)
absdiff = np.abs(np.diff(cated, axis=1))
rows, cols = np.where(absdiff==1)
rows = rows.reshape(-1, 2)[:, 0]
cols = cols.reshape(-1, 2)
counts = np.diff(cols, axis=1)+1
result = np.zeros(x.shape).astype('int')
result[rows, cols[:,1]] = counts.flatten()
return result
def MannKendallTrend2D(data, tails=2, axis=0, verbose=True):
'''Vectorized Mann-Kendall tests on 2D matrix rows/columns
Args:
data (ndarray): 2d array with shape (m, n).
Keyword Args:
tails (int): 1 for 1-tail, 2 for 2-tail test.
axis (int): 0: test trend in each column. 1: test trend in each
row.
Returns:
z (ndarray): If <axis> = 0, 1d array with length <n>, standard scores
corresponding to data in each row in <x>.
If <axis> = 1, 1d array with length <m>, standard scores
corresponding to data in each column in <x>.
p (ndarray): p-values corresponding to <z>.
'''
if np.ndim(data) != 2:
raise Exception("<data> should be 2D.")
# alway put records in rows and do M-K test on each row
if axis == 0:
data = data.T
m, n = data.shape
mask = np.triu(np.ones([n, n])).astype('int')
mask = np.repeat(mask[None,...], m, axis=0)
s = np.sign(data[:,None,:]-data[:,:,None]).astype('int')
s = (s * mask).sum(axis=(1,2))
#--------------------Count ties--------------------
counts = countTies(data)
tt = counts * (counts - 1) * (2*counts + 5)
tt = tt.sum(axis=1)
#-----------------Sample Gaussian-----------------
var = (n * (n-1) * (2*n+5) - tt) / 18.
eps = 1e-8 # avoid dividing 0
z = (s - np.sign(s)) / (np.sqrt(var) + eps)
p = norm.cdf(z)
p = np.where(p>0.5, 1-p, p)
if tails==2:
p=p*2
return z, p
I assume your data come in the layout of (time, latitude, longitude), and you are examining the temporal trend for each lat/lon cell.
To simulate this task, I synthesized a sample data array of shape (50, 145, 192). The 50 time points are taken from Example 5.9 of the book Wilks 2011, Statistical methods in the atmospheric sciences. And then I simply duplicated the same time series 27840 times to make it (50, 145, 192).
Below is the computation:
x = np.array([0.44,1.18,2.69,2.08,3.66,1.72,2.82,0.72,1.46,1.30,1.35,0.54,\
2.74,1.13,2.50,1.72,2.27,2.82,1.98,2.44,2.53,2.00,1.12,2.13,1.36,\
4.9,2.94,1.75,1.69,1.88,1.31,1.76,2.17,2.38,1.16,1.39,1.36,\
1.03,1.11,1.35,1.44,1.84,1.69,3.,1.36,6.37,4.55,0.52,0.87,1.51])
# create a big cube with shape: (T, Y, X)
arr = np.zeros([len(x), 145, 192])
for i in range(arr.shape[1]):
for j in range(arr.shape[2]):
arr[:, i, j] = x
print(arr.shape)
# re-arrange into tabular layout: (Y*X, T)
arr = np.transpose(arr, [1, 2, 0])
arr = arr.reshape(-1, len(x))
print(arr.shape)
import time
t1 = time.time()
z, p = MannKendallTrend2D(arr, tails=2, axis=1)
p = p.reshape(145, 192)
t2 = time.time()
print('time =', t2-t1)
The p-value for that sample time series is 0.63341565, which I have validated against the pymannkendall module result. Since arr contains merely duplicated copies of x, the resultant p is a 2d array of size (145, 192), with all 0.63341565.
And it took me only 1.28 seconds to compute that.

fast downsampling of huge matrix using python (numpy memmap, pytables or other?)

As part of my data processing I produce huge non sparse matrices in the order of 100000*100000 cells, which I want to downsample by a factor of 10 to reduce the amount of data. In this case I want to average over blocks of 10*10 pixels, to reduce the size of my matrix from 100000*100000 to 10000*10000.
What is the fastest way to do so using python? It does not matter for me if I need to save my original data to a new dataformat, because I have to do the downsampling of the same dataset multiple times.
Currently I am using numpy.memmap:
import numpy as np
data_1 = 'data_1.dat'
date_2 = 'data_2.dat'
lines = 100000
pixels = 100000
window = 10
new_lines = lines / window
new_pixels = pixels / window
dat_1 = np.memmap(data_1, dtype='float32', mode='r', shape=(lines, pixels))
dat_2 = np.memmap(data_2, dtype='float32', mode='r', shape=(lines, pixels))
dat_in = dat_1 * dat_2
dat_out = dat_in.reshape([new_lines, window, new_pixels, window]).mean(3).mean(1)
But with with large files this method becomes very slow. Likely this has something to do with the binary data of these files, which are ordered by line. Therefore, I think that a data format which stores my data in blocks instead of lines will be faster, but I am not sure what the performance gain will be and whether there are python packages who support this.
I have also thought about downsampling of the data before creating such a huge matrix (not shown here), but my input data is fractured and irregular, so that would become very complex.

Based on this answer, I think this might be a relatively fast method, depending on how much overhead reshape gives you with memmap.
def downSample(a, window):
i, j = a.shape
ir = np.arange(0, i, window)
jr = np.arange(0, j, window)
n = 1./(window**2)
return n * np.add.reduceat(np.add.reduceat(a, ir), jr, axis=1)
Hard to test speed without your dataset.

This avoids an intermediate copy, as the reshape keeps dimensions contiguous
dat_in.reshape((lines/window, window, pixels/window, window)).mean(axis=(1,3))

Numerical operations on arrays while reading from CSV file

I am trying to do a few numerical operations on a few arrays while reading some values from CSV files.
I have the coordinates of a receiver which is fixed and I read coordinates of the heliostats from a CSV file which track the Sun.
The coordinates of the receiver:
# co-ordinates of Receiver
XT = 0 # X co-ordinate of Receiver
YT = 0 # Y co-ordinate of Receiver
ZT = 207.724 # Z co-ordinate of Receiver, this is the height of tower
A = np.array(([XT],[YT],[ZT]))
print(A," are the co-ordinates of the target i.e. the receiver")
The coordinates of the ten heliostats:
This data I read from a CSV file with the follwoing data:
#X,Y,Z
#-1269.56,-1359.2,5.7
#1521.28,-68.0507,5.7
#-13.6163,1220.79,5.7
#-1388.76,547.708,5.7
#1551.75,-82.2342,5.7
#405.92,-1853.83,5.7
#1473.43,-881.703,5.7
#1291.73,478.988,5.7
#539.027,1095.43,5.7
#-1648.13,-73.7251,5.7
I read the coordinates of the CSV as follows:
import csv
# Reading data from csv file
with open('Heliostat Field Layout Large heliostat.csv') as csvfile:
readCSV = csv.reader(csvfile, delimiter=',')
X = []
Y = []
Z = []
for row in readCSV:
X_coordinates = row[0]
Y_coordinates = row[1]
Z_coordinates = row[2]
X.append(X_coordinates)
Y.append(Y_coordinates)
Z.append(Z_coordinates)
Xcoordinate = [float(X[c]) for c in range(1,len(X))]
Ycoordinate=[float(Y[c]) for c in range(1,len(Y))]
Zcoordinate=[float(Z[c]) for c in range(1,len(Z))]
Now, when I try to print the co-ordinates of the ten heliostats, I get three big arrays with all Xcoordinate, Ycoordinate and Zcoordinate grouped into one instead of ten different outputs.
[[[-1269.56 1521.28 -13.6163 -1388.76 1551.75 405.92 1473.43
1291.73 539.027 -1648.13 ]]
[[-1359.2 -68.0507 1220.79 547.708 -82.2342 -1853.83
-881.703 478.988 1095.43 -73.7251]]
[[ 5.7 5.7 5.7 5.7 5.7 5.7 5.7
5.7 5.7 5.7 ]]] are the co-ordinates of the heliostats
I used:
B = np.array(([Xcoordinate],[Ycoordinate],[Zcoordinate]))
print(B," are the co-ordinates of the heliostats")
What is the mistake?
Further, I would like to have an array where I wuold like B - A
for which I use:
#T1 = matrix(A)- matrix(B)
#print(T1," is the target vector for heliostat 1, T1")
How should i do a numerical operation on Arrays A and B? I tried a matrix operation here. Is that wrong?

Your code is correct
The following output is the way numpy arrays are displayed.
[[-1359.2 -68.0507 1220.79 547.708 -82.2342 -1853.83
-881.703 478.988 1095.43 -73.7251]]
Despite the illusion that the values are stuck together, they are perfectly distinct in the array. You can access to a single value with
print(B[1, 0, 0]) # print Y[0]
The substraction of arrays A and B you want to perform will work
T1 = np.matrix(A)- np.matrix(B)
print(T1," is the target vector for heliostat 1, T1")
May I make two suggestions ?
You can read a numpy array written as a matrix in a text file (it's the case here) with the function loadtxt of numpy :
your_file = 'Heliostat Field Layout Large heliostat.csv'
B = np.loadtxt(your_file, delimiter=',', skiprows=1)
The result will be a (3, 10) numpy array.
You can perform broadcasing operations directly on numpy arrays (so you don't need to convert it in matrix). You just need to be careful with the dimensions.
In your original script you just need to write :
T1 = A - B
If you get array B with loadtxt as suggested, you will get a (10, 3) array, while A is a (3, 1) array. The array B must first be reshaped in a (3, 10) array :
B = B.reshape((3, 10))
T1 = A - B
EDIT : compute the norm of each 3D vector of T1
norm_T1 = np.sqrt( np.sum( np.array(T1)**2, axis=0 ) )
Note that in your code T1 is a matrix, so T1**2 is a matrix product. In order to compute sqrt( v[0]**2 + v[1]**2 + v[2]**2 ) for each vector v of T1, I first convert it to a numpy array.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.