Xarray: Loading several CSV files into a dataset

Xarray: Loading several CSV files into a dataset - python

I have several comma-separated data files that I want to load into an xarray dataset. Each row in each file represents a different spatial value of a field in a fixed grid, and every file represents a different point in time. The grid spacing is fixed and unchanging in time. The spacing of the grid is not uniform. The ultimate goal is to compute max_{x, y} { std_t[ value(x, y, t) * sqrt(y **2 + x ** 2)] }, where sqrt is the square root, std_t is standard deviation with respect to time and max_{x, y} is the maximum across all space.
I am having trouble loading the data. It is not clear to me how one is supposed to load several CSV files into an xarray dataset. There is an open_mfdataset function, which is designed for loading several data files into a dataset, but seems to expect hdf5 or netcdf files.
It seems like there is no way to load regular CSV files into an xarray dataset, and that preprocessing the data is necessary. In my example, I decided to preprocess the csv files to hdf5 files beforehand, to make use of the h5netcdf engine. This has created what appears to be an hdf5-specific problem for me.
below is my best attempt at loading the data so far. Unfortunately, it results in an empty xarray dataset. I tried several options in the open_mfdataset function, and the following code is only one realization of several attempts at using the function.
How can I load these csv files into a single xarray dataset, to set myself up to find the maximum across space of the standard deviation in time of the value of interest?
import xarray as xr
import numpy as np
import pandas as pd
'''
Create example files
- Each file contains a spatial-dependent value, f(x, y)
- Each file represents a different point in time, f(x, y, t)
'''
for ii in range(7):
# create csv file
fl = open('exampleFile%i.dat' % ii, 'w')
fl.write('time x1 x2 value\n')
for xx in range(10):
for yy in range(10):
fl.write('%i %i %i %i\n' %
(ii, xx, yy, (xx - yy) * np.exp(ii)))
fl.close()
# convert csv to hdf5
dat = pd.read_csv('exampleFile%i.dat' % ii)
dat.to_hdf('exampleFile%i.hdf5' % ii, 'data', mode='w')
'''
Read all files into xarray dataframe
(the ultimate goal is to find the
maximum across time of
the standard deviation across space
of the "value" column)
'''
result = xr.open_mfdataset('exampleFile*.hdf5', engine='h5netcdf', combine='nested')
... When I run the code, the result variable does not appear to contain the desired data:
In: result
Out:
<xarray.Dataset>
Dimensions: ()
Data variables:
*empty*
Attributes:
PYTABLES_FORMAT_VERSION: 2.1
TITLE: Empty(dtype=dtype('S1'))
VERSION: 1.0
Edit
An answer was posted that assumes a uniformly spaced spatial grid. Here is a slightly modified example that does not assume an evenly-spaced grid of spatial points.
The example also assumes three spatial dimensions. That is more true to my real problem, and I realized that might be an important detail in this simple example.
import xarray as xr
import numpy as np
import pandas as pd
'''
Create example files
- Each file contains a spatial-dependent value, f(x, y)
- Each file represents a different point in time, f(x, y, t)
'''
for ii in range(7):
# create csv file
fl = open('exampleFile%i.dat' % ii, 'w')
fl.write('time x y z value\n')
for xx in range(10):
for yy in range(int(10 + xx // 2)):
for zz in range(int(10 + xx //3 + yy // 3)):
fl.write('%i %f %f %f %f\n' %
(ii, xx * np.exp(- 1 * yy * zz) , yy * np.exp(xx - zz), zz * np.exp(xx * yy), (xx - yy) * np.exp(ii)))
fl.close()
# convert csv to hdf5
dat = pd.read_csv('exampleFile%i.dat' % ii)
dat.to_hdf('exampleFile%i.hdf5' % ii, 'data', mode='w')
'''
Read all files into xarray dataframe
(the ultimate goal is to find the
maximum across time of
the standard deviation across space
of the "value" column)
'''
result = xr.open_mfdataset('exampleFile*.hdf5', engine='h5netcdf', combine='nested')

My approach would be to create a parsing function that converts the CSVs into xarray.Datasets.
This way you can use xarray.concat to combine them to a final dataset, on which you can perform your computations.
The following works with your example data:
from glob import glob
def csv2xr(csv, sep=" "):
df = pd.read_csv(csv, sep)
x = df.x1.unique()
y = df.x2.unique()
pix = df.value.values.reshape(1, x.size, y.size)
ds = xr.Dataset({
"value": xr.DataArray(
pix,
dims=['time', 'x', 'y'],
coords={"time": df.time.unique(), "x": x, "y": y})
})
return ds
csvs = glob("*dat")
ds_full = xr.concat([csv2xr(x) for x in csvs], dim="time")
print(ds_full)
#<xarray.Dataset>
# Dimensions: (time: 7, x: 10, y: 10)
# Coordinates:
# * time (time) int64 4 3 2 0 1 6 5
# * x (x) int64 0 1 2 3 4 5 6 7 8 9
# * y (y) int64 0 1 2 3 4 5 6 7 8 9
# Data variables:
# value (time, x, y) int64 0 -54 -109 -163 -218 -272 ... 593 445 296 148 0
Then to get the max of the std over time:
ds_full.std("time").max()

I hope I understood your problem. See if this works for you.
When defining the key arguments for read_csv, note that is is better using delim_whitespace=True instead of sep=" ". This will avoind considering double columns if somewhere you have double spaces.
I am passing to read_csv that time,x,y and z are all coordinates and I am converting them to xarray. It will automatically structure your unstructured data and fill the holes with NaN. Then I am concatenating all xarray objects into a single object by time.
from glob import glob
fnames = glob('*.dat')
fnames.sort()
kw = dict(delim_whitespace=True,index_col=['time','x','y','z'])
ds = xr.concat([pd.read_csv(fname,**kw).to_xarray() for fname in fnames],'time')
The final result is an xarray object like this:
Now you can do everything with this object.
ds.max(['x','y','z']).std('time') will return the standard deviation in time of the spatial maximum value for all variables (in this case it is only value column). Beware that sometimes you may have to pass skipna=True to avoid having NaN outputs from your analysis.
Please, let me know it that solves your problem and I would be glad adapting it if it does not tackle some specific issue your are having with your data.

Related

how to restructure data from h5py written in python

I am reading functions from an existing file using h5py library.
readFile = h5py.File('File',r)
using readFile.keys() I obtained the list of the functions stored in 'File'. One of these functions is the function phi. To print the function phi, I did
phi = numpy.array(readFile['phi'])[:,0,:,:]
in [:,0,:,:] we find the way how the data is stored [blocks, z, y, x]. z= 0 because it is a 2D case. x is divided in 2 blocks, and y is divided to 2 blocks. each x block is divided to nxb (x1, x2, ....,x20), and each y block is divided to nyb. (nxb and nyb can also be obtained directly from the file using h5py as they are also stored in the file. The domain of the data is also stored in the file and it is called ['bounding box'])
Then , coding the grid will be:
nxb = numpy.array(readFile['integer scalars'])[0][1]
nyb = numpy.array(readFile['integer scalars'])[1][1]
X = numpy.zeros([block, nxb, nyb])
Y = numpy.zeros([block, nxb, nyb])
for block in range(block):
x_min, x_max = numpy.array(readFile['bounding box'])[block,0,:]
y_min, y_max = numpy.array(readFile['bounding box'])[block,1,:]
X[block,:,:], Y[block,:,:] = numpy.meshgrid(numpy.linspace(x_min,x_max,nxb),
numpy.linspace(y_min,y_max,nyb))
My question, is that I am trying to restructure the data (see the figure). I want to bring the data of the block 2 up to the data of the block 1 and not next to him. Which means that I need to create new coordinates I' and J' related to the old coordinates I , and J. I tried this but it is not working:
for i in range(X):
for j in range(Y):
i' = i -len(X[0:1,:,:]
j' = j + len(Y[0:1,:,:]
phi(i',j') = phi

When working with HDF5 data, it's important to understand your data schema before you start writing code. Here are my initial observations and suggestions.
Your question is a little hard to follow. (For example, you are using the term "functions" to describe HDF5 datasets.) HDF5 organizes data in datasets and groups. Your data of interest is in 2 datasets: 'phi' and 'integer scalars'.
You can simplify code to access the datasets as a Numpy arrays using the following:
with h5py.File('File','r') as readFile:
# to get the axis dimensions for 'phi':
print(f"Shape of Dataset phi: {readFile['phi'].shape}")
phi_ds = readFile['phi'] # to get a dataset object
phi_arr = readFile['phi'][()] # to read dataset as a numpy array
# to get the axis dimensions for 'integer scalars'
nxb, nyb = readFile['integer scalars'].shape
I don't understand what you mean by "blocks". Are you referering to the axis simensions? Also, why you are using meshgrid? If you simply want to change dimensions, use Numpy's .reshape() method to change the axis dimensions of the Numpy array.
Here is a simple example that creates a 2x2 dataset, then reads it into a new array and reshapes it to 1x4. I think this is what you want to do. Change the values of a0 and a1 if you want to increase the size. The reshape operation will read the shape from the first array and reshape the new array to (N,1), where N is your nxb*nyb value.
with h5py.File('SO_72340647.h5','w') as h5f:
a0, a1 = 2,2
arr = np.arange(a0*a1).reshape(a0,a1)
h5f.create_dataset('ds_2x2',data=arr)
with h5py.File('SO_72340647.h5','r') as h5f:
print(f"Shape of Dataset ds_2x2: {h5f['ds_2x2'].shape}")
ds_arr = h5f['ds_2x2'][()]
print(ds_arr)
ds0, ds1 = ds_arr.shape
new_arr = ds_arr.reshape(ds0*ds1,1)
print(f"Shape of new (reshaped) array: {new_arr.shape}")
print(new_arr)
Note: h5py dataset objects "behave like" Numpy arrays. So, you frequently don't have to read into an array to use the data.

Python: plotting across time with a single column file

I must write a function that allows me to find the local max and min from a series of values.
Data for function is x, y of each "peak".
Output are 4 vectors that contain x, y max and min "peaks".
To find max peaks, I must "stand" on each data point and check it is mare or less than neighbors on both sides in order to decide if it is a peak (save as max/min peak).
Points on both ends only have 1 neighbor, do not consider those for this analysis.
Then write a program to read a data file and invoke the function to calculate the peaks. The program must generate a graph showing the entered data with the calculated peaks.
1st file is an Array of float64 of (2001,) size. All data is in column 0. This file represents the amplitude of a signal in time, frequency of sampling is 200Hz. Asume initial time is 0.
Graph should look like this
Program must also generate an .xls file that shows 2 tables; 1 with min peaks, and another with max peaks. Each table must be titled and consist of 2 column, one with the time at which peaks occur, and the other with the amplitude of each peak.
No Pandas allowed.
first file is a .txt file, and is single column, 2001 rows total
0
0.0188425
0.0376428
0.0563589
0.0749497
0.0933749
0.111596
0.129575
0.147277
0.164669
0.18172
...
Current attempt:
import numpy as np
import matplotlib.pyplot as plt
filename = 'location/file_name.txt'
T = np.loadtxt(filename,comments='#',delimiter='\n')
x = T[::1] # all the files of column 0 are x vales
a = np.empty(x, dtype=array)
y = np.linspace[::1/200]
X, Y = np.meshgrid(x,y)

This does what you ask. I had to generate random data, since you didn't share yours. You can surely build your spreadsheet from the minima and maxima values.
import numpy as np
import matplotlib.pyplot as plt
#filename = 'location/file_name.txt'
#T = np.loadtxt(filename,comments='#',delimiter='\n')
#
#y = T[::1] # all the files of column 0 are x vales
y = np.random.random(200) * 2.0
minima = []
maxima = []
for i in range(0,y.shape[0]-1):
if y[i-1] < y[i] and y[i+1] < y[i]:
maxima.append( (i/200, y[i]) )
if y[i-1] > y[i] and y[i+1] > y[i]:
minima.append( (i/200, y[i]) )
minima = np.array(minima)
maxima = np.array(maxima)
print(minima)
print(maxima)
x = np.linspace(0, 1, 200 )
plt.plot( x, y )
plt.scatter( maxima[:,0], maxima[:,1] )
plt.show()

Using Mann Kendall in python with a lot of data

I have a set of 46 years worth of rainfall data. It's in the form of 46 numpy arrays each with a shape of 145, 192, so each year is a different array of maximum rainfall data at each lat and lon coordinate in the given model.
I need to create a global map of tau values by doing an M-K test (Mann-Kendall) for each coordinate over the 46 years.
I'm still learning python, so I've been having trouble finding a way to go through all the data in a simple way that doesn't involve me making 27840 new arrays for each coordinate.
So far I've looked into how to use scipy.stats.kendalltau and using the definition from here: https://github.com/mps9506/Mann-Kendall-Trend
EDIT:
To clarify and add a little more detail, I need to perform a test on for each coordinate and not just each file individually. For example, for the first M-K test, I would want my x=46 and I would want y=data1[0,0],data2[0,0],data3[0,0]...data46[0,0]. Then to repeat this process for every single coordinate in each array. In total the M-K test would be done 27840 times and leave me with 27840 tau values that I can then plot on a global map.
EDIT 2:
I'm now running into a different problem. Going off of the suggested code, I have the following:
for i in range(145):
for j in range(192):
out[i,j] = mk_test(yrmax[:,i,j],alpha=0.05)
print out
I used numpy.stack to stack all 46 arrays into a single array (yrmax) with shape: (46L, 145L, 192L) I've tested it out and it calculates p and tau correctly if I change the code from out[i,j] to just out. However, doing this messes up the for loop so it only takes the results from the last coordinate in stead of all of them. And if I leave the code as it is above, I get the error: TypeError: list indices must be integers, not tuple
My first guess was that it has to do with mk_test and how the information is supposed to be returned in the definition. So I've tried altering the code from the link above to change how the data is returned, but I keep getting errors relating back to tuples. So now I'm not sure where it's going wrong and how to fix it.
EDIT 3:
One more clarification I thought I should add. I've already modified the definition in the link so it returns only the two number values I want for creating maps, p and z.

I don't think this is as big an ask as you may imagine. From your description it sounds like you don't actually want the scipy kendalltau, but the function in the repository you posted. Here is a little example I set up:
from time import time
import numpy as np
from mk_test import mk_test
data = np.array([np.random.rand(145, 192) for _ in range(46)])
mk_res = np.empty((145, 192), dtype=object)
start = time()
for i in range(145):
for j in range(192):
out[i, j] = mk_test(data[:, i, j], alpha=0.05)
print(f'Elapsed Time: {time() - start} s')
Elapsed Time: 35.21990394592285 s
My system is a MacBook Pro 2.7 GHz Intel Core I7 with 16 GB Ram so nothing special.
Each entry in the mk_res array (shape 145, 192) corresponds to one of your coordinate points and contains an entry like so:
array(['no trend', 'False', '0.894546014835', '0.132554125342'], dtype='<U14')
One thing that might be useful would be to modify the code in mk_test.py to return all numerical values. So instead of 'no trend'/'positive'/'negative' you could return 0/1/-1, and 1/0 for True/False and then you wouldn't have to worry about the whole object array type. I don't know what kind of analysis you might want to do downstream but I imagine that would preemptively circumvent any headaches.

Thanks to the answers provided and some work I was able to work out a solution that I'll provide here for anyone else that needs to use the Mann-Kendall test for data analysis.
The first thing I needed to do was flatten the original array I had into a 1D array. I know there is probably an easier way to go about doing this, but I ultimately used the following code based on code Grr suggested using.
`x = 46
out1 = np.empty(x)
out = np.empty((0))
for i in range(146):
for j in range(193):
out1 = yrmax[:,i,j]
out = np.append(out, out1, axis=0) `
Then I reshaped the resulting array (out) as follows:
out2 = np.reshape(out,(27840,46))
I did this so my data would be in a format compatible with scipy.stats.kendalltau 27840 is the total number of values I have at every coordinate that will be on my map (i.e. it's just 145*192) and the 46 is the number of years the data spans.
I then used the following loop I modified from Grr's code to find Kendall-tau and it's respective p-value at each latitude and longitude over the 46 year period.
`x = range(46)
y = np.zeros((0))
for j in range(27840):
b = sc.stats.kendalltau(x,out2[j,:])
y = np.append(y, b, axis=0)`
Finally, I reshaped the data one for time as shown:newdata = np.reshape(y,(145,192,2)) so the final array is in a suitable format to be used to create a global map of both tau and p-values.
Thanks everyone for the assistance!

Depending on your situation, it might just be easiest to make the arrays.
You won't really need them all in memory at once (not that it sounds like a terrible amount of data). Something like this only has to deal with one "copied out" coordinate trend at once:
SIZE = (145,192)
year_matrices = load_years() # list of one 145x192 arrays per year
result_matrix = numpy.zeros(SIZE)
for x in range(SIZE[0]):
for y in range(SIZE[1]):
coord_trend = map(lambda d: d[x][y], year_matrices)
result_matrix[x][y] = analyze_trend(coord_trend)
print result_matrix
Now, there are things like itertools.izip that could help you if you really want to avoid actually copying the data.
Here's a concrete example of how Python's "zip" might works with data like yours (although as if you'd used ndarray.flatten on each year):
year_arrays = [
['y0_coord0_val', 'y0_coord1_val', 'y0_coord2_val', 'y0_coord2_val'],
['y1_coord0_val', 'y1_coord1_val', 'y1_coord2_val', 'y1_coord2_val'],
['y2_coord0_val', 'y2_coord1_val', 'y2_coord2_val', 'y2_coord2_val'],
]
assert len(year_arrays) == 3
assert len(year_arrays[0]) == 4
coord_arrays = zip(*year_arrays) # i.e. `zip(year_arrays[0], year_arrays[1], year_arrays[2])`
# original data is essentially transposed
assert len(coord_arrays) == 4
assert len(coord_arrays[0]) == 3
assert coord_arrays[0] == ('y0_coord0_val', 'y1_coord0_val', 'y2_coord0_val', 'y3_coord0_val')
assert coord_arrays[1] == ('y0_coord1_val', 'y1_coord1_val', 'y2_coord1_val', 'y3_coord1_val')
assert coord_arrays[2] == ('y0_coord2_val', 'y1_coord2_val', 'y2_coord2_val', 'y3_coord2_val')
assert coord_arrays[3] == ('y0_coord2_val', 'y1_coord2_val', 'y2_coord2_val', 'y3_coord2_val')
flat_result = map(analyze_trend, coord_arrays)
The example above still copies the data (and all at once, rather than a coordinate at a time!) but hopefully shows what's going on.
Now, if you replace zip with itertools.izip and map with itertools.map then the copies needn't occur — itertools wraps the original arrays and keeps track of where it should be fetching values from internally.
There's a catch, though: to take advantage itertools you to access the data only sequentially (i.e. through iteration). In your case, it looks like the code at https://github.com/mps9506/Mann-Kendall-Trend/blob/master/mk_test.py might not be compatible with that. (I haven't reviewed the algorithm itself to see if it could be.)
Also please note that in the example I've glossed over the numpy ndarray stuff and just show flat coordinate arrays. It looks like numpy has some of it's own options for handling this instead of itertools, e.g. this answer says "Taking the transpose of an array does not make a copy". Your question was somewhat general, so I've tried to give some general tips as to ways one might deal with larger data in Python.

I ran into the same task and have managed to come up with a vectorized solution using numpy and scipy.
The formula are the same as in this page: https://vsp.pnnl.gov/help/Vsample/Design_Trend_Mann_Kendall.htm.
The trickiest part is to work out the adjustment for the tied values. I modified the code as in this answer to compute the number of tied values for each record, in a vectorized manner.
Below are the 2 functions:
import copy
import numpy as np
from scipy.stats import norm
def countTies(x):
'''Count number of ties in rows of a 2D matrix
Args:
x (ndarray): 2d matrix.
Returns:
result (ndarray): 2d matrix with same shape as <x>. In each
row, the number of ties are inserted at (not really) arbitary
locations.
The locations of tie numbers in are not important, since
they will be subsequently put into a formula of sum(t*(t-1)*(2t+5)).
Inspired by: https://stackoverflow.com/a/24892274/2005415.
'''
if np.ndim(x) != 2:
raise Exception("<x> should be 2D.")
m, n = x.shape
pad0 = np.zeros([m, 1]).astype('int')
x = copy.deepcopy(x)
x.sort(axis=1)
diff = np.diff(x, axis=1)
cated = np.concatenate([pad0, np.where(diff==0, 1, 0), pad0], axis=1)
absdiff = np.abs(np.diff(cated, axis=1))
rows, cols = np.where(absdiff==1)
rows = rows.reshape(-1, 2)[:, 0]
cols = cols.reshape(-1, 2)
counts = np.diff(cols, axis=1)+1
result = np.zeros(x.shape).astype('int')
result[rows, cols[:,1]] = counts.flatten()
return result
def MannKendallTrend2D(data, tails=2, axis=0, verbose=True):
'''Vectorized Mann-Kendall tests on 2D matrix rows/columns
Args:
data (ndarray): 2d array with shape (m, n).
Keyword Args:
tails (int): 1 for 1-tail, 2 for 2-tail test.
axis (int): 0: test trend in each column. 1: test trend in each
row.
Returns:
z (ndarray): If <axis> = 0, 1d array with length <n>, standard scores
corresponding to data in each row in <x>.
If <axis> = 1, 1d array with length <m>, standard scores
corresponding to data in each column in <x>.
p (ndarray): p-values corresponding to <z>.
'''
if np.ndim(data) != 2:
raise Exception("<data> should be 2D.")
# alway put records in rows and do M-K test on each row
if axis == 0:
data = data.T
m, n = data.shape
mask = np.triu(np.ones([n, n])).astype('int')
mask = np.repeat(mask[None,...], m, axis=0)
s = np.sign(data[:,None,:]-data[:,:,None]).astype('int')
s = (s * mask).sum(axis=(1,2))
#--------------------Count ties--------------------
counts = countTies(data)
tt = counts * (counts - 1) * (2*counts + 5)
tt = tt.sum(axis=1)
#-----------------Sample Gaussian-----------------
var = (n * (n-1) * (2*n+5) - tt) / 18.
eps = 1e-8 # avoid dividing 0
z = (s - np.sign(s)) / (np.sqrt(var) + eps)
p = norm.cdf(z)
p = np.where(p>0.5, 1-p, p)
if tails==2:
p=p*2
return z, p
I assume your data come in the layout of (time, latitude, longitude), and you are examining the temporal trend for each lat/lon cell.
To simulate this task, I synthesized a sample data array of shape (50, 145, 192). The 50 time points are taken from Example 5.9 of the book Wilks 2011, Statistical methods in the atmospheric sciences. And then I simply duplicated the same time series 27840 times to make it (50, 145, 192).
Below is the computation:
x = np.array([0.44,1.18,2.69,2.08,3.66,1.72,2.82,0.72,1.46,1.30,1.35,0.54,\
2.74,1.13,2.50,1.72,2.27,2.82,1.98,2.44,2.53,2.00,1.12,2.13,1.36,\
4.9,2.94,1.75,1.69,1.88,1.31,1.76,2.17,2.38,1.16,1.39,1.36,\
1.03,1.11,1.35,1.44,1.84,1.69,3.,1.36,6.37,4.55,0.52,0.87,1.51])
# create a big cube with shape: (T, Y, X)
arr = np.zeros([len(x), 145, 192])
for i in range(arr.shape[1]):
for j in range(arr.shape[2]):
arr[:, i, j] = x
print(arr.shape)
# re-arrange into tabular layout: (Y*X, T)
arr = np.transpose(arr, [1, 2, 0])
arr = arr.reshape(-1, len(x))
print(arr.shape)
import time
t1 = time.time()
z, p = MannKendallTrend2D(arr, tails=2, axis=1)
p = p.reshape(145, 192)
t2 = time.time()
print('time =', t2-t1)
The p-value for that sample time series is 0.63341565, which I have validated against the pymannkendall module result. Since arr contains merely duplicated copies of x, the resultant p is a 2d array of size (145, 192), with all 0.63341565.
And it took me only 1.28 seconds to compute that.

Numerical operations on arrays while reading from CSV file

I am trying to do a few numerical operations on a few arrays while reading some values from CSV files.
I have the coordinates of a receiver which is fixed and I read coordinates of the heliostats from a CSV file which track the Sun.
The coordinates of the receiver:
# co-ordinates of Receiver
XT = 0 # X co-ordinate of Receiver
YT = 0 # Y co-ordinate of Receiver
ZT = 207.724 # Z co-ordinate of Receiver, this is the height of tower
A = np.array(([XT],[YT],[ZT]))
print(A," are the co-ordinates of the target i.e. the receiver")
The coordinates of the ten heliostats:
This data I read from a CSV file with the follwoing data:
#X,Y,Z
#-1269.56,-1359.2,5.7
#1521.28,-68.0507,5.7
#-13.6163,1220.79,5.7
#-1388.76,547.708,5.7
#1551.75,-82.2342,5.7
#405.92,-1853.83,5.7
#1473.43,-881.703,5.7
#1291.73,478.988,5.7
#539.027,1095.43,5.7
#-1648.13,-73.7251,5.7
I read the coordinates of the CSV as follows:
import csv
# Reading data from csv file
with open('Heliostat Field Layout Large heliostat.csv') as csvfile:
readCSV = csv.reader(csvfile, delimiter=',')
X = []
Y = []
Z = []
for row in readCSV:
X_coordinates = row[0]
Y_coordinates = row[1]
Z_coordinates = row[2]
X.append(X_coordinates)
Y.append(Y_coordinates)
Z.append(Z_coordinates)
Xcoordinate = [float(X[c]) for c in range(1,len(X))]
Ycoordinate=[float(Y[c]) for c in range(1,len(Y))]
Zcoordinate=[float(Z[c]) for c in range(1,len(Z))]
Now, when I try to print the co-ordinates of the ten heliostats, I get three big arrays with all Xcoordinate, Ycoordinate and Zcoordinate grouped into one instead of ten different outputs.
[[[-1269.56 1521.28 -13.6163 -1388.76 1551.75 405.92 1473.43
1291.73 539.027 -1648.13 ]]
[[-1359.2 -68.0507 1220.79 547.708 -82.2342 -1853.83
-881.703 478.988 1095.43 -73.7251]]
[[ 5.7 5.7 5.7 5.7 5.7 5.7 5.7
5.7 5.7 5.7 ]]] are the co-ordinates of the heliostats
I used:
B = np.array(([Xcoordinate],[Ycoordinate],[Zcoordinate]))
print(B," are the co-ordinates of the heliostats")
What is the mistake?
Further, I would like to have an array where I wuold like B - A
for which I use:
#T1 = matrix(A)- matrix(B)
#print(T1," is the target vector for heliostat 1, T1")
How should i do a numerical operation on Arrays A and B? I tried a matrix operation here. Is that wrong?

Your code is correct
The following output is the way numpy arrays are displayed.
[[-1359.2 -68.0507 1220.79 547.708 -82.2342 -1853.83
-881.703 478.988 1095.43 -73.7251]]
Despite the illusion that the values are stuck together, they are perfectly distinct in the array. You can access to a single value with
print(B[1, 0, 0]) # print Y[0]
The substraction of arrays A and B you want to perform will work
T1 = np.matrix(A)- np.matrix(B)
print(T1," is the target vector for heliostat 1, T1")
May I make two suggestions ?
You can read a numpy array written as a matrix in a text file (it's the case here) with the function loadtxt of numpy :
your_file = 'Heliostat Field Layout Large heliostat.csv'
B = np.loadtxt(your_file, delimiter=',', skiprows=1)
The result will be a (3, 10) numpy array.
You can perform broadcasing operations directly on numpy arrays (so you don't need to convert it in matrix). You just need to be careful with the dimensions.
In your original script you just need to write :
T1 = A - B
If you get array B with loadtxt as suggested, you will get a (10, 3) array, while A is a (3, 1) array. The array B must first be reshaped in a (3, 10) array :
B = B.reshape((3, 10))
T1 = A - B
EDIT : compute the norm of each 3D vector of T1
norm_T1 = np.sqrt( np.sum( np.array(T1)**2, axis=0 ) )
Note that in your code T1 is a matrix, so T1**2 is a matrix product. In order to compute sqrt( v[0]**2 + v[1]**2 + v[2]**2 ) for each vector v of T1, I first convert it to a numpy array.

Accelerate UDF in xlwings

I tried to experiment some features from Xlwings. I would like to use a common function from numpy which allowed to interpolate quickly (numpy.interp).
#xlfunc
def interp(x, xp,yp):
"""Interpolate vector x on vector (xp,yp)"""
y=np.interp(x,xp,yp)
return y
#xlfunc
def random():
"""Returns a random number between 0 and 1"""
x=np.random.randn()
return x
For instance, I create two vectors (xp, yp) like this (in Excel)
800 rows
First Column Second Column
0 =random()
1 =random()
2 =random()
3 =random()
4 =random()
5 =random()
[...]
In the third columns I create another vector (60 row), with random number bewteen 0 and 800 (ranked in ascending order)
Which give me something like this :
Third Column
17.2
52.6
75.8
[...]
I would like to interpolate the third column into the first column. So
Fourth Column
=interp(C1,A1:A800,B1:B800)
=interp(C2,A1:A800,B1:B800)
=interp(C3,A1:A800,B1:B800)
[...]
It's easy to do this. But if I have 10 or more columns to interpolate it could take too much time. I am sure there is a better way to do this. An idea ?
Thanks for your help !
edit :
I tried this but doesn't work at "xw.Range[...].value=y"
#xw.xlfunc
def interpbis(x, xp,yp):
"""Interpolate scalar x on vector (xp,yp)"""
thisWB=xw.Workbook.active()
thisSlctn=thisWB.get_selection(asarray=True)
sheet=thisSlctn.xl_sheet.name
r = thisSlctn.row
c = thisSlctn.column
y=np.interp(x,xp,yp)
xw.Range(sheet,(r,c)).value=y
return None

The short answer is: Use Excel's array functions.
The long answer is:
First, update to xlwings v0.6.4 (otherwise what I am going to show for random() will not work). Then change your functions as follows:
from xlwings import Application, Workbook, xlfunc
import numpy as np
#xlfunc
def interp(x, xp, yp):
"""Interpolate vector x on vector (xp,yp)"""
y = np.interp(x, xp, yp)
return y[:, np.newaxis] # column orientation
#xlfunc
def random():
"""Returns a random number between 0 and 1"""
app = Application(Workbook.caller())
# We shall make this easier in a future release (getting the array dimensions)
r = app.xl_app.Caller.Rows.Count
c = app.xl_app.Caller.Columns.Count
x = np.random.randn(r, c)
return x
Now use array formulas in Excel as described here (Ctrl+Shift+Enter).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.