Loading .txt data into 10x256 3d numpy array - python

I'm trying to load some text files into numpy arrays. The .txt files represent pixels of an image where each pixel is given an arbitrary relative coordinate between -10 and +10 (for x) and 0 and 10 for y. In total, the image is 10x256 pixels. The catch is that each pixel isn't given an RGB values it is given a list of intensities that corresponds to the wavelength vales in the first /n separated "header". Each coordinate is given as the two first tab separated item and the first entry only has "0 0" because that The format of the text files is as follows:
Line 1: "0 0 625.15360 625.69449 626.23538 ..." (two coordinates followed by the wavelengths)
Line 2: "-10.00000 -10.00000 839 841 833 843 838 847 ..."
Line 3: "-10.00000 -9.92157 838 839 838 ..."
Where 839 and 838 represent the intensity of the wavelength 625.15360 for two different adjacent pixels one on top of another (with a small change in y). Furthermore, 841 and 839 would be the intensity of the 625.69449 wavelength, and so on and so forth.
My reasoning thus far has been to iterate through the file using np.genfromtxt() and adding to a new array 3D numpy array with variables (x,y, lambda) each being assigned one single intensity value. Also, I think it would make much more sense if x and y spanned from 0-9 and 0-255 respectively to represent the image instead of the arbitrary relative coordinates given in the data...
Problem: I don't know how to load the data into a 3x3 (stuck figuring out 2x2) and I can't seem to slice properly...
What I have so far:
intensity_array2 = np.zeros([len(unique_y),len(unique_x)], dtype= int)
for element in np.nditer(intensity_array2, op_flags=['readwrite']):
for i in range(len(unique_y)):
for j in range(len(unique_x)):
with open(os.path.join(path_name,right_file)) as rf:
intensity_array2[i,j] = np.genfromtxt(rf, skip_header = (i*j)+j, delimiter = " ")
Where len(unique_y) = 10 and len(unique_x) = 256 are found in a function above.

I'm not sure I entirely understand your file format, so forgive me if this does not make sense. However, if there is any way you can load all the data in at once I am sure it will run much faster. It appears to me that you can use this to get all the data into memory:
data = np.genfromtxt(rf, delimiter = " ")
Then create your 3D array:
intensity_array2 = np.zeros( (10, 256, num_wavlengths) )
Then fill in the values of the 3D array:
intensity_array2[ data[:,0], data[:,1], :] = data[:, 2:]
This will not work exactly because your x and y indices can go negative--you might need to add an offset in this case. Also, if your input file is in a predictable format, you may be able to simply call np.reshape() on the data matrix to get what you want.

Building on Lukeclh's answer, try:
data = np.genfromtxt(rf)
Then, cleave off the wavelength values
wavelengths = data[0]
intensities = data[1:]
We can now rearrange the data using reshape:
intensitiesshaped = np.reshape(intensities, (len(unique_x),len(unique_y),-1))
The "-1" value says 'the rest goes here'.
We still have the leading values (of on each of these arrays. To trim them, we can do:
wavelengths = wavelengths[2:]
intensitiesshaped = intensities[:,:,2:]
This just throws the information in the first two indices away. If you need to retain it you'll have to do something a bit more sophisticated.

Related

how to restructure data from h5py written in python

I am reading functions from an existing file using h5py library.
readFile = h5py.File('File',r)
using readFile.keys() I obtained the list of the functions stored in 'File'. One of these functions is the function phi. To print the function phi, I did
phi = numpy.array(readFile['phi'])[:,0,:,:]
in [:,0,:,:] we find the way how the data is stored [blocks, z, y, x]. z= 0 because it is a 2D case. x is divided in 2 blocks, and y is divided to 2 blocks. each x block is divided to nxb (x1, x2, ....,x20), and each y block is divided to nyb. (nxb and nyb can also be obtained directly from the file using h5py as they are also stored in the file. The domain of the data is also stored in the file and it is called ['bounding box'])
Then , coding the grid will be:
nxb = numpy.array(readFile['integer scalars'])[0][1]
nyb = numpy.array(readFile['integer scalars'])[1][1]
X = numpy.zeros([block, nxb, nyb])
Y = numpy.zeros([block, nxb, nyb])
for block in range(block):
x_min, x_max = numpy.array(readFile['bounding box'])[block,0,:]
y_min, y_max = numpy.array(readFile['bounding box'])[block,1,:]
X[block,:,:], Y[block,:,:] = numpy.meshgrid(numpy.linspace(x_min,x_max,nxb),
numpy.linspace(y_min,y_max,nyb))
My question, is that I am trying to restructure the data (see the figure). I want to bring the data of the block 2 up to the data of the block 1 and not next to him. Which means that I need to create new coordinates I' and J' related to the old coordinates I , and J. I tried this but it is not working:
for i in range(X):
for j in range(Y):
i' = i -len(X[0:1,:,:]
j' = j + len(Y[0:1,:,:]
phi(i',j') = phi
When working with HDF5 data, it's important to understand your data schema before you start writing code. Here are my initial observations and suggestions.
Your question is a little hard to follow. (For example, you are using the term "functions" to describe HDF5 datasets.) HDF5 organizes data in datasets and groups. Your data of interest is in 2 datasets: 'phi' and 'integer scalars'.
You can simplify code to access the datasets as a Numpy arrays using the following:
with h5py.File('File','r') as readFile:
# to get the axis dimensions for 'phi':
print(f"Shape of Dataset phi: {readFile['phi'].shape}")
phi_ds = readFile['phi'] # to get a dataset object
phi_arr = readFile['phi'][()] # to read dataset as a numpy array
# to get the axis dimensions for 'integer scalars'
nxb, nyb = readFile['integer scalars'].shape
I don't understand what you mean by "blocks". Are you referering to the axis simensions? Also, why you are using meshgrid? If you simply want to change dimensions, use Numpy's .reshape() method to change the axis dimensions of the Numpy array.
Here is a simple example that creates a 2x2 dataset, then reads it into a new array and reshapes it to 1x4. I think this is what you want to do. Change the values of a0 and a1 if you want to increase the size. The reshape operation will read the shape from the first array and reshape the new array to (N,1), where N is your nxb*nyb value.
with h5py.File('SO_72340647.h5','w') as h5f:
a0, a1 = 2,2
arr = np.arange(a0*a1).reshape(a0,a1)
h5f.create_dataset('ds_2x2',data=arr)
with h5py.File('SO_72340647.h5','r') as h5f:
print(f"Shape of Dataset ds_2x2: {h5f['ds_2x2'].shape}")
ds_arr = h5f['ds_2x2'][()]
print(ds_arr)
ds0, ds1 = ds_arr.shape
new_arr = ds_arr.reshape(ds0*ds1,1)
print(f"Shape of new (reshaped) array: {new_arr.shape}")
print(new_arr)
Note: h5py dataset objects "behave like" Numpy arrays. So, you frequently don't have to read into an array to use the data.

Perform a different neighbourhood operation for specified pixels

I have an HxW "feature map", F. Let us assume that it is a HxWx1 map. Through some other operation, I have a set of pixels that are of interest to me, (say N pixels). Each of these pixels is associated with a different value, thus my set is of the form Nx3 where each pixel is of the form x, y and val. Note that this val is different from the feature map value at the location.
Here is my question. Is it possible to vectorize a neighbourhood operation for each of these points? For each pixel n from N, I wish to multiply the corresponding val to its 3x3 neighbourhood in the feature map F. For the 3x3 neighbourhood, this gives a new 3x3 set of elements new val. I want to replace the x y with the pixel with the maximum of new val (multiplied feature map) in the 3x3 window.
This sounds similar to a convolution (slight abuse of terminology here) followed by a max pool operation, but not exactly since each pixel location has a different val to be multiplied.
Sample input and output, and walkthrough for required solution
Let us assume H=10 and W=10
Here is a sample F
0.635955 0.922379 0.993406 0.007837 0.818661 0.983730 0.199866 0.757519 0.073152 0.015831
0.397718 0.097353 0.231351 0.177886 0.343099 0.419940 0.017342 0.087294 0.402266 0.366337
0.978686 0.476594 0.067836 0.148977 0.058994 0.810586 0.542894 0.797419 0.386559 0.225982
0.479860 0.033354 0.353366 0.431562 0.336208 0.674272 0.398151 0.713732 0.598623 0.829230
0.940838 0.869564 0.287100 0.669844 0.631836 0.748982 0.762292 0.597999 0.540236 0.758802
0.925995 0.141296 0.466772 0.672663 0.929746 0.544029 0.991860 0.197474 0.762866 0.798973
0.543519 0.128332 0.624323 0.876569 0.050709 0.223705 0.708381 0.380842 0.818092 0.163447
0.283125 0.329618 0.283481 0.672950 0.136922 0.897785 0.385479 0.764824 0.132671 0.091148
0.661984 0.369459 0.501181 0.352681 0.554113 0.133283 0.593048 0.108534 0.397813 0.836065
0.654929 0.928576 0.539204 0.931213 0.344114 0.591214 0.126809 0.456681 0.036531 0.725228
My structure of pixels, let us say N=3
The three values in the order of row,col,val: (for simplicity I assume x is rows, and y is cols, though it isn't necessarily the case). This is completely independent of the feature map in the previous step.
3,2,0.38
4,4,0.602
7,5,0.9647
The neighborhood around (3,2) is:
[[0.4765941 , 0.06783561, 0.14897662],
[0.03335438, 0.35336647, 0.4315618 ],
[0.86956374, 0.28709952, 0.66984412]]
Thus val * neighborhood yields. (here val is 0.38)
[[0.18110576, 0.02577753, 0.05661112],
[0.01267466, 0.13427926, 0.16399349],
[0.33043422, 0.10909782, 0.25454077]]
The location of max value here is (2,0) i.e. (1,-1) with respect to center pixel. Thus my updated (x,y) should be (3,2) + (1,-1) = (4,1).
Similarly for the other two, the updated pixels are : (5,4) and (7,5)
How can I parallelize this entire thing?
(Hopefully to be loaded onto a GPU using Pytorch, but not necessarily, I have not come to that stage yet.)
Note: I had asked this question a few days ago, but it was poorly framed without proper info. Hopefully this solves the issue.
Edit: For this specific instance, F can be produced as a random array:
F = np.random.rand(10,10)
If I understand correctly, you want this:
from skimage.util.shape import view_as_windows
idx = pixels[:,0:2].astype(int)
print((np.unravel_index((view_as_windows(F,(3,3))[tuple(idx.T-1)]*pixels[:,-1][:,None,None]).reshape(-1,9).argmax(1),(3,3))+idx.T).T-1)
#if you need to replace the values of F with new values
F[tuple(idx.T)] = (view_as_windows(F,(3,3))[tuple(idx.T-1)]*pixels[:,-1][:,None,None]).reshape(-1,9).max(1)
I assumed your window shape is (3,3). Of course, you can change it. And if you need to deal with edge neighborhoods, pad your F with enough 0s (depending on your window size) using np.pad before using the view_as_windows.
output:
[[4 1]
[5 4]
[7 5]]

Do anyone know how to extract Image coordinate from Marmot dataset?

Marmot is a document image dataset (http://www.icst.pku.edu.cn/cpdp/data/marmot_data.htm) where labelling several things such as document body, image area, table area, table caption and so on. This dataset specially use for document image analysis research purpose. They mentioned all coordinates in 16 digit hexa decimal with little endian format. Is there anyone how worked with this dataset and how to convert that 16 digit XY coordinate to human understandable format?
Finally I got the clue after analysis and posting here if anyone need to investigate this dataset. However, they mentioned the unit value in which way they convert the given coordinate into pixel value but it was difficult to trace out because they did not mentioned it in their manual/guideline. They mentioned another place as an annotation.
First you have to convert their 16 character hexadecimal value using IEEE 754 little endian format. For example, a given coordinates for a label is,
BBox=['4074145c00000005', '4074dd95999999a9', '4080921e74bc6a80', '406fb9999999999a']
Convert using python,
conv_pound = struct.unpack('!d', str(t).decode('hex'))[0]) for t in BBox]
You will get value in "pound" unit which is 1/72 inch. We usually use coordinate in pixel unit and we know 1 inch is 96 pixel. So,
conv_pound = [321.2724609375003, 333.8490234375009, 530.2648710937501, 253.8]
Then, divided each value by 72 and multiply with 96 to finally get corresponding pixel value which is,
in_pixel = [428.36328, 445.13203, 707.01983, 338.40000]
They started to count pixel position from bottom-left corner of the document image. If you consider from top-left corner (usually we consider in this way), you have to subtract 2nd and 4th value from image height. If we consider image [height, width] is [1123, 793] then we can represent the above coordinates in integer value as,
label_boundary = [428, 678, 707, 785]
After staring at the xmls for an hour, I've found the last missing piece in the answer by #MMReza:
You don't need to rely on the units of measure in (step number 3). There is an attribute called "CropBox" of the root element "Page". Use that one to scale the coordinates.
I have something along the following lines (also inverse y axis here):
px0, py1, px1, py0 = list(map(hex_to_double, page.get("CropBox").split()))
pw = abs(px1 - px0)
ph = abs(py1 - py0)
for table in page.findall(".//Composite[#Label='TableBody']"):
x0p, y1m, x1p, y0m = list(map(hex_to_double, table.get("BBox").split()))
x0 = round(imgw*(x0p - px0)/pw)
x1 = round(imgw*(x1p - px0)/pw)
y0 = round(imgh*(py1 - y0m)/ph)
y1 = round(imgh*(py1 - y1m)/ph)
In case anyone is trying to do this in Python 3 like I did, you only have to change step 2 of the other answer like this :
conv_pound = [struct.unpack('!d', bytes.fromhex(t))[0] for t in BBox]
I wanted to convert the coordinates as well as wanted to verify that my conversion actually worked. So, I made this script to read label file and respective image file then extract coordinates of table body(for eg) and visualize them on the images. It can be used to extract other fields in the similar manner. Comments explain it all
import glob
import struct
import cv2
import binascii
import re
xml_files = glob.glob("path_to_labeled_files/*.xml")
for i in xml_files:
# Open the current file and read everything
cur_file = open(i,"r")
content = cur_file.read()
# Find index of all occurrences of only needed portions (eg TableBody this case)
residxs = [l.start() for l in re.finditer('Label="TableBody"', content)]
# Read the image
img = cv2.imread("path_to_images_folder/"+i.split('/')[-1][:-3]+"jpg")
# Traverse over all occurences
for r in residxs[:-1]:
# List to store output points
coords = []
# Start index of an occurence
sidx = r
# Substring from whole file content
substr = content[sidx:sidx+400]
# Now find start index and end index of coordinates in this substring
sidx = substr.find('BBox="')
eidx = substr.find('" CLIDs')
# String containing only points
points = substr[sidx+6:eidx]
# Make the conversion (also take care of little and big endian in unpack)
bins = ''
for j in points.split(' '):
if(j == ''):
continue
coords.append(struct.unpack('>d', binascii.unhexlify(j))[0])
if len(coords) != 4:
continue
# As suggested by MMReza
for k in range(4):
coords[k] = (coords[k]/72)*96
coords[1] = img.shape[0] - coords[1]
coords[3] = img.shape[0] - coords[3]
# Print the extracted coordinates
print(coords)
# Visualize it on the image
cv2.rectangle(img, (int(coords[0]),int(coords[1])) , (int(coords[2]),int(coords[3])), (255, 0, 0), 2)
cv2.imshow("frame",img)
cv2.waitKey(0)

Using Mann Kendall in python with a lot of data

I have a set of 46 years worth of rainfall data. It's in the form of 46 numpy arrays each with a shape of 145, 192, so each year is a different array of maximum rainfall data at each lat and lon coordinate in the given model.
I need to create a global map of tau values by doing an M-K test (Mann-Kendall) for each coordinate over the 46 years.
I'm still learning python, so I've been having trouble finding a way to go through all the data in a simple way that doesn't involve me making 27840 new arrays for each coordinate.
So far I've looked into how to use scipy.stats.kendalltau and using the definition from here: https://github.com/mps9506/Mann-Kendall-Trend
EDIT:
To clarify and add a little more detail, I need to perform a test on for each coordinate and not just each file individually. For example, for the first M-K test, I would want my x=46 and I would want y=data1[0,0],data2[0,0],data3[0,0]...data46[0,0]. Then to repeat this process for every single coordinate in each array. In total the M-K test would be done 27840 times and leave me with 27840 tau values that I can then plot on a global map.
EDIT 2:
I'm now running into a different problem. Going off of the suggested code, I have the following:
for i in range(145):
for j in range(192):
out[i,j] = mk_test(yrmax[:,i,j],alpha=0.05)
print out
I used numpy.stack to stack all 46 arrays into a single array (yrmax) with shape: (46L, 145L, 192L) I've tested it out and it calculates p and tau correctly if I change the code from out[i,j] to just out. However, doing this messes up the for loop so it only takes the results from the last coordinate in stead of all of them. And if I leave the code as it is above, I get the error: TypeError: list indices must be integers, not tuple
My first guess was that it has to do with mk_test and how the information is supposed to be returned in the definition. So I've tried altering the code from the link above to change how the data is returned, but I keep getting errors relating back to tuples. So now I'm not sure where it's going wrong and how to fix it.
EDIT 3:
One more clarification I thought I should add. I've already modified the definition in the link so it returns only the two number values I want for creating maps, p and z.
I don't think this is as big an ask as you may imagine. From your description it sounds like you don't actually want the scipy kendalltau, but the function in the repository you posted. Here is a little example I set up:
from time import time
import numpy as np
from mk_test import mk_test
data = np.array([np.random.rand(145, 192) for _ in range(46)])
mk_res = np.empty((145, 192), dtype=object)
start = time()
for i in range(145):
for j in range(192):
out[i, j] = mk_test(data[:, i, j], alpha=0.05)
print(f'Elapsed Time: {time() - start} s')
Elapsed Time: 35.21990394592285 s
My system is a MacBook Pro 2.7 GHz Intel Core I7 with 16 GB Ram so nothing special.
Each entry in the mk_res array (shape 145, 192) corresponds to one of your coordinate points and contains an entry like so:
array(['no trend', 'False', '0.894546014835', '0.132554125342'], dtype='<U14')
One thing that might be useful would be to modify the code in mk_test.py to return all numerical values. So instead of 'no trend'/'positive'/'negative' you could return 0/1/-1, and 1/0 for True/False and then you wouldn't have to worry about the whole object array type. I don't know what kind of analysis you might want to do downstream but I imagine that would preemptively circumvent any headaches.
Thanks to the answers provided and some work I was able to work out a solution that I'll provide here for anyone else that needs to use the Mann-Kendall test for data analysis.
The first thing I needed to do was flatten the original array I had into a 1D array. I know there is probably an easier way to go about doing this, but I ultimately used the following code based on code Grr suggested using.
`x = 46
out1 = np.empty(x)
out = np.empty((0))
for i in range(146):
for j in range(193):
out1 = yrmax[:,i,j]
out = np.append(out, out1, axis=0) `
Then I reshaped the resulting array (out) as follows:
out2 = np.reshape(out,(27840,46))
I did this so my data would be in a format compatible with scipy.stats.kendalltau 27840 is the total number of values I have at every coordinate that will be on my map (i.e. it's just 145*192) and the 46 is the number of years the data spans.
I then used the following loop I modified from Grr's code to find Kendall-tau and it's respective p-value at each latitude and longitude over the 46 year period.
`x = range(46)
y = np.zeros((0))
for j in range(27840):
b = sc.stats.kendalltau(x,out2[j,:])
y = np.append(y, b, axis=0)`
Finally, I reshaped the data one for time as shown:newdata = np.reshape(y,(145,192,2)) so the final array is in a suitable format to be used to create a global map of both tau and p-values.
Thanks everyone for the assistance!
Depending on your situation, it might just be easiest to make the arrays.
You won't really need them all in memory at once (not that it sounds like a terrible amount of data). Something like this only has to deal with one "copied out" coordinate trend at once:
SIZE = (145,192)
year_matrices = load_years() # list of one 145x192 arrays per year
result_matrix = numpy.zeros(SIZE)
for x in range(SIZE[0]):
for y in range(SIZE[1]):
coord_trend = map(lambda d: d[x][y], year_matrices)
result_matrix[x][y] = analyze_trend(coord_trend)
print result_matrix
Now, there are things like itertools.izip that could help you if you really want to avoid actually copying the data.
Here's a concrete example of how Python's "zip" might works with data like yours (although as if you'd used ndarray.flatten on each year):
year_arrays = [
['y0_coord0_val', 'y0_coord1_val', 'y0_coord2_val', 'y0_coord2_val'],
['y1_coord0_val', 'y1_coord1_val', 'y1_coord2_val', 'y1_coord2_val'],
['y2_coord0_val', 'y2_coord1_val', 'y2_coord2_val', 'y2_coord2_val'],
]
assert len(year_arrays) == 3
assert len(year_arrays[0]) == 4
coord_arrays = zip(*year_arrays) # i.e. `zip(year_arrays[0], year_arrays[1], year_arrays[2])`
# original data is essentially transposed
assert len(coord_arrays) == 4
assert len(coord_arrays[0]) == 3
assert coord_arrays[0] == ('y0_coord0_val', 'y1_coord0_val', 'y2_coord0_val', 'y3_coord0_val')
assert coord_arrays[1] == ('y0_coord1_val', 'y1_coord1_val', 'y2_coord1_val', 'y3_coord1_val')
assert coord_arrays[2] == ('y0_coord2_val', 'y1_coord2_val', 'y2_coord2_val', 'y3_coord2_val')
assert coord_arrays[3] == ('y0_coord2_val', 'y1_coord2_val', 'y2_coord2_val', 'y3_coord2_val')
flat_result = map(analyze_trend, coord_arrays)
The example above still copies the data (and all at once, rather than a coordinate at a time!) but hopefully shows what's going on.
Now, if you replace zip with itertools.izip and map with itertools.map then the copies needn't occur — itertools wraps the original arrays and keeps track of where it should be fetching values from internally.
There's a catch, though: to take advantage itertools you to access the data only sequentially (i.e. through iteration). In your case, it looks like the code at https://github.com/mps9506/Mann-Kendall-Trend/blob/master/mk_test.py might not be compatible with that. (I haven't reviewed the algorithm itself to see if it could be.)
Also please note that in the example I've glossed over the numpy ndarray stuff and just show flat coordinate arrays. It looks like numpy has some of it's own options for handling this instead of itertools, e.g. this answer says "Taking the transpose of an array does not make a copy". Your question was somewhat general, so I've tried to give some general tips as to ways one might deal with larger data in Python.
I ran into the same task and have managed to come up with a vectorized solution using numpy and scipy.
The formula are the same as in this page: https://vsp.pnnl.gov/help/Vsample/Design_Trend_Mann_Kendall.htm.
The trickiest part is to work out the adjustment for the tied values. I modified the code as in this answer to compute the number of tied values for each record, in a vectorized manner.
Below are the 2 functions:
import copy
import numpy as np
from scipy.stats import norm
def countTies(x):
'''Count number of ties in rows of a 2D matrix
Args:
x (ndarray): 2d matrix.
Returns:
result (ndarray): 2d matrix with same shape as <x>. In each
row, the number of ties are inserted at (not really) arbitary
locations.
The locations of tie numbers in are not important, since
they will be subsequently put into a formula of sum(t*(t-1)*(2t+5)).
Inspired by: https://stackoverflow.com/a/24892274/2005415.
'''
if np.ndim(x) != 2:
raise Exception("<x> should be 2D.")
m, n = x.shape
pad0 = np.zeros([m, 1]).astype('int')
x = copy.deepcopy(x)
x.sort(axis=1)
diff = np.diff(x, axis=1)
cated = np.concatenate([pad0, np.where(diff==0, 1, 0), pad0], axis=1)
absdiff = np.abs(np.diff(cated, axis=1))
rows, cols = np.where(absdiff==1)
rows = rows.reshape(-1, 2)[:, 0]
cols = cols.reshape(-1, 2)
counts = np.diff(cols, axis=1)+1
result = np.zeros(x.shape).astype('int')
result[rows, cols[:,1]] = counts.flatten()
return result
def MannKendallTrend2D(data, tails=2, axis=0, verbose=True):
'''Vectorized Mann-Kendall tests on 2D matrix rows/columns
Args:
data (ndarray): 2d array with shape (m, n).
Keyword Args:
tails (int): 1 for 1-tail, 2 for 2-tail test.
axis (int): 0: test trend in each column. 1: test trend in each
row.
Returns:
z (ndarray): If <axis> = 0, 1d array with length <n>, standard scores
corresponding to data in each row in <x>.
If <axis> = 1, 1d array with length <m>, standard scores
corresponding to data in each column in <x>.
p (ndarray): p-values corresponding to <z>.
'''
if np.ndim(data) != 2:
raise Exception("<data> should be 2D.")
# alway put records in rows and do M-K test on each row
if axis == 0:
data = data.T
m, n = data.shape
mask = np.triu(np.ones([n, n])).astype('int')
mask = np.repeat(mask[None,...], m, axis=0)
s = np.sign(data[:,None,:]-data[:,:,None]).astype('int')
s = (s * mask).sum(axis=(1,2))
#--------------------Count ties--------------------
counts = countTies(data)
tt = counts * (counts - 1) * (2*counts + 5)
tt = tt.sum(axis=1)
#-----------------Sample Gaussian-----------------
var = (n * (n-1) * (2*n+5) - tt) / 18.
eps = 1e-8 # avoid dividing 0
z = (s - np.sign(s)) / (np.sqrt(var) + eps)
p = norm.cdf(z)
p = np.where(p>0.5, 1-p, p)
if tails==2:
p=p*2
return z, p
I assume your data come in the layout of (time, latitude, longitude), and you are examining the temporal trend for each lat/lon cell.
To simulate this task, I synthesized a sample data array of shape (50, 145, 192). The 50 time points are taken from Example 5.9 of the book Wilks 2011, Statistical methods in the atmospheric sciences. And then I simply duplicated the same time series 27840 times to make it (50, 145, 192).
Below is the computation:
x = np.array([0.44,1.18,2.69,2.08,3.66,1.72,2.82,0.72,1.46,1.30,1.35,0.54,\
2.74,1.13,2.50,1.72,2.27,2.82,1.98,2.44,2.53,2.00,1.12,2.13,1.36,\
4.9,2.94,1.75,1.69,1.88,1.31,1.76,2.17,2.38,1.16,1.39,1.36,\
1.03,1.11,1.35,1.44,1.84,1.69,3.,1.36,6.37,4.55,0.52,0.87,1.51])
# create a big cube with shape: (T, Y, X)
arr = np.zeros([len(x), 145, 192])
for i in range(arr.shape[1]):
for j in range(arr.shape[2]):
arr[:, i, j] = x
print(arr.shape)
# re-arrange into tabular layout: (Y*X, T)
arr = np.transpose(arr, [1, 2, 0])
arr = arr.reshape(-1, len(x))
print(arr.shape)
import time
t1 = time.time()
z, p = MannKendallTrend2D(arr, tails=2, axis=1)
p = p.reshape(145, 192)
t2 = time.time()
print('time =', t2-t1)
The p-value for that sample time series is 0.63341565, which I have validated against the pymannkendall module result. Since arr contains merely duplicated copies of x, the resultant p is a 2d array of size (145, 192), with all 0.63341565.
And it took me only 1.28 seconds to compute that.

Trying to get Python to concatenate generated column vectors to form a two dimensional array. It's not working

I'm new to Python. I've done this particular task before in MATLAB, and I'm trying to get the hang of the syntax and particular behaviour of Python, as I'll be using this language much more in future.
The task: I am taking 43,200 single data points (integers, but written as decimals) and performing a fast-fourier transform on a "window" of 600 at a time, shifting this window by 60 data points each time. Hence, this transform will output 600 fourier coefficients, 720 times - I will end up with a 600 x 720 matrix (rows, columns).
These data points are initially contained within a list and turned into a column vector after being FFT'd. The issue comes when I try to build the maxtrix from a loop - take the first 600 points, FFT them, and dump them in an empty array. Take the next 600, do the same thing, but now add these two columns together to make two rows, then three, then four... etc. I've been trying for several hours now, but whatever I try I cannot get it to work - it consistently outputs my "final" matrix (the one that was meant to be the generated 600 x 720) as being the exact same dimensions as each generated "block".
My code (relevant sections):
for i in range(npoints):
newdata.append(float(newy.readline())) #Read data from file
FFT_out = [] #Initialize empty FFT output array
window_size = 600 #Number of points in data "window"
window_skip = 60 #Number of points window moves across
j = 0 #FFT count variable
for i in range(0, npoints, window_skip):
block = np.fft.fft(newdata[i:i+window_size]) #FFT Computation of "window"
block = block[:, np.newaxis] #turn into column vector (n, 1)
if j == 0:
FFT_out = block
j = 1
else:
np.hstack((FFT_out, block))
j = j + 1
print("Shape of FFT matrix:")
print(np.shape(FFT_out))
print("Number of times FFT completed:")
print(j)
At this point, I'm willing to believe it's a fundamental flaw on my understanding of how Python does matrices or deals with arrays. I've tried reading about it, but I still cannot see where I'm going wrong. Any help would be greatly appreciated!
First thing to note is that Python is uses indentation to form blocks, so as posted you would only ever assign once to FFT_out and never actually call np.hstack.
Then assuming that this was in fact only a cut&paste issue when posting your question, you should note that hstack returns a concatenation of its arguments without actually modifying them. To accumulate the concatenation, you should then assign the result back to FFT_out:
FFT_out = np.hstack((FFT_out, block))
You should then be able to get a 600 x 720 matrix with:
for i in range(0, npoints, window_skip):
block = np.fft.fft(newdata[i:i+window_size])
block = block[:, np.newaxis] #turn into column vector (n, 1)
if j == 0:
FFT_out = block
j = 1
else:
FFT_out = np.hstack((FFT_out, block))
j = j + 1

Categories

Resources