Applying function over entire parameter space/2d array, Numpy

Applying function over entire parameter space/2d array, Numpy - python

I'm in one of those weird places, where I know exactly what I want to do. I could easily code it up using for loops. but I'm trying to learn Numpy and I can't formulate how to solve this in Numpy.
I want to have a 2d array or parameter space. All values between 1200 and 1800, and all combinations therein. So [1200, 1200], [1200, 1201], [1200, 1202] .... [1201, 1200], [1201, 1201] etc.
I want to apply a function across this entire parameter space. The function uses a further 2 arrays, which are also values in 1200-1800 range. But they are random values, so these 2 extra arrays are random values in the 1200-1800 range, so [1356, 1689, 1436, ...] and [1768, 1495, 1358, ...] etc. check_array1 and check_array2.
The function needs to move through the parameter space checking a condition, which is basically if x < check_array1 and y < check_array2 then 1 else 0. Where x and y are the each specific point in the 2d parameter space. It needs to check against every value combination in the check arrays. Sum the total, do a comparison to another static value, and return the difference.
Each unique combination in the parameter space grid will then have a unique value associated with it based on how those specific x and y values from the parameter space compare to the 2 check arrays.
Hopefully the above makes, I just can't figure out how to work this into a Numpy friendly problem. Sorry for the wall of text.
Edit: I've written it in more basic Python to better illustrate what I'm trying to do.
check1 = np.random.randint(1200, 1801, 300)
check2 = np.random.randint(1200, 1801, 300)
def check_this_double(i, j, check1, check2):
total = 0
for num in range(0, len(check1)):
if ((i < check1[num]) or (j < check2[num])):
total += 1
return total
outputs = {}
for i in range(1200, 1801):
for j in range(1200, 1801):
outputs[i,j] = check_this_double(i, j, check1, check2)
Edit 2: I believe I have it.
Following from Mountains code creating the p_space and then using np.vectorize on a normal Python fuction.
check1 = np.random.randint(1200, 1801, 300)
check2 = np.random.randint(1200, 1801, 300)
def calc(i, j):
total = np.where(np.logical_or(check1 < i, checks2 < j), 1, 0)
return total.sum()
rate_calv_v = np.vectorize(rate_calc)
final = rate_calv_v(p_space[:, 0], p_space[:, 1])
Feels kind of like cheating :), there must be way to do it without np.vectorize. But this works for me I believe.

I don't fully understand the problem you are trying to solve. I hope the following will
give you a starting point on how numpy can be used. I recommend going through a numpy introductory tutorial.
numpy boolean indexing and vector math can improve speed and reduce the need for loops.
Here is my understanding of the first part of your questions.
import numpy as np
xv, yv = np.meshgrid(np.arange(1200, 1801), np.arange(1200, 1801))
p_space = np.stack((xv, yv), axis=-1) # the 2d array described
# print original values
print(p_space[0,:10,0])
print(p_space[0,-10:,0])
old_shape = p_space.shape
p_space = p_space.reshape(-1, 2) # flatten the array for the compare
check1 = np.random.randint(1200, 1801, len(p_space))
check2 = np.random.randint(1200, 1801, len(p_space))
# you can used this to access and modify values that meet the condition
index_array = np.logical_and(p_space[:, 0] < check1, p_space[:, 1] < check2)
# do some sort of complex math
p_space[index_array] = p_space[index_array] / 2 + 10
# get the sum across the two columns
print(np.sum(p_space, axis=0))
p_space = p_space.reshape(old_shape) # return to the grid shape
# print modified values
print(p_space[0,:10,0]) # likely to be changed based on checks
print(p_space[0,-10:,0]) # unlikely to be changed

Related

Sum values from numpy array if condition on value in another array is met

I'm facing a problem with vectorizing a function so that it applies efficiently on a numpy array.
My program entries :
A pos_part 2D Array of Nb_particles lines, 3 columns (basicaly x,y,z coordinates, only z is relevant for the part that bothers me) Nb_particles can up to several hundreds of thousands.
An prop_part 1D array with Nb_particles values. This part I got covered, creation is made with some nice numpy functions ; I just put here a basic distribution that ressembles real values.
A z_distances 1D Array, a simple np.arange betwwen z=0 and z=z_max.
Then come the calculation that takes time, because where I can't find a way to do things properply with only numpy operation of arrays. What i want to do is :
For all distances z_i in z_distances, sum all values from prop_part if corresponding particle coordinate z_particle < z_i. This would return a 1D array the same length as z_distances.
My ideas so far :
Version 0, for loop, enumerate and np.where do retrieve the index of values that I need to sum. Obviously quite long.
Version 1, using a mask on a new array (combination of z coordinates and particle properties), and sum on the masked array. Seems better than v0
Version 2, another mask and a np.vectorize, but i understand it's not efficient as vectorize is basicaly a for loop. Still seems better than v0
Version 3, I'm trying to use mask on a function that can I directly apply to z_distances, but it's not working so far.
So, here I am. There is maybe something to do with a sort and a cumulative sum, but I don't know how to do this, so any help would be greatly appreciated. Please find below the code to make things clearer
Thanks in advance.
import numpy as np
import time
import matplotlib.pyplot as plt
# Creation of particles' positions
Nb_part = 150_000
pos_part = 10*np.random.rand(Nb_part,3)
pos_part[:,0] = pos_part[:,1] = 0
#usefull property creation
beta = 1/1.5
prop_part = (1/beta)*np.exp(-pos_part[:,2]/beta)
z_distances = np.arange(0,10,0.1)
#my version 0
t0=time.time()
result = np.empty(len(z_distances))
for index_dist, val_dist in enumerate(z_distances):
positions = np.where(pos_part[:,2]<val_dist)[0]
result[index_dist] = sum(prop_part[i] for i in positions)
print("v0 :",time.time()-t0)
#A graph to help understand
plt.figure()
plt.plot(z_distances,result, c="red")
plt.ylabel("Sum of particles' usefull property for particles with z-pos<d")
plt.xlabel("d")
#version 1 ??
t1=time.time()
combi = np.column_stack((pos_part[:,2],prop_part))
result2 = np.empty(len(z_distances))
for index_dist, val_dist in enumerate(z_distances):
mask = (combi[:,0]<val_dist)
result2[index_dist]=sum(combi[:,1][mask])
print("v1 :",time.time()-t1)
plt.plot(z_distances,result2, c="blue")
#version 2
t2=time.time()
def themask(a):
mask = (combi[:,0]<a)
return sum(combi[:,1][mask])
thefunc = np.vectorize(themask)
result3 = thefunc(z_distances)
print("v2 :",time.time()-t2)
plt.plot(z_distances,result3, c="green")
### This does not work so far
# version 3
# =============================
# t3=time.time()
# def thesum(a):
# mask = combi[combi[:,0]<a]
# return sum(mask[:,1])
# result4 = thesum(z_distances)
# print("v3 :",time.time()-t3)
# =============================

You can get a lot more performance by writing your first version completely in numpy. Replace pythons sum with np.sum. Instead of the for i in positions list comprehension, simply pass the positions mask you are creating anyways.
Indeed, the np.where is not necessary and my best version looks like:
#my version 0
t0=time.time()
result = np.empty(len(z_distances))
for index_dist, val_dist in enumerate(z_distances):
positions = pos_part[:, 2] < val_dist
result[index_dist] = np.sum(prop_part[positions])
print("v0 :",time.time()-t0)
# out: v0 : 0.06322097778320312
You can get a bit faster if z_distances is very long by using numba.
Running calc for the first time usually creates some overhead which we can get rid of by running the function for some small set of `z_distances.
The below code achieves roughly a factor of two speedup over pure numpy on my laptop.
import numba as nb
#nb.njit(parallel=True)
def calc(result, z_distances):
n = z_distances.shape[0]
for ii in nb.prange(n):
pos = pos_part[:, 2] < z_distances[ii]
result[ii] = np.sum(prop_part[pos])
return result
result4 = np.zeros_like(result)
# _t = time.time()
# calc(result4, z_distances[:10])
# print(time.time()-_t)
t3 = time.time()
result4 = calc(result4, z_distances)
print("v3 :", time.time()-t3)
plt.plot(z_distances, result4)

Build a coupled map lattice using 2D array

So I'm trying to build a coupled map lattice on my computer.
A coupled map lattice (CML) is given by this eq'n:
where, the function f(Xn) is a logistic map :
with x value from 0-1, and r=4 for this CML.
Note: 'n' can be thought of as time, and 'i' as space
I have spent a lot of time understanding the iterations and i came up with a code as below, however i'm not sure if this is the correct code to iterate this equation.
Note: I have used 2d numpy arrays, where rows are 'n' and columns are 'i' as obvious from the code.
So basically, I want to develop a code to simulate this equation, and here is my take on that
Don't jump to the code directly, you won't understand what's happening without bothering to look at the equations first.
import numpy as np
import matplotlib.pyplot as plt
'''The 4 definitions created below are actually similar and only vary in their indexings. These 4
have been created only because of the if conditions I have put in the for loop '''
def logInit(r,x):
y[n,0]=r*x[n,0]*(1-x[n,0])
return y[n,0]
def logPresent(r,x):
y[n,i]=r*x[n,i]*(1-x[n,i])
return y[n,i]
def logLast(r,x):
y[n,L-1]=r*x[n,L-1]*(1-x[n,L-1])
return y[n,L-1]
def logNext(r,x):
y[n,i+1]=r*x[n,i+1]*(1-x[n,i+1])
return y[n,i+1]
def logPrev(r,x):
y[n,i-1]=r*x[n,i-1]*(1-x[n,i-1])
return y[n,i-1]
# 2d array with 4 row, 3 col. I created this because I want to store the evaluated values of log
function into this y[n,i] array
y=np.ones(12).reshape(4,3)
# creating an array of random numbers between 0-1 with 4 rows 3 columns
np.random.seed(0)
x=np.random.random((4,3))
L=3
r=4
eps=0.5
for n in range(3):
for i in range(L):
if i==0:
x[n+1,i]=(1-eps)*logPresent(r,x) + 0.5*eps*(logLast(r,x)+logNext(r,x))
elif i==L-1:
x[n+1,i]=(1-eps)*logPresent(r,x) + 0.5*eps*(logPrev(r,x) + logInit(r,x))
elif i > 0 and i < L - 1:
x[n+1,i]=(1-eps)*logPresent(r,x) + 0.5*eps*(logPrev(r,x) +logNext(r,x))
print(x)
This does give an output. Here it is:
[[0.5488135 0.71518937 0.60276338]
[0.94538775 0.82547604 0.64589411]
[0.43758721 0.891773 0.96366276]
[0.38344152 0.79172504 0.52889492]]
[[0.5488135 0.71518937 0.60276338]
[0.94538775 0.82547604 0.92306303]
[0.2449672 0.49731638 0.96366276]
[0.38344152 0.79172504 0.52889492]]
[[0.5488135 0.71518937 0.60276338]
[0.94538775 0.82547604 0.92306303]
[0.2449672 0.49731638 0.29789622]
[0.75613708 0.93368134 0.52889492]]
But I'm very sure this is not what I'm looking for.
If you can please figure out a correct way to iterate and loop the CML equation with code ? Suggest me the changes I have to make. Thank you very much!!
You'll have to think about the iterations and looping to be made to simulate this equation. It might be tedious, but that's the only way you can suggest me some changes in my code.

Your calculations seem fine to me. You could improve the speed by using vectorization along the space dimension and by reusing your intermediate results y. I restructured your program a little, but in essence it does the same thing as before. For me the results look plausible. The image shows the random initial vector in the first row and as the time goes on (top to bottom) the coupling comes in to play and little islands and patterns form.
import numpy as np
import matplotlib.pyplot as plt
L = 128 # grid size
N = 128 # time steps
r = 4
eps = 0.5
# Create random values for the initial time step
np.random.seed(0)
x = np.zeros((N+1, L))
x[0, :] = np.random.random(L)
# Create a helper matrix to save and reuse part of the calculations
y = np.zeros((N, L))
# Indices for previous, present, next position for every point on the grid
idx_present = np.arange(L) # 0, 1, ..., L-2, L-1
idx_next = (idx_present + 1) % L # 1, 2, ..., L-1, 0
idx_prev = (idx_present - 1) % L # L-1, 0, ..., L-3, L-2
def log_vector(rr, xx):
return rr * xx * (1 - xx)
# Loop over the time steps
for n in range(N):
# Compute y once for the whole time step and reuse it
# to build the next time step with coupling the neighbours
y[n, :] = log_vector(rr=r, xx=x[n, :])
x[n+1, :] = (1-eps)*y[n,idx_present] + 0.5*eps*(y[n,idx_prev]+y[n,idx_next])
# Plot the results
plt.imshow(x)

How do I force two arrays to be equal for use in pyplot?

I'm trying to plot a simple moving averages function but the resulting array is a few numbers short of the full sample size. How do I plot such a line alongside a more standard line that extends for the full sample size? The code below results in this error message:
ValueError: x and y must have same first dimension, but have shapes (96,) and (100,)
This is using standard matplotlib.pyplot. I've tried just deleting X values using remove and del as well as switching all arrays to numpy arrays (since that's the output format of my moving averages function) then tried adding an if condition to the append in the while loop but neither has worked.
import random
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
def movingaverage(values, window):
weights = np.repeat(1.0, window) / window
smas = np.convolve(values, weights, 'valid')
return smas
sampleSize = 100
min = -10
max = 10
window = 5
vX = np.array([])
vY = np.array([])
x = 0
val = 0
while x < sampleSize:
val += (random.randint(min, max))
vY = np.append(vY, val)
vX = np.append(vX, x)
x += 1
plt.plot(vX, vY)
plt.plot(vX, movingaverage(vY, window))
plt.show()
Expected results would be two lines on the same graph - one a simple moving average of the other.

Just change this line to the following:
smas = np.convolve(values, weights,'same')
The 'valid' option, only convolves if the window completely covers the values array. What you want is 'same', which does what you are looking for.
Edit: This, however, also comes with its own issues as it acts like there are extra bits of data with value 0 when your window does not fully sit on top of the data. This can be ignored if chosen, as is done in this solution, but another approach is to pad the array with specific values of your choosing instead (see Mike Sperry's answer).

Here is how you would pad a numpy array out to the desired length with 'nan's (replace 'nan' with other values, or replace 'constant' with another mode depending on desired results)
https://docs.scipy.org/doc/numpy/reference/generated/numpy.pad.html
import numpy as np
bob = np.asarray([1,2,3])
alice = np.pad(bob,(0,100-len(bob)),'constant',constant_values=('nan','nan'))
So in your code it would look something like this:
import random
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
def movingaverage(values,window):
weights = np.repeat(1.0,window)/window
smas = np.convolve(values,weights,'valid')
shorted = int((100-len(smas))/2)
print(shorted)
smas = np.pad(smas,(shorted,shorted),'constant',constant_values=('nan','nan'))
return smas
sampleSize = 100
min = -10
max = 10
window = 5
vX = np.array([])
vY = np.array([])
x = 0
val = 0
while x < sampleSize:
val += (random.randint(min,max))
vY = np.append(vY,val)
vX = np.append(vX,x)
x += 1
plt.plot(vX,vY)
plt.plot(vX,(movingaverage(vY,window)))
plt.show()

To answer your basic question, the key is to take a slice of the x-axis appropriate to the data of the moving average. Since you have a convolution of 100 data elements with a window of size 5, the result is valid for the last 96 elements. You would plot it like this:
plt.plot(vX[window - 1:], movingaverage(vY, window))
That being said, your code could stand to have some optimization done on it. For example, numpy arrays are stored in fixed size static buffers. Any time you do append or delete on them, the entire thing gets reallocated, unlike Python lists, which have amortization built in. It is always better to preallocate if you know the array size ahead of time (which you do).
Secondly, running an explicit loop is rarely necessary. You are generally better off using the under-the-hood loops implemented at the lowest level in the numpy functions instead. This is called vectorization. Random number generation, cumulative sums and incremental arrays are all fully vectorized in numpy. In a more general sense, it's usually not very effective to mix Python and numpy computational functions, including random.
Finally, you may want to consider a different convolution method. I would suggest something based on numpy.lib.stride_tricks.as_strided. This is a somewhat arcane, but very effective way to implement a sliding window with numpy arrays. I will show it here as an alternative to the convolution method you used, but feel free to ignore this part.
All in all:
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
def movingaverage(values, window):
# this step creates a view into the same buffer
values = np.lib.stride_tricks.as_strided(values, shape=(window, values.size - window + 1), strides=values.strides * 2)
smas = values.sum(axis=0)
smas /= window # in-place to avoid temp array
return smas
sampleSize = 100
min = -10
max = 10
window = 5
v_x = np.arange(sampleSize)
v_y = np.cumsum(np.random.random_integers(min, max, sampleSize))
plt.plot(v_x, v_y)
plt.plot(v_x[window - 1:], movingaverage(v_y, window))
plt.show()
A note on names: in Python, variable and function names are conventionally name_with_underscore. CamelCase is reserved for class names. np.random.random_integers uses inclusive bounds just like random.randint, but allows you to specify the number of samples to generate. Confusingly, np.random.randint has an exclusive upper bound, more like random.randrange.

Using Mann Kendall in python with a lot of data

I have a set of 46 years worth of rainfall data. It's in the form of 46 numpy arrays each with a shape of 145, 192, so each year is a different array of maximum rainfall data at each lat and lon coordinate in the given model.
I need to create a global map of tau values by doing an M-K test (Mann-Kendall) for each coordinate over the 46 years.
I'm still learning python, so I've been having trouble finding a way to go through all the data in a simple way that doesn't involve me making 27840 new arrays for each coordinate.
So far I've looked into how to use scipy.stats.kendalltau and using the definition from here: https://github.com/mps9506/Mann-Kendall-Trend
EDIT:
To clarify and add a little more detail, I need to perform a test on for each coordinate and not just each file individually. For example, for the first M-K test, I would want my x=46 and I would want y=data1[0,0],data2[0,0],data3[0,0]...data46[0,0]. Then to repeat this process for every single coordinate in each array. In total the M-K test would be done 27840 times and leave me with 27840 tau values that I can then plot on a global map.
EDIT 2:
I'm now running into a different problem. Going off of the suggested code, I have the following:
for i in range(145):
for j in range(192):
out[i,j] = mk_test(yrmax[:,i,j],alpha=0.05)
print out
I used numpy.stack to stack all 46 arrays into a single array (yrmax) with shape: (46L, 145L, 192L) I've tested it out and it calculates p and tau correctly if I change the code from out[i,j] to just out. However, doing this messes up the for loop so it only takes the results from the last coordinate in stead of all of them. And if I leave the code as it is above, I get the error: TypeError: list indices must be integers, not tuple
My first guess was that it has to do with mk_test and how the information is supposed to be returned in the definition. So I've tried altering the code from the link above to change how the data is returned, but I keep getting errors relating back to tuples. So now I'm not sure where it's going wrong and how to fix it.
EDIT 3:
One more clarification I thought I should add. I've already modified the definition in the link so it returns only the two number values I want for creating maps, p and z.

I don't think this is as big an ask as you may imagine. From your description it sounds like you don't actually want the scipy kendalltau, but the function in the repository you posted. Here is a little example I set up:
from time import time
import numpy as np
from mk_test import mk_test
data = np.array([np.random.rand(145, 192) for _ in range(46)])
mk_res = np.empty((145, 192), dtype=object)
start = time()
for i in range(145):
for j in range(192):
out[i, j] = mk_test(data[:, i, j], alpha=0.05)
print(f'Elapsed Time: {time() - start} s')
Elapsed Time: 35.21990394592285 s
My system is a MacBook Pro 2.7 GHz Intel Core I7 with 16 GB Ram so nothing special.
Each entry in the mk_res array (shape 145, 192) corresponds to one of your coordinate points and contains an entry like so:
array(['no trend', 'False', '0.894546014835', '0.132554125342'], dtype='<U14')
One thing that might be useful would be to modify the code in mk_test.py to return all numerical values. So instead of 'no trend'/'positive'/'negative' you could return 0/1/-1, and 1/0 for True/False and then you wouldn't have to worry about the whole object array type. I don't know what kind of analysis you might want to do downstream but I imagine that would preemptively circumvent any headaches.

Thanks to the answers provided and some work I was able to work out a solution that I'll provide here for anyone else that needs to use the Mann-Kendall test for data analysis.
The first thing I needed to do was flatten the original array I had into a 1D array. I know there is probably an easier way to go about doing this, but I ultimately used the following code based on code Grr suggested using.
`x = 46
out1 = np.empty(x)
out = np.empty((0))
for i in range(146):
for j in range(193):
out1 = yrmax[:,i,j]
out = np.append(out, out1, axis=0) `
Then I reshaped the resulting array (out) as follows:
out2 = np.reshape(out,(27840,46))
I did this so my data would be in a format compatible with scipy.stats.kendalltau 27840 is the total number of values I have at every coordinate that will be on my map (i.e. it's just 145*192) and the 46 is the number of years the data spans.
I then used the following loop I modified from Grr's code to find Kendall-tau and it's respective p-value at each latitude and longitude over the 46 year period.
`x = range(46)
y = np.zeros((0))
for j in range(27840):
b = sc.stats.kendalltau(x,out2[j,:])
y = np.append(y, b, axis=0)`
Finally, I reshaped the data one for time as shown:newdata = np.reshape(y,(145,192,2)) so the final array is in a suitable format to be used to create a global map of both tau and p-values.
Thanks everyone for the assistance!

Depending on your situation, it might just be easiest to make the arrays.
You won't really need them all in memory at once (not that it sounds like a terrible amount of data). Something like this only has to deal with one "copied out" coordinate trend at once:
SIZE = (145,192)
year_matrices = load_years() # list of one 145x192 arrays per year
result_matrix = numpy.zeros(SIZE)
for x in range(SIZE[0]):
for y in range(SIZE[1]):
coord_trend = map(lambda d: d[x][y], year_matrices)
result_matrix[x][y] = analyze_trend(coord_trend)
print result_matrix
Now, there are things like itertools.izip that could help you if you really want to avoid actually copying the data.
Here's a concrete example of how Python's "zip" might works with data like yours (although as if you'd used ndarray.flatten on each year):
year_arrays = [
['y0_coord0_val', 'y0_coord1_val', 'y0_coord2_val', 'y0_coord2_val'],
['y1_coord0_val', 'y1_coord1_val', 'y1_coord2_val', 'y1_coord2_val'],
['y2_coord0_val', 'y2_coord1_val', 'y2_coord2_val', 'y2_coord2_val'],
]
assert len(year_arrays) == 3
assert len(year_arrays[0]) == 4
coord_arrays = zip(*year_arrays) # i.e. `zip(year_arrays[0], year_arrays[1], year_arrays[2])`
# original data is essentially transposed
assert len(coord_arrays) == 4
assert len(coord_arrays[0]) == 3
assert coord_arrays[0] == ('y0_coord0_val', 'y1_coord0_val', 'y2_coord0_val', 'y3_coord0_val')
assert coord_arrays[1] == ('y0_coord1_val', 'y1_coord1_val', 'y2_coord1_val', 'y3_coord1_val')
assert coord_arrays[2] == ('y0_coord2_val', 'y1_coord2_val', 'y2_coord2_val', 'y3_coord2_val')
assert coord_arrays[3] == ('y0_coord2_val', 'y1_coord2_val', 'y2_coord2_val', 'y3_coord2_val')
flat_result = map(analyze_trend, coord_arrays)
The example above still copies the data (and all at once, rather than a coordinate at a time!) but hopefully shows what's going on.
Now, if you replace zip with itertools.izip and map with itertools.map then the copies needn't occur — itertools wraps the original arrays and keeps track of where it should be fetching values from internally.
There's a catch, though: to take advantage itertools you to access the data only sequentially (i.e. through iteration). In your case, it looks like the code at https://github.com/mps9506/Mann-Kendall-Trend/blob/master/mk_test.py might not be compatible with that. (I haven't reviewed the algorithm itself to see if it could be.)
Also please note that in the example I've glossed over the numpy ndarray stuff and just show flat coordinate arrays. It looks like numpy has some of it's own options for handling this instead of itertools, e.g. this answer says "Taking the transpose of an array does not make a copy". Your question was somewhat general, so I've tried to give some general tips as to ways one might deal with larger data in Python.

I ran into the same task and have managed to come up with a vectorized solution using numpy and scipy.
The formula are the same as in this page: https://vsp.pnnl.gov/help/Vsample/Design_Trend_Mann_Kendall.htm.
The trickiest part is to work out the adjustment for the tied values. I modified the code as in this answer to compute the number of tied values for each record, in a vectorized manner.
Below are the 2 functions:
import copy
import numpy as np
from scipy.stats import norm
def countTies(x):
'''Count number of ties in rows of a 2D matrix
Args:
x (ndarray): 2d matrix.
Returns:
result (ndarray): 2d matrix with same shape as <x>. In each
row, the number of ties are inserted at (not really) arbitary
locations.
The locations of tie numbers in are not important, since
they will be subsequently put into a formula of sum(t*(t-1)*(2t+5)).
Inspired by: https://stackoverflow.com/a/24892274/2005415.
'''
if np.ndim(x) != 2:
raise Exception("<x> should be 2D.")
m, n = x.shape
pad0 = np.zeros([m, 1]).astype('int')
x = copy.deepcopy(x)
x.sort(axis=1)
diff = np.diff(x, axis=1)
cated = np.concatenate([pad0, np.where(diff==0, 1, 0), pad0], axis=1)
absdiff = np.abs(np.diff(cated, axis=1))
rows, cols = np.where(absdiff==1)
rows = rows.reshape(-1, 2)[:, 0]
cols = cols.reshape(-1, 2)
counts = np.diff(cols, axis=1)+1
result = np.zeros(x.shape).astype('int')
result[rows, cols[:,1]] = counts.flatten()
return result
def MannKendallTrend2D(data, tails=2, axis=0, verbose=True):
'''Vectorized Mann-Kendall tests on 2D matrix rows/columns
Args:
data (ndarray): 2d array with shape (m, n).
Keyword Args:
tails (int): 1 for 1-tail, 2 for 2-tail test.
axis (int): 0: test trend in each column. 1: test trend in each
row.
Returns:
z (ndarray): If <axis> = 0, 1d array with length <n>, standard scores
corresponding to data in each row in <x>.
If <axis> = 1, 1d array with length <m>, standard scores
corresponding to data in each column in <x>.
p (ndarray): p-values corresponding to <z>.
'''
if np.ndim(data) != 2:
raise Exception("<data> should be 2D.")
# alway put records in rows and do M-K test on each row
if axis == 0:
data = data.T
m, n = data.shape
mask = np.triu(np.ones([n, n])).astype('int')
mask = np.repeat(mask[None,...], m, axis=0)
s = np.sign(data[:,None,:]-data[:,:,None]).astype('int')
s = (s * mask).sum(axis=(1,2))
#--------------------Count ties--------------------
counts = countTies(data)
tt = counts * (counts - 1) * (2*counts + 5)
tt = tt.sum(axis=1)
#-----------------Sample Gaussian-----------------
var = (n * (n-1) * (2*n+5) - tt) / 18.
eps = 1e-8 # avoid dividing 0
z = (s - np.sign(s)) / (np.sqrt(var) + eps)
p = norm.cdf(z)
p = np.where(p>0.5, 1-p, p)
if tails==2:
p=p*2
return z, p
I assume your data come in the layout of (time, latitude, longitude), and you are examining the temporal trend for each lat/lon cell.
To simulate this task, I synthesized a sample data array of shape (50, 145, 192). The 50 time points are taken from Example 5.9 of the book Wilks 2011, Statistical methods in the atmospheric sciences. And then I simply duplicated the same time series 27840 times to make it (50, 145, 192).
Below is the computation:
x = np.array([0.44,1.18,2.69,2.08,3.66,1.72,2.82,0.72,1.46,1.30,1.35,0.54,\
2.74,1.13,2.50,1.72,2.27,2.82,1.98,2.44,2.53,2.00,1.12,2.13,1.36,\
4.9,2.94,1.75,1.69,1.88,1.31,1.76,2.17,2.38,1.16,1.39,1.36,\
1.03,1.11,1.35,1.44,1.84,1.69,3.,1.36,6.37,4.55,0.52,0.87,1.51])
# create a big cube with shape: (T, Y, X)
arr = np.zeros([len(x), 145, 192])
for i in range(arr.shape[1]):
for j in range(arr.shape[2]):
arr[:, i, j] = x
print(arr.shape)
# re-arrange into tabular layout: (Y*X, T)
arr = np.transpose(arr, [1, 2, 0])
arr = arr.reshape(-1, len(x))
print(arr.shape)
import time
t1 = time.time()
z, p = MannKendallTrend2D(arr, tails=2, axis=1)
p = p.reshape(145, 192)
t2 = time.time()
print('time =', t2-t1)
The p-value for that sample time series is 0.63341565, which I have validated against the pymannkendall module result. Since arr contains merely duplicated copies of x, the resultant p is a 2d array of size (145, 192), with all 0.63341565.
And it took me only 1.28 seconds to compute that.

Trying to get Python to concatenate generated column vectors to form a two dimensional array. It's not working

I'm new to Python. I've done this particular task before in MATLAB, and I'm trying to get the hang of the syntax and particular behaviour of Python, as I'll be using this language much more in future.
The task: I am taking 43,200 single data points (integers, but written as decimals) and performing a fast-fourier transform on a "window" of 600 at a time, shifting this window by 60 data points each time. Hence, this transform will output 600 fourier coefficients, 720 times - I will end up with a 600 x 720 matrix (rows, columns).
These data points are initially contained within a list and turned into a column vector after being FFT'd. The issue comes when I try to build the maxtrix from a loop - take the first 600 points, FFT them, and dump them in an empty array. Take the next 600, do the same thing, but now add these two columns together to make two rows, then three, then four... etc. I've been trying for several hours now, but whatever I try I cannot get it to work - it consistently outputs my "final" matrix (the one that was meant to be the generated 600 x 720) as being the exact same dimensions as each generated "block".
My code (relevant sections):
for i in range(npoints):
newdata.append(float(newy.readline())) #Read data from file
FFT_out = [] #Initialize empty FFT output array
window_size = 600 #Number of points in data "window"
window_skip = 60 #Number of points window moves across
j = 0 #FFT count variable
for i in range(0, npoints, window_skip):
block = np.fft.fft(newdata[i:i+window_size]) #FFT Computation of "window"
block = block[:, np.newaxis] #turn into column vector (n, 1)
if j == 0:
FFT_out = block
j = 1
else:
np.hstack((FFT_out, block))
j = j + 1
print("Shape of FFT matrix:")
print(np.shape(FFT_out))
print("Number of times FFT completed:")
print(j)
At this point, I'm willing to believe it's a fundamental flaw on my understanding of how Python does matrices or deals with arrays. I've tried reading about it, but I still cannot see where I'm going wrong. Any help would be greatly appreciated!

First thing to note is that Python is uses indentation to form blocks, so as posted you would only ever assign once to FFT_out and never actually call np.hstack.
Then assuming that this was in fact only a cut&paste issue when posting your question, you should note that hstack returns a concatenation of its arguments without actually modifying them. To accumulate the concatenation, you should then assign the result back to FFT_out:
FFT_out = np.hstack((FFT_out, block))
You should then be able to get a 600 x 720 matrix with:
for i in range(0, npoints, window_skip):
block = np.fft.fft(newdata[i:i+window_size])
block = block[:, np.newaxis] #turn into column vector (n, 1)
if j == 0:
FFT_out = block
j = 1
else:
FFT_out = np.hstack((FFT_out, block))
j = j + 1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.