How to reduce memory usage in xarray multidimensional rolling aggregation? - python

Given a multidimensional xarray DataArray, I would like to perform multidimensional rolling aggregation. For example, if I have a DataArray that is m x n x k, I would like to be able to roll the data along the m axis, and aggregate away either the n or k dimension.
I have an approach that gives me the correct answer but seems not to scale at all. If my window sizes are small, it is feasible, but in the case of a 5000 x 2000 x 10 DataArray, rolling along the 5000 length dimension with a long window explodes memory with my current approach.
import xarray as xr
import numpy as np
import pandas as pd
drange = pd.date_range(start='2000-01-01', freq='D', periods=5000)
x = ['x%i' % i for i in range(1, 3001)]
y = ['y%i' % i for i in range(1,11)]
raw_dat = np.random.randn(len(drange), len(x), len(y))
da = xr.DataArray(raw_dat, coords={'time': drange, 'x': x, 'y': y}, dims=['time', 'x', 'y'])
new_da = da.rolling(time=20).construct('window_dim')
final_da = new_da.stack(combo=['x', 'window_dim']).std('combo')
I have also tried the below, it gives the same result but also runs out of memory when the rolling window is large.
new_da = da.rolling(time=20).construct('window_dim')
final_da = new_da.std(['x', 'window_dim'])
The above code works and on my machine takes roughly 35 seconds to perform the stack and aggregation, but as window size increases, memory usage explodes. I am wondering if there is a smarter way to do this type of aggregation.

Related

Memory leak in pandas multiIndex (with minmum reproducible example)

Eariler today I posted this question. I now have a MRE that can reproduce the issue.
In short, this piece of code seems to be using much more memory than it should (the idea is to average some number of time-traces into a certain number of bins. The traces are arranged in a matrix using a pd.MultiIndex).
import numpy as np
import pandas as pd
#Len of each trace
trace_len =2500
#Number of Bins
bin_num = 300
# Traces matrix dimensions
L1 = 70
L2 = 100
index = pd.MultiIndex.from_product([range(L1), range(L2)])
traces = np.random.random((L1*L2, trace_len))
traces_df = pd.DataFrame(traces, index=index)
#Lets make 300 random bins
bins = [index.to_frame().sample(frac=1, replace=True) for _ in range(bin_num)]
bins = [pd.MultiIndex.from_frame(bin) for bin in bins]
def bin_single(traces: pd.DataFrame, bin_idx:pd.Index) -> np.array:
""" Cumulative sum of all shots that are both in traces and bin_idx"""
bin_idx = bin_idx.intersection(traces.index)
binned = traces.reindex(bin_idx, copy=False)
return binned.sum(axis=0, skipna=False).to_numpy()
output = np.empty((bin_num, trace_len))
for n, bin in enumerate(bins):
output[n] = bin_single(traces_df, bin)
print(output.nbytes)
This is the memory allocation over time:
The issue cannot be due to lazy allocation of output, since that array is only 6Mb, as reported by output.nbytes, while the overall memory allocation grows by more than 200Mb over the for loop.
I think the problem might be hidden in the pd.MultiIndex usage, since this very similar program that does not use MultiIndex does not show the memory increase:
import numpy as np
import pandas as pd
#Len of each trace
trace_len =2500
#Number of Bins
bin_num = 300
# Traces matrix dimensions
L1 = 70
L2 = 100
#index = pd.MultiIndex.from_product([range(L1), range(L2)])
traces = np.random.random((L1*L2, trace_len))
traces_df = pd.DataFrame(traces)
index = traces_df.index
#Lets make 300 random bins
bins = [index.to_frame().sample(frac=1, replace=True) for _ in range(bin_num)]
bins = [pd.MultiIndex.from_frame(bin) for bin in bins]
def bin_single(traces: pd.DataFrame, bin_idx:pd.Index) -> np.array:
""" Cumulative sum of all shots that are both in traces and bin_idx"""
bin_idx = bin_idx.intersection(traces.index)
binned = traces.reindex(bin_idx, copy=False)
return binned.sum(axis=0, skipna=False).to_numpy()
output = np.empty((bin_num, trace_len))
for n, bin in enumerate(bins):
output[n] = bin_single(traces_df, bin)
print(output.nbytes)
I tend to think that there might be a bug somewhere in pd.MultiIndex, but maybe I'm just overlooking something.
Thanks a lot!

Parallelize nested large `(15e4 * 15e4)` for loop to get a pairwise matrix

I am trying to parallelize the following code, which creates a pairwise result for each row. As shown below.
def get_custom_value(i, j):
first = df[df['id'] == i]
second = df[df['id'] == j]
return int(first['val_1']) * int(second['val_1']) +\
int(first['val_2']) * int(second['val_2'])
df = pd.DataFrame(
{
'id' : range(4),
'val_1' : [3, 4, 5, 1],
'val_2' : [2, 3, 1, 1]
}
)
n = df.shape[0]
result = []
for i in range(n):
for j in range(i+1, n):
temp_value = get_custom_value(i, j)
result.append([i, j, temp_value])
if len(result) > 1e5:
# store it in a local file and reset the result object.
# Assume here some code to write to a local file here.
result = []
print(result)
What I have tried? Below is the code: The code hangs. Without any error.
import itertools
import multiprocessing
paramlist = list(itertools.combinations(df.id, 2))
pool = multiprocessing.Pool(processes = 2)
result = pool.map(get_custom_value, paramlist)
print(result)
Can I use dask for this?
The actual data has more than 150,000 records. i.e final result will have (150,000 * 150,000 * 1/2) pairs/rows. Given the huge size of the result object, I have a condition which if satisfied then the result is stored. Hence, the actual result object will not exceed my RAM.
The algorithm used is very inefficient. Indeed, both df['id'] == i and df['id'] == j iterate over the whole id column containing 150_000 items in your real-world use-case. Thus, your algorithm runs in O(n^3) time and performs roughly 3_375_000_000_000_000 comparisons while the best algorithm runs in O(n^2) time.
Moreover, CPython loops are very slow and you should avoid using them as much as possible. Fetching Pandas dataframe cells by name is very slow too. Instead, you can use vectorized Pandas/Numpy functions.
Additionally, the output is not efficient too: CPython lists are a bit slow (because of dynamic reference-counted objects) and storing the (i,j) values consume three time more memory. You can store the result in a matrix. Possibly a sparse one or alternatively in a list of compact Numpy arrays.
Furthermore, Bigger data structures are generally slower. If you want a computation to be done very quickly, you generally need to make it fit in the CPU caches (of few MiB). Thus, process you dataframe efficiently you certainly need to compute it in-situ.
Here is a relatively efficient solution using Numpy:
import numpy as np
val_1 = np.ascontiguousarray(df['val_1'].to_numpy())
val_2 = np.ascontiguousarray(df['val_2'].to_numpy())
result = val_1.reshape(-1, 1) * val_1 + val_2.reshape(-1, 1) * val_2
It produces a n² matrix where the (i,j) item can be found using result[i, j]. reshape(-1, 1) is used to transpose the horizontal vector so to get a vertical one and then benefit from Numpy broadcasting. Note that you can filter the upper-triangular part using np.triu(result, 1).
You can generate the result line by line so not to allocate a huge array:
val_1 = np.ascontiguousarray(df['val_1'].to_numpy())
val_2 = np.ascontiguousarray(df['val_2'].to_numpy())
for i in range(n-1):
first_val_1 = val_1[i]
first_val_2 = val_2[i]
line = first_val_1 * val_1[i+1:] + first_val_2 * val_2[i+1:]
# Store the line if needed with the i value so to know where it is
If you really want to generate an inefficient list from the Numpy array lines, then you can do that with np.vstack((np.repeat(i, n-i-1), np.arange(i+1, n), line)).T.tolist(). But I strongly advise you not to do that (there is certainly no need to use lists). Note that you can load/store Numpy arrays efficiently using np.load and np.save.
Here are the performances of the different approaches on my machine (with a i5-9600KF processor, 2 DDR4 channels reaching 40 GiB/s and a fast Nvme SSD that can practically write big files at 800 MiB/s) on a random Pandas dataframe with 15_000 records:
Initial code: 60500 seconds (estimation)
Numpy matrix: 0.71 second
Numpy line-by-line: 0.24 second
Time to store all the lines: 0.50 second (estimation)
in a compact way on my SSD
Thus, the Numpy, line-by-line solution is about 250_000 times faster than the initial code! All of this without using multiple cores. In fact, using multiple cores will not be much faster in this case because the RAM is a limited shared resource and file storages are not much faster in parallel on most machines (in fact, HDD are slower when used in parallel because they are inherently sequential). If you really want to do that, then using multiprocessing is definitively not the good tool. Please consider using Numba or Cython instead.
Yes, you can use dask for this. However, I would recommend adding a reduction to the computation that directly computes the quantity you are interest in (summary statistics [mean, median, ...], max/min over all values, etc.). Otherwise, you are looking to write something in the ballpark of 250 GB of binary data, which isn't the most efficient use of computing time and disk space (unless your use-case demands it).
import pandas as pd
import numpy as np
import dask.array as da
# generate fake data of desired size
num_records = int(150000)
df = pd.DataFrame(
{
# no need for an ID column, because you have a range index by default
"val_1": np.random.randint(0, 500, size=num_records),
"val_2": np.random.randint(0, 500, size=num_records),
}
)
# index magix to walk over the upper triangle of a fortran contigous array without diagonal
# Note: this is a different order than you use in your for loops, but
# order doesn't matter here.
N = len(df) - 1
flat_index = da.arange(N * (N + 1) // 2, chunks=(1e6,))
idx_j = (da.floor((da.sqrt(1 + 8 * flat_index) - 1)) // 2 + 1).astype(int)
idx_i = flat_index - (idx_j * (idx_j + 1) // 2)
val_1 = df.loc[:, "val_1"].to_numpy().astype(int)
val_2 = df.loc[:, "val_2"].to_numpy().astype(int)
custom_value_vector = da.take(val_1, idx_i) * da.take(val_1, idx_j) + da.take(
val_2, idx_i
) * da.take(val_2, idx_j)
result_vector = da.stack([idx_i, idx_j, custom_value_vector], axis=-1)
rechunked_result = da.rechunk(result_vector, (5e7, 3))
# now you can reduce to your desired summary statistic
# saving uses da.to_npy_stack
reduced = da.min(rechunked_result)
The above will run in a few sec, but that is just setting up the computational graph. You will still have to run it
from dask.diagnostics import ProgressBar
with ProgressBar():
value = reduced.compute()
# [########################################] | 100% Completed | 3min 3.8s
If you can't vectorize, you can still use dask to do your computation. However, be aware that things will be significantly slower:
import pandas as pd
import numpy as np
import dask.array as da
from dask.diagnostics import ProgressBar
# generate fake data of desired size
num_records = int(150000)
df = pd.DataFrame(
{
"id": np.arange(num_records),
"val_1": np.random.randint(0, 500, size=num_records),
"val_2": np.random.randint(0, 500, size=num_records),
}
)
# index magix to walk over the upper triangle of a fortran contigous array without diagonal
# Note: this is a different order than you use in your for loops, but
# order doesn't matter here.
N = len(df) - 1
flat_index = da.arange(N * (N + 1) // 2, chunks=(1e6,))
idx_j = (da.floor((da.sqrt(1 + 8 * flat_index) - 1)) // 2).astype(int)
idx_i = flat_index - (idx_j * (idx_j + 1) // 2)
idx_j = idx_j + 1
def get_custom_value(multi_idx):
i, j = multi_idx # I added this
first = df[df['id'] == i]
second = df[df['id'] == j]
return int(first['val_1']) * int(second['val_1']) +\
int(first['val_2']) * int(second['val_2'])
multi_idx = da.stack((idx_i, idx_j), axis=-1)
custom_value = da.apply_along_axis(get_custom_value, -1, multi_idx, dtype=int, shape=tuple())
result = da.stack([idx_i, idx_j, custom_value])
reduced = da.min(result)
with ProgressBar():
value = reduced.compute(scheduler="processes")
I didn't run this to completion, so I can't share any timings. (I stopped after around 10 minutes. The point here is: it works, but you will need to bring a healthy amount of patience.)
Lets reframe problem as building and processing cartesian product of two datasets (or dataset with itself).
In distributed setting it is usually solved as following:
Lets call our datasets A and B
We partition bigger dataset over cluster
We broadcast (replicate) smaller dataset to each machine in the cluster
We calculate and process caretesian product in each machine.
If smaller dataset is not "that small" and we can not broadcast it, we can do work in iterations, each time broadcasting another partition of it.
In our case, we have one dataset, and it is small.
So what should be done is:
we should partition our dataframe over cluster, by just creating dask dataframe with enough partitions. Lets call it "A_Partitioned"
We should broadcast the same dataset over cluster using scatter function over the cluster http://distributed.dask.org/en/stable/locality.html Lets call it A_Broadcasted
Now we can do map_partiions over A_Partitioned where it will do nested loops over partition and A_Broadcasted

How do I force two arrays to be equal for use in pyplot?

I'm trying to plot a simple moving averages function but the resulting array is a few numbers short of the full sample size. How do I plot such a line alongside a more standard line that extends for the full sample size? The code below results in this error message:
ValueError: x and y must have same first dimension, but have shapes (96,) and (100,)
This is using standard matplotlib.pyplot. I've tried just deleting X values using remove and del as well as switching all arrays to numpy arrays (since that's the output format of my moving averages function) then tried adding an if condition to the append in the while loop but neither has worked.
import random
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
def movingaverage(values, window):
weights = np.repeat(1.0, window) / window
smas = np.convolve(values, weights, 'valid')
return smas
sampleSize = 100
min = -10
max = 10
window = 5
vX = np.array([])
vY = np.array([])
x = 0
val = 0
while x < sampleSize:
val += (random.randint(min, max))
vY = np.append(vY, val)
vX = np.append(vX, x)
x += 1
plt.plot(vX, vY)
plt.plot(vX, movingaverage(vY, window))
plt.show()
Expected results would be two lines on the same graph - one a simple moving average of the other.
Just change this line to the following:
smas = np.convolve(values, weights,'same')
The 'valid' option, only convolves if the window completely covers the values array. What you want is 'same', which does what you are looking for.
Edit: This, however, also comes with its own issues as it acts like there are extra bits of data with value 0 when your window does not fully sit on top of the data. This can be ignored if chosen, as is done in this solution, but another approach is to pad the array with specific values of your choosing instead (see Mike Sperry's answer).
Here is how you would pad a numpy array out to the desired length with 'nan's (replace 'nan' with other values, or replace 'constant' with another mode depending on desired results)
https://docs.scipy.org/doc/numpy/reference/generated/numpy.pad.html
import numpy as np
bob = np.asarray([1,2,3])
alice = np.pad(bob,(0,100-len(bob)),'constant',constant_values=('nan','nan'))
So in your code it would look something like this:
import random
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
def movingaverage(values,window):
weights = np.repeat(1.0,window)/window
smas = np.convolve(values,weights,'valid')
shorted = int((100-len(smas))/2)
print(shorted)
smas = np.pad(smas,(shorted,shorted),'constant',constant_values=('nan','nan'))
return smas
sampleSize = 100
min = -10
max = 10
window = 5
vX = np.array([])
vY = np.array([])
x = 0
val = 0
while x < sampleSize:
val += (random.randint(min,max))
vY = np.append(vY,val)
vX = np.append(vX,x)
x += 1
plt.plot(vX,vY)
plt.plot(vX,(movingaverage(vY,window)))
plt.show()
To answer your basic question, the key is to take a slice of the x-axis appropriate to the data of the moving average. Since you have a convolution of 100 data elements with a window of size 5, the result is valid for the last 96 elements. You would plot it like this:
plt.plot(vX[window - 1:], movingaverage(vY, window))
That being said, your code could stand to have some optimization done on it. For example, numpy arrays are stored in fixed size static buffers. Any time you do append or delete on them, the entire thing gets reallocated, unlike Python lists, which have amortization built in. It is always better to preallocate if you know the array size ahead of time (which you do).
Secondly, running an explicit loop is rarely necessary. You are generally better off using the under-the-hood loops implemented at the lowest level in the numpy functions instead. This is called vectorization. Random number generation, cumulative sums and incremental arrays are all fully vectorized in numpy. In a more general sense, it's usually not very effective to mix Python and numpy computational functions, including random.
Finally, you may want to consider a different convolution method. I would suggest something based on numpy.lib.stride_tricks.as_strided. This is a somewhat arcane, but very effective way to implement a sliding window with numpy arrays. I will show it here as an alternative to the convolution method you used, but feel free to ignore this part.
All in all:
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
def movingaverage(values, window):
# this step creates a view into the same buffer
values = np.lib.stride_tricks.as_strided(values, shape=(window, values.size - window + 1), strides=values.strides * 2)
smas = values.sum(axis=0)
smas /= window # in-place to avoid temp array
return smas
sampleSize = 100
min = -10
max = 10
window = 5
v_x = np.arange(sampleSize)
v_y = np.cumsum(np.random.random_integers(min, max, sampleSize))
plt.plot(v_x, v_y)
plt.plot(v_x[window - 1:], movingaverage(v_y, window))
plt.show()
A note on names: in Python, variable and function names are conventionally name_with_underscore. CamelCase is reserved for class names. np.random.random_integers uses inclusive bounds just like random.randint, but allows you to specify the number of samples to generate. Confusingly, np.random.randint has an exclusive upper bound, more like random.randrange.

Mean and Standard deviation across multiple arrays using numpy

I’m importing 2-D matrix data for a multi year climate time series testing on a 5 year annual dataset. I’ve created a for loop to import the 2D matrix data by year into a series of 5 separate arrays of size (1500, 3600). I append the matrix time series data into a single combined (5, 1500, 3600) array with each year being one dimension in the array. I then run the np.mean and np.std to create (1500, 3600) matrices calculating the 5 year mean and stddev of the data at each matrix point. Code is below. The numbers look to be coming out correctly when I test this but I would like to know ..
Is there a faster way to do this? I will eventually need to run this type of analysis for daily data over an 18 time span which would be building and operating on a (6570, 1500, 3600) array. Any suggestions? I’m fairly new to Python and still finding my way.
StartYear=2009
EndYear=2014
for x in range(StartYear, EndYear):
name = "/dir/climate_variable" + str(x) + ".gz"
Q_WBM = rg.grid(name)
Q_WBM.Load()
q_wbm = Q_WBM.Data # .flatten()
q_wbm[np.isnan(q_wbm)] = 0
if x == StartYear:
QTS_array = q_wbm
else:
QTS_array = np.append(QTS_array, q_wbm, axis=0)
DischargeMEAN = np.mean(QTS_array, axis=0)
DischargeSTD = np.std(QTS_array, axis=0)
Unlike list.append which is amortized O(1) numpy.append is pretty much O(n), meaning your loop is O(n^2) and will be no fun to use on your full problem.
On top of that, 6570 x 1500 x 3600 x itemsize is actually quite large and won't fit into memory unless you have a lot of that.
If all you want are mean and SD, then you can sidestep both these problems by summing on the fly. You would replace the end of your code by something like
if x == StartYear:
mom1 = q_wbm
mom2 = q_wbm**2
else:
mom1 += q_wbm
mom2 += q_wbm**2
DischargeMEAN = mom1 / n
DischargeSTD = np.sqrt(mom2 / n - DischargeMEAN**2)

Using Mann Kendall in python with a lot of data

I have a set of 46 years worth of rainfall data. It's in the form of 46 numpy arrays each with a shape of 145, 192, so each year is a different array of maximum rainfall data at each lat and lon coordinate in the given model.
I need to create a global map of tau values by doing an M-K test (Mann-Kendall) for each coordinate over the 46 years.
I'm still learning python, so I've been having trouble finding a way to go through all the data in a simple way that doesn't involve me making 27840 new arrays for each coordinate.
So far I've looked into how to use scipy.stats.kendalltau and using the definition from here: https://github.com/mps9506/Mann-Kendall-Trend
EDIT:
To clarify and add a little more detail, I need to perform a test on for each coordinate and not just each file individually. For example, for the first M-K test, I would want my x=46 and I would want y=data1[0,0],data2[0,0],data3[0,0]...data46[0,0]. Then to repeat this process for every single coordinate in each array. In total the M-K test would be done 27840 times and leave me with 27840 tau values that I can then plot on a global map.
EDIT 2:
I'm now running into a different problem. Going off of the suggested code, I have the following:
for i in range(145):
for j in range(192):
out[i,j] = mk_test(yrmax[:,i,j],alpha=0.05)
print out
I used numpy.stack to stack all 46 arrays into a single array (yrmax) with shape: (46L, 145L, 192L) I've tested it out and it calculates p and tau correctly if I change the code from out[i,j] to just out. However, doing this messes up the for loop so it only takes the results from the last coordinate in stead of all of them. And if I leave the code as it is above, I get the error: TypeError: list indices must be integers, not tuple
My first guess was that it has to do with mk_test and how the information is supposed to be returned in the definition. So I've tried altering the code from the link above to change how the data is returned, but I keep getting errors relating back to tuples. So now I'm not sure where it's going wrong and how to fix it.
EDIT 3:
One more clarification I thought I should add. I've already modified the definition in the link so it returns only the two number values I want for creating maps, p and z.
I don't think this is as big an ask as you may imagine. From your description it sounds like you don't actually want the scipy kendalltau, but the function in the repository you posted. Here is a little example I set up:
from time import time
import numpy as np
from mk_test import mk_test
data = np.array([np.random.rand(145, 192) for _ in range(46)])
mk_res = np.empty((145, 192), dtype=object)
start = time()
for i in range(145):
for j in range(192):
out[i, j] = mk_test(data[:, i, j], alpha=0.05)
print(f'Elapsed Time: {time() - start} s')
Elapsed Time: 35.21990394592285 s
My system is a MacBook Pro 2.7 GHz Intel Core I7 with 16 GB Ram so nothing special.
Each entry in the mk_res array (shape 145, 192) corresponds to one of your coordinate points and contains an entry like so:
array(['no trend', 'False', '0.894546014835', '0.132554125342'], dtype='<U14')
One thing that might be useful would be to modify the code in mk_test.py to return all numerical values. So instead of 'no trend'/'positive'/'negative' you could return 0/1/-1, and 1/0 for True/False and then you wouldn't have to worry about the whole object array type. I don't know what kind of analysis you might want to do downstream but I imagine that would preemptively circumvent any headaches.
Thanks to the answers provided and some work I was able to work out a solution that I'll provide here for anyone else that needs to use the Mann-Kendall test for data analysis.
The first thing I needed to do was flatten the original array I had into a 1D array. I know there is probably an easier way to go about doing this, but I ultimately used the following code based on code Grr suggested using.
`x = 46
out1 = np.empty(x)
out = np.empty((0))
for i in range(146):
for j in range(193):
out1 = yrmax[:,i,j]
out = np.append(out, out1, axis=0) `
Then I reshaped the resulting array (out) as follows:
out2 = np.reshape(out,(27840,46))
I did this so my data would be in a format compatible with scipy.stats.kendalltau 27840 is the total number of values I have at every coordinate that will be on my map (i.e. it's just 145*192) and the 46 is the number of years the data spans.
I then used the following loop I modified from Grr's code to find Kendall-tau and it's respective p-value at each latitude and longitude over the 46 year period.
`x = range(46)
y = np.zeros((0))
for j in range(27840):
b = sc.stats.kendalltau(x,out2[j,:])
y = np.append(y, b, axis=0)`
Finally, I reshaped the data one for time as shown:newdata = np.reshape(y,(145,192,2)) so the final array is in a suitable format to be used to create a global map of both tau and p-values.
Thanks everyone for the assistance!
Depending on your situation, it might just be easiest to make the arrays.
You won't really need them all in memory at once (not that it sounds like a terrible amount of data). Something like this only has to deal with one "copied out" coordinate trend at once:
SIZE = (145,192)
year_matrices = load_years() # list of one 145x192 arrays per year
result_matrix = numpy.zeros(SIZE)
for x in range(SIZE[0]):
for y in range(SIZE[1]):
coord_trend = map(lambda d: d[x][y], year_matrices)
result_matrix[x][y] = analyze_trend(coord_trend)
print result_matrix
Now, there are things like itertools.izip that could help you if you really want to avoid actually copying the data.
Here's a concrete example of how Python's "zip" might works with data like yours (although as if you'd used ndarray.flatten on each year):
year_arrays = [
['y0_coord0_val', 'y0_coord1_val', 'y0_coord2_val', 'y0_coord2_val'],
['y1_coord0_val', 'y1_coord1_val', 'y1_coord2_val', 'y1_coord2_val'],
['y2_coord0_val', 'y2_coord1_val', 'y2_coord2_val', 'y2_coord2_val'],
]
assert len(year_arrays) == 3
assert len(year_arrays[0]) == 4
coord_arrays = zip(*year_arrays) # i.e. `zip(year_arrays[0], year_arrays[1], year_arrays[2])`
# original data is essentially transposed
assert len(coord_arrays) == 4
assert len(coord_arrays[0]) == 3
assert coord_arrays[0] == ('y0_coord0_val', 'y1_coord0_val', 'y2_coord0_val', 'y3_coord0_val')
assert coord_arrays[1] == ('y0_coord1_val', 'y1_coord1_val', 'y2_coord1_val', 'y3_coord1_val')
assert coord_arrays[2] == ('y0_coord2_val', 'y1_coord2_val', 'y2_coord2_val', 'y3_coord2_val')
assert coord_arrays[3] == ('y0_coord2_val', 'y1_coord2_val', 'y2_coord2_val', 'y3_coord2_val')
flat_result = map(analyze_trend, coord_arrays)
The example above still copies the data (and all at once, rather than a coordinate at a time!) but hopefully shows what's going on.
Now, if you replace zip with itertools.izip and map with itertools.map then the copies needn't occur — itertools wraps the original arrays and keeps track of where it should be fetching values from internally.
There's a catch, though: to take advantage itertools you to access the data only sequentially (i.e. through iteration). In your case, it looks like the code at https://github.com/mps9506/Mann-Kendall-Trend/blob/master/mk_test.py might not be compatible with that. (I haven't reviewed the algorithm itself to see if it could be.)
Also please note that in the example I've glossed over the numpy ndarray stuff and just show flat coordinate arrays. It looks like numpy has some of it's own options for handling this instead of itertools, e.g. this answer says "Taking the transpose of an array does not make a copy". Your question was somewhat general, so I've tried to give some general tips as to ways one might deal with larger data in Python.
I ran into the same task and have managed to come up with a vectorized solution using numpy and scipy.
The formula are the same as in this page: https://vsp.pnnl.gov/help/Vsample/Design_Trend_Mann_Kendall.htm.
The trickiest part is to work out the adjustment for the tied values. I modified the code as in this answer to compute the number of tied values for each record, in a vectorized manner.
Below are the 2 functions:
import copy
import numpy as np
from scipy.stats import norm
def countTies(x):
'''Count number of ties in rows of a 2D matrix
Args:
x (ndarray): 2d matrix.
Returns:
result (ndarray): 2d matrix with same shape as <x>. In each
row, the number of ties are inserted at (not really) arbitary
locations.
The locations of tie numbers in are not important, since
they will be subsequently put into a formula of sum(t*(t-1)*(2t+5)).
Inspired by: https://stackoverflow.com/a/24892274/2005415.
'''
if np.ndim(x) != 2:
raise Exception("<x> should be 2D.")
m, n = x.shape
pad0 = np.zeros([m, 1]).astype('int')
x = copy.deepcopy(x)
x.sort(axis=1)
diff = np.diff(x, axis=1)
cated = np.concatenate([pad0, np.where(diff==0, 1, 0), pad0], axis=1)
absdiff = np.abs(np.diff(cated, axis=1))
rows, cols = np.where(absdiff==1)
rows = rows.reshape(-1, 2)[:, 0]
cols = cols.reshape(-1, 2)
counts = np.diff(cols, axis=1)+1
result = np.zeros(x.shape).astype('int')
result[rows, cols[:,1]] = counts.flatten()
return result
def MannKendallTrend2D(data, tails=2, axis=0, verbose=True):
'''Vectorized Mann-Kendall tests on 2D matrix rows/columns
Args:
data (ndarray): 2d array with shape (m, n).
Keyword Args:
tails (int): 1 for 1-tail, 2 for 2-tail test.
axis (int): 0: test trend in each column. 1: test trend in each
row.
Returns:
z (ndarray): If <axis> = 0, 1d array with length <n>, standard scores
corresponding to data in each row in <x>.
If <axis> = 1, 1d array with length <m>, standard scores
corresponding to data in each column in <x>.
p (ndarray): p-values corresponding to <z>.
'''
if np.ndim(data) != 2:
raise Exception("<data> should be 2D.")
# alway put records in rows and do M-K test on each row
if axis == 0:
data = data.T
m, n = data.shape
mask = np.triu(np.ones([n, n])).astype('int')
mask = np.repeat(mask[None,...], m, axis=0)
s = np.sign(data[:,None,:]-data[:,:,None]).astype('int')
s = (s * mask).sum(axis=(1,2))
#--------------------Count ties--------------------
counts = countTies(data)
tt = counts * (counts - 1) * (2*counts + 5)
tt = tt.sum(axis=1)
#-----------------Sample Gaussian-----------------
var = (n * (n-1) * (2*n+5) - tt) / 18.
eps = 1e-8 # avoid dividing 0
z = (s - np.sign(s)) / (np.sqrt(var) + eps)
p = norm.cdf(z)
p = np.where(p>0.5, 1-p, p)
if tails==2:
p=p*2
return z, p
I assume your data come in the layout of (time, latitude, longitude), and you are examining the temporal trend for each lat/lon cell.
To simulate this task, I synthesized a sample data array of shape (50, 145, 192). The 50 time points are taken from Example 5.9 of the book Wilks 2011, Statistical methods in the atmospheric sciences. And then I simply duplicated the same time series 27840 times to make it (50, 145, 192).
Below is the computation:
x = np.array([0.44,1.18,2.69,2.08,3.66,1.72,2.82,0.72,1.46,1.30,1.35,0.54,\
2.74,1.13,2.50,1.72,2.27,2.82,1.98,2.44,2.53,2.00,1.12,2.13,1.36,\
4.9,2.94,1.75,1.69,1.88,1.31,1.76,2.17,2.38,1.16,1.39,1.36,\
1.03,1.11,1.35,1.44,1.84,1.69,3.,1.36,6.37,4.55,0.52,0.87,1.51])
# create a big cube with shape: (T, Y, X)
arr = np.zeros([len(x), 145, 192])
for i in range(arr.shape[1]):
for j in range(arr.shape[2]):
arr[:, i, j] = x
print(arr.shape)
# re-arrange into tabular layout: (Y*X, T)
arr = np.transpose(arr, [1, 2, 0])
arr = arr.reshape(-1, len(x))
print(arr.shape)
import time
t1 = time.time()
z, p = MannKendallTrend2D(arr, tails=2, axis=1)
p = p.reshape(145, 192)
t2 = time.time()
print('time =', t2-t1)
The p-value for that sample time series is 0.63341565, which I have validated against the pymannkendall module result. Since arr contains merely duplicated copies of x, the resultant p is a 2d array of size (145, 192), with all 0.63341565.
And it took me only 1.28 seconds to compute that.

Categories

Resources