Python time-lat-lon array manipulation and grouping - python

For a t-x-y array representing time-latitude-longitude and where the values of the t-x-y grid hold arbitrary measured variables, how can i 'group' x-y slices of the array for a give time condition?
For example, if a companion t-array is a 1d list of datetimes, how can i find the elementwise mean of the x-y grids that have months equal to 1. If t has only 10 elements where month = 1 then I want a (10, len(x), len(y)) array. From here I know I can do np.mean(out, axis=0) to get my desired mean values across the x-y grid, where out is the result of the array manipulation.
The shape of t-x-y is approximately (2000, 50, 50), that is a (50, 50) grid of values for 2000 different times. Assume that the number of unique conditions (whether I'm slicing by month or year) are << than the total number of elements in the t array.
What is the most pythonic way to achieve this? This operation will be repeated with many datasets so a computationally efficient solution is preferred. I'm relatively new to python (I can't even figure out how to create an example array for you to test with) so feel free to recommend other modules that may help. (I have looked at Pandas, but it seems like it mainly handles 1d time-series data...?)
Edit:
This is the best I can do as an example array:
>>> t = np.repeat([1,2,3,4,5,6,7,8,9,10,11,12],83)
>>> t.shape
(996,)
>>> a = np.random.randint(1,101,2490000).reshape(996, 50, 50)
>>> a.shape
(996, 50, 50)
>>> list(set(t))
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
So a is the array of random data, t is (say) your array representing months of the year, in this case just plain integers. In this example there are 83 instances of each month. How can we separate out the 83 x-yslices of a that correspond to when t = 1 (to create a monthly mean dataset)?

One possible answer to the (my) question, using numpy.where
To find the slices of a, where t = 1:
>>> import numpy as np
>>> out = a[np.where(t == 1),:,:]
although this gives the slightly confusing (to me at least) output of:
>>> out.shape
(1, 83, 50, 50)
but if we follow through with my needing the mean
>>> out2 = np.mean(np.mean(out, axis = 0), axis = 0)
reduces the result to the expected:
>>> out2.shape
(50,50)
Can anyone improve on this or see any issues here?

Related

What is an efficient way to calculate the mean of values in the bin with maximum frequency for large number of numpy arrays?

I am looking for an efficient way to do the following calculations on millions of arrays. For the values in each array, I want to calculate the mean of the values in the bin with most frequency as demonstrated below. Some of the arrays might contain nan values and other values are float. The loop for my actual data takes too long to finish.
import numpy as np
array = np.array([np.random.uniform(0, 10) for i in range(800,)])
# adding nan values
mask = np.random.choice([1, 0], array.shape, p=[.7, .3]).astype(bool)
array[mask] = np.nan
array = array.reshape(50, 16)
bin_values=np.linspace(0, 10, 21)
f = np.apply_along_axis(lambda a: np.histogram(a, bins=bin_values)[0], 1, array)
bin_start = np.apply_along_axis(lambda a: bin_values[np.argmax(a)], 1, f).reshape(array.shape[0], -1)
bin_end = bin_start + (abs(bin_values[1]-bin_values[0])
values = np.zeros(array.shape[0])
for i in range(array.shape[0]):
values[i] = np.nanmean(array[i][(array[i]>=bin_start[i])*(array[i]<bin_end[i])])
Also, when I run the above code I get three warnings. The first is 'RuntimeWarning: Mean of empty slice' for the line where I calculate the value variable. I set a condition in case I have all nan values to skip this line, but the warning did not go away. I was wondering what the reason is. The other two warnings are for when the less and greater_equal conditions do not meet which make sense to me since they might be nan values.
The arrays that I want to run this algorithm on are independent, but I am already processing them with 12 separate scripts. Running the code in parallel would be an option, however, for now I am looking to improve the algorithm itself.
The reason that I am using lambda function is to run numpy.histogram over an axis since it seems the histogram function does not take an axis as an option. I was able to use a mask and remove the loop from the code. The code is 2 times faster now, but I think it still can be improved more.
I can explain what I want to do in more detail by an example if it clarifies it. Imagine I have 36 numbers which are greater than 0 and smaller than 20. Also, I have bins with equal distance of 0.5 over the same interval (0.0_0.5, 0.5_1.0, 1.0_1.5, … , 19.5_20.0). I want to see if I put the 36 numbers in their corresponding bin what would be the mean of the numbers within the bin which contain the most number of numbers.
Please post your solution if you can think of a faster algorithm.
import numpy as np
# creating an array to test the algorithm
array = np.array([np.random.uniform(0, 10) for i in range(800,)])
# adding nan values
mask = np.random.choice([1, 0], array.shape, p=[.7, .3]).astype(bool)
array[mask] = np.nan
array = array.reshape(50, 16)
# the algorithm
bin_values=np.linspace(0, 10, 21)
# calculating the frequency of each bin
f = np.apply_along_axis(lambda a: np.histogram(a, bins=bin_values)[0], 1, array)
bin_start = np.apply_along_axis(lambda a: bin_values[np.argmax(a)], 1, f).reshape(array.shape[0], -1)
bin_end = bin_start + (abs(bin_values[1]-bin_values[0]))
# creating a mask to get the mean over the bin with maximum frequency
mask = (array>=bin_start) * (array<bin_end)
mask_nan = np.tile(np.nan, (mask.shape[0], mask.shape[1]))
mask_nan[mask] = 1
v = np.nanmean(array * mask_nan, axis = 1)

Iterating over 3D numpy using one dimension as iterator remaining dimensions in the loop

Despite there being a number of similar questions related to iterating over a 3D array and after trying out some functions like nditer of numpy, I am still confused on how the following can be achieved:
I have a signal of dimensions (30, 11, 300) which is 30 trials of 11 signals containing 300 signal points.
Let this signal be denoted by the variable x_
I have another function which takes as input a (11, 300) matrix and plots it on 1 graph (11 signals containing 300 signal points plotted on a single graph). Let this function be sliding_window_plot.
Currently, I can get it to do this:
x_plot = x_[0,:,:]
for i in range(x_.shape[0]):
sliding_window_plot(x_plot[:,:])
which plots THE SAME (first trial) 11 signals containing 300 points on 1 plot, 30 times.
I want it to plot the i'th set of signals. Not the first (0th) trial of signals everytime. Any hints on how to attempt this?
You should be able to iterate over the first dimension with a for loop:
for s in x_:
sliding_window_plot(s)
with each iteration s will be the next array of shape (11, 300).
In general for all nD-arrays where n>1, you can iterate over the very first dimension of the array as if you're iterating over any other iterable. For checking whether an array is an iterable, you can use np.iterable(arr). Here is an example:
In [9]: arr = np.arange(3 * 4 * 5).reshape(3, 4, 5)
In [10]: arr.shape
Out[10]: (3, 4, 5)
In [11]: np.iterable(arr)
Out[11]: True
In [12]: for a in arr:
...: print(a.shape)
...:
(4, 5)
(4, 5)
(4, 5)
So, in each iteration we get a matrix (of shape (4, 5)) as output. In total, 3 such outputs constitute the 3D array of shape (3, 4, 5)
If, for some reason, you want to iterate over other dimensions then you can use numpy.rollaxis to move the desired axis to the first position and then iterate over it as mentioned in iterating-over-arbitrary-dimension-of-numpy-array
NOTE: Having said that numpy.rollaxis is only maintained for backwards compatibility. So, it is recommended to use numpy.moveaxis instead for moving the desired axis to the first dimension.
You are hardcoding the 0th slice outside the for loop. You need to create x_plot to be inside the loop. In fact you can simplify your code by not using x_plot at all.
for i in rangge(x_.shape[0]):
sliding_window_plot(x_[i])

Splitting a numpy array into two subsets of different sizes

I have a numpy array of a shape (400, 3, 3, 3) and I want to split it into two parts, so I would get arrays like (100, 3, 3, 3) and (300, 3, 3, 3).
I was playing with numpy split methods, e.g.:
subsets = np.array_split(arr, 2)
which gives me what I want, but it divides the original array into two halves the same size and I don't know how to specify these sizes. It'd be probably easy with some indexing (I guess) but I'm not sure how to do it.
As mentioned in my comment, you can use the Ellipsis notation to specify all axes:
x, y = arr[:100, ...], arr[100:, ...]

keeping track of indices change in numpy.reshape

While using numpy.reshape in Python, is there a way to keep track of the change in indices?
For example, if a numpy array with the shape (m,n,l,k) is reshaped into an array with the shape (m*n,k*l); is there a way to get the initial index ([x,y,w,z]) for the current [X,Y] index and vice versa?
Yes there is, it's called raveling and unraveling the index. For example you have two arrays:
import numpy as np
arr1 = np.arange(10000).reshape(20, 10, 50)
arr2 = arr.reshape(20, 500)
say you want to index the (10, 52) (equivalent to arr2[10, 52]) element but in arr1:
>>> np.unravel_index(np.ravel_multi_index((10, 52), arr2.shape), arr1.shape)
(10, 1, 2)
or in the other direction:
>>> np.unravel_index(np.ravel_multi_index((10, 1, 2), arr1.shape), arr2.shape)
(10, 52)
You don't keep track of it, but you can calculate it. The original m x n is mapped onto the new m*n dimension, e.g. n*x+y == X. But we can verify with a couple of multidimensional ravel/unravel functions (as answered by #MSeifert).
In [671]: m,n,l,k=2,3,4,5
In [672]: np.ravel_multi_index((1,2,3,4), (m,n,l,k))
Out[672]: 119
In [673]: np.unravel_index(52, (m*n,l*k))
Out[673]: (2, 12)

Is there a difference in the way we access elements of a list comprehension and the elements of a numpy array

I am working on a genetic algorithm code. I am fairly new to python.
My code snippet is as follows:
import numpy as np
pop_size = 10 # Population size
noi = 2 # Number of Iterations
M = 2 # Number of Phases in the Data
alpha = [np.random.randint(0, 64, size = pop_size)]* M
phi = [np.random.randint(0, 64, size = pop_size)]* M
reduced_tensor = [np.zeros((pop_size,3,3))]* M
for n_i in range(noi):
alpha_en = [(2*np.pi*alpha/63.00) for alpha in alpha]
phi_en = [(phi/63.00) for phi in phi]
for i in range(M):
for j in range(pop_size):
reduced_tensor[i][j] = [[1, 0, 0],
[0, phi_en[i][j], 0],
[0, 0, 0]]
Here I have a list of numpy arrays. The variable 'alpha' is a list containing two numpy arrays. How do I use list comprehension in this case? I want to create a similar list 'alpha_en' which operates on every element of alpha. How do I do that? I know my current code is wrong, it was just trial and error.
What does 'for alpha in alpha' mean (line 11)? This line doesn't give any error, but also doesn't give the desired output. It changes the dimension and value of alpha.
The variable 'reduced_tensor' is a list of an array of 3x3 matrix, i.e., four dimensions in total. How do I differentiate between the indexing of a list comprehension and a numpy array? I want to perform various operations on a list of matrices, in this case, assign the values of phi_en to one of the elements of the matrix reduced_tensor (as shown in the code). How should I do it efficiently? I think my current code is wrong, if not just confusing.
There some questionable programming in these 2 lines
alpha = [np.random.randint(0, 64, size = pop_size)]* M
...
alpha_en = [(2*np.pi*alpha/63.00) for alpha in alpha]
The first makes an array, and then makes a list with M pointers to the same thing. Note, M copies of the random array. If I were to change one element of alpha, I'd change them all. I don't see the point to this type of construction.
The [... for alpha in alpha] works because the 2 uses of alpha are different. At least in newer Pythons the i in [i*3 for i in range(3)] does not 'leak out' of the comprehension. That said, I would not approve of that variable naming. At the very least is it confusing to readers.
The arrays in alpha_en are separate. Values are derived from the array in alpha, but they are new.
for a in alphas:
a *= 2
would modify each array in alphas; how ever due to how alphas is constructed this ends up multiplying the array many times.
reduced_tensor = [np.zeros((pop_size,3,3))]* M
has the same problem; it's a list of M references to the same 3d array.
reduced_tensor[i][j]
references the i reference in that list, and the j 'row' of that array. I like to use
reduced_tensor[i][j,:,:]
to make it clearer to me and my reader the expected dimensions of the result.
The iteration over M does nothing for you; it just repeats the same assignment M times.
At the root of your problems is that use of list replication.
In [30]: x=[np.arange(3)]*3
In [31]: x
Out[31]: [array([0, 1, 2]), array([0, 1, 2]), array([0, 1, 2])]
In [32]: [id(i) for i in x]
Out[32]: [3036895536, 3036895536, 3036895536]
In [33]: x[0] *= 10
In [34]: x
Out[34]: [array([ 0, 10, 20]), array([ 0, 10, 20]), array([ 0, 10, 20])]

Categories

Resources