Efficiently concatenate many large 3-dimensional arrays in python

Efficiently concatenate many large 3-dimensional arrays in python - python

I have approximately 100 numpy arrays. Each of them is having shape of (100, 40000, 4). I want to concatenate these arrays along first axis, i.e., axis=0 into one big array efficiently.
Approach 1
I used np.concatenate as shown below-
def concatenate(all_data):
for index, data in enumerate(all_data):
if index == 0:
arr = data.copy()
else:
arr = np.concatenate((arr, data), axis=0)
return arr
Approach 2
I created panel in pandas and then used pd.concat as shown below-
def concatenate(all_data):
for index, data in enumerate(all_data):
if index == 0:
pn = pd.Panel(data)
else:
pn = pd.concat([pn, pd.Panel(data)])
return pn # numpy array can be acquired from pn.values
The second approach seems faster than first one. However, this approach shows deprecated warning while creating pd.Panel.
I want to know if there exists better way to concatenate large 3-dimensional arrays in python.

Calling np.concatenate() repeatedly is an anti-pattern. Instead, try this:
np.concatenate(all_data)
Simple, fast.

Related

fastest way to get max value of each masked np.array for many masks?

I have two numpy arrays of the same shape. One contains information that I am interested in, and the other contains a bunch of integers that can be used as mask values.
In essence, I want to loop through each unique integer to get each mask for the array, then filtered the main array using this mask and find the max value of the filtered array.
For simplicity, lets say the arrays are:
arr1 = np.random.rand(10000,10000)
arr2 = np.random.randint(low=0, high=1000, size=(10000,10000))
right now I'm doing this:
maxes = {}
ids = np.unique(arr2)
for id in ids:
max_val = arr1[np.equal(arr2, id)].max()
maxes[id] = max_val
My arrays are alot bigger and this is painfully slow, I am strugging to find a quicker way of doing this...maybe there's some kind of creative method I'm not aware of, would really appreciate any help.
EDIT
let's say the majority of arr2 is actually 0 and I dont care about the 0 id, is it possible to speed it up by dropping this entire chunk from the search??
i.e.
arr2[:, 0:4000] = 0
and just return the maxes for ids > 0 ??
much appreciated..

Generic bin-based reduction strategies
Listed below are few approaches to tackle such scenarios where we need to perform bin-based reduction operations. So, essentially we are given two arrays and we are required to use one as the bins and the other one for values and reduce the second one.
Approach #1 : One strategy would be to sort arr1 based on arr2. Once we have them both sorted in that same order, we find the group start and stop indices and then with appropriate ufunc.reduceat, we do our slice-based reduction operation. That's all there is!
Here's the implementation -
def binmax(bins, values, reduceat_func):
''' Get binned statistic from two 1D arrays '''
sidx = bins.argsort()
bins_sorted = bins[sidx]
grpidx = np.flatnonzero(np.r_[True,bins_sorted[:-1]!=bins_sorted[1:]])
max_per_group = reduceat_func(values[sidx],grpidx)
out = dict(zip(bins_sorted[grpidx], max_per_group))
return out
out = binmax(arr2.ravel(), arr1.ravel(), reduceat_func=np.maximum.reduceat)
It's applicable across ufuncs that have their corresponding ufunc.reduceat methods.
Approach #2 : We can also leverage scipy.stats.binned_statistic that 's basically a generic utility to do some of the common reduction operations based on binned array values -
from scipy.stats import binned_statistic
def binmax_v2(bins, values, statistic):
''' Get binned statistic from two 1D arrays '''
num_labels = bins.max()+1
R = np.arange(num_labels+1)
Mx = binned_statistic(bins, values, statistic=statistic, bins=R)[0]
idx = np.flatnonzero(~np.isnan(Mx))
out = dict(zip(idx, Mx[idx].astype(int)))
return out
out = binmax_v2(arr2.ravel(), arr1.ravel(), statistic='max')

Cleanest way to replace np.array value with np.nan by user defined index

One question about mask 2-d np.array data.
For example:
one 2-d np.array value in the shape of 20 x 20.
An index t = [(1,2),(3,4),(5,7),(12,13)]
How to mask the 2-d array value by the (y,x) in index?
Usually, replacing with np.nan are based on the specific value like y[y==7] = np.nan
On my example, I want to replace the value specific location with np.nan.
For now, I can do it by:
Creating a new array value_mask in the shape of 20 x 20
Loop the value and testify the location by (i,j) == t[k]
If True, value_mask[i,j] = value[i,j] ; In verse, value_mask[i,j] = np.nan
My method was too bulky especially for hugh data(3 levels of loops).
Are there some efficiency method to achieve that? Any advice would be appreciate.

You are nearly there.
You can pass arrays of indices to arrays. You probably know this with 1D-arrays.
With a 2D-array you need to pass the array a tuple of lists (one tuple for each axis; one element in the lists (which have to be of equal length) for each array-element you want to chose). You have a list of tuples. So you have just to "transpose" it.
t1 = zip(*t)
gives you the right shape of your index array; which you can now use as index for any assignment, for example: value[t1] = np.NaN
(There are lots of nice explanation of this trick (with zip and *) in python tutorials, if you don't know it yet.)

You can use np.logical_and
arr = np.zeros((20,20))
You can select by location, this is just an example location.
arr[4:8,4:8] = 1
You can create a mask the same shape as arr
mask = np.ones((20,20)).astype(bool)
Then you can use the np.logical_and.
mask = np.logical_and(mask, arr == 1)
And finally, you can replace the 1s with the np.nan
arr[mask] = np.nan

Storing multiple arrays within multiple arrays within an array Python/Numpy

I have a text file with 93 columns and 1699 rows that I have imported into Python. The first three columns do not contain data that is necessary for what I'm currently trying to do. Within each column, I need to divide each element (aka row) in the column by all of the other elements (rows) in that same column. The result I want is an array of 90 elements where each of 1699 elements has 1699 elements.
A more detailed description of what I'm attempting: I begin with Column3. At Column3, Row1 is to be divided by all the other rows (including the value in Row1) within Column3. That will give Row1 1699 calculations. Then the same process is done for Row2 and so on until Row1699. This gives Column3 1699x1699 calculations. When the calculations of all of the rows in Column 3 have completed, then the program moves on to do the same thing in Column 4 for all of the rows. This is done for all 90 columns which means that for the end result, I should have 90x1699x1699 calculations.
My code as it currently is is:
import numpy as np
from glob import glob
fnames = glob("NIR_data.txt")
arrays = np.array([np.loadtxt(f, skiprows=1) for f in fnames])
NIR_values = np.concatenate(arrays)
NIR_band = NIR_values.T
C_values = []
for i in range(3,len(NIR_band)):
for j in range(0,len(NIR_band[3])):
loop_list = NIR_band[i][j]/NIR_band[i,:]
C_values.append(loop_list)
What it produces is an array of 1699x1699 dimension. Each individual array is the results from the Row calculations. Another complaint is that the code takes ages to run. So, I have two questions, is it possible to create the type of array I'd like to work with? And, is there a faster way of coding this calculation?

Dividing each of the numbers in a given column by each of the other values in the same column can be accomplished in one operation as follows.
result = a[:, numpy.newaxis, :] / a[numpy.newaxis, :, :]
Because looping over the elements happens in the optimized binary depths of numpy, this is as fast as Python is ever going to get for this operation.
If a.shape was [1699,90] to begin with, then the result will have shape [1699,1699,90]. Assuming dtype=float64, that means you will need nearly 2 GB of memory available to store the result.

First let's focus on the load:
arrays = np.array([np.loadtxt(f, skiprows=1) for f in fnames])
NIR_values = np.concatenate(arrays)
Your text talks about loading a file, and manipulating it. But this clip loads multple files and joins them.
My first change is to collect the arrays in a list, not another array
alist = [np.loadtxt(f, skiprows=1) for f in fnames]
If you want to skip some columns, look at using the usecols parameter. That may save you work later.
The elements of alist will now be 2d arrays (of floats). If they are matching sizes (N,M), they can be joined in various ways. If there are n files, then
arrays = np.array(alist) # (n,N,M) array
arrays = np.concatenate(alist, axis=0) # (n*N, M) array
# similarly for axis=1
Your code does the same, but potentially confuses steps:
In [566]: arrays = np.array([np.ones((3,4)) for i in range(5)])
In [567]: arrays.shape
Out[567]: (5, 3, 4) # (n,N,M) array
In [568]: NIR_values = np.concatenate(arrays)
In [569]: NIR_values.shape
Out[569]: (15, 4) # (n*N, M) array
NIR_band is now (4,15), and it's len() is the .shape[0], the size of the 1st dimension.len(NIR_band[3])isshape[1]`, the size of the 2nd dimension.
You could skip the columns of NIR_values with NIR_values[:,3:].
I get lost in the rest of text description.
The NIR_band[i][j]/NIR_band[i,:], I would rewrite as NIR_band[i,j]/NIR_band[i,:]. What's the purpose of that?
As for you subject line, Storing multiple arrays within multiple arrays within an array - that sounds like making a 3 or 4d array. arrays is 3d, NIR_valus is 2d.
Creating a (90,1699,1699) from a (93,1699) will probably involve (without iteration) a calculation analogous to:
In [574]: X = np.arange(13*4).reshape(13,4)
In [575]: X.shape
Out[575]: (13, 4)
In [576]: (X[3:,:,None]+X[3:,None,:]).shape
Out[576]: (10, 4, 4)
The last dimension is expanded with None (np.newaxis), and 2 versions broadcasted against each other. np.outer does the multiplication of this calculation.

Copying row element in a numpy array

I have an array X of <class 'scipy.sparse.csr.csr_matrix'> format with shape (44, 4095)
I would like to now to create a new numpy array say X_train = np.empty([44, 4095]) and copy row by row in a different order. Say I want the 5th row of X in 1st row of X_train.
How do I do this (copying an entire row into a new numpy array) similar to matlab?

Define the new row order as a list of indices, then define X_train using integer indexing:
row_order = [4, ...]
X_train = X[row_order]
Note that unlike Matlab, Python uses 0-based indexing, so the 5th row has index 4.
Also note that integer indexing (due to its ability to select values in arbitrary order) returns a copy of the original NumPy array.
This works equally well for sparse matrices and NumPy arrays.

Python works generally by reference, which is something you should keep in mind. What you need to do is make a copy and then swap. I have written a demo function which swaps rows.
import numpy as np # import numpy
''' Function which swaps rowA with rowB '''
def swapRows(myArray, rowA, rowB):
temp = myArray[rowA,:].copy() # create a temporary variable
myArray[rowA,:] = myArray[rowB,:].copy()
myArray[rowB,:]= temp
a = np.arange(30) # generate demo data
a = a.reshape(6,5) # reshape the data into 6x5 matrix
print a # prin the matrix before the swap
swapRows(a,0,1) # swap the rows
print a # print the matrix after the swap
To answer your question, one solution would be to use
X_train = np.empty([44, 4095])
X_train[0,:] = x[4,:].copy() # store in the 1st row the 5th one
unutbu answer seems to be the most logical.
Kind Regards,

Select specefic rows from a 2d Numpy array using a sparse binary 1-d array

I am having a issues figuring out to do this operation
So I have and the variable index 1xM sparse binary array and I have a 2-d array (NxM) samples. I want to use index to select specific rows of samples adnd get a 2-d array.
I have tried stuff like:
idx = index.todense() == 1
samples[idx.T,:]
but nothing.
So far I have made it work doing this:
idx = test_x.todense() == 1
selected_samples = samples[np.array(idx.flat)]
But there should be a cleaner way.
To give an idea using a fraction of the data:
print(idx.shape) # (1, 22360)
print(samples.shape) (22360, 200)

The short answer:
selected_samples = samples[index.nonzero()[1]]
The long answer:
The first problem is that your index matrix is 1xN while your sample ndarray is NxM. (See the mismatch?) This is why you needed to call .flat.
That's not a big deal, though, because we just need the nonzero entries in the sparse vector. Get those with index.nonzero(), which returns a tuple of (row indices, column indices). We only care about the column indices, so we use index.nonzero()[1] to get those by themselves.
Then, simply index with the array of nonzero column indices and you're done.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficiently concatenate many large 3-dimensional arrays in python - python

Calling np.concatenate() repeatedly is an anti-pattern. Instead, try this: np.concatenate(all_data) Simple, fast.

Related

fastest way to get max value of each masked np.array for many masks?

Cleanest way to replace np.array value with np.nan by user defined index

Storing multiple arrays within multiple arrays within an array Python/Numpy

Copying row element in a numpy array

Select specefic rows from a 2d Numpy array using a sparse binary 1-d array

Categories

Resources