Storing multiple arrays within multiple arrays within an array Python/Numpy - python

I have a text file with 93 columns and 1699 rows that I have imported into Python. The first three columns do not contain data that is necessary for what I'm currently trying to do. Within each column, I need to divide each element (aka row) in the column by all of the other elements (rows) in that same column. The result I want is an array of 90 elements where each of 1699 elements has 1699 elements.
A more detailed description of what I'm attempting: I begin with Column3. At Column3, Row1 is to be divided by all the other rows (including the value in Row1) within Column3. That will give Row1 1699 calculations. Then the same process is done for Row2 and so on until Row1699. This gives Column3 1699x1699 calculations. When the calculations of all of the rows in Column 3 have completed, then the program moves on to do the same thing in Column 4 for all of the rows. This is done for all 90 columns which means that for the end result, I should have 90x1699x1699 calculations.
My code as it currently is is:
import numpy as np
from glob import glob
fnames = glob("NIR_data.txt")
arrays = np.array([np.loadtxt(f, skiprows=1) for f in fnames])
NIR_values = np.concatenate(arrays)
NIR_band = NIR_values.T
C_values = []
for i in range(3,len(NIR_band)):
for j in range(0,len(NIR_band[3])):
loop_list = NIR_band[i][j]/NIR_band[i,:]
C_values.append(loop_list)
What it produces is an array of 1699x1699 dimension. Each individual array is the results from the Row calculations. Another complaint is that the code takes ages to run. So, I have two questions, is it possible to create the type of array I'd like to work with? And, is there a faster way of coding this calculation?

Dividing each of the numbers in a given column by each of the other values in the same column can be accomplished in one operation as follows.
result = a[:, numpy.newaxis, :] / a[numpy.newaxis, :, :]
Because looping over the elements happens in the optimized binary depths of numpy, this is as fast as Python is ever going to get for this operation.
If a.shape was [1699,90] to begin with, then the result will have shape [1699,1699,90]. Assuming dtype=float64, that means you will need nearly 2 GB of memory available to store the result.

First let's focus on the load:
arrays = np.array([np.loadtxt(f, skiprows=1) for f in fnames])
NIR_values = np.concatenate(arrays)
Your text talks about loading a file, and manipulating it. But this clip loads multple files and joins them.
My first change is to collect the arrays in a list, not another array
alist = [np.loadtxt(f, skiprows=1) for f in fnames]
If you want to skip some columns, look at using the usecols parameter. That may save you work later.
The elements of alist will now be 2d arrays (of floats). If they are matching sizes (N,M), they can be joined in various ways. If there are n files, then
arrays = np.array(alist) # (n,N,M) array
arrays = np.concatenate(alist, axis=0) # (n*N, M) array
# similarly for axis=1
Your code does the same, but potentially confuses steps:
In [566]: arrays = np.array([np.ones((3,4)) for i in range(5)])
In [567]: arrays.shape
Out[567]: (5, 3, 4) # (n,N,M) array
In [568]: NIR_values = np.concatenate(arrays)
In [569]: NIR_values.shape
Out[569]: (15, 4) # (n*N, M) array
NIR_band is now (4,15), and it's len() is the .shape[0], the size of the 1st dimension.len(NIR_band[3])isshape[1]`, the size of the 2nd dimension.
You could skip the columns of NIR_values with NIR_values[:,3:].
I get lost in the rest of text description.
The NIR_band[i][j]/NIR_band[i,:], I would rewrite as NIR_band[i,j]/NIR_band[i,:]. What's the purpose of that?
As for you subject line, Storing multiple arrays within multiple arrays within an array - that sounds like making a 3 or 4d array. arrays is 3d, NIR_valus is 2d.
Creating a (90,1699,1699) from a (93,1699) will probably involve (without iteration) a calculation analogous to:
In [574]: X = np.arange(13*4).reshape(13,4)
In [575]: X.shape
Out[575]: (13, 4)
In [576]: (X[3:,:,None]+X[3:,None,:]).shape
Out[576]: (10, 4, 4)
The last dimension is expanded with None (np.newaxis), and 2 versions broadcasted against each other. np.outer does the multiplication of this calculation.

Related

Adding a smaller array to a larger array at a specified location (using variables)

Suppose I have a 6x6 matrix I want to add into a 9v9 matrix, but I also want to add it at a specified location and not necessarily in a 6x6 block.
The below code summarizes what I want to accomplish, the only difference is that I want to use variables instead of the rows 0:6 and 3:9.
import numpy as np
a = np.zeros((9,9))
b = np.ones((6,6))
a[0:6,3:9] += b #Inserts the 6x6 ones matrix into the top right corner of the 9x9 zeros
Now using variables:
rows = np.array([0,1,2,3,4,5])
cols = np.array([3,4,5,6,7,8])
a[rows,3:9] += b #This works fine
a[0:6,cols] += b #This also works fine
a[rows,cols += b #But this gives me the following error: ValueError: shape mismatch: value array of shape (6,6) could not be broadcast to indexing result of shape (6,)
I have spent hours reading through forums and trying different solutions but nothing has ever worked. The reason I need to use variables is because these are input by the user and could be any combination of rows and columns. This notation worked perfectly in MatLab, where I could add b into a with any combination of rows and columns.
Explanation:
rows = np.array([0,1,2,3,4,5])
cols = np.array([3,4,5,6,7,8])
a[rows,cols] += b
You could translate the last line to the following code:
for x, y, z in zip(rows, cols, b):
a[x, y] = z
That means: rows contains the x-coordinate, cols the y-coordinate of the field you want to manipulate. Both arrays contain 6 values, so you effectively manipulate 6 values, and b must thus also contain exactly 6 values. But your b contains 6x6 values. Therefore this is a "shape mismatch". This site should contain all you need about indexing of np.arrays.

how to extract overlapping sub-arrays with a window size and flatten them

I am trying to get better at using numpy functions and methods to run my programs in python faster
I want to do the following:
I create an array 'a' as:
a=np.random.randint(-10,11,10000).reshape(-1,10)
a.shape: (1000,10)
I create another array which takes only the first two columns in array a
b=a[:,0:2]
b,shape: (1000,2)
now I want to create an array c which has 990 rows containing flattened slices of 10 rows of array 'b'.
So the first row of array 'c' will have 20 columns which is a flattened slice of 0 to 10 rows
of array 'b'. The next row of array 'c' will have 20 columns of flattened rows 1 to 11 of array
'b' etc.
I can do this with a for loop. But I want to know if there is much faster way to do this using numpy functions and methods like strides or something else
Thanks for your time and your help.
This loops over shifts rather than rows (loop of size 10):
N = 10
c = np.hstack([b[i:i-N] for i in range(N)])
Explanation: b[i:i-N] is b's rows from i to m-(N-i)(excluding m-(N-i) itself) where m is number of rows in b. Then np.hstack stacks those selected sub-arrays horizontally(stacks b[0:m-10], b[1:m-9], b[2:m-8], ..., b[10:m]) (as question explains).
c.shape: (990, 20)
Also I think you may be looking for a shape of (991, 20) if you want to include all windows.
you can also use strides, but if you want to do operations on it, I would advise against that, since the memory is tricky using them. Here is a strides solution if you insist:
from skimage.util.shape import view_as_windows
c = view_as_windows(b, (10,2)).reshape(-1, 20)
c.shape: (991, 20)
If you don't want the last row, simply remove it by calling c[:-1].
A similar solution applies with numpy's as_strides function (they basically operate similar, not sure of internals of them).
UPDATE: if you want to find unique values and their frequencies in each row of c you can do:
unique_values = []
unique_counts = []
for row in c:
unique, unique_c = np.unique(row, return_counts=True)
unique_values.append(unique)
unique_counts.append(unique_c)
Note that numpy arrays have to be rectangular, meaning the number of elements per each(dimension) row must be the same. Since different rows in c can have different number of unique values, you cannot create a numpy array for unique values of each row (Alternative would be to make a structured numpy array). Therefore, a solution is to make a list/array of arrays, each including unique values of different rows in c. unique_values are list of arrays of unique values and unique_counts is their frequency in the same order.

Copying row element in a numpy array

I have an array X of <class 'scipy.sparse.csr.csr_matrix'> format with shape (44, 4095)
I would like to now to create a new numpy array say X_train = np.empty([44, 4095]) and copy row by row in a different order. Say I want the 5th row of X in 1st row of X_train.
How do I do this (copying an entire row into a new numpy array) similar to matlab?
Define the new row order as a list of indices, then define X_train using integer indexing:
row_order = [4, ...]
X_train = X[row_order]
Note that unlike Matlab, Python uses 0-based indexing, so the 5th row has index 4.
Also note that integer indexing (due to its ability to select values in arbitrary order) returns a copy of the original NumPy array.
This works equally well for sparse matrices and NumPy arrays.
Python works generally by reference, which is something you should keep in mind. What you need to do is make a copy and then swap. I have written a demo function which swaps rows.
import numpy as np # import numpy
''' Function which swaps rowA with rowB '''
def swapRows(myArray, rowA, rowB):
temp = myArray[rowA,:].copy() # create a temporary variable
myArray[rowA,:] = myArray[rowB,:].copy()
myArray[rowB,:]= temp
a = np.arange(30) # generate demo data
a = a.reshape(6,5) # reshape the data into 6x5 matrix
print a # prin the matrix before the swap
swapRows(a,0,1) # swap the rows
print a # print the matrix after the swap
To answer your question, one solution would be to use
X_train = np.empty([44, 4095])
X_train[0,:] = x[4,:].copy() # store in the 1st row the 5th one
unutbu answer seems to be the most logical.
Kind Regards,

Select specefic rows from a 2d Numpy array using a sparse binary 1-d array

I am having a issues figuring out to do this operation
So I have and the variable index 1xM sparse binary array and I have a 2-d array (NxM) samples. I want to use index to select specific rows of samples adnd get a 2-d array.
I have tried stuff like:
idx = index.todense() == 1
samples[idx.T,:]
but nothing.
So far I have made it work doing this:
idx = test_x.todense() == 1
selected_samples = samples[np.array(idx.flat)]
But there should be a cleaner way.
To give an idea using a fraction of the data:
print(idx.shape) # (1, 22360)
print(samples.shape) (22360, 200)
The short answer:
selected_samples = samples[index.nonzero()[1]]
The long answer:
The first problem is that your index matrix is 1xN while your sample ndarray is NxM. (See the mismatch?) This is why you needed to call .flat.
That's not a big deal, though, because we just need the nonzero entries in the sparse vector. Get those with index.nonzero(), which returns a tuple of (row indices, column indices). We only care about the column indices, so we use index.nonzero()[1] to get those by themselves.
Then, simply index with the array of nonzero column indices and you're done.

How to convert a 2D array to a structured array using view (numpy)?

I am having some problems assigning fields to an array using the view method. Apparently, there doesn't seem to be a control of how you want to assign the field.
a=array([[1,2],[1,2],[1,2]]) # 3x2 matrix
#array([[1, 2],
# [1, 2],
# [1, 2]])
aa=a.transpose() # 2x3 matrix
#array([[1, 1, 1],
# [2, 2, 2]])
a.view(dtype='i8,i8') # This works
a.view(dtype='i8,i8,i8') # This returns error ValueError: new type not compatible with array.
aa.view(dtype='i8,i8') # This works
aa.view(dtype='i8,i8,i8') # This returns error ValueError: new type not compatible with array.
In fact, if I create aa from scratch instead of using transpose of a,
b=array([[1,1,1],[2,2,2]])
b.view(dtype='i8 i8') # This returns ValueError again.
b.view(dtype='i8,i8,i8') # This works
Why does this happen? Is there any way I can set the fields to represent rows or columns?
When you create a standard array in NumPy, some contiguous blocks of memory are occupied by the data. The size of each block depends on the dtype, the number and organization of these blocks by the shape of your array. Structured arrays follow the same pattern, except that each block is now composed of several sub-blocks, each sub-block occupying some space as defined by the corresponding dtype of the field.
In your example, you define a (3,2) array of ints (a). That's 2 int blocks for the first row, followed by 2 other blocks for the second and then 2 last blocks for the first. If you want to transform it into a structured array, you can either keep the original layout (each block becomes a unique field (a.view(dtype=[('f0', int)]), or transform your 2-block rows into rows of 1 larger block consisting of 2 sub-blocks, each sub-block having a int size.
That's what happen when you do a.view(dtype=[('f0',int),('f1',int)]).
You can't make larger blocks (ie, dtype="i8,i8,i8"), as the corresponding information would be spread across different rows.
Now, you can display your array in a different way, for example display it column by column: that's what happen when you do a .transpose of your array. It's only display, though, ('views' in the NumPy lingo), that doesn't change the original memory layout. So, your aa example, the original layout is still "3 rows of 2 integers", that you can represent as "3 rows of one block of 2 integers".
In your second example, b=array([[1,1,1],[2,2,2]]), you have a different layout: 2 rows of 3 int blocks. You can group the 3 int blocks into one larger block (dtype="i8,i8,i8") because you're not going over a row. You can't group it two by two, because you would have an extra block on each row.
You can transform a (N,M) standard array into only (1) a N structured array of M fields or (2) a NxM structured array of 1 field and that's it. The (N,M) is the shape given to the array at its creation. You can display your array as a (M,N) array by a transposition, but that doesn't modify the original memory layout.
when you specify the view as b.view(dtype='i8, i8') you are asking numpy to reinterpret the values as set of tuples with two values in them but this simply isn't feasible since we have 3 values which isn't a multiple of two, its like reshaping the matrix where it would generate a new matrix of different size, numpy doesn't like such things.

Categories

Resources