performing operation on matched columns of NumPy arrays

performing operation on matched columns of NumPy arrays - python

I am new to python and programming in general and ran into a question:
I have two NumPy arrays of the same shape: they are 2D arrays, of the dimensions 1000 x 2000.
I wish to compare the values of each column in array A with the values in array B. The important part is that not every column of A should be compared to every column in B, but rather the same columns of A & B should be compared to one another, as in: A[:,0] should be compared to B[:,0], A[:,1] should be compared to B[:,1],… etc.
This was easier to do when I had one dimensional arrays: I used zip(A, B), so I could run the following for-loop:
A = np.array([2,5,6,3,7])
B = np.array([1,3,9,4,8])
res_list = []
For number1, number2 in zip(A, B):
if number1 > number2:
comment1 = “bigger”
res_list.append(comment1)
if number1 < number2:
comment2 = “smaller”
res_list.append(comment2)
res_list
In [702]: res_list
Out[702]: ['bigger', 'bigger', 'smaller', 'smaller', 'smaller']
however, I am not sure how to best do this on the 2D array. As output, I am aiming for a list with 2000 sublists (the 2000 cols), to later count the numbers of instances of "bigger" and "smaller" for each of the columns.
I am very thankful for any input.
So far I have tried to use np.nditer in a double for loop, but it returned all the possible column combinations. I would specifically desire to only combine the "matching" columns.
an approximation of the input (but I have: 1000 rows and 2000 cols)
In [709]: A
Out[709]:
array([[2, 5, 6, 3, 7],
[6, 2, 9, 2, 3],
[2, 1, 4, 5, 7]])
In [710]: B
Out[710]:
array([[1, 3, 9, 4, 8],
[4, 8, 2, 3, 1],
[3, 7, 1, 8, 9]])
As desired output, I want to compare the values of the arrays A & B column-wise (only the "matching" columns, not all columns with all columns, as I tried to explain above), and store them in the a nested list (number of "sublists" should correspond to the number of columns):
res_list = [["bigger", "bigger", "smaller"], ["bigger", "smaller", "smaller"], ["smaller", "bigger", "bigger], ["smaller", "smaller", "smaller"], ...]

From the example input and output, I see that you want to do an element wise comparison, and store the values per columns. From your code you understand the 1D variant of this problem, so the question seems to be how to do it in 2D.
Solution 1
In order to achieve this, we have to make the 2D problem, a 1D problem, so you can do what you already did. If for example the columns would become rows, then you can redo your zip strategy for every row.
In otherwords, if we can turn:
a = np.array(
[[2, 5, 6, 3, 7],
[6, 2, 9, 2, 3],
[2, 1, 4, 5, 7]]
)
into:
array([[2, 6, 2],
[5, 2, 1],
[6, 9, 4],
[3, 2, 5],
[7, 3, 7]])
we can iterate over a and b, at the same time, and get our 1D version of the problem. Swapping the x and y axis of the matrix like this, is called transposing, and is very common, the operation for numpy is a.T, (docs ndarry.T).
Now we use your code onces for the outer loop of iterating over all the rows (after transposing, all the rows actually hold the column values). After which we use the code on those values, because every row is a 1D numpy array.
result = []
# Outer loop, to go over the columns of `a` and `b` at the same time.
for row_a, row_b in zip(a.T, b.T):
result_col = []
# Inner loop to compare a whole column element wise.
for col_a, col_b in zip(row_a, row_b):
result_col.append('bigger' if col_a > col_b else 'smaller')
result.append(result_col)
Note: I use a ternary operator to assign smaller and bigger.
Solution 2
As indicated before you are only looking at 2 values that are in the same position for both arrays, this is called an elementwise comparison. Since we are only interested in the values that are at the exact same position, and we know the output shape of our result array (input 1000x2000, output will be 2000x1000), we can also iterate over all the elements using their index.
Now some quick handy shortcuts,
a.shape holds the dimensions of the array, therefore a.shape will be (1000, 2000).
using [::-1] will reverse the order, similar to reverse()
Combining a.shape[::-1] will hold (2000, 1000), our expected output shape.
np.ndindex provides indexing, based on the number of dimensions provided.
An *, performs tuple unpacking, so using it like np.ndindex(*a.shape), is equivalent to np.ndindex(1000, 2000).
Therefore we can use their index (from np.ndindex) and turn the x and y around to write the result to the correct location in the output array:
a = np.random.randint(0, 255, (1000, 2000))
b = np.random.randint(0, 255, (1000, 2000))
result = np.zeros(a.shape[::-1], dtype=object)
for rows, columns in np.ndindex(*a.shape):
result[columns, rows] = 'bigger' if a[rows, columns] > b[rows, columns] else 'smaller'
print(result)
This will lead to the same result. Similarly we could also first transpose the a and b array, drop the [::-1] in the result array, and swap the assignment result[columns, rows] back to result[rows, columns].
Edit
Thinking about it a bit longer, you are only interested in doing a comparison between two array of the same shape (dimension). For this numpy already has a good solution, np.where(cond, <true>, <false>).
So the entire problem can be reduced to:
answer = np.where(a > b, 'bigger', 'smaller').T
Note the .T to transpose the solution, such that the answer has the columns in the rows.

Related

How to join slices of rows to create a new row in python

In Python, i need to split two rows in half, take the first half from row 1 and second half from row 2 and concatenate them into an array which is then saved as a row in another 2d array. for example
values=np.array([[1,2,3,4],[5,6,7,8]])
will become
Y[2,:]= ([1,2,7,8])) // 2 is arbitrarily chosen
I tried doing this with concatenate but got an error
only integer scalar arrays can be converted to a scalar index
x=values.shape[1]
pop[y,:]=np.concatenate(values[temp0,0:int((x-1)/2)],values[temp1,int((x-1)/2):x+1])
temp0 and temp1 are integers, and values is a 2d integer array of dimensions (100,x)

np.concatenate takes a list of arrays, plus a scalar axis parameter (optional)
In [411]: values=np.array([[1,2,3,4],[5,6,7,8]])
...:
Nothing wrong with how you split values:
In [412]: x=values.shape[1]
In [413]: x
Out[413]: 4
In [415]: values[0,0:int((x-1)/2)],values[1,int((x-1)/2):x+1]
Out[415]: (array([1]), array([6, 7, 8]))
wrong:
In [416]: np.concatenate(values[0,0:int((x-1)/2)],values[1,int((x-1)/2):x+1])
----
TypeError: only integer scalar arrays can be converted to a scalar index
It's trying to interpret the 2nd argument as an axis parameter, hence the scalar error message.
right:
In [417]: np.concatenate([values[0,0:int((x-1)/2)],values[1,int((x-1)/2):x+1]])
Out[417]: array([1, 6, 7, 8])
There are other concatenate front ends. Here hstack would work the same. np.append takes 2 arrays, so would work - but too often people use it wrongly. np.r_ is another front end with different syntax.
The indexing might be clearer with:
In [423]: idx = (x-1)//2
In [424]: np.concatenate([values[0,:idx],values[1,idx:]])
Out[424]: array([1, 6, 7, 8])

Try numpy.append
numpy.append Documentation
np.append(values[temp0,0:int((x-1)/2)],values[temp1,int((x-1)/2):x+1])

You don't need splitting and/or concatenation. Just use indexing:
In [47]: values=np.array([[1,2,3,4],[5,6,7,8]])
In [48]: values[[[0], [1]],[[0, 1], [-2, -1]]]
Out[48]:
array([[1, 2],
[7, 8]])
Or ravel to get the flattened version:
In [49]: values[[[0], [1]],[[0, 1], [-2, -1]]].ravel()
Out[49]: array([1, 2, 7, 8])
As a more general approach you can also utilize np.r_ as following:
In [61]: x, y = values.shape
In [62]: values[np.arange(x)[:,None],[np.r_[0:y//2], np.r_[-y//2:0]]].ravel()
Out[62]: array([1, 2, 7, 8])

Reshape to split the second dimension in two; stack the part you want.
a = np.array([[1,2,3,4],[5,6,7,8]])
b = a.reshape(a.shape[0], a.shape[1]//2, 2)
new_row = np.hstack([b[0,0,:], b[1,1,:]])
#new_row = np.hstack([b[0,0], b[1,1]])

Advanced Integer slicing when slicing object is an ndarray tuple

I understand how
x=np.array([[1, 2], [3, 4], [5, 6]]
y = x[[0,1,2], [0,1,0]]
Output gives y= [1 4 5] This just takes the first list as rows and seconds list and columns.
But how does the the below work?
x = np.array([[ 0, 1, 2],[ 3, 4, 5],[ 6, 7, 8],[ 9, 10, 11]])
rows = np.array([[0,0],[3,3]])
cols = np.array([[0,2],[0,2]])
y = x[rows,cols]
This gives the output of :
[[ 0 2]
[ 9 11]]
Can you please explain the logic when using ndarrays as slicing object? Why does it have a 2d array for both rows and columns. How are the rules different when the slicing object is a ndarray as opposed to a python list?

We've the following array x
x = np.array([[1, 2], [3, 4], [5, 6]]
And the indices [0, 1, 2] and [0, 1, 0] which when indexed into x like
x[[0,1,2], [0,1,0]]
gives
[1, 4, 5]
The indices that we used basically translates to:
[0, 1, 2] & [0, 1, 0] --> [0,0], [1,1], [2,0]
Since we used 1D list as indices, we get 1D array as result.
With that knowledge, let's see the next case. Now, we've the array x as:
x = np.array([[ 0, 1, 2],[ 3, 4, 5],[ 6, 7, 8],[ 9, 10, 11]])
Now the indices are 2D arrays.
rows = np.array([[0,0],[3,3]])
cols = np.array([[0,2],[0,2]])
This when indexed into the array x like:
x[rows,cols]
simply translates to:
[[0,0],[3,3]]
| | | | ====> [[0,0]], [[0,2]], [[3,0]], [[3,2]]
[[0,2],[0,2]]
Now, it's easy to observe how these 4 list of list when indexed into the array x would give the following result (i.e. here it simply returns the corner elements from our array x):
[[ 0, 2]
[ 9, 11]]
Note that in this case we get the result as a 2D array (as opposed to 1D array in the first case) since our indices rows & columns itself were 2D arrays (i.e. equivalently list of list) whereas in the first case our indices were 1D arrays (or equivalently simple list without any nesting).
So, if you need 2D arrays as result, you need to give 2D arrays as indices.

The easiest way to wrap one's head around this is the following observation: The shape of the output is determined by the shape of the index array, or more precisely the shape resulting from broadcasting all the index arrays together.
Look at it like that: you have an array A of a given shape and another array V of some other shape and you want to fill A with values from V. What do you need to specify? Well, for each position in A you need to specify coordinates of some element in V. Therefore if V is ND you need N index arrays of the same shape as A or at least broadcastable to that. Then you index V by putting these index arrays at their coordinate positions in the [] expression.

To stay simple, we'll stay 2D and assume rows.shape = cols.shape. (You can break this rule with broadcasting, but for now we won't). We'll call this shape (I, J)
then y = x[rows, cols] is the same as:
y = np.empty((I, J))
for i in range(I):
for j in range(J):
y[i, j] = x[rows[i, j], cols[i, j]]

Selecting specific groups of rows from numpy array [duplicate]

Given:
test = numpy.array([[1, 2], [3, 4], [5, 6]])
test[i] gives the ith row (e.g. [1, 2]). How do I access the ith column? (e.g. [1, 3, 5]). Also, would this be an expensive operation?

To access column 0:
>>> test[:, 0]
array([1, 3, 5])
To access row 0:
>>> test[0, :]
array([1, 2])
This is covered in Section 1.4 (Indexing) of the NumPy reference. This is quick, at least in my experience. It's certainly much quicker than accessing each element in a loop.

>>> test[:,0]
array([1, 3, 5])
this command gives you a row vector, if you just want to loop over it, it's fine, but if you want to hstack with some other array with dimension 3xN, you will have
ValueError: all the input arrays must have same number of dimensions
while
>>> test[:,[0]]
array([[1],
[3],
[5]])
gives you a column vector, so that you can do concatenate or hstack operation.
e.g.
>>> np.hstack((test, test[:,[0]]))
array([[1, 2, 1],
[3, 4, 3],
[5, 6, 5]])

And if you want to access more than one column at a time you could do:
>>> test = np.arange(9).reshape((3,3))
>>> test
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
>>> test[:,[0,2]]
array([[0, 2],
[3, 5],
[6, 8]])

You could also transpose and return a row:
In [4]: test.T[0]
Out[4]: array([1, 3, 5])

Although the question has been answered, let me mention some nuances.
Let's say you are interested in the first column of the array
arr = numpy.array([[1, 2],
[3, 4],
[5, 6]])
As you already know from other answers, to get it in the form of "row vector" (array of shape (3,)), you use slicing:
arr_col1_view = arr[:, 1] # creates a view of the 1st column of the arr
arr_col1_copy = arr[:, 1].copy() # creates a copy of the 1st column of the arr
To check if an array is a view or a copy of another array you can do the following:
arr_col1_view.base is arr # True
arr_col1_copy.base is arr # False
see ndarray.base.
Besides the obvious difference between the two (modifying arr_col1_view will affect the arr), the number of byte-steps for traversing each of them is different:
arr_col1_view.strides[0] # 8 bytes
arr_col1_copy.strides[0] # 4 bytes
see strides and this answer.
Why is this important? Imagine that you have a very big array A instead of the arr:
A = np.random.randint(2, size=(10000, 10000), dtype='int32')
A_col1_view = A[:, 1]
A_col1_copy = A[:, 1].copy()
and you want to compute the sum of all the elements of the first column, i.e. A_col1_view.sum() or A_col1_copy.sum(). Using the copied version is much faster:
%timeit A_col1_view.sum() # ~248 µs
%timeit A_col1_copy.sum() # ~12.8 µs
This is due to the different number of strides mentioned before:
A_col1_view.strides[0] # 40000 bytes
A_col1_copy.strides[0] # 4 bytes
Although it might seem that using column copies is better, it is not always true for the reason that making a copy takes time too and uses more memory (in this case it took me approx. 200 µs to create the A_col1_copy). However if we needed the copy in the first place, or we need to do many different operations on a specific column of the array and we are ok with sacrificing memory for speed, then making a copy is the way to go.
In the case we are interested in working mostly with columns, it could be a good idea to create our array in column-major ('F') order instead of the row-major ('C') order (which is the default), and then do the slicing as before to get a column without copying it:
A = np.asfortranarray(A) # or np.array(A, order='F')
A_col1_view = A[:, 1]
A_col1_view.strides[0] # 4 bytes
%timeit A_col1_view.sum() # ~12.6 µs vs ~248 µs
Now, performing the sum operation (or any other) on a column-view is as fast as performing it on a column copy.
Finally let me note that transposing an array and using row-slicing is the same as using the column-slicing on the original array, because transposing is done by just swapping the shape and the strides of the original array.
A[:, 1].strides[0] # 40000 bytes
A.T[1, :].strides[0] # 40000 bytes

To get several and indepent columns, just:
> test[:,[0,2]]
you will get colums 0 and 2

>>> test
array([[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9]])
>>> ncol = test.shape[1]
>>> ncol
5L
Then you can select the 2nd - 4th column this way:
>>> test[0:, 1:(ncol - 1)]
array([[1, 2, 3],
[6, 7, 8]])

This is not multidimensional. It is 2 dimensional array. where you want to access the columns you wish.
test = numpy.array([[1, 2], [3, 4], [5, 6]])
test[:, a:b] # you can provide index in place of a and b

Maintaining shape of output as of input after Boolean indexing in python

I want help in the following problem, plz.
Suppose X = [1 3 0 8
1 4 6 0
2 0 7 8 ]
mask = (X != 0)
mask = [ T T F T
T T T F
T F T T]
X1 = X[(mask,np.newaxis)]
Its output X1 is of shape (9,1)
But i want X1 to be of (3,3), i.e., maintaining the same shape as of X except the masked entries.
X1 = [1 3 8
1 4 6
2 7 8 ]
Can someone help me plz? Thank you.
Every row of X will contain a zero and I don't want to use reshape(). Here is the working
X= np.array([[1,3,0,8],[1,4,6,0],[2,0,7,8]])
mask = (X!=0)
X1=X[(mask,np.newaxis)]
The output X is of shape (9,1). Is there any way that X1 be of (3,3) as mentioned.

I think you might want to start on something easier in python, since your question doesn't even contain correct syntax. I'm hoping this was just a psuedocode attempt. However, here's some code to do the mask you desire.
import numpy as np
X = np.array([1, 3, 0, 8, 1, 4, 6, 0, 2, 0, 7, 8])
indicies_we_want = np.where(X > 0) # Results in an array containing the indicies of X we want to keep
result = np.take(X, indicies_we_want) # Filter by these indicies
result = result.reshape(3, 3) # Reshape to desired result
print result
This code could be condensed considerably, but I wanted to show each step as you have in your question for clarity.
As pointed out in the comments section, the reshape typically isn't a good idea unless you somehow know after filtering out 0s that you'll be left with 9 elements. In the case you described, we certainly know this, but for a given array, not so much.

In [173]: x=[[1,3,0,8],[1,4,6,0],[2,0,7,8]]
In [174]: xa=np.array(x)
solution with reshape:
In [175]: xa[xa!=0].reshape(3,3)
Out[175]:
array([[1, 3, 8],
[1, 4, 6],
[2, 7, 8]])
a solution without reshape:
In [176]: np.array([i[i!=0] for i in xa])
Out[176]:
array([[1, 3, 8],
[1, 4, 6],
[2, 7, 8]])
Obviously both depend on there being only one deletion per row.
You aren't deleting a common column; nothing in your code tells the underlying numpy that the result will be reshapeable. So boolean indexing operates on the flattened array.
In [177]: xa[xa!=0]
Out[177]: array([1, 3, 8, 1, 4, 6, 2, 7, 8])
In [178]: xa.flat[xa.flat!=0]
Out[178]: array([1, 3, 8, 1, 4, 6, 2, 7, 8])
I could throw in an extra 0, and this indexing would still work the same; but the efforts to reshape it to 3x3 will fail.
Keep in mind that the underlying data buffer is flat, 1d, and that it only displays as 2d because of the shape and striding attributes. Selecting elements (or skipping some) will produce a copy, and a 1d copy is just as easy, even faster, than a 2d one. reshape doesn't change the data buffer, just the shape attribute.

Numpy array subtraction creates a matrix with different dimension. How to correct that?

I am attempting to create a simply neuronetwork using Python (I know there are libraries, but I'm building a simple one from scratch to get more familiar with each step taken), and one part of it is to calculate the difference between the true label and the predicted label.
I have the true label in dim <2059 x 1>, and the predicted label also in <2059 x 1>
Both are in np.array
I would expect a simple
l2_error=tag_train-l2
would do the job. (l2 is the predicted label, tag_train is the true label)
but what I got in return is a <2059x2059> matrix. It seems like this operation is doing a subtraction of every possible combination of elements. Why would this happen? I know I can probably run a for loop to get the job done, for I'm wondering why the program would produce this result?
Both dtypes are float64, btw. I don't think it matters, but just in case this info is needed.

As you indicated in the comments, what is happening is that tag_train is a one dimensional array with length 2059 , whereas l2 is supposedly a 2 dimensional array with 2059 rows and 1 column.
So when you try to do subtraction it leads to a 2 dimensional array with 2059 rows and 2059 columns.
If you are 100% sure that l2 would only be one column then you can reshape that array to make it one dimensional before doing the subtraction. Like -
l2.reshape((l2.shape[0],))
Example/Demo -
In [1]: import numpy as np
In [2]: l1 = np.array([1,2,3,4])
In [3]: l2 = np.array([[5],[6],[7],[8]])
In [7]: l2.shape
Out[7]: (4, 1)
In [8]: l2-l1
Out[8]:
array([[4, 3, 2, 1], #Just to show that you get the behaviour when arrays are in
[5, 4, 3, 2], #different dimensions.
[6, 5, 4, 3],
[7, 6, 5, 4]])
In [19]: l2 = l2.reshape((l2.shape[0],))
In [25]: l2 = l2.reshape((l2.shape[0],))
In [26]: l2-l1
Out[26]: array([4, 4, 4, 4])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

performing operation on matched columns of NumPy arrays - python

Related

How to join slices of rows to create a new row in python

Advanced Integer slicing when slicing object is an ndarray tuple

Selecting specific groups of rows from numpy array [duplicate]

Maintaining shape of output as of input after Boolean indexing in python

Numpy array subtraction creates a matrix with different dimension. How to correct that?

Categories

Resources