Pulling elements in order based on first element using key array - python

I'm looking for a vectorized approach for the following problem:
Suppose I have two arrays, one with a bunch of non-contiguous ids in the first column and some data in the remaining columns, and a second array suggesting which datalines I need to pull:
data_array = np.array([[101,4],[102,7],[201,2],[203,9],[403,12]])
key_array = np.array([101,403,201])
The output must stay in the order given by the key_array, leading to the following:
output_array = np.array([[101,4],[403,12],[201,2]])
I can easily do this through a list comprehension:
output_array = np.array([data_array[i==data_array[:,0]][0] for i in key_array])
but this is not a vectorized solution. Using the numpy isin() is very close to working, but does not preserve the given order:
data_array[np.isin(data_array[:,0],key_array)]
#[[101 4]
# [201 2] not the order given by the key_array!
# [403 12]]
I tried making the above work by some use of argsort(), haven't been able to get anything working. Any help would be greatly appreciated.

We can use np.searchsorted -
s = data_array[:,0].argsort()
out = data_array[s[np.searchsorted(data_array[:,0],key_array,sorter=s)]]
If the first column of data_array is already sorted, simplifies to one-liner -
out = data_array[np.searchsorted(data_array[:,0],key_array)]

Related

Understanding np.ix_

Code:
import numpy as np
ray = [1,22,33,42,51], [61,71,812,92,103], [113,121,132,143,151], [16,172,183,19,201]
ray = np.asarray(ray)
type(ray)
ray[np.ix_([-2:],[3:4])]
I'd like to use index slicing and get a subarray consisting of the last two rows and the 3rd/4th columns. My current code produces an error:
I'd also like to sum each column. What am I doing wrong? I cannot post a picture because I need at least 10 reputation points.
So you want to make a slice of an array. The most straightforward way to do it is... slicing:
slice = ray[-2:,3:]
or if you want it explicitly
slice = ray[-2:,3:5]
See it explained in Understanding slicing
But if you do want to use np.ix_ for some reason, you need
slice = ray[np.ix_([-2,-1],[3,4])]
You can't use : here, because [] here don't make a slice, they construct lists and you should specify explicitly every row number and every column number you want in the result. If there are too many consecutive indices, you may use range:
slice = ray[np.ix_(range(-2, 0),range(3, 5))]
And to sum each column:
slice.sum(0)
0 means you want to reduce the 0th dimension (rows) by summation and keep other dimensions (columns in this case).

I want to average the third column onwards

I hope you are well. I am new. I am trying to add certain columns but not to all, and I require your help.
W=[[77432664,6,2,4,3,4,3],
[6233234,7,3,2,5,3,1],
[3412455221,8,3,2,4,5,5]]
rows=len(W)
columns=len(W[0])
for i in range(rows):
T=sum(W[i])
W[i].append(T)
I assume by "add" you mean "sum" and not "insert". If so, then you can use what is called a slice:
for row in rows:
t = sum(row[1:])
row.append(t)
row[1:] takes all but the first element of the list row. For more information on this syntax, you should google "python slice".
Also notice how I am iterating over rows directly, rather than using an index. This is the most common way to do a loop in Python.
You can create a subarray in python by specifying the column range and then add it. Below code demonstrate the addition of column 2,3,4,5,6 in Python.
W=[[77432664,6,2,4,3,4,3],
[6233234,7,3,2,5,3,1],
[3412455221,8,3,2,4,5,5]]
rows=len(W)
columns=len(W[0])
for i in range(rows):
T=sum(W[i][2:6]) #For i=0 it retreives subarray [2,4,3,4,3] then add it to get T=16
W[i].append(T)
I'd suggest using pandas sum method using over axis=0:
# numeric of columns
my_cols_n = [2,3,4,5,6]
# Get cols by name
my_cols = [x for x,i in enumerate(list(df.columns)) if i in my_cols_n]
# Get Sum
df["my_sum"] = df[my_cols].sum(axis=0)
To add to #Code-Apprentice answer - consider using numpy for similar assignments:
import numpy as np
W=[[77432664,6,2,4,3,4,3],
[6233234,7,3,2,5,3,1],
[3412455221,8,3,2,4,5,5]]
W=np.array(W)
>>> print(W[:, 3:].mean(axis=1))
[3.5 2.75 4. ]
Especially with the growth of complexity of matrix operations - you will quickly see big advantages of numpy

numpy.take range of array elements Python

I have an array of integers.
data = [10,20,30,40,50,60,70,80,90,100]
I want to extract a range of integers from the array and get a smaller array.
data_extracted = [20,30,40]
I tried numpy.take.
data = [10,20,30,40,50,60,70,80,90,100]
start = 1 # index of starting data entry (20)
end = 3 # index of ending data entry (40)
data_extracted = np.take(data,[start:end])
I get a syntax error pointing to the : in numpy.take.
Is there a better way to use numpy.take to store part of an array in a separate array?
You can directly slice the list.
import numpy as np
data = [10,20,30,40,50,60,70,80,90,100]
data_extracted = np.array(data[1:4])
Also, you do not need to use numpy.array, you could just store the data in another list:
data_extracted = data[1:4]
If you want to use numpy.take, you have to pass it a list of the desired indices as second argument:
import numpy as np
data = [10,20,30,40,50,60,70,80,90,100]
data_extracted = np.take(data, [1, 2, 3])
I do not think numpy.take is needed for this application though.
You ought to just use a slice to get a range of indices, there is no need for numpy.take, which is intended as a shortcut for fancy indexing.
data_extracted = data[1:4]
As others have mentioned, you can use fancy indexing in this case. However, if you need to use np.take because e.g. the axis you're slicing over is variable, you might try:
axis=0
data.take(range(1,4), axis=axis)
Note: this might be slower than:
data_extracted = data[1:4]

sort an array by row in python

I understood that sorting a numpy array arr by column (for only a particular column, for example, its 2nd column) can be done with:
arr[arr[:,1].argsort()]
How I understood this code sample works: argsort sorts the values of the 2nd column of arr, and gives the corresponding indices as an array. This array is given to arr as row numbers. Am I correct in my interpretation?
Now I wonder what if I want to sort the array arr with respect to the 2nd row instead of the 2nd column? Is the simplest way to transpose the array before sorting it and transpose it back after sorting, or is there a way to do it like previously (by giving an array with the number of the columns we wish to display)?
Instead of doing (n,n)array[(n,)array] (n is the size of the 2d array) I tried to do something like (n,n)array[(n,1)array] to indicate the numbers of the columns but it does not work.
EXAMPLE of what I want:
arr = [[11,25],[33,4]] => base array
arr_col2=[[33,4],[11,25]] => array I got with argsort()
arr_row2=[[25,11],[4,33]] => array I tried to got in a simple way with argsort() but did not succeed
I assume that arr is a numpy array? I haven't seen the syntax arr[:,1] in any other context in python. It would be worth mentioning this in your question!
Assuming this is the case, then you should be using
arr.sort(axis=0)
to sort by column and
arr.sort(axis=1)
to sort by row. (Both sort in-place, i.e. change the value of arr. If you don't want this you can copy arr into another variable first, and apply sort to that.)
If you want to sort just a single row (in this case, the second one) then
arr[1,:].sort()
works.
Edit: I now understand what problem you are trying to solve. You would like to reorder the columns in the matrix so that the nth row goes in increasing order. You can do this simply by
arr[:,arr[1,:].argsort()]
(where here we're sorting by the 2nd row).

How to apply the output of numpy.argpartition for 2-D Arrays?

I have a largish 2d numpy array, and I want to extract the lowest 10 elements of each row as well as their indexes. Since my array is largish, I would prefer not to sort the whole array.
I heard about the argpartition() function, with which I can get the indexes of the lowest 10 elements:
top10indexes = np.argpartition(myBigArray,10)[:,:10]
Note that argpartition() partitions axis -1 by default, which is what I want. The result here has the same shape as myBigArray containing indexes into the respective rows such that the first 10 indexes point to the 10 lowest values.
How can I now extract the elements of myBigArray corresponding to those indexes?
Obvious fancy indexing like myBigArray[top10indexes] or myBigArray[:,top10indexes] do something quite different. I could also use list comprehensions, something like:
array([row[idxs] for row,idxs in zip(myBigArray,top10indexes)])
but that would incur a performance hit iterating numpy rows and converting the result back to an array.
nb: I could just use np.partition() to get the values, and they may even correspond to the indexes (or may not..), but I don't want to do the partition twice if I can avoid it.
You can avoid using the flattened copies and the need to extract all the values by doing:
num = 10
top = np.argpartition(myBigArray, num, axis=1)[:, :num]
myBigArray[np.arange(myBigArray.shape[0])[:, None], top]
For NumPy >= 1.9.0 this will be very efficient and comparable to np.take().

Categories

Resources