i have a 3D array of int32. I would like to transform each item from array to its corresponding bit value on "n" th position. My current approach is to loop through the whole array, but I think it can be done much more efficiently.
for z in range(0,dim[2]):
for y in range(0,dim[1]):
for x in range(0,dim[0]):
byte='{0:032b}'.format(array[z][y][x])
array[z][y][x]=int(byte>>n) & 1
Looking forward to your answers.
If you are dealing with large arrays, you are better off using numpy. Applying bitwise operations on a numpy array is much faster than applying it on python lists.
import numpy as np
a = np.random.randint(1,65, (2,2,2))
print a
Out[12]:
array([[[37, 46],
[47, 34]],
[[ 3, 15],
[44, 57]]])
print (a>>1)&1
Out[16]:
array([[[0, 1],
[1, 1]],
[[1, 1],
[0, 0]]])
Unless there is an intrinsic relation between the different points, you have no other choice than to loop over them to discover their current values. So the best you can do, will always be O(n^3)
What I don't get however, is why you go over the hassle of converting a number to a 32bit string, then back to int.
If you want to check if the nth bit of a number is set, you would do the following:
power_n = 1 << (n - 1)
for z in xrange(0,dim[2]):
for y in xrange(0,dim[1]):
for x in xrange(0,dim[0]):
array[z][y][x]= 0 if array[z][y][x] & power_n == 0 else 1
Not that in this example, I'm assuming that N is a 1-index (first bit is at n=1).
Related
I often need to convert a multi-column (or 2D) numpy array into an indicator vector in a stable (i.e., order preserved) manner.
For example, I have the following numpy array:
import numpy as np
arr = np.array([
[2, 20, 1],
[1, 10, 3],
[2, 20, 2],
[2, 20, 1],
[1, 20, 3],
[2, 20, 2],
])
The output I like to have is:
indicator = [0, 1, 2, 0, 3, 2]
How can I do this (preferably using numpy only)?
Notes:
I am looking for a high performance (vectorized) approach as the arr (see the example above) has millions of rows in a real application.
I am aware of the following auxiliary solutions, but none is ideal. It would be nice to hear expert's opinion.
My thoughts so far:
1. Numpy's unique: This would not work, as it is not stable:
arr_unq, indicator = np.unique(arr, axis=0, return_inverse=True)
print(arr_unq)
# output 1:
# [[ 1 10 3]
# [ 1 20 3]
# [ 2 20 1]
# [ 2 20 2]]
print(indicator)
# output 2:
# [2 0 3 2 1 3]
Notice how the indicator starts from 2. This is because unique function returns a "sorted" array (see output 1). However, I would like it to start from 0.
Of course I can use LabelEncoder from sklearn to convert the items in a manner that they start from 0 but I feel that there is a simple numpy trick that I can use and therefore avoid adding sklearn dependency to my program.
Or I can resolve this by a dictionary mapping like below, but I can imagine that there is a better or more elegant solution:
dct = {}
for idx, item in enumerate(indicator):
if item not in dct:
dct[item] = len(dct)
indicator[idx] = dct[item]
print(indicator)
# outputs:
# [0 1 2 0 3 2]
2. Stabilizing numpy's unique output: This solution is already posted in stackoverflow and correctly returns an stable unique array. But I do not know how to convert the returned indicator vector (returned when return_inverse=True) to represent the values in an stable order starting from 0.
3. Pandas's get_dummies: function. But it returns a "hot encoding" (matrix of indicator values). In contrast, I would like to have an indicator vector. It is indeed possible to convert the "hot encoding" to the indicator vector by few lines of code and data manipulation. But again that approach is not going to be highly efficient.
In addition to return_inverse, you can add the return_index option. This will tell you the first occurrence of each sorted item:
unq, idx, inv = np.unique(arr, axis=0, return_index=True, return_inverse=True)
Now you can use the fact that np.argsort is its own inverse to fix the order. Note that idx.argsort() places unq into sorted order. The corrected result is therefore
indicator = idx.argsort().argsort()[inv]
And of course the byproduct
unq = unq[idx.argsort()]
Of course there's nothing special about these operations to 2D.
A Note on the Intuition
Let's say you have an array x:
x = np.array([7, 3, 0, 1, 4])
x.argsort() is the index that tells you what elements of x are placed at each of the locations in the sorted array. So
i = x.argsort() # 2, 3, 1, 4, 0
But how would you get from np.sort(x) back to x (which is the problem you express in #2)?
Well, it happens that i tells you the original position of each element in the sorted array: the first (smallest) element was originally at index 2, the second at 3, ..., the last (largest) element was at index 0. This means that to place np.sort(x) back into its original order, you need the index that puts i into sorted order. That means that you can write x as
np.sort(x)[i.argsort()]
Which is equivalent to
x[i][i.argsort()]
OR
x[x.argsort()][x.argsort().argsort()]
So, as you can see, np.argsort is effectively its own inverse: argsorting something twice gives you the index to put it back in the original order.
I'm trying to create a function that will calculate the lattice distance (number of horizontal and vertical steps) between elements in a multi-dimensional numpy array. For this I need to retrieve the actual numbers from the indexes of each element as I iterate through the array. I want to store those values as numbers that I can run through a distance formula.
For the example array A
A=np.array([[1,2,3],[4,5,6],[7,8,9]])
I'd like to create a loop that iterates through each element and for the first element 1 it would retrieve a=0, b=0 since 1 is at A[0,0], then a=0, b=1 for element 2 as it is located at A[0,1], and so on...
My envisioned output is two numbers (corresponding to the two index values for that element) for each element in the array. So in the example above, it would be the two values that I am assigning to be a and b. I only will need to retrieve these two numbers within the loop (rather than save separately as another data object).
Any thoughts on how to do this would be greatly appreciated!
As I've become more familiar with the numpy and pandas ecosystem, it's become clearer to me that iteration is usually outright wrong due to how slow it is in comparison, and writing to use a vectorized operation is best whenever possible. Though the style is not as obvious/Pythonic at first, I've (anecdotally) gained ridiculous speedups with vectorized operations; more than 1000x in a case of swapping out a form like some row iteration .apply(lambda)
#MSeifert's answer much better provides this and will be significantly more performant on a dataset of any real size
More general Answer by #cs95 covering and comparing alternatives to iteration in Pandas
Original Answer
You can iterate through the values in your array with numpy.ndenumerate to get the indices of the values in your array.
Using the documentation above:
A = np.array([[1,2,3],[4,5,6],[7,8,9]])
for index, values in np.ndenumerate(A):
print(index, values) # operate here
You can do it using np.ndenumerate but generally you don't need to iterate over an array.
You can simply create a meshgrid (or open grid) to get all indices at once and you can then process them (vectorized) much faster.
For example
>>> x, y = np.mgrid[slice(A.shape[0]), slice(A.shape[1])]
>>> x
array([[0, 0, 0],
[1, 1, 1],
[2, 2, 2]])
>>> y
array([[0, 1, 2],
[0, 1, 2],
[0, 1, 2]])
and these can be processed like any other array. So if your function that needs the indices can be vectorized you shouldn't do the manual loop!
For example to calculate the lattice distance for each point to a point say (2, 3):
>>> abs(x - 2) + abs(y - 3)
array([[5, 4, 3],
[4, 3, 2],
[3, 2, 1]])
For distances an ogrid would be faster. Just replace np.mgrid with np.ogrid:
>>> x, y = np.ogrid[slice(A.shape[0]), slice(A.shape[1])]
>>> np.hypot(x - 2, y - 3) # cartesian distance this time! :-)
array([[ 3.60555128, 2.82842712, 2.23606798],
[ 3.16227766, 2.23606798, 1.41421356],
[ 3. , 2. , 1. ]])
Another possible solution:
import numpy as np
A=np.array([[1,2,3],[4,5,6],[7,8,9]])
for _, val in np.ndenumerate(A):
ind = np.argwhere(A==val)
print val, ind
In this case you will obtain the array of indexes if value appears in array not once.
I am wondering what the most concise and pythonic way to keep only the maximum element in each line of a 2D numpy array while setting all other elements to zeros. Example:
given the following numpy array:
a = [ [1, 8, 3 ,6],
[5, 5, 60, 1],
[63,9, 9, 23] ]
I want the answer to be:
b = [ [0, 8, 0, 0],
[0, 0, 60, 0],
[63,0, 0, 0 ] ]
I can think of several ways to solve that, but what interests me is whether there are python functions to so this just quickly
Thank you in advance
You can use np.max to take the maximum along one axis, then use np.where to zero out the non-maximal elements:
np.where(a == a.max(axis=1, keepdims=True), a, 0)
The keepdims=True argument keeps the singleton dimension after taking the max (i.e. so that a.max(1, keepdims=True).shape == (3, 1)), which simplifies broadcasting it against a.
Don't know what is pythonic, so I assume the way with most python specific grammar is pythonic.
It used two list comprehension, which is feature of python. but in this way it might not that concise.
b = [[y if y == max(x) else 0 for y in x] for x in a ]
Suppose we start with an integer numpy array with integers between 0 and 99, i.e.
x = np.array([[1,2,3,1],[10,5,0,2]],dtype=int)
Now we want to represent rows in this array with a single unique value. One simple way to do this is representing it as a floating number. An intuitive way to do this is
rescale = np.power(10,np.arange(0,2*x.shape[1],2)[::-1],dtype=float)
codes = np.dot(x,rescale)
where we exploit that the integers have at most 2 digits. (I'm casting rescale as a float to avoid exceeding the maximum value of int in case the entries of x have more elements; this is not very elegant)
This returns
array([ 1020301., 10050002.])
How can this process be reversed to obtain x again?
I'm thinking of converting codes to a string, then split the string every 2nd entry. I'm not too familiar with these string operations, especially when they have to be executed on all entries of an array simultaneously. A problem is also that the first number has a varying number of digits, so trailing zeros have to be added in some way.
Maybe something simpler is possible using some divisions or rounding, or perhaps represting the rows of the array in a different manner. Important is that at least the initial conversion is fast and vectorized.
Suggestions are welcome.
First, you need to find the correct number of columns:
number_of_cols = max(ceil(math.log(v, 100)) for v in codes)
Note that is your first column is always 0, then there is no way with your code to know it even existed: [[0, 1], [0, 2]] -> [1., 2.] -> [[1], [2]] or [[0, 0, 0, 1], [0, 0, 0, 2]]. It might be something to consider.
Anyways, here is a mockup for the string way:
def decode_with_string(codes):
number_of_cols = max(ceil(math.log(v, 100)) for v in codes)
str_format = '{:0%dd}'%(2*number_of_cols) # prepare to format numbers as string
return [[int(str_format.format(int(code))[2*i:2*i+2]) # extract the wanted digits
for i in range(number_of_cols)] # for all columns
for code in codes] # for all rows
But you can also compute the numbers directly:
def decode_direct(codes):
number_of_cols = max(ceil(math.log(v, 100)) for v in codes)
return [[floor(code/(100**index)) % 100
for index in range(number_of_cols-1, -1, -1)]
for code in codes]
Example:
>>> codes = [ 1020301., 10050002.]
>>> number_of_cols = max(ceil(math.log(v, 100)) for v in codes)
>>> print(number_of_cols)
4
>>> print(decode_with_strings(codes))
[[1, 2, 3, 1], [10, 5, 0, 2]]
>>> print(decode_direct(codes))
[[1, 2, 3, 1], [10, 5, 0, 2]]
Here is a numpy solution:
>>> divisors = np.power(0.01, np.arange(number_of_cols-1, -1, -1))
>>> x = np.mod(np.floor(divisors*codes.reshape((codes.shape[0], 1))), 100)
Finally, you say you use float in case of overflow of int. First, the mantissa of floating point numbers is also limited, so you don't eliminate the risk of overflow. Second, in Python3, integer actually have unlimited precision.
You could exploit that Numpy stores its arrays as continuous blocks in memory. So storing the memory-block as binary string and remembering the shape of the array should be sufficient:
import numpy as np
x = np.array([[1,2,3,1],[10,5,0,2]], dtype=np.uint8) # 8 Bit are enough for 2 digits
x_sh = x.shape
# flatten array and convert to binarystring
xs = x.ravel().tostring()
# convert back and reshape:
y = np.reshape(np.fromstring(xs, np.uint8), x_sh)
The reason for flattening the array first is that you don't need to pay attention to the storage order of 2D arrays (C or FORTRAN order). Of course you also could generate a string for each row separately too:
import numpy as np
x = np.array([[1,2,3,1],[10,5,0,2]], dtype=np.uint8) # 8 Bit are enough for 2 digits
# conversion:
xss = [xr.tostring() for xr in x]
# conversion back:
y = np.array([np.fromstring(xs, np.uint8) for xs in xss])
Since your numbers are between 0 and 99, you should rather pad up to 2 digits: 0 becomes "00" , 5 becomes "05" and 50 becomes "50". That way, all you need to do is repeatedly divide your number by 100 and you'll get the values. Your encoding will also be smaller, since every number is encoded in 2 digits instead of 2-3 as you currently do.
If you want to be able to detect [0,0,0] (which is currently undistinguishable from [0] or [O.....O]) as well, add a 1 in front of your number: 1000000 is [0,0,0] and 100 is [0]. When your division returns 1, you know you've finished.
You can easily construct a string with that information and cast it to a number afterwards.
I have a 2D numpy array that I need to take the max of along a specific axis. I then need to later know which indexes were selected for this operation as a mask for another operation which is only done on those same indexes but on another array of the same shape.
Right how I'm doing it by using 2d array indexing, but it's slow and kind of convoluted, particularly the mgrid hack to generate the row indexes. It's just [0,1] for this example but I need the robustness to work with arbitrary shapes.
a = np.array([[0,0,5],[0,0,5]])
b = np.array([[1,1,1],[1,1,1]])
columnIndexes = np.argmax(a,axis=1)
rowIndexes = np.mgrid[0:a.shape[0],0:columnIdx.size-1][0].flatten()
b[rowIndexes,columnIndexes] = b[rowIndexes,columnIndexes]+1
B should now be array([[1,1,2],[1,1,2]]) since it preformed the operation on b for only the indexes of the max along the columns of a.
Anyone know a better way? Preferably using just boolean masking arrays so that I can port this code to run on a GPU without too much hassle. Thanks!
I will suggest an answer but with slightly different data.
c = np.array([[0,1,1],[2,1,0]]) # note that this data has dupes for max in row 1
d = np.array([[0,10,10],[20,10,0]]) # data to be chaged
c_argmax = np.argmax(c,axis=1)[:,np.newaxis]
b_map1 = c_argmax == np.arange(c.shape[1])
# now use the bool map as you described
d[b_map1] += 1
d
[out]
array([[ 0, 11, 10],
[21, 10, 0]])
Note that I created an original with a duplicate of the largest number. The above works with argmax as you requested but you might have wanted to increment all max values. as in:
c_max = np.max(c,axis=1)[:,np.newaxis]
b_map2 = c_max == c
d[b_map2] += 1
d
[out]
array([[ 0, 12, 11],
[22, 10, 0]])