Numpy equivalents of Ruby array functions - python

I'm working on an assignment to practice data preprocessing, equal width binning in this case, but I'm not familiar with these numpy functions, so my python code is kinda ugly:
def eq_width_bin(data, bins):
bin_edge = np.linspace(np.min(data), np.max(data), bins+1)
bin_edge[-1] += 1
re = []
for i in data:
for j in bin_edge:
if i < j:
re.append(int(np.argwhere(bin_edge==j))-1)
break
return re
data = np.array([80, 95, 70, 30, 20, 10, 75, 65, 98, 103, 130, 70])
print("After equal width binning:\n{}".format(eq_width_bin(data, 3)))
however in ruby I can do it with less than 10 lines (despite the fact this is kinda slow):
def eq_width_bin(data, bins)
bin_edge = bins.times.collect{|i| data.min + (data.max - data.min) / bins * i} << data.max + 1
return data.collect{|i| bin_edge.index{|j| i < j} - 1}
end
data = [80, 95, 70, 30, 20, 10, 75, 65, 98, 103, 130, 70]
puts "After equal width binning:\n#{eq_width_bin(data, 3)}"
I often use .select .collect .inject .sort_by to dealing with array in ruby, so is there any numpy functions I can use to "beautify" my python code above? (Especially knowing that numpy's built-in functions are way much faster than doing it in pyhton)

Initially this looked like a bincount or histogram, but the output is the bins where each value fits, not the number of items per bin:
In [3]: eq_width_bin(data,3)
Out[3]: [1, 2, 1, 0, 0, 0, 1, 1, 2, 2, 2, 1]
Your bins:
In [10]: np.linspace(np.min(data),np.max(data),4)
Out[10]: array([ 10., 50., 90., 130.])
we can identify the bin for each value with a simple integer division:
In [12]: (data-10)//40
Out[12]: array([1, 2, 1, 0, 0, 0, 1, 1, 2, 2, 3, 1])
and correct that 3 with:
In [16]: np.minimum((data-10)//40,2)
Out[16]: array([1, 2, 1, 0, 0, 0, 1, 1, 2, 2, 2, 1])
But that doesn't answer you question about .select .collect .inject .sort_by. Off hand I'm not familiar with those (though I was a fan of Squeak years ago, and dabbled in Ruby a bit). They sound more like iterators, such as those collected in itertools.
With numpy we don't usually take an iterative approach. Rather we try to look at the arrays as a whole, doing things like division and min/max for the whole thing.
===
searchsorted also works for this problem:
In [19]: np.searchsorted(Out[10],data)
Out[19]: array([2, 3, 2, 1, 1, 0, 2, 2, 3, 3, 3, 2])
In [21]: np.maximum(0,np.searchsorted(Out[10],data)-1)
Out[21]: array([1, 2, 1, 0, 0, 0, 1, 1, 2, 2, 2, 1])
A (possibly) cleaner expression of your Python loop:
def foo(i, edges):
for j,n in enumerate(edges):
if i<n:
return j-1
return j-1
In [34]: edges = np.linspace(np.min(data),np.max(data),4).tolist()
In [35]: [foo(i,edges) for i in data]
Out[35]: [1, 2, 1, 0, 0, 0, 1, 1, 2, 2, 2, 1]
I converted edges to a list, because it's faster to iterate that way.
With itertools:
In [55]: [len(list(itertools.takewhile(lambda x: x<i,edges)))-1 for i in data]
Out[55]: [1, 2, 1, 0, 0, -1, 1, 1, 2, 2, 2, 1]
===
Another approach
In [45]: edges = np.linspace(np.min(data),np.max(data),4)
In [46]: data[:,None]<-edges
Out[46]:
array([[False, False, False, False],
[False, False, False, False],
[False, False, False, False],
[False, False, False, False],
[False, False, False, False],
[False, False, False, False],
[False, False, False, False],
[False, False, False, False],
[False, False, False, False],
[False, False, False, False],
[False, False, False, False],
[False, False, False, False]])
In [47]: np.argmax(data[:,None]<edges, axis=1)-1
Out[47]: array([ 1, 2, 1, 0, 0, 0, 1, 1, 2, 2, -1, 1])
That -1 needs cleaning (the row where there's no True).
edit
Lists have an index method; with that we can get an expression that's a lot like your last Ruby line. Looks like list comprehension is a lot like the Ruby collect:
In [88]: [[i<j for i in edges].index(False)-1 for j in data]
Out[88]: [1, 2, 1, 0, 0, -1, 1, 1, 2, 2, 2, 1]

Related

ValueError with union of two arrays using the OR operator

I am using Python and numpy where I have a couple of numpy arrays of the same shape and I am trying to create a union of these arrays. these arrays contain only 0 and 1 and basically I want to merge them into a new array using the OR operation. So, I do the following:
import numpy as np
segs = list()
a = np.ones((10, 10)).astype('uint8')
b = np.zeros((10, 10)).astype('uint8')
segs.append(a)
segs.append(b)
mask = np.asarray([any(tup) for tup in zip(*segs)]).astype('uint8')
With the last staement I get the error:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
If I use np.any, somehow my array shape is now just (10,). How can I create this merge without explicitly looping through the arrays?
EDIT
mask = np.asarray([any(tup) for tup in zip(segs)]).astype('uint8')
also results in the same error.
Your segs is a list of 2 arrays:
In [25]: segs = [np.ones((3,6),'uint8'), np.zeros((3,6),'uint8')]
In [26]: [tup for tup in zip(*segs)]
Out[26]:
[(array([1, 1, 1, 1, 1, 1], dtype=uint8),
array([0, 0, 0, 0, 0, 0], dtype=uint8)),
(array([1, 1, 1, 1, 1, 1], dtype=uint8),
array([0, 0, 0, 0, 0, 0], dtype=uint8)),
(array([1, 1, 1, 1, 1, 1], dtype=uint8),
array([0, 0, 0, 0, 0, 0], dtype=uint8))]
The zip produces tuples of 1d arrays (pairing rows of the two arrays). Python any applied to arrays gives the ambiguity error - that's true for other logical Python operations like if, or, etc, which expect a scalar True/False.
You tried np.any - that turns the tuple of arrays into a 2d array. But without an axis parameter it works on the flattened version, return a scalar True/False. But with an axis parameter we can apply this any across rows:
In [27]: [np.any(tup, axis=0) for tup in zip(*segs)]
Out[27]:
[array([ True, True, True, True, True, True]),
array([ True, True, True, True, True, True]),
array([ True, True, True, True, True, True])]
Using the logical_or ufunc as suggested in a comment:
In [31]: np.logical_or(segs[0],segs[1])
Out[31]:
array([[ True, True, True, True, True, True],
[ True, True, True, True, True, True],
[ True, True, True, True, True, True]])
In [32]: np.logical_or.reduce(segs)
Out[32]:
array([[ True, True, True, True, True, True],
[ True, True, True, True, True, True],
[ True, True, True, True, True, True]])
Using the '|' operator isn't quite the same:
In [33]: segs[0] | segs[1]
Out[33]:
array([[1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1]], dtype=uint8)
It uses the segs[0].__or__(segs[1]) method. I'd have to check the docs to see what is going on. Application to uint8 (or other numeric values) is different from application to bool. Almost looks like a max.

How to zero out row slice in 2 dimensional array in numpy?

I have a numpy array representing an image. I want to zero out all indexes that are below a certain row in each column (based on a external data). I can't seem to figure out how to slice/broadcast/arrange the data to do this "the numpy way".
def first_nonzero(arr, axis, invalid_val=-1):
mask = arr!=0
return np.where(mask.any(axis=axis), mask.argmax(axis=axis), invalid_val)
# Find first non-zero pixels in a processed image
# Note, I might have my axes switched here... I'm not sure.
rows_to_zero = first_nonzero(processed_image, 0, processed_image.shape[1])
# zero out data in image below the rows found
# This is the part I'm stuck on.
image[:, :rows_to_zero, :] = 0 # How can I slice along an array of indexes?
# Or in plain python, I'm trying to do this:
for x in range(image.shape[0]):
for y in range(rows_to_zero, image.shape[1]):
image[x,y] = 0
Create a mask leveraging broadcasting and assign -
mask = rows_to_zero <= np.arange(image.shape[0])[:,None]
image[mask] = 0
Or multiply with the inverted mask : image *= ~mask.
Sample run to showcase mask setup -
In [56]: processed_image
Out[56]:
array([[1, 0, 1, 0],
[1, 0, 1, 1],
[0, 1, 1, 0],
[0, 1, 0, 1],
[1, 1, 1, 1],
[0, 1, 0, 1]])
In [57]: rows_to_zero
Out[57]: array([0, 2, 0, 1])
In [58]: rows_to_zero <= np.arange(processed_image.shape[0])[:,None]
Out[58]:
array([[ True, False, True, False],
[ True, False, True, True],
[ True, True, True, True],
[ True, True, True, True],
[ True, True, True, True],
[ True, True, True, True]], dtype=bool)
Also, for setting per column basis, I think you meant :
rows_to_zero = first_nonzero(processed_image, 0, processed_image.shape[0]-1)
If you meant to zero out on per row basis, you would have the indices first non-zero indices per row, let's call it idx. So, then do -
mask = idx[:,None] <= np.arange(image.shape[1])
image[mask] = 0
Sample run -
In [77]: processed_image
Out[77]:
array([[1, 0, 1, 0],
[1, 0, 1, 1],
[0, 1, 1, 0],
[0, 1, 0, 1],
[1, 1, 1, 1],
[0, 1, 0, 1]])
In [78]: idx = first_nonzero(processed_image, 1, processed_image.shape[1]-1)
In [79]: idx
Out[79]: array([0, 0, 1, 1, 0, 1])
In [80]: idx[:,None] <= np.arange(image.shape[1])
Out[80]:
array([[ True, True, True, True],
[ True, True, True, True],
[False, True, True, True],
[False, True, True, True],
[ True, True, True, True],
[False, True, True, True]], dtype=bool)

Python - How to generate the Pairwise Hamming Distance Matrix

beginner with Python here. So I'm having trouble trying to calculate the resulting binary pairwise hammington distance matrix between the rows of an input matrix using only the numpy library. I'm supposed to avoid loops and use vectorization. If for instance I have something like:
[ 1, 0, 0, 1, 1, 0]
[ 1, 0, 0, 0, 0, 0]
[ 1, 1, 1, 1, 0, 0]
The matrix should be something like:
[ 0, 2, 3]
[ 2, 0, 3]
[ 3, 3, 0]
ie if the original matrix was A and the hammingdistance matrix is B. B[0,1] = hammingdistance (A[0] and A[1]). In this case the answer is 2 as they only have two different elements.
So for my code is something like this
def compute_HammingDistance(X):
hammingDistanceMatrix = np.zeros(shape = (len(X), len(X)))
hammingDistanceMatrix = np.count_nonzero ((X[:,:,None] != X[:,:,None].T))
return hammingDistanceMatrix
However it seems to just be returning a scalar value instead of the intended matrix. I know I'm probably doing something wrong with the array/vector broadcasting but I can't figure out how to fix it. I've tried using np.sum instead of np.count_nonzero but they all pretty much gave me something similar.
Try this approach, create a new axis along axis = 1, and then do broadcasting and count trues or non zero with sum:
(arr[:, None, :] != arr).sum(2)
# array([[0, 2, 3],
# [2, 0, 3],
# [3, 3, 0]])
def compute_HammingDistance(X):
return (X[:, None, :] != X).sum(2)
Explanation:
1) Create a 3d array which has shape (3,1,6)
arr[:, None, :]
#array([[[1, 0, 0, 1, 1, 0]],
# [[1, 0, 0, 0, 0, 0]],
# [[1, 1, 1, 1, 0, 0]]])
2) this is a 2d array has shape (3, 6)
arr
#array([[1, 0, 0, 1, 1, 0],
# [1, 0, 0, 0, 0, 0],
# [1, 1, 1, 1, 0, 0]])
3) This triggers broadcasting since their shape doesn't match, and the 2d array arr is firstly broadcasted along the 0 axis of 3d array arr[:, None, :], and then we have array of shape (1, 6) be broadcasted against (3, 6). The two broadcasting steps together make a cartesian comparison of the original array.
arr[:, None, :] != arr
#array([[[False, False, False, False, False, False],
# [False, False, False, True, True, False],
# [False, True, True, False, True, False]],
# [[False, False, False, True, True, False],
# [False, False, False, False, False, False],
# [False, True, True, True, False, False]],
# [[False, True, True, False, True, False],
# [False, True, True, True, False, False],
# [False, False, False, False, False, False]]], dtype=bool)
4) the sum along the third axis count how many elements are not equal, i.e, trues which gives the hamming distance.
For reasons I do not understand this
(2 * np.inner(a-0.5, 0.5-a) + a.shape[1] / 2)
appears to be much faster than #Psidom's for larger arrays:
a = np.random.randint(0,2,(100,1000))
timeit(lambda: (a[:, None, :] != a).sum(2), number=100)
# 2.297890231013298
timeit(lambda: (2 * np.inner(a-0.5, 0.5-a) + a.shape[1] / 2), number=100)
# 0.10616962902713567
Psidom's is a bit faster for the very small example:
a
# array([[1, 0, 0, 1, 1, 0],
# [1, 0, 0, 0, 0, 0],
# [1, 1, 1, 1, 0, 0]])
timeit(lambda: (a[:, None, :] != a).sum(2), number=100)
# 0.0004370050155557692
timeit(lambda: (2 * np.inner(a-0.5, 0.5-a) + a.shape[1] / 2), number=100)
# 0.00068191799800843
Update
Part of the reason appears to be floats being faster than other dtypes:
timeit(lambda: (0.5 * np.inner(2*a-1, 1-2*a) + a.shape[1] / 2), number=100)
# 0.7315902590053156
timeit(lambda: (0.5 * np.inner(2.0*a-1, 1-2.0*a) + a.shape[1] / 2), number=100)
# 0.12021801102673635

How to obtain the same result as numpy.where over a 2D array without getting 2 indices from the same row

I have a numpy array with booleans:
bool_array.shape
Out[84]: (78, 8)
bool_array.dtype
Out[85]: dtype('bool')
And I would like to find the indices where the second dimension is True:
bool_array[30:35]
Out[87]:
array([[False, False, False, False, True, False, False, False],
[ True, False, False, False, True, False, False, False],
[False, False, False, False, False, True, False, False],
[ True, False, False, False, False, False, False, False],
[ True, False, False, False, False, False, False, False]], dtype=bool)
I have been using numpy.where to do this, but sometimes there are more than 1 indices along the second dimension with the True value.
I would like to find a way to obtain the same result as numpy.where but avoiding to have 2 indices from the same row:
np.where(bool_array)[0][30:35]
Out[88]: array([30, 31, 31, 32, 33])
I currently solve this by looping over the results of numpy.where, finding which n indices are equal to n-1, and using numpy.delete to remove the unwanted indices.
I would like to know if there is a more directly way to obtain the kind of results that I want.
Notes:
The rows of the boolean arrays that I use always have at least 1
True value.
I don't care which one of the multiples True values remains, i only
care to have just 1.
IIUC and given the fact that there is at least one TRUE element per row, you can simply use np.argmax along the second axis to select the first TRUE element along each row, like so -
col_idx = bool_array.argmax(1)
Sample run -
In [246]: bool_array
Out[246]:
array([[ True, True, True, True, False],
[False, False, True, True, False],
[ True, True, False, False, True],
[ True, True, False, False, True]], dtype=bool)
In [247]: np.where(bool_array)[0]
Out[247]: array([0, 0, 0, 0, 1, 1, 2, 2, 2, 3, 3, 3])
In [248]: np.where(bool_array)[1]
Out[248]: array([0, 1, 2, 3, 2, 3, 0, 1, 4, 0, 1, 4])
In [249]: bool_array.argmax(1)
Out[249]: array([0, 2, 0, 0])
Explanation -
Corresponding to the duplicates from the output of np.where(bool_array)[0], i.e. :
array([0, 0, 0, 0, 1, 1, 2, 2, 2, 3, 3, 3])
, we need to select anyone from the output of np.where(bool_array)[1], i.e. :
array([0, 1, 2, 3, 2, 3, 0, 1, 4, 0, 1, 4])
^ ^ ^ ^
Thus, selecting the first True from each row with bool_array.argmax(1) gives us :
array([0, 2, 0, 0])
You could call np.unique on the resultant array like so:
>>> np.where(bool_array)[0][30:35]
Out[4]: array([0, 1, 1, 2, 3, 4])
>>> np.unique(np.where(bool_array)[0][30:35])
Out[5]: array([0, 1, 2, 3, 4])

indices of NumPy arrays intersection

I have two NumPy arrays. For example:
arr1 = np.array(['a','b','a','c','c','b','a','d'])
arr2 = np.array(['a','b','c','d'])
My task is to create list of indices of arr2 array where arr1 == arr2.
The length of the desired list should be equal to len(arr1). For instance, in my case the correct answer is [0,1,0,2,2,1,0,3].
What is the short way to do this? Is it possible to use a list comprehension here?
I noticed that arr2 is sorted, is that by design? If so you can do:
arr1 = np.array(['a','b','a','c','c','b','a','d'])
arr2 = np.array(['a','b','c','d'])
arr2.searchsorted(arr1)
# array([0, 1, 0, 2, 2, 1, 0, 3])
As #JAB has mentioned you could use the sorter keyword to searchsorted when arr2 is not sorted:
arr2 = np.array(['d', 'c', 'b', 'a'])
sorter = arr2.argsort()
sorter[arr2.searchsorted(arr1, sorter=sorter)]
# array([3, 2, 3, 1, 1, 2, 3, 0])
This is an O(N*log(N)) method because of the argsort, but it should still be very fast for many use-cases.
Not sure if numpy has a method for this, but here is a builtin approach, which takes O(N) in time:
In [9]: lookup = {v:i for i, v in enumerate(arr2)}
In [10]: [lookup[v] for v in arr1]
Out[10]: [0, 1, 0, 2, 2, 1, 0, 3]
You can do it like this with NumPy using broadcasting, however if your arrays are large you can end up allocating a lot of memory for the intermediate result
>>> import numpy as np
>>> arr1, arr2 = np.array(['a','b','a','c','c','b','a','d']), np.array(['a','b','c','d'])
>>> arr1 == arr2[:, None]
array([[ True, False, True, False, False, False, True, False],
[False, True, False, False, False, True, False, False],
[False, False, False, True, True, False, False, False],
[False, False, False, False, False, False, False, True]], dtype=bool)
>>> (arr1 == arr2[:, None]).argmax(axis=0)
array([0, 1, 0, 2, 2, 1, 0, 3])
>>>
Otherwise keep an eye on arraysetops in case someone adds a return_index parameter to intersect1d

Categories

Resources