I am new to Python and am writing an application to identify matching images. I am using the dHash algorithm to compare the hashes of images. I have seen in a tutorial the following lines of code:
import cv2 as cv
import numpy as np
import sys
hashSize = 8
img = cv.imread("Resources\\test_image.jpg",0)
image_resized = cv.resize(img, (hashSize + 1, hashSize))
difference = image_resized[0:, 1:] > image_resized[0:, :-1]
hash_Value = sum([5**i for i, value in enumerate(difference.flatten())if value == True])
print(hash_Value)
The two lines I am referring to are the difference line and the hash_Value line. As far as I understand, the first line checks to see if the left pixel has a greater intensity then the right pixel. How does it do this for the whole array? There is no for loop over the array to check each index. As for the second line, I think it is checking to see if the value is true and if it is, it is adding the value of 5^i to sum and then assigning that to image_hash.
I am new to Python and the syntax here is a little confusing. Can someone explain what the two lines above are doing? Is there a more readable way of writing this that will help me understand what the algorithm is doing and be more readable in future?
To break it down, the first line pixel_difference = image_resized[0:, 1:] > image_resized[0:, :-1] is basically doing the following:
import numpy as np # I assume you are using numpy.
# Suppose you have the following 2D arrays:
arr1 = np.array([ [1, 2, 3], [4, 5, 6], [7, 8, 9] ])
arr2 = np.array([ [3, 1, 2], [5, 5, 4], [7, 7, 7] ])
# pixel_difference = image_resized[0:, 1:] > image_resized[0:, :-1]
# can be written as following:
m, n = arr1.shape # This will assign m = 3, n = 3.
pixel_difference = np.ndarray(shape=(m, n-1), dtype=bool) # Initializes m x (n-1) matrix.
for row in range(m):
for col in range(n-1):
pixel_difference[row, col] = arr1[row, col+1] > arr2[row, col]
print(pixel_difference)
And the second line is doing this:
image_hash = 0
for i, value in enumerate(pixel_difference.flatten()):
if value:
image_hash += 5**i
print(image_hash)
Suppose I have an example of numpy array:
import numpy as np
X = np.array([2,5,0,4,3,1])
And I also have a list of arrays, like:
A = [np.array([-2,0,2]), np.array([0,1,2,3,4,5]), np.array([2,5,4,6])]
I want to leave only these items of each list that are also in X. I expect also to do it in a most efficient/common way.
Solution I have tried so far:
Sort X using X.sort().
Find locations of items of each array in X using:
locations = [np.searchsorted(X, n) for n in A]
Leave only proper ones:
masks = [X[locations[i]] == A[i] for i in range(len(A))]
result = [A[i][masks[i]] for i in range(len(A))]
But it doesn't work because locations of third array is out of bounds:
locations = [array([0, 0, 2], dtype=int64), array([0, 1, 2, 3, 4, 5], dtype=int64), array([2, 5, 4, 6], dtype=int64)]
How to solve this issue?
Update
I ended up with idx[idx==len(Xs)] = 0 solution. I've also noticed two different approaches posted between the answers: transforming X into set vs np.sort. Both of them has plusses and minuses: set operations uses iterations which is quite slow in compare with numpy methods; however np.searchsorted speed increases logarithmically unlike acceses of set items which is instant. That why I decided to compare performance using data with huge sizes, especially 1 million items for X, A[0], A[1], A[2].
One idea would be less compute and minimal work when looping. So, here's one with those in mind -
a = np.concatenate(A)
m = np.isin(a,X)
l = np.array(list(map(len,A)))
a_m = a[m]
cut_idx = np.r_[0,l.cumsum()]
l_m = np.add.reduceat(m,cut_idx[:-1])
cl_m = np.r_[0,l_m.cumsum()]
out = [a_m[i:j] for (i,j) in zip(cl_m[:-1],cl_m[1:])]
Alternative #1 :
We can also use np.searchsorted to get the isin mask, like so -
Xs = np.sort(X)
idx = np.searchsorted(Xs,a)
idx[idx==len(Xs)] = 0
m = Xs[idx]==a
Another way with np.intersect1d
If you are looking for the most common/elegant one, think it would be with np.intersect1d -
In [43]: [np.intersect1d(X,A_i) for A_i in A]
Out[43]: [array([0, 2]), array([0, 1, 2, 3, 4, 5]), array([2, 4, 5])]
Solving your issue
You can also solve your out-of-bounds issue, with a simple fix -
for l in locations:
l[l==len(X)]=0
How about this, very simple and efficent:
import numpy as np
X = np.array([2,5,0,4,3,1])
A = [np.array([-2,0,2]), np.array([0,1,2,3,4,5]), np.array([2,5,4,6])]
X_set = set(X)
A = [np.array([a for a in arr if a in X_set]) for arr in A]
#[array([0, 2]), array([0, 1, 2, 3, 4, 5]), array([2, 5, 4])]
According to the docs, set operations all have O(1) complexity, therefore the overall is O(N)
I try to convert code from Matlab to python
I have code in Matlab:
[value, iA, iB] = intersect(netA{i},netB{j});
I am looking for code in python that find the values common to both A and B, as well as the index vectors ia and ib (for each common element, its first index in A and its first index in B).
I try to use different solution, but I received vectors with different length. tried to use numpy.in1d/intersect1d , that returns bad not the same value.
Thing I try to do :
def FindoverlapIndx(self,a, b):
bool_a = np.in1d(a, b)
ind_a = np.arange(len(a))
ind_a = ind_a[bool_a]
ind_b = np.array([np.argwhere(b == a[x]) for x in ind_a]).flatten()
return ind_a, ind_b
IS=np.arange(IDs[i].shape[0])[np.in1d(IDs[i], R_IDs[j])]
IR = np.arange(R_IDs[j].shape[0])[np.in1d(R_IDs[j],IDs[i])]
I received indexes with different lengths. But both must be of the same length as in Matlab's intersect.
MATLAB's intersect(a, b) returns:
common values of a and b, sorted
the first position of each of them in a
the first position of each of them in b
NumPy's intersect1d does only the first part. So I read its source and modified it to return indices as well.
import numpy as np
def intersect_mtlb(a, b):
a1, ia = np.unique(a, return_index=True)
b1, ib = np.unique(b, return_index=True)
aux = np.concatenate((a1, b1))
aux.sort()
c = aux[:-1][aux[1:] == aux[:-1]]
return c, ia[np.isin(a1, c)], ib[np.isin(b1, c)]
a = np.array([7, 1, 7, 7, 4]);
b = np.array([7, 0, 4, 4, 0]);
c, ia, ib = intersect_mtlb(a, b)
print(c, ia, ib)
This prints [4 7] [4 0] [2 0] which is consistent with the output on MATLAB documentation page, as I used the same example as they did. Of course, indices are 0-based in Python unlike MATLAB.
Explanation: the function takes unique elements from each array, puts them together, and concatenates: the result is [0 1 4 4 7 7]. Each number appears at most twice here; when it's repeated, that means it was in both arrays. This is what aux[1:] == aux[:-1] selects for.
The array ia contains the first index of each element of a1 in the original array a. Filtering it by isin(a1, c) leaves only the indices that were in c. Same is done for ib.
EDIT:
Since version 1.15.0, intersect1d does the second and third part if you pass return_indices=True:
x = np.array([1, 1, 2, 3, 4])
y = np.array([2, 1, 4, 6])
xy, x_ind, y_ind = np.intersect1d(x, y, return_indices=True)
Where you get xy = array([1, 2, 4]), x_ind = array([0, 2, 4]) and y_ind = array([1, 0, 2])
I would like to create a two dimensional numpy array of arrays that has a different number of elements on each row.
Trying
cells = numpy.array([[0,1,2,3], [2,3,4]])
gives an error
ValueError: setting an array element with a sequence.
We are now almost 7 years after the question was asked, and your code
cells = numpy.array([[0,1,2,3], [2,3,4]])
executed in numpy 1.12.0, python 3.5, doesn't produce any error and
cells contains:
array([[0, 1, 2, 3], [2, 3, 4]], dtype=object)
You access your cells elements as cells[0][2] # (=2) .
An alternative to tom10's solution if you want to build your list of numpy arrays on the fly as new elements (i.e. arrays) become available is to use append:
d = [] # initialize an empty list
a = np.arange(3) # array([0, 1, 2])
d.append(a) # [array([0, 1, 2])]
b = np.arange(3,-1,-1) #array([3, 2, 1, 0])
d.append(b) #[array([0, 1, 2]), array([3, 2, 1, 0])]
While Numpy knows about arrays of arbitrary objects, it's optimized for homogeneous arrays of numbers with fixed dimensions. If you really need arrays of arrays, better use a nested list. But depending on the intended use of your data, different data structures might be even better, e.g. a masked array if you have some invalid data points.
If you really want flexible Numpy arrays, use something like this:
numpy.array([[0,1,2,3], [2,3,4]], dtype=object)
However this will create a one-dimensional array that stores references to lists, which means that you will lose most of the benefits of Numpy (vector processing, locality, slicing, etc.).
This isn't well supported in Numpy (by definition, almost everywhere, a "two dimensional array" has all rows of equal length). A Python list of Numpy arrays may be a good solution for you, as this way you'll get the advantages of Numpy where you can use them:
cells = [numpy.array(a) for a in [[0,1,2,3], [2,3,4]]]
Another option would be to store your arrays as one contiguous array and also store their sizes or offsets. This takes a little more conceptual thought around how to operate on your arrays, but a surprisingly large number of operations can be made to work as if you had a two dimensional array with different sizes. In the cases where they can't, then np.split can be used to create the list that calocedrus recommends. The easiest operations are ufuncs, because they require almost no modification. Here are some examples:
cells_flat = numpy.array([0, 1, 2, 3, 2, 3, 4])
# One of these is required, it's pretty easy to convert between them,
# but having both makes the examples easy
cell_lengths = numpy.array([4, 3])
cell_starts = numpy.insert(cell_lengths[:-1].cumsum(), 0, 0)
cell_lengths2 = numpy.diff(numpy.append(cell_starts, cells_flat.size))
assert np.all(cell_lengths == cell_lengths2)
# Copy prevents shared memory
cells = numpy.split(cells_flat.copy(), cell_starts[1:])
# [array([0, 1, 2, 3]), array([2, 3, 4])]
numpy.array([x.sum() for x in cells])
# array([6, 9])
numpy.add.reduceat(cells_flat, cell_starts)
# array([6, 9])
[a + v for a, v in zip(cells, [1, 3])]
# [array([1, 2, 3, 4]), array([5, 6, 7])]
cells_flat + numpy.repeat([1, 3], cell_lengths)
# array([1, 2, 3, 4, 5, 6, 7])
[a.astype(float) / a.sum() for a in cells]
# [array([ 0. , 0.16666667, 0.33333333, 0.5 ]),
# array([ 0.22222222, 0.33333333, 0.44444444])]
cells_flat.astype(float) / np.add.reduceat(cells_flat, cell_starts).repeat(cell_lengths)
# array([ 0. , 0.16666667, 0.33333333, 0.5 , 0.22222222,
# 0.33333333, 0.44444444])
def complex_modify(array):
"""Some complicated function that modifies array
pretend this is more complex than it is"""
array *= 3
for arr in cells:
complex_modify(arr)
cells
# [array([0, 3, 6, 9]), array([ 6, 9, 12])]
for arr in numpy.split(cells_flat, cell_starts[1:]):
complex_modify(arr)
cells_flat
# array([ 0, 3, 6, 9, 6, 9, 12])
In numpy 1.14.3, using append:
d = [] # initialize an empty list
a = np.arange(3) # array([0, 1, 2])
d.append(a) # [array([0, 1, 2])]
b = np.arange(3,-1,-1) #array([3, 2, 1, 0])
d.append(b) #[array([0, 1, 2]), array([3, 2, 1, 0])]
what you get an list of arrays (that can be of different lengths) and you can do operations like d[0].mean(). On the other hand,
cells = numpy.array([[0,1,2,3], [2,3,4]])
results in an array of lists.
You may want to do this:
a1 = np.array([1,2,3])
a2 = np.array([3,4])
a3 = np.array([a1,a2])
a3 # array([array([1, 2, 3]), array([3, 4])], dtype=object)
type(a3) # numpy.ndarray
type(a2) # numpy.ndarray
Slightly off-topic, but not as much as one would think because of eager mode which is now the default:
If you are using Tensorflow, you can do:
a = tf.ragged.constant([[0, 1, 2, 3]])
b = tf.ragged.constant([[2, 3, 4]])
c = tf.concat([a, b], axis=0)
And you can then do all the mathematical operations still, like tf.math.reduce_mean, etc.
np.array([[0,1,2,3], [2,3,4]], dtype=object) returns an "array" of lists.
a = np.array([np.array([0,1,2,3]), np.array([2,3,4])], dtype=object) returns an array of arrays. It allows already for operations such as a+1.
Building up on this, the functionality can be enhanced by subclassing.
import numpy as np
class Arrays(np.ndarray):
def __new__(cls, input_array, dims=None):
obj = np.array(list(map(np.array, input_array))).view(cls)
return obj
def __getitem__(self, ij):
if isinstance(ij, tuple) and len(ij) > 1:
# handle twodimensional slicing
if isinstance(ij[0],slice) or hasattr(ij[0], '__iter__'):
# [1:4,:] or [[1,2,3],[1,2]]
return Arrays(arr[ij[1]] for arr in self[ij[0]])
return self[ij[0]][ij[1]] # [1,:] np.array
return super(Arrays, self).__getitem__(ij)
def __array_ufunc__(self, ufunc, method, *inputs, **kwargs):
axis = kwargs.pop('axis', None)
dimk = [len(arg) if hasattr(arg, '__iter__') else 1 for arg in inputs]
dim = max(dimk)
pad_inputs = [([i]*dim if (d<dim) else i) for d,i in zip(dimk, inputs)]
result = [np.ndarray.__array_ufunc__(self, ufunc, method, *x, **kwargs) for x in zip(*pad_inputs)]
if method == 'reduce':
# handle sum, min, max, etc.
if axis == 1:
return np.array(result)
else:
# repeat over remaining axis
return np.ndarray.__array_ufunc__(self, ufunc, method, result, **kwargs)
return Arrays(result)
Now this works:
a = Arrays([[0,1,2,3], [2,3,4]])
a[0:1,0:-1]
# Arrays([[0, 1, 2]])
np.sin(a)
# Arrays([array([0. , 0.84147098, 0.90929743, 0.14112001]),
# array([ 0.90929743, 0.14112001, -0.7568025 ])], dtype=object)
a + 2*a
# Arrays([array([0, 3, 6, 9]), array([ 6, 9, 12])], dtype=object)
To get nanfunctions working, this can be done
# patch for nanfunction that cannot handle the object-ndarrays along with second axis=-1
def nanpatch(func):
def wrapper(a, axis=None, **kwargs):
if isinstance(a, Arrays):
rowresult = [func(x, **kwargs) for x in a]
if axis == 1:
return np.array(rowresult)
else:
# repeat over remaining axis
return func(rowresult)
# otherwise keep the original version
return func(a, axis=axis, **kwargs)
return wrapper
np.nanmean = nanpatch(np.nanmean)
np.nansum = nanpatch(np.nansum)
np.nanmin = nanpatch(np.nanmin)
np.nanmax = nanpatch(np.nanmax)
np.nansum(a)
# 15
np.nansum(a, axis=1)
# array([6, 9])