Comparing NumPy object references - python

I want to understand the NumPy behavior.
When I try to get the reference of an inner array of a NumPy array, and then compare it to the object itself, I get as returned value False.
Here is the example:
In [198]: x = np.array([[1,2,3], [4,5,6]])
In [201]: x0 = x[0]
In [202]: x0 is x[0]
Out[202]: False
While on the other hand, with Python native objects, the returned is True.
In [205]: c = [[1,2,3],[1]]
In [206]: c0 = c[0]
In [207]: c0 is c[0]
Out[207]: True
My question, is that the intended behavior of NumPy? If so, what should I do if I want to create a reference of inner objects of NumPy arrays.

2d slicing
When I first wrote this I constructed and indexed a 1d array. But the OP is working with a 2d array, so x[0] is a 'row', a slice of the original.
In [81]: arr = np.array([[1,2,3], [4,5,6]])
In [82]: arr.__array_interface__['data']
Out[82]: (181595128, False)
In [83]: x0 = arr[0,:]
In [84]: x0.__array_interface__['data']
Out[84]: (181595128, False) # same databuffer pointer
In [85]: id(x0)
Out[85]: 2886887088
In [86]: x1 = arr[0,:] # another slice, different id
In [87]: x1.__array_interface__['data']
Out[87]: (181595128, False)
In [88]: id(x1)
Out[88]: 2886888888
What I wrote earlier about slices still applies. Indexing an individual elements, as with arr[0,0] works the same as with a 1d array.
This 2d arr has the same databuffer as the 1d arr.ravel(); the shape and strides are different. And the distinction between view, copy and item still applies.
A common way of implementing 2d arrays in C is to have an array of pointers to other arrays. numpy takes a different, strided approach, with just one flat array of data, and usesshape and strides parameters to implement the transversal. So a subarray requires its own shape and strides as well as a pointer to the shared databuffer.
1d array indexing
I'll try to illustrate what is going on when you index an array:
In [51]: arr = np.arange(4)
The array is an object with various attributes such as shape, and a data buffer. The buffer stores the data as bytes (in a C array), not as Python numeric objects. You can see information on the array with:
In [52]: np.info(arr)
class: ndarray
shape: (4,)
strides: (4,)
itemsize: 4
aligned: True
contiguous: True
fortran: True
data pointer: 0xa84f8d8
byteorder: little
byteswap: False
type: int32
or
In [53]: arr.__array_interface__
Out[53]:
{'data': (176486616, False),
'descr': [('', '<i4')],
'shape': (4,),
'strides': None,
'typestr': '<i4',
'version': 3}
One has the data pointer in hex, the other decimal. We usually don't reference it directly.
If I index an element, I get a new object:
In [54]: x1 = arr[1]
In [55]: type(x1)
Out[55]: numpy.int32
In [56]: x1.__array_interface__
Out[56]:
{'__ref': array(1),
'data': (181158400, False),
....}
In [57]: id(x1)
Out[57]: 2946170352
It has some properties of an array, but not all. For example you can't assign to it. Notice also that its 'data` value is totally different.
Make another selection from the same place - different id and different data:
In [58]: x2 = arr[1]
In [59]: id(x2)
Out[59]: 2946170336
In [60]: x2.__array_interface__['data']
Out[60]: (181143288, False)
Also if I change the array at this point, it does not affect the earlier selections:
In [61]: arr[1] = 10
In [62]: arr
Out[62]: array([ 0, 10, 2, 3])
In [63]: x1
Out[63]: 1
x1 and x2 don't have the same id, and thus won't match with is, and they don't use the arr data buffer either. There's no record that either variable was derived from arr.
With slicing it is possible get a view of the original array,
In [64]: y = arr[1:2]
In [65]: y.__array_interface__
Out[65]:
{'data': (176486620, False),
'descr': [('', '<i4')],
'shape': (1,),
....}
In [66]: y
Out[66]: array([10])
In [67]: y[0]=4
In [68]: arr
Out[68]: array([0, 4, 2, 3])
In [69]: x1
Out[69]: 1
It's data pointer is 4 bytes larger than arr - that is, it points to the same buffer, just a different spot. And changing y does change arr (but not the independent x1).
I could even make a 0d view of this item
In [71]: z = y.reshape(())
In [72]: z
Out[72]: array(4)
In [73]: z[...]=0
In [74]: arr
Out[74]: array([0, 0, 2, 3])
In Python code we normally don't work with objects like this. When we use the c-api or cython is it possible to access the data buffer directly. nditer is an iteration mechanism that works with 0d objects like this (either in Python or the c-api). In cython typed memoryviews are particularly useful for low level access.
http://cython.readthedocs.io/en/latest/src/userguide/memoryviews.html
https://docs.scipy.org/doc/numpy/reference/arrays.nditer.html
https://docs.scipy.org/doc/numpy/reference/c-api.iterator.html#c.NpyIter
elementwise ==
In response to comment, Comparing NumPy object references
np.array([1]) == np.array([2]) will return array([False], dtype=bool)
== is defined for arrays as an elementwise operation. It compares the values of the respective elements and returns a matching boolean array.
If such a comparison needs to be used in a scalar context (such as an if) it needs to be reduced to a single value, as with np.all or np.any.
The is test compares object id's (not just for numpy objects). It has limited value in practical coding. I used it most often in expressions like is None, where None is an object with a unique id, and which does not play nicely with equality tests.

I think that you have a miss understanding about Numpy arrays. You think that sub arrays in a multidimensional array in Numpy (like in Python lists) are separate objects, well, they're not.
A Numpy array, regardless of its dimension is just one object. And that's because Numpy creates the arrays at C levels and when loads them up as a python object it can't be break down to multiple objects. That makes Python to create a new object for preserving new parts when you use some attributes like split(), __getitem__, take() or etc., which as a mater of fact, its just the way that python abstracts the list-like behavior for Numpy arrays.
You can also check thin in real-time like following:
In [7]: x
Out[7]:
array([[1, 2, 3],
[4, 5, 6]])
In [8]: x[0] is x[0]
Out[8]: False
So as soon as you have an array or any mutable object that can hols other object in it you'll have a python mutable object and therefore you will lose the performance and all other Numpy array's cool features.
Also as #Imanol mentioned in comments you may want to use Numpy view objects if you want to have a memory optimized and flexible operation when you want to modify an array(s) with reference(s). view objects can be constructed in following two ways:
a.view(some_dtype) or a.view(dtype=some_dtype) constructs a view of
the array’s memory with a different data-type. This can cause a
reinterpretation of the bytes of memory.
a.view(ndarray_subclass) or a.view(type=ndarray_subclass) just returns
an instance of ndarray_subclass that looks at the same array (same
shape, dtype, etc.) This does not cause a reinterpretation of the
memory.
For a.view(some_dtype), if some_dtype has a different number of bytes
per entry than the previous dtype (for example, converting a regular
array to a structured array), then the behavior of the view cannot be
predicted just from the superficial appearance of a (shown by
print(a)). It also depends on exactly how a is stored in memory.
Therefore if a is C-ordered versus fortran-ordered, versus defined as
a slice or transpose, etc., the view may give different results.

Not sure if it's useful at this point, but numpy.ndarray.ctypes seems to have useful bits:
https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.ctypes.html
Used something like this (missing dtype, but meh):
def is_same_array(a, b):
return (a.shape == b.shape) and (a == b).all() and a.ctypes.data == b.ctypes.data
here:
https://github.com/EricCousineau-TRI/repro/blob/a60daf899e9726daf2ca1259bb80ad2c7c9b3e3f/python/namedlist_alt.py#L111

Related

How to check numpy array is empty? [duplicate]

How can I check whether a numpy array is empty or not?
I used the following code, but this fails if the array contains a zero.
if not self.Definition.all():
Is this the solution?
if self.Definition == array([]):
You can always take a look at the .size attribute. It is defined as an integer, and is zero (0) when there are no elements in the array:
import numpy as np
a = np.array([])
if a.size == 0:
# Do something when `a` is empty
https://numpy.org/devdocs/user/quickstart.html (2020.04.08)
NumPy’s main object is the homogeneous multidimensional array. It is a table of elements (usually numbers), all of the same type, indexed by a tuple of non-negative integers. In NumPy dimensions are called axes.
(...) NumPy’s array class is called ndarray. (...) The more important attributes of an ndarray object are:
ndarray.ndim
the number of axes (dimensions) of the array.
ndarray.shape
the dimensions of the array. This is a tuple of integers indicating the size of the array in each dimension. For a matrix with n rows and m columns, shape will be (n,m). The length of the shape tuple is therefore the number of axes, ndim.
ndarray.size
the total number of elements of the array. This is equal to the product of the elements of shape.
One caveat, though.
Note that np.array(None).size returns 1!
This is because a.size is equivalent to np.prod(a.shape),
np.array(None).shape is (), and an empty product is 1.
>>> import numpy as np
>>> np.array(None).size
1
>>> np.array(None).shape
()
>>> np.prod(())
1.0
Therefore, I use the following to test if a numpy array has elements:
>>> def elements(array):
... return array.ndim and array.size
>>> elements(np.array(None))
0
>>> elements(np.array([]))
0
>>> elements(np.zeros((2,3,4)))
24
Why would we want to check if an array is empty? Arrays don't grow or shrink in the same that lists do. Starting with a 'empty' array, and growing with np.append is a frequent novice error.
Using a list in if alist: hinges on its boolean value:
In [102]: bool([])
Out[102]: False
In [103]: bool([1])
Out[103]: True
But trying to do the same with an array produces (in version 1.18):
In [104]: bool(np.array([]))
/usr/local/bin/ipython3:1: DeprecationWarning: The truth value
of an empty array is ambiguous. Returning False, but in
future this will result in an error. Use `array.size > 0` to
check that an array is not empty.
#!/usr/bin/python3
Out[104]: False
In [105]: bool(np.array([1]))
Out[105]: True
and bool(np.array([1,2]) produces the infamous ambiguity error.
edit
The accepted answer suggests size:
In [11]: x = np.array([])
In [12]: x.size
Out[12]: 0
But I (and most others) check the shape more than the size:
In [13]: x.shape
Out[13]: (0,)
Another thing in its favor is that it 'maps' on to an empty list:
In [14]: x.tolist()
Out[14]: []
But there are other other arrays with 0 size, that aren't 'empty' in that last sense:
In [15]: x = np.array([[]])
In [16]: x.size
Out[16]: 0
In [17]: x.shape
Out[17]: (1, 0)
In [18]: x.tolist()
Out[18]: [[]]
In [19]: bool(x.tolist())
Out[19]: True
np.array([[],[]]) is also size 0, but shape (2,0) and len 2.
While the concept of an empty list is well defined, an empty array is not well defined. One empty list is equal to another. The same can't be said for a size 0 array.
The answer really depends on
what do you mean by 'empty'?
what are you really test for?

What does "[()]" mean when called upon a numpy array?

I just came across this piece of code:
x = np.load(lc_path, allow_pickle=True)[()]
And I've never seen this pattern before: [()]. What does it do and why is this syntacticly correct?
a = np.load(lc_path, allow_pickle=True)
>>> array({'train_ppls': [1158.359413193576, 400.54333992093854, ...],
'val_ppls': [493.0056070137404, 326.53203520368623, ...],
'train_losses': [340.40905952453613, 675.6475067138672, ...],
'val_losses': [217.46258735656738, 438.86770486831665, ...],
'times': [19.488852977752686, 20.147733449935913, ...]}, dtype=object)
So I guess a is a dict wrapped in an array for some reason by the person who saved it
It a way (the only way) of indexing a 0d array:
In [475]: x=np.array(21)
In [476]: x
Out[476]: array(21)
In [477]: x.shape
Out[477]: ()
In [478]: x[()]
Out[478]: 21
In effect it pulls the element out of the array. item() is another way:
In [479]: x.item()
Out[479]: 21
In [480]: x.ndim
Out[480]: 0
In
x = np.load(lc_path, allow_pickle=True)[()]
most likely the np.save was given a non-array; and wrapped in a 0d object dtype array to save it. This is a way of recovering that object.
In [481]: np.save('test.npy', {'a':1})
In [482]: x = np.load('test.npy', allow_pickle=True)
In [483]: x
Out[483]: array({'a': 1}, dtype=object)
In [484]: x.ndim
Out[484]: 0
In [485]: x[()]
Out[485]: {'a': 1}
In general when we index a nd array, e.g. x[1,2] we are really doing x[(1,2)], that is, using a tuple that corresponds to the number of dimensions. If x is 0d, the only tuple that works is an empty one, ().
That's indexing the array with a tuple of 0 indices. For most arrays, this just produces a view of the whole array, but for a 0-dimensional array, it extracts the array's single element as a scalar.
In this case, it looks like someone made the weird choice to dump a non-NumPy object to an array with numpy.save, resulting in NumPy saving a 0-dimensional array of object dtype wrapping the original object. The use of allow_pickle=True and the empty tuple index extracts the object from the 0-dimensional array.
They probably should have picked something other than numpy.save to save this object.

Mapping an integer to array (Python): ValueError: setting an array element with a sequence

I have a defaultdict which maps certain integers to a numpy array of size 20.
In addition, I have an existing array of indices. I want to turn that array of indices into a 2D array, where each original index is converted into an array via my defaultdict.
Finally, in the case that an index isn't found in the defaultdict, I want to create an array of zeros for that index.
Here's what I have so far
converter = lambda x: np.zeros((d), dtype='float32') if x == -1 else cVf[x]
vfunc = np.vectorize(converter)
cvf = vfunc(indices)
np.zeros((d), dtype='float32') and cVf[x] are identical data types/ shapes:
(Pdb) np.shape(cVf[0])
(20,)
Yet I get the error in the title (*** ValueError: setting an array element with a sequence.) when I try to run this code.
Any ideas?
You should give us a some sample arrays or dictionaries (in the case of cVF, so we can make a test run.
Read what vectorize has to say about the return value. Since you don't define otypes, it makes a test calculation to determine the dtype of the returned array. My first thought was that the test calc and subsequent one might be returning different things. But you claim converter will always be returning the same dtype and shape array.
But let's try something simpler:
In [609]: fv = np.vectorize(lambda x: np.array([x,x]))
In [610]: fv([1,2,3])
...
ValueError: setting an array element with a sequence.
It's having trouble with returning any array.
But if I give an otypes, it works
In [611]: fv = np.vectorize(lambda x: np.array([x,x]), otypes=[object])
In [612]: fv([1,2,3])
Out[612]: array([array([1, 1]), array([2, 2]), array([3, 3])], dtype=object)
In fact in this case I could use frompyfunc, which returns object dtype, and is the underlying function for vectorize (and a bit faster).
In [613]: fv = np.frompyfunc(lambda x: np.array([x,x]), 1,1)
In [614]: fv([1,2,3])
Out[614]: array([array([1, 1]), array([2, 2]), array([3, 3])], dtype=object)
vectorize and frompyfunc are designed for functions that are scalar in- scalar out. That scalar may be an object, even array, but is still treated as a scalar.

Get a hashable numpy memory view

I want to hash numpy arrays without copying the data into a bytearray first.
Specifically, I have a contiguous read-only two-dimensional int64 numpy array A with unique rows. To be concrete, let's say:
A = np.array([[1, 2], [3, 4], [5, 6]])
A.setflags(write=False)
I want to make a constant time function that maps an arbitrary array ap that's identical in value to a slice of A, e.g. A[i], to its index i. e.g.
foo(np.array([1, 2])) == 0
foo(np.array([3, 4])) == 1
foo(np.array([5, 6])) == 2
The natural choice is to make a dictionary like:
lookup = {a: i for i, a in enumerate(A)}
Unfortunately numpy arrays are not hashable. There are ways to hash numpy arrays, but ideally I'd like the equality to be preserved so I can use it in a dictionary without writing manual collision detection.
The referenced article does point out that I could do:
lookup = {a.data.tobytes(): i for i, a in enumerate(A)}
def foo(ap):
return lookup[ap.data.tobytes()]
However the tobytes method returns a copy of the data pointed to by a.data, hence doubling the memory usage.
What I'd love to do is something like:
lookup = {a.data: i for i, a in enumerate(A)}
def foo(ap):
return lookup[ap.data]
This would ideally use a pointer to the underlying memory instead of the array object or a copy of its bytes, but since a.dtype == int, this fails with:
ValueError: memoryview: hashing is restricted to formats 'B', 'b' or 'c'
This is fine, we can cast it Aview = A.view(np.byte), now we have:
>>> Aview.flags
# C_CONTIGUOUS : True
# F_CONTIGUOUS : False
# OWNDATA : False
# WRITEABLE : False
# ALIGNED : True
# UPDATEIFCOPY : False
>>> Aview.data.format
# 'b'
However, when trying to hash this, it still errors with:
TypeError: unhashable type: 'numpy.ndarray'
A possible solution (inspired by this) would be to define:
class _wrapper(object)
def __init__(self, array):
self._array = array
self._hash = hash(array.data.tobytes())
def __hash__(self):
return self._hash
def __eq__(self, other):
return self._hash == other._hash and np.all(self._array == other._array)
lookup = {_wrapper(a): i for i, a in enumerate(A)}
def foo(ap):
return lookup[_wrapper(ap)]
But this seems inelegant. Is there a way to tell python to just interpret the memoryview as a bytearray and hash it normally without having to make a copy to a bytestring or having numpy intercede and abort the hash?
Other things I've tried:
The format of A does allow me to map each row into a distinct integer, but for very large A the space of possible arrays is larger than np.iinfo(int).max, and while I can use python's integer types, this is ~100x slower than just hashing the memory.
I also tried doing something like:
Aview = A.view(np.void(A.shape[1] * A.itemsize)).squeeze()
However, even though A.flags.writeable == False, A[0].flags.writeable == True. When trying hash(A[0]) python raises TypeError: unhashable type: 'writeable void-scalar'. I'm unsure if it's possible to mark scalars as read-only, or otherwise hash a void scalar, even though most other scalars seem hashable.
I can't make sense of what you are trying to do.
When I create an array:
In [111]: A=np.array([1,0,1,2,3,0,2])
In [112]: A.__array_interface__
Out[112]:
{'data': (173599672, False),
'descr': [('', '<i4')],
'shape': (7,),
'strides': None,
'typestr': '<i4',
'version': 3}
In [113]: A.nbytes
Out[113]: 28
In [114]: id(A)
Out[114]: 2984144632
I get a ndarray object with a unique id, and attributes like shape, and strides, and data buffer. This buffer is 28 bytes starting at 173599672.
There isn't a A[3] object in A. I have to create it
In [115]: x=A[3]
In [116]: type(x)
Out[116]: numpy.int32
In [117]: id(x)
Out[117]: 2984723472
In [118]: x.__array_interface__
Out[118]:
{'__ref': array(2),
'data': (179546048, False),
'descr': [('', '<i4')],
'shape': (),
'strides': None,
'typestr': '<i4',
'version': 3}
This x is in many ways just a 0d 1 element array (it displays differently). Notice that its data pointer is unrelated to that of A. So it isn't even sharing memory.
A slice does share memory
In [119]: y=A[3:4]
In [120]: y.__array_interface__
Out[120]:
{'data': (173599684, False), # 173599672+3*4
'descr': [('', '<i4')],
'shape': (1,),
'strides': None,
'typestr': '<i4',
'version': 3}
====================
What exactly do you mean by mapping arbitrary A[i] to its B[i]? Are you using the value at A[i] as the key, or the location as the key? In my example, the elements of A are not unique. I can uniquely access A[0] or A[2] (by index), but in both cases I get a value of 1.
But consider this situation. There is a relatively fast way of finding a value in a 1d array - in1d.
In [121]: np.in1d(A,1)
Out[121]: array([ True, False, True, False, False, False, False], dtype=bool)
Make the B array:
In [122]: B=np.arange(A.shape[0])
All the elements in B corresponding to a 1 value in A:
In [123]: B[np.in1d(A,1)]
Out[123]: array([0, 2])
In [124]: B[np.in1d(A,0)] # to 0
Out[124]: array([1, 5])
In [125]: B[np.in1d(A,2)] # to 2
Out[125]: array([3, 6])
A dictionary created from A gives the same (last) values:
In [134]: dict(zip(A,B))
Out[134]: {0: 5, 1: 2, 2: 6, 3: 4}
=====================
The paragraph about hashable in the Python docs talks about needing to have a __hash__ method.
So I checked a objects:
In [200]: {}.__hash__ # None
In [201]: [].__hash__ # None
In [202]: ().__hash__
Out[202]: <method-wrapper '__hash__' of tuple object at 0xb729302c>
In [204]: class MyClass(object):
...: pass
...:
In [205]: MyClass().__hash__
Out[205]: <method-wrapper '__hash__' of MyClass object at 0xb3008c4c>
A numpy integer - int with a np.int32 wrapper:
In [206]: x
Out[206]: 2
In [207]: x.__hash__
Out[207]: <method-wrapper '__hash__' of numpy.int32 object at 0xb1e748c0>
In [208]: x.__hash__()
Out[208]: 2
A numpy array
In [209]: A
Out[209]:
array(['one', 'two'],
dtype='<U3')
In [210]: A.__hash__ # None
In [212]: np.float(12.232).__hash__()
Out[212]: 1219767578
So at a minimum the key for a dictionary must have a method of generating a hash, a unique identifier. It may be the instance id (default case), or maybe something derived from the values of the object (a checksum of some sort?). The dictionary maintains are table of these hashes, presumably with a pointer to both the key and the value. When I do a dictionary get, it generates the hash for the object I give it, looks that up, and returns the corresponding value - if present.
Classes that aren't hashable don't have a __hash__ method (or the method is None). They can't generate this unique id. Apparently by design an object of class np.ndarray does not have a __hash__. And playing with the writability flags does not change that.
The big problem with trying to hash or make a dictionary of the rows of an array is that you aren't interested in hashing a particular instance of an array (the object created by a slicing view), but a hash based on the values of that row.
So to take your 2d array:
In [236]: A
Out[236]:
array([[1, 2],
[3, 4],
[5, 6]])
you want A[1,:], and np.array([3,4]) to both generate the same __hash__() value. And A[0,:]+2, and maybe A.mean(axis=0) (except that's a float array).
And since you are worried about memory, you must be dealing with very large arrays, say (1000,1000) - which implies a hash value based on 1000 different numbers, and some how unique.

numpy.view gives valueerror

Ran into this in the context of libtiff saving a file, but now I'm just confused. Can anyone tell me why these two are not equivalent?
ar1 = zeros((1000,1000),dtype=uint16)
ar1 = ar1.view(dtype=uint8) # works
ar2 = zeros((1000,2000),dtype=uint16)
ar2 = ar2[:,1000:]
ar2 = ar2.view(dtype=uint8) # ValueError: new type not compatible with array.
Edit:
so this also works?
ar2 = zeros((1000,2000),dtype=uint16)
ar2 = array(ar2[:,1000:])
ar2 = ar2.view(dtype=uint8)
Summary
In a nutshell, just move the view before the slicing.
Instead of:
ar2 = zeros((1000,2000),dtype=uint16)
ar2 = ar2[:,1000:]
ar2 = ar2.view(dtype=uint8)
Do:
ar2 = zeros((1000,2000),dtype=uint16)
ar2 = ar2.view(dtype=uint8) # ar2 is now a 1000x4000 array...
ar2 = ar2[:,2000:] # Note the 2000 instead of 1000!
What's happening is that the sliced array isn't contiguous (as #Craig noted) and view errs on the conservative side and doesn't try to re-interpret non-contiguous memory buffers. (It happens to be possible in this exact case, but in some cases it would result in a non-evenly-strided array, which numpy doesn't allow.)
If you're not very familiar with numpy, it's possible that you're misunderstanding view, and you actually want astype instead.
What does view do?
First off, let's take a detailed look at what view does. In this case, it re-interprets the memory buffer of a numpy array as a new datatype, if possible. That means that the number of elements in the array will often change when you use view. (You can also use it to view the array as a different subclass of ndarray, but we'll skip that part for now.)
You may already be aware of the following (your problem is a bit more subtle), but if not, here's an explanation.
As an example:
In [1]: import numpy as np
In [2]: x = np.zeros(2, dtype=np.uint16)
In [3]: x
Out[3]: array([0, 0], dtype=uint16)
In [4]: x.view(np.uint8)
Out[4]: array([0, 0, 0, 0], dtype=uint8)
In [5]: x.view(np.uint32)
Out[5]: array([0], dtype=uint32)
If you want to make a copy of the array with the new datatype instead, use astype:
In [6]: x
Out[6]: array([0, 0], dtype=uint16)
In [7]: x.astype(np.uint8)
Out[7]: array([0, 0], dtype=uint8)
In [8]: x.astype(np.uint32)
Out[8]: array([0, 0], dtype=uint32)
Now let's take a look at what happens with when viewing a 2D array.
In [9]: y = np.arange(4, dtype=np.uint16).reshape(2, 2)
In [10]: y
Out[10]:
array([[0, 1],
[2, 3]], dtype=uint16)
In [11]: y.view(np.uint8)
Out[12]:
array([[0, 0, 1, 0],
[2, 0, 3, 0]], dtype=uint8)
Notice that the shape of the array has changed, and that the changes have happened along the last axis (in this case, extra columns have been added).
At first glance it may appear that extra zeros have been added. It's not that extra zeros are being inserted, it's that the uint16 representation of 2 is equivalent to two uint8s, one with a value of 2 and one with a value of 0. Therefore, any uint16 under 255 will result in the value and a zero, while any value over that will result in two smaller uint8s. As an example:
In [13]: y * 100
Out[14]:
array([[ 0, 100],
[200, 300]], dtype=uint16)
In [15]: (y * 100).view(np.uint8)
Out[15]:
array([[ 0, 0, 100, 0],
[200, 0, 44, 1]], dtype=uint8)
What's happening behind the scenes
Numpy arrays consist of a "raw" memory buffer that's interpreted through a shape, a dtype, and strides (and an offset, but let's ignore that for now). For more detail, there are several good overviews: the official documentation, the numpy book, or scipy-lectures.
This allows numpy to be very memory efficient and "slice and dice" the underlying memory buffer in many different ways without making a copy.
Strides tell numpy how many bytes to jump within the memory buffer to go one increment along a particular axis.
For example:
In [17]: y
Out[17]:
array([[0, 1],
[2, 3]], dtype=uint16)
In [18]: y.strides
Out[18]: (4, 2)
So, to go one row deeper in the array, numpy needs to step forward 4 bytes in the memory buffer, while to go one column farther in the array, numpy needs to step 2 bytes. Transposing the array just amounts to reversing the strides (and shape, but in this case, y is 2x2):
In [19]: y.T.strides
Out[19]: (2, 4)
When we view the array as uint8, the strides change. We still step forward 4 bytes per row, but only one byte per column:
In [20]: y.view(np.uint8).strides
Out[20]: (4, 1)
However, numpy arrays have to have the one stride length per dimension. This is what "evenly-strided" means. In other words, do move forward one row/column/whatever, numpy needs to be able to step the same amount through the underlying memory buffer each time. In other words, there's no way to tell numpy to step different amounts for each row/column/whatever.
For that reason, view takes a very conservative route. If the array isn't contiguous, and the view would change the shape and strides of the array, it doesn't try to handle it. As #Craig noted, it's because the slice of y isn't contiguous that view isn't working.
There are plenty of cases (yours is one) where the resulting array would be valid, but the view method doesn't try to be too smart about it.
To really play around with what's possible, you can use numpy.lib.stride_tricks.as_strided or directly use the __array_interface__. It's a good learning tool to experiment with it, but you have to really understand what you're doing to use it effectively.
Hopefully that helps a bit, anyway! Sorry for the long-winded answer!
This isn't a complete answer but may point the way to the detail I am missing. When you make a slice of an array you no longer have contiguous data and you do not own the data. To see this look at the flags of the array:
ar2 = zeros((1000,2000),dtype=uint16)
ar2.flags
C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : True
WRITEABLE : True
ALIGNED : True
UPDATEIFCOPY : False
ar2 = ar2[:,1000:]
ar2.flags
C_CONTIGUOUS : False
F_CONTIGUOUS : False
OWNDATA : False
WRITEABLE : True
ALIGNED : True
UPDATEIFCOPY : False
I don't know which of these is causing the actual problem. As you note in your edit, if you make a new copy of the sliced array then things are fine. You can do this using array() as you note or something like ar2=ar2[:,1000:].copy().

Categories

Resources