Creating a numpy array from a set

Creating a numpy array from a set - python

I noticed the following behaviour exhibited by numpy arrays:
>>> import numpy as np
>>> s = {1,2,3}
>>> l = [1,2,3]
>>> np.array(l)
array([1, 2, 3])
>>> np.array(s)
array({1, 2, 3}, dtype=object)
>>> np.array(l, dtype='int')
array([1, 2, 3])
>>> np.array(l, dtype='int').dtype
dtype('int64')
>>> np.array(s, dtype='int')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: int() argument must be a string, a bytes-like object or a number, not 'set'
There are 2 things to notice:
Creating an array from a set results in the array dtype being object
Trying to specify dtype results in an error which suggests that the
set is being treated as a single element rather than an iterable.
What am I missing - I don't fully understand which bit of python I'm overlooking. Set is a mutable object much like a list is.
EDIT: tuples work fine:
>>> t = (1,2,3)
>>> np.array(t)
array([1, 2, 3])
>>> np.array(t).dtype
dtype('int64')

The array factory works best with sequence objects which a set is not. If you do not care about the order of elements and know they are all ints or convertible to int, then you can use np.fromiter
np.fromiter({1,2,3},int,3)
# array([1, 2, 3])
The second (dtype) argument is mandatory; the last (count) argument is optional, providing it can improve performance.

As you can see from the syntax of using curly brackets, a set are more closely related to a dict than to a list. You can solve it very simply by turning the set into a list or tuple before converting to an array:
>>> import numpy as np
>>> s = {1,2,3}
>>> np.array(s)
array({1, 2, 3}, dtype=object)
>>> np.array(list(s))
array([1, 2, 3])
>>> np.array(tuple(s))
array([1, 2, 3])
However this might be too inefficient for large sets, because the list or tuple functions have to run through the whole set before even starting the creation of the array. A better method would be to use the set as an iterator:
>>> np.fromiter(s, int)
array([1, 2, 3])

The np.array documentation says that the object argument must be "an array, any object exposing the array interface, an object whose __array__ method returns an array, or any (nested) sequence" (emphasis added).
A set is not a sequence. Specifically, sets are unordered and do not support the __getitem__ method. Hence you cannot create an array from a set like you trying to with the list.

Numpy expects the argument to be a list, it doesn't understand the set type so it creates an object array (this would be the same if you passed any other non sequence object). You can create a numpy array with a set by first converting the set to a list numpy.array(list(my_set)). Hope this helps.

Related

How is it possible for Numpy to use comma-separated subscripting with `:`?

Consider the following example:
>>> a=np.array([1,2,3,4])
>>> a
array([1, 2, 3, 4])
>>> a[np.newaxis,:,np.newaxis]
array([[[1],
[2],
[3],
[4]]])
How is it possible for Numpy to use the : (normally used for slicing arrays) as an index when using comma-separated subscripting?
If I try to use comma-separated subscripting with either a Python list or a Python list-of-lists, I get a TypeError:
>>> [[1,2],[3,4]][0,:]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: list indices must be integers or slices, not tuple
?

Define a simple class with a getitem, indexing method:
In [128]: class Foo():
...: def __getitem__(self, arg):
...: print(type(arg), arg)
...:
In [129]: f = Foo()
And look at what different indexes produce:
In [130]: f[:]
<class 'slice'> slice(None, None, None)
In [131]: f[1:2:3]
<class 'slice'> slice(1, 2, 3)
In [132]: f[:, [1,2,3]]
<class 'tuple'> (slice(None, None, None), [1, 2, 3])
In [133]: f[:, :3]
<class 'tuple'> (slice(None, None, None), slice(None, 3, None))
In [134]: f[(slice(1,None),3)]
<class 'tuple'> (slice(1, None, None), 3)
For builtin classes like list, a tuple argument raises an error. But that's a class dependent issue, not a syntax one. numpy.ndarray accepts a tuple, as long as it's compatible with its shape.
The syntax for a tuple index was added to Python to meet the needs of numpy. I don't think there are any builtin classes that use it.
The numpy.lib.index_tricks.py module has several classes that take advantage of this behavior. Look at its code for more ideas.
In [137]: np.s_[3:]
Out[137]: slice(3, None, None)
In [139]: np.r_['0,2,1',[1,2,3],[4,5,6]]
Out[139]:
array([[1, 2, 3],
[4, 5, 6]])
In [140]: np.c_[[1,2,3],[4,5,6]]
Out[140]:
array([[1, 4],
[2, 5],
[3, 6]])
other "indexing" examples:
In [141]: f[...]
<class 'ellipsis'> Ellipsis
In [142]: f[[1,2,3]]
<class 'list'> [1, 2, 3]
In [143]: f[10]
<class 'int'> 10
In [144]: f[{1:12}]
<class 'dict'> {1: 12}
I don't know of any class that makes use of a dict argument, but the syntax allows it.

Lists are 1D and you can either pass a single index or a slice. The : when used for index is a short notation to create a slice. In [[1,2],[3,4]][0,:] you are passing 0,:, which is the tuple (0, :). That is, in order to create a tuple the parenthesis are optional here. Just having values separated by comma will create the tuple.
But numpy is different, since it is a N-dimensional array and not just 1D as lists you can pass multiple indexes to index the different dimensions. Therefore, passing a tuple to index numpy is allowed as long as the number of elements in the tuple is not greater than the dimensions of the indexed array. Consuder the array below
arr = np.random.randn(5,10, 3)
It has 3 dimensions and we can index it like arr[0,1,0], but this is the same as arr[(0,1,0)]. That is, we are passing a tuple to index the array. Each tuple element itself can be an integer or a slice and numpy will do the appropriated indexing. Numpy accepts tuples for indexing but lists don't.
However, when you write a[np.newaxis,:,np.newaxis] it is a more than just indexing. First, note that np.newaxis is just None. When you use None to index a dimension in numpy what it does as creating that dimension. The a array in your example is 1D, but a[np.newaxis,:,np.newaxis] is a special type of indexing understood by numpy as a short notation to "give me an array with extra axis where I'm indexing with np.newaxis and whose elements are from my original array indexed as I'm specifying".
So, the TLDR answer is numpy indexing is more general and powerful than list indexing.

Strange things about numpy indexing using empty ndarray

When I do numpy indexing, sometimes the index can be an empty list, in that way I want numpy to also return a empty array. For example:
a = np.array([1, 2, 3])
b = []
print a[b]
This works perfectly fine! when the result gives me:
result:[]
But when I use ndarray as indexer, strange things happened:
a = np.array([1, 2, 3])
b = []
c = np.array(b)
print a[c]
This gives me an error:
IndexError: arrays used as indices must be of integer (or boolean) type
However, When I do this:
a = np.array([1, 2, 3])
b = []
d = np.arange(0, a.size)[b]
print a[d]
Then it works perfectly again:
result:[]
But when I check the type of c and d, they returns the same! Even the shape and everything:
print type(c), c.shape
print type(d), d.shape
result：<type 'numpy.ndarray'> (0L,)
result：<type 'numpy.ndarray'> (0L,)
So I was wondering if there is anything wrong with it? How come a[c] doesn't work but a[d] works? Can you explain it for me? Thank you!

There is a simple resolution numpy arrays have in a sense two types. their own type which is numpy.ndarray and the type of their elements which is specified by a numpy.dtype. Arrays used for indexing must have elements of integer or boolean dtype. In your examples you use two array creation methods with different default dtypes. The array factory which defaults to a float dtype if nothing else can be inferred from the template. And arange which uses an integer dtype unless you pass some float parameters.
Since the dtype is also a property of the array it is specified and checked even if there are no elements. Empty lists are not checked for dtype because they do not have a dtype attribute.

numpy doesn't know what type the empty array is.
Try:
c = np.array(b, dtype=int)

Python function to accept numpy ndarray or sequence as arguments

I have seen some python functions that work generally receiving a (n,2) shaped numpy ndarray as argument, but can also "automagically" receive (2,n) or even len(2) sequences (tuple or list).
How is it pythonically achieved? Is there a unified good practice to check and treat these cases (example, functions in numpy and scipy module), or each developer implements which he thinks best?
I'd just like to avoid chains of (possibly nested) chains of ifs/elifs, in case there is a well known better way.
Thanks for any help.

You can use the numpy.asarray function to convert any sequence-like input to an array:
>>> import numpy
>>> numpy.asarray([1,2,3])
array([1, 2, 3])
>>> numpy.asarray(numpy.array([2,3]))
array([2, 3])
>>> numpy.asarray(1)
array(1)
>>> numpy.asarray((2,3))
array([2, 3])
>>> numpy.asarray({1:3,2:4})
array({1: 3, 2: 4}, dtype=object)
It's important to note that as the documentation says No copy is performed if the input is already an ndarray. This is really nice since you can pass an existing array in and it just returns the same array.
Once you convert it to a numpy array, just check the length if that's a requirement. Something like:
>>> def f(x):
... x = numpy.asarray(x)
... if len(x) != 2:
... raise Exception("invalid argument")
...
>>> f([1,2])
>>> f([1,2,3])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 4, in f
Exception: invalid argument
Update:
Since you asked, here's a "magic" function that will except *args as an array also:
>>> def f(*args):
... args = numpy.asarray(args[0]) if len(args) == 1 else numpy.asarray(args)
... return args
...
>>> f(7,3,5)
array([7, 3, 5])
>>> f([1,2,3])
array([1, 2, 3])
>>> f((2,3,4))
array([2, 3, 4])
>>> f(numpy.array([1,2,3]))
array([1, 2, 3])

Understanding the behavior of Python's set

The documentation for the built-in type set says:
class set([iterable])
Return a new set or frozenset object
whose elements are taken from
iterable. The elements of a set must
be hashable.
That is all right but why does this work:
>>> l = range(10)
>>> s = set(l)
>>> s
set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
And this doesn't:
>>> s.add([10])
Traceback (most recent call last):
File "<pyshell#7>", line 1, in <module>
s.add([10])
TypeError: unhashable type: 'list'
Both are lists. Is some magic happening during the initialization?

When you initialize a set, you provide a list of values that must each be hashable.
s = set()
s.add([10])
is the same as
s = set([[10]])
which throws the same error that you're seeing right now.

In [13]: (2).__hash__
Out[13]: <method-wrapper '__hash__' of int object at 0x9f61d84>
In [14]: ([2]).__hash__ # nothing.
The thing is that set needs its items to be hashable, i.e. implement the __hash__ magic method (this is used for ordering in the tree as far as I know). list does not implement that magic method, hence it cannot be added in a set.

In this line:
s.add([10])
You are trying to add a list to the set, rather than the elements of the list. If you want ot add the elements of the list, use the update method.

Think of the constructor being something like:
class Set:
def __init__(self,l):
for elem in l:
self.add(elem)
Nothing too interesting to be concerned about why it takes lists but on the other hand add(element) does not.

It behaves according to the documentation: set.add() adds a single element (and since you give it a list, it complains it is unhashable - since lists are no good as hash keys). If you want to add a list of elements, use set.update(). Example:
>>> s = set([1,2,3])
>>> s.add(5)
>>> s
set([1, 2, 3, 5])
>>> s.update([8])
>>> s
set([8, 1, 2, 3, 5])

s.add([10]) works as documented. An exception is raised because [10] is not hashable.
There is no magic happening during initialisation.
set([0,1,2,3,4,5,6,7,8,9]) has the same effect as set(range(10)) and set(xrange(10)) and set(foo()) where
def foo():
for i in (9,8,7,6,5,4,3,2,1,0):
yield i
In other words, the arg to set is an iterable, and each of the values obtained from the iterable must be hashable.

numpy.equal with string values

The numpy.equal function does not work if a list or array contains strings:
>>> import numpy
>>> index = numpy.equal([1,2,'a'],None)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: function not supported for these types, and can't coerce safely to supported types
What is the easiest way to workaround this without looping through each element? In the end, I need index to contain a boolean array indicating which elements are None.

If you really need to use numpy, be more careful about what you pass in and it can work:
>>> import numpy
>>> a = numpy.array([1, 2, 'a'], dtype=object) # makes type of array what you need
>>> numpy.equal(a, None)
array([False, False, False], dtype=bool)
Since you start with a list, there's a chance what you really want is just a list comprehension like [item is None for item in [1, 2, 'a']] or the similar generator expression.
To have an a heterogeneous list like this is odd. Lists (and numpy arrays) are typically used for homogeneous data.

What's wrong with a stock list comprehension?
index = [x is None for x in L]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.