This question has info on using an input as an output to compute something in place with a numpy.ufunc:
Numpy passing input array as `out` argument to ufunc
Is it possible to avoid allocating space for an unwanted output of a numpy.ufunc? For example, say I only want one of the two outputs from modf. Can I ensure that the other, unwanted array is never allocated at all?
I thought passing _ to out might do it, but it throws an error:
import numpy as np
ar = np.arange(6)/3
np.modf(ar, out=(ar, _))
TypeError: return arrays must be of ArrayType
As it says in the docs, passing None means that the output array is allocated in the function and returned. I can ignore the returned values, but it still has to be allocated and populated inside the function.
You can minimize allocation by passing a "fake" array:
ar = np.arange(6) / 3
np.modf(ar, ar, np.broadcast_arrays(ar.dtype.type(0), ar)[0])
This dummy array is as big as a single double, and modf will not do allocation internally.
EDIT According to suggestions from #Eric and #hpaulj, a more general and long-term solution would be
np.lib.stride_tricks._broadcast_to(np.empty(1, ar.dtype), ar.shape, False, False)
Related
In order to create an empty array for some results, I need to know the resulting dtype for a certain operation (e.g. multiply) when doing the operation based on two other arrays.
How to determine resulting dtype of a numpy array operation in advance?
If a and b are the argument arrays, I can for example determine the resulting dtype of the multiplication (*) by making to zero values (0) and doing a trial operation, like:
dtype=(a.dtype.type(0) * b.dtype.type(0)).dtype
However, it seems a little award... or maybe I do this the wrong way around...
So using the result_type, given in the accepted answer, the code can be like:
dtype=numpy.result_type(a, b)
use numpy.result_type(), in numpy >= 1.6.0
https://docs.scipy.org/doc/numpy/reference/generated/numpy.result_type.html
import numpy as np
a = np.array([0])
b = np.array([None])
c = np.array([''])
d = np.array([' '])
Why should we have this inconsistency:
>>> bool(a)
False
>>> bool(b)
False
>>> bool(c)
True
>>> bool(d)
False
For arrays with one element, the array's truth value is determined by the truth value of that element.
The main point to make is that np.array(['']) is not an array containing one empty Python string. This array is created to hold strings of exactly one byte each and NumPy pads strings that are too short with the null character. This means that the array is equal to np.array(['\0']).
In this regard, NumPy is being consistent with Python which evaluates bool('\0') as True.
In fact, the only strings which are False in NumPy arrays are strings which do not contain any non-whitespace characters ('\0' is not a whitespace character).
Details of this Boolean evaluation are presented below.
Navigating NumPy's labyrinthine source code is not always easy, but we can find the code governing how values in different datatypes are mapped to Boolean values in the arraytypes.c.src file. This will explain how bool(a), bool(b), bool(c) and bool(d) are determined.
Before we get to the code in that file, we can see that calling bool() on a NumPy array invokes the internal _array_nonzero() function. If the array is empty, we get False. If there are two or more elements we get an error. But if the array has exactly one element, we hit the line:
return PyArray_DESCR(mp)->f->nonzero(PyArray_DATA(mp), mp);
Now, PyArray_DESCR is a struct holding various properties for the array. f is a pointer to another struct PyArray_ArrFuncs that holds the array's nonzero function. In other words, NumPy is going to call upon the array's own special nonzero function to check the Boolean value of that one element.
Determining whether an element is nonzero or not is obviously going to depend on the datatype of the element. The code implementing the type-specific nonzero functions can be found in the "nonzero" section of the arraytypes.c.src file.
As we'd expect, floats, integers and complex numbers are False if they're equal with zero. This explains bool(a). In the case of object arrays, None is similarly going to be evaluated as False because NumPy just calls the PyObject_IsTrue function. This explains bool(b).
To understand the results of bool(c) and bool(d), we see that the nonzero function for string type arrays is mapped to the STRING_nonzero function:
static npy_bool
STRING_nonzero (char *ip, PyArrayObject *ap)
{
int len = PyArray_DESCR(ap)->elsize; // size of dtype (not string length)
int i;
npy_bool nonz = NPY_FALSE;
for (i = 0; i < len; i++) {
if (!Py_STRING_ISSPACE(*ip)) { // if it isn't whitespace, it's True
nonz = NPY_TRUE;
break;
}
ip++;
}
return nonz;
}
(The unicode case is more or less the same idea.)
So in arrays with a string or unicode datatype, a string is only False if it contains only whitespace characters:
>>> bool(np.array([' ']))
False
In the case of array c in the question, there is a really a null character \0 padding the seemingly-empty string:
>>> np.array(['']) == np.array(['\0'])
array([ True], dtype=bool)
The STRING_nonzero function sees this non-whitespace character and so bool(c) is True.
As noted at the start of this answer, this is consistent with Python's evaluation of strings containing a single null character: bool('\0') is also True.
Update: Wim has fixed the behaviour detailed above in NumPy's master branch by making strings which contain only null characters, or a mix of only whitespace and null characters, evaluate to False. This means that NumPy 1.10+ will see that bool(np.array([''])) is False, which is much more in line with Python's treatment of "empty" strings.
I'm pretty sure the answer is, as explained in Scalars, that:
Array scalars have the same attributes and methods as ndarrays. [1] This allows one to treat items of an array partly on the same footing as arrays, smoothing out rough edges that result when mixing scalar and array operations.
So, if it's acceptable to call bool on a scalar, it must be acceptable to call bool on an array of shape (1,), because they are, as far as possible, the same thing.
And, while it isn't directly said anywhere in the docs that I know of, it's pretty obvious from the design that NumPy's scalars are supposed to act like native Python objects.
So, that explains why np.array([0]) is falsey rather than truthy, which is what you were initially surprised about.
So, that explains the basics. But what about the specifics of case c?
First, note that your array np.array(['']) is not an array of one Python object, but an array of one NumPy <U1 null-terminated character string of length 1. Fixed-length-string values don't have the same truthiness rule as Python strings—and they really couldn't; for a fixed-length-string type, "false if empty" doesn't make any sense, because they're never empty. You could argument about whether NumPy should have been designed that way or not, but it clearly does follow that rule consistently, and I don't think the opposite rule would be any less confusing here, just different.
But there seems to be something else weird going on with strings. Consider this:
>>> np.array(['a', 'b']) != 0
True
That's not doing an elementwise comparison of the <U2 strings to 0 and returning array([True, True]) (as you'd get from np.array(['a', 'b'], dtype=object)), it's doing an array-wide comparison and deciding that no array of strings is equal to 0, which seems odd… I'm not sure whether this deserves a separate answer here or even a whole separate question, but I am pretty sure I'm not going to be the one who writes that answer, because I have no clue what's going on here. :)
Beyond arrays of shape (1,), arrays of shape () are treated the same way, but anything else is a ValueError, because otherwise it would be very easily to misuse arrays with and and other Python operators that NumPy can't automagically convert into elementwise operations.
I personally think being consistent with other arrays would be more useful than being consistent with scalars here—in other words, just raise a ValueError. I also think that, if being consistent with scalars were important here, it would be better to be consistent with the unboxed Python values. In other words, if bool(array([v])) and bool(array(v)) are going to be allowed at all, they should always return exactly the same thing as bool(v), even if that's not consistent with np.nonzero. But I can see the argument the other way.
It's fixed in master now.
I thought this was a bug, and the numpy devs agreed, so this patch was merged earlier today. We should see new behaviour in the upcoming 1.10 release.
Numpy seems to be following the same castings as builtin python**, in this context it seems to be because of which return true for calls to nonzero. Apparently len can also be used, but here, none of these arrays are empty (length 0) - so that's not directly relevant. Note that calling bool([False]) also returns True according to these rules.
a = np.array([0])
b = np.array([None])
c = np.array([''])
>>> nonzero(a)
(array([], dtype=int64),)
>>> nonzero(b)
(array([], dtype=int64),)
>>> nonzero(c)
(array([0]),)
This also seems consistent with the more enumerative description of bool casting --- where your examples are all explicitly discussed.
Interestingly, there does seem to be systematically different behavior with string arrays, e.g.
>>> a.astype(bool)
array([False], dtype=bool)
>>> b.astype(bool)
array([False], dtype=bool)
>>> c.astype(bool)
ERROR: ValueError: invalid literal for int() with base 10: ''
I think, when numpy converts something into a bool it uses the PyArray_BoolConverter function which, in turn, just calls the PyObject_IsTrue function --- i.e. the exact same function that builtin python uses, which is why numpy's results are so consistent.
I am testing some edge cases of my program and observed a strange fact. When I create a scalar numpy array, it has size==1 and ndim==0.
>>> A=np.array(1.0)
>>> A.ndim # returns 0
>>> A.size # returns 1
But when I create empty array with no element, then it has size==0 but ndim==1.
>>> A=np.array([])
>>> A.ndim # returns 1
>>> A.size # returns 0
Why is that? I would expect the ndim to be also 0. Or is there another way of creation of 'really' empty array with size and ndim equal to 0?
UPDATE: even A=np.empty(shape=None) does not create dimensionless array of size 0...
I believe the answer is that "No, you can't create an ndarray with both ndim and size of zero". As you've already found out yourself, the (ndim,size) pairs of (1,0) and (0,1) are as low as you can go.
This very nice answer explains a lot about numpy scalar types, and why they're a bit odd to have around. This explanation makes it clear that scalar numpy arrays like array(1) are a very special kind of beast. They only have a single value (causing size==1), but by definition they don't have a sense of dimensionality, hence ndim==0. Non-scalar numpy arrays, on the other hand, can be empty, but they contain at least a pair of square brackets, leading to a minimal ndim of 1, even if their size can be 0 if they are made up of empty lists. (This is how I think about the situation: ndarrays are in a way lists of lists of lists of ..., on as many levels as there are dimensions. 1d arrays are compatible with lists, so an empty list, being still a list, also has a defining dimension.)
The only way to come up with an empty scalar would be to call np.array() like this, but arrays can only be initialized by some actual object. So I believe your program is safe from this edge case.
The actual problem I have is that I want to store a long sorted list of (float, str) tuples in RAM. A plain list doesn't fit in my 4Gb RAM, so I thought I could use two numpy.ndarrays.
The source of the data is an iterable of 2-tuples. numpy has a fromiter function, but how can I use it? The number of items in the iterable is unknown. I can't consume it to a list first due to memory limitations. I thought of itertools.tee, but it seems to add a lot of memory overhead here.
What I guess I could do is consume the iterator in chunks and add those to the arrays. Then my question is, how to do that efficiently? Should I maybe make 2 2D arrays and add rows to them? (Then later I'd need to convert them to 1D).
Or maybe there's a better approach? Everything I really need is to search through an array of strings by the value of the corresponding number in logarithmic time (that's why I want to sort by the value of float) and to keep it as compact as possible.
P.S. The iterable is not sorted.
Perhaps build a single, structured array using np.fromiter:
import numpy as np
def gendata():
# You, of course, have a different gendata...
for i in xrange(N):
yield (np.random.random(), str(i))
N = 100
arr = np.fromiter(gendata(), dtype='<f8,|S20')
Sorting it by the first column, using the second for tie-breakers will take O(N log N) time:
arr.sort(order=['f0','f1'])
Finding the row by the value in the first column can be done with searchsorted in O(log N) time:
# Some pseudo-random value in arr['f0']
val = arr['f0'][10]
print(arr[10])
# (0.049875262239617246, '46')
idx = arr['f0'].searchsorted(val)
print(arr[idx])
# (0.049875262239617246, '46')
You've asked many important questions in the comments; let me attempt to answer them here:
The basic dtypes are explained in the numpybook. There may be one or
two extra dtypes (like float16 which have been added since that
book was written, but the basics are all explained there.)
Perhaps a more thorough discussion is in the online documentation. Which is a good supplement to the examples you mentioned here.
Dtypes can be used to define structured arrays with column names, or
with default column names. 'f0', 'f1', etc. are default column
names. Since I defined the dtype as '<f8,|S20' I failed to provide
column names, so NumPy named the first column 'f0', and the second
'f1'. If we had used
dtype='[('fval','<f8'), ('text','|S20')]
then the structured array arr would have column names 'fval' and
'text'.
Unfortunately, the dtype has to be fixed at the time np.fromiter is called. You
could conceivably iterate through gendata once to discover the
maximum length of the strings, build your dtype and then call
np.fromiter (and iterate through gendata a second time), but
that's rather burdensome. It is of course better if you know in
advance the maximum size of the strings. (|S20 defines the string
field as having a fixed length of 20 bytes.)
NumPy arrays place data of a
pre-defined size in arrays of a fixed size. Think of the array (even multidimensional ones) as a contiguous block of one-dimensional memory. (That's an oversimplification -- there are non-contiguous arrays -- but will help your imagination for the following.) NumPy derives much of its speed by taking advantage of the fixed sizes (set by the dtype) to quickly compute the offsets needed to access elements in the array. If the strings had variable sizes, then it
would be hard for NumPy to find the right offsets. By hard, I mean
NumPy would need an index or somehow be redesigned. NumPy is simply not
built this way.
NumPy does have an object dtype which allows you to place a 4-byte
pointer to any Python object you desire. This way, you can have NumPy
arrays with arbitrary Python data. Unfortunately, the np.fromiter
function does not allow you to create arrays of dtype object. I'm not sure why there is this restriction...
Note that np.fromiter has better performance when the count is
specified. By knowing the count (the number of rows) and the
dtype (and thus the size of each row) NumPy can pre-allocate
exactly enough memory for the resultant array. If you do not specify
the count, then NumPy will make a guess for the initial size of the
array, and if too small, it will try to resize the array. If the
original block of memory can be extended you are in luck. But if
NumPy has to allocate an entirely new hunk of memory then all the old
data will have to be copied to the new location, which will slow down
the performance significantly.
Here is a way to build N separate arrays out of a generator of N-tuples:
import numpy as np
import itertools as IT
def gendata():
# You, of course, have a different gendata...
N = 100
for i in xrange(N):
yield (np.random.random(), str(i))
def fromiter(iterable, dtype, chunksize=7):
chunk = np.fromiter(IT.islice(iterable, chunksize), dtype=dtype)
result = [chunk[name].copy() for name in chunk.dtype.names]
size = len(chunk)
while True:
chunk = np.fromiter(IT.islice(iterable, chunksize), dtype=dtype)
N = len(chunk)
if N == 0:
break
newsize = size + N
for arr, name in zip(result, chunk.dtype.names):
col = chunk[name]
arr.resize(newsize, refcheck=0)
arr[size:] = col
size = newsize
return result
x, y = fromiter(gendata(), '<f8,|S20')
order = np.argsort(x)
x = x[order]
y = y[order]
# Some pseudo-random value in x
N = 10
val = x[N]
print(x[N], y[N])
# (0.049875262239617246, '46')
idx = x.searchsorted(val)
print(x[idx], y[idx])
# (0.049875262239617246, '46')
The fromiter function above reads the iterable in chunks (of size chunksize). It calls the NumPy array method resize to extend the resultant arrays as necessary.
I used a small default chunksize since I was testing this code on small data. You, of course, will want to either change the default chunksize or pass a chunksize parameter with a larger value.
I came across the following oddity in numpy which may or may not be a bug:
import numpy as np
dt = np.dtype([('tuple', (int, 2))])
a = np.zeros(3, dt)
type(a['tuple'][0]) # ndarray
type(a[0]['tuple']) # ndarray
a['tuple'][0] = (1,2) # ok
a[0]['tuple'] = (1,2) # ValueError: shape-mismatch on array construction
I would have expected that both of the options below work.
Opinions?
I asked that on the numpy-discussion list. Travis Oliphant answered here.
Citing his answer:
The short answer is that this is not really a "normal" bug, but it could be considered a "design" bug (although the issues may not be straightforward to resolve). What that means is that it may not be changed in the short term --- and you should just use the first spelling.
Structured arrays can be a confusing area of NumPy for several of reasons. You've constructed an example that touches on several of them. You have a data-type that is a "structure" array with one member ("tuple"). That member contains a 2-vector of integers.
First of all, it is important to remember that with Python, doing
a['tuple'][0] = (1,2)
is equivalent to
b = a['tuple']; b[0] = (1,2)
In like manner,
a[0]['tuple'] = (1,2)
is equivalent to
b = a[0]; b['tuple'] = (1,2)
To understand the behavior, we need to dissect both code paths and what happens. You built a (3,) array of those elements in 'a'. When you write b = a['tuple'] you should probably be getting a (3,) array of (2,)-integers, but as there is currently no formal dtype support for (n,)-integers as a general dtype in NumPy, you get back a (3,2) array of integers which is the closest thing that NumPy can give you. Setting the [0] row of this object via
a['tuple'][0] = (1,2)
works just fine and does what you would expect.
On the other hand, when you type:
b = a[0]
you are getting back an array-scalar which is a particularly interesting kind of array scalar that can hold records. This new object is formally of type numpy.void and it holds a "scalar representation" of anything that fits under the "VOID" basic dtype.
For some reason:
b['tuple'] = [1,2]
is not working. On my system I'm getting a different error: TypeError: object of type 'int' has no len()
I think this should be filed as a bug on the issue tracker which is for the time being here: http://projects.scipy.org/numpy
The problem is ultimately the void->copyswap function being called in voidtype_setfields if someone wants to investigate. I think this behavior should work.
An explanation for this is given in a numpy bug report.
I get a different error than you do (using numpy 1.7.0.dev):
ValueError: setting an array element with a sequence.
so the explanation below may not be correct for your system (or it could even be the wrong explanation for what I see).
First, notice that indexing a row of a structured array gives you a numpy.void object (see data type docs)
import numpy as np
dt = np.dtype([('tuple', (int, 2))])
a = np.zeros(3, dt)
print type(a[0]) # = numpy.void
From what I understand, void is sort of like a Python list since it can hold objects of different data types, which makes sense since the columns in a structured array can be different data types.
If, instead of indexing, you slice out the first row, you get an ndarray:
print type(a[:1]) # = numpy.ndarray
This is analogous to how Python lists work:
b = [1, 2, 3]
print b[0] # 1
print b[:1] # [1]
Slicing returns a shortened version of the original sequence, but indexing returns an element (here, an int; above, a void type).
So when you slice into the rows of the structured array, you should expect it to behave just like your original array (only with fewer rows). Continuing with your example, you can now assign to the 'tuple' columns of the first row:
a[:1]['tuple'] = (1, 2)
So,... why doesn't a[0]['tuple'] = (1, 2) work?
Well, recall that a[0] returns a void object. So, when you call
a[0]['tuple'] = (1, 2) # this line fails
you're assigning a tuple to the 'tuple' element of that void object. Note: despite the fact you've called this index 'tuple', it was stored as an ndarray:
print type(a[0]['tuple']) # = numpy.ndarray
So, this means the tuple needs to be cast into an ndarray. But, the void object can't cast assignments (this is just a guess) because it can contain arbitrary data types so it doesn't know what type to cast to. To get around this you can cast the input yourself:
a[0]['tuple'] = np.array((1, 2))
The fact that we get different errors suggests that the above line might not work for you since casting addresses the error I received---not the one you received.
Addendum:
So why does the following work?
a[0]['tuple'][:] = (1, 2)
Here, you're indexing into the array when you add [:], but without that, you're indexing into the void object. In other words, a[0]['tuple'][:] says "replace the elements of the stored array" (which is handled by the array), a[0]['tuple'] says "replace the stored array" (which is handled by void).
Epilogue:
Strangely enough, accessing the row (i.e. indexing with 0) seems to drop the base array, but it still allows you to assign to the base array.
print a['tuple'].base is a # = True
print a[0].base is a # = False
a[0] = ((1, 2),) # `a` is changed
Maybe void is not really an array so it doesn't have a base array,... but then why does it have a base attribute?
This was an upstream bug, fixed as of NumPy PR #5947, with a fix in 1.9.3.