I'm learning Python right now and I'm stuck with this line of code I found on the internet. I can not understand what actually this line of code do.
Suppose I have this array:
import numpy as np
x = np.array ([[1,5],[8,1],[10,0.5]]
y = x[np.sqrt(x[:,0]**2+x[:,1]**2) < 1]
print (y)
The result is an empty array. What I want to know is what does actually the y do? I've never encountered this kind of code before. It seems like the square brackets is like the if-conditional statement. Instead of that code, If write this line of code:
import numpy as np
x = np.array ([[1,5],[8,1],[10,0.5]]
y = x[0 < 1]
print (y)
It will return exactly what x is (because zero IS less than one).
Assuming that it is a way to write if-conditional statement, I find it really absurd because I'm comparing an array with an integer.
Thank you for your answer!
In Numpy:
[1,1,2,3,4] < 2
is (very roughly) equivalent to something like:
[x<2 for x in [1,1,2,3,4]]
for vanilla Python lists. And as such, in both cases, the result would be:
[True, True, False, False, False]
The same holds true for some other functions, like addition, multiplication and so on. Broadcasting is actually a major selling point for Numpy.
Now, another thing you can do in Numpy is boolean indexing, which is providing an array of bools that are interpreted as 'Keep this value Y/N?'. So:
arr = [1,1,2,3,4]
res = arr[arr<2]
# evaluates to:
=> [1,1]
numpy works differently when you slice an array using a boolean or an int.
From the docs:
This advanced indexing occurs when obj is an array object of Boolean type, such as may be returned from comparison operators. A single
boolean index array is practically identical to x[obj.nonzero()]
where, as described above, obj.nonzero() returns a tuple (of length
obj.ndim) of integer index arrays showing the True elements of obj.
However, it is faster when obj.shape == x.shape.
If obj.ndim == x.ndim, x[obj] returns a 1-dimensional array filled
with the elements of x corresponding to the True values of obj. The
search order will be row-major, C-style. If obj has True values at
entries that are outside of the bounds of x, then an index error will
be raised. If obj is smaller than x it is identical to filling it with
False.
When you index an array using booleans, you are telling numpy to select the data corresponding to True, therefore array[True] is not the same as array[1]. In the first case, numpy will therefore interpret it as a zero dimensional boolean array, which, based on how masks works, is the same as selecting all data.
Therefore:
x[True]
will return the full array, just as
x[False]
will return an empty array.
Related
What is the rationale behind the seemingly inconsistent behaviour of the following lines of code?
import numpy as np
# standard list
print(bool([])) # False - expected
print(bool([0])) # True - expected
print(bool([1])) # True - expected
print(bool([0,0])) # True - expected
# numpy arrays
print(bool(np.array([]))) # False - expected, deprecation warning: The
# truth value of an empty array is ambiguous...
print(bool(np.array([0]))) # False - unexpected, no warning
print(bool(np.array([1]))) # True - unexpected, no warning
print(bool(np.array([0,0]))) # ValueError: The truth value of an array
# with more than one element is ambiguous...
There are at least two inconsistencies in my point of view:
Standard python containers can be tested for emptiness bool(container). Why do numpy array not follow this pattern? (bool(np.array([0])) yields False)
Why is there an exception/deprecation warning when converting an empty numpy array or an array of length > 1, but it is okay to do so when the numpy array contains just one element?
Note that the deprecation for empty numpy arrays was added somewhere between numpy 1.11. and 1.14.
For the first problem, the reason is that it's not at all clear what you want to do with if np.array([1, 2]):.
This isn't a problem for if [1, 2]: because Python lists don't do element-wise anything. The only thing you can be asking is whether the list itself is truthy (non-empty).
But Numpy arrays do everything element-wise that possibly could be element-wise. Notice that this is hardly the only place, or even the most common place, where element-wise semantics mean that arrays work differently from normal Python sequences. For example:
>>> [1, 2] * 3
[1, 2, 1, 2, 1, 2]
>>> np.array([1, 2]) * 3
array([3, 6])
And, for this case in particular, boolean arrays are a very useful thing, especially since you can index with them:
>>> arr = np.array([1, 2, 3, 4])
>>> arr > 2 # all values > 2
array([False, False, True, True])
>>> arr[arr > 2] = 2 # clamp the values at <= 2
>>> arr
array([1, 2, 2, 2])
And once you have that feature, it becomes ambiguous what an array should mean in a boolean context. Normally, you want the bool array. But when you write if arr:, you could mean any of multiple things:
Do the body of the if for each element that's truthy. (Rewrite the body as an expression on arr indexed by the bool array.)
Do the body of the if if any element is truthy. (Use any.)
Do the body of the if if all elements are truthy. (Use any.)
A hybrid over some axis—e.g., do the body for each row where any element is truthy.
Do the body of the if if the array is nonempty—acting like a normal Python sequence but violating the usual element-wise semantics of an array. (Explicitly check for emptiness.)
So, rather than guess, and be wrong more often than not, numpy gives you an error and forces you to be explicit.
For the second problem, doesn't the warning text answer this for you? The truth value of a single element is obviously not ambiguous.
And single-element arrays—especially 0D ones—are often used as pseudo-scalars, so being able to do this isn't just harmless, it's also sometimes useful.
By contrast, asking "is this array empty" is rarely useful. A list is a variable-sized thing that you usually build up by adding one element at a time, zero or more times (possibly implicitly in a comprehension), so it's very often worth asking whether you added zero elements. But an array is a fixed-size thing, where you usually explicitly specified the size somewhere nearby in the code.
That's why it's allowed. And why it operates on the single value, not on the size of the array.
For empty arrays (which you didn't ask about, but did bring up): here, instead of there being multiple reasonable things you could mean, it's hard to think of anything reasonable you could mean. Which is probably why this is the only case that's changed recently (see issue 9583), rather than being the same since the days when Python added __nonzero__.
epsData is a two-dimensional array consisting of Dates and StockID.
I took out some of the code in order to make it simple.
The code calls the functions Generate and neweps, epsData is passed by the engine. I am not sure why it gives an error when I try to pass the array epsss to the SUE() function.
I tried to remove the extra bracket in array (if any) by using flatten function but that does not help.
SUE() is supposed to loop through the array and find the 4th last different value and then store these in an array.
I get this error:
TypeError: return arrays must be of ArrayType
with the three lines marked below:
def lastdifferentvalue(vals,datas,i):
sizes=len(datas)
j=sizes-1
values=0
while (i>0) and (j>=0):
if logical_and((vals-datas[j]!=0),(datas[j]!=0),(datas[j-1]!=0)): # !! HERE !!
i=i-1
values=datas[j-1]
j=j-1
return j, values
def SUE(datas):
sizes=len(datas)
j=sizes-1
values=0
sues=zeros(8)
eps1=datas[j]
i=7
while (j>0) and (i>=0) :
counts, eps2=lastdifferentvalue(eps1,array(datas[0:j]),4)
if eps2!=0:
sues[i]=eps1-eps2
i=i-1
j,eps1=lastdifferentvalue(eps1,datas[0:j],1) # !! HERE !!
stddev=std(SUE)
sue7=SUE[7]
return stddev,sue7
def Generate(di,alpha):
#the code below loops through the data. neweps is a two dimensional array of floats [dates, stockid]
for ii in range(0,len(alpha)):
if (epss[2,ii]-epss[1,ii]!=0) and (epss[2,ii]!=0) and (epss[1,ii]!=0):
predata=0
epsss= neweps[di-delay-250:di-delay+1,ii]
stddevs,suedata= SUE(array(epsss.flatten())) # !! HERE !!
Presumably, you're using numpy.logical_and, in the form of
np.logical_and(a, b, c)
with the meaning that you'd like to take the logical and of the three. If you check the documentation, though, that's not what it does. It's interpreting c as the array where you intend to store the results.
You probably mean here something like
np.logical_and(a, np.logical_and(b, c))
or
from functools import reduce
reduce(np.logical_and, [a, b, c])
The line:
if logical_and((vals-datas[j]!=0),(datas[j]!=0),(datas[j-1]!=0))
has two errors:
Presumably you are wanting to perform a logical_and over (vals-datas[j] != 0) and (datas[j] != 0) and (datas[j-1] != 0). However numpy.logical_and only takes two input parameters, the third if passed is assumed to be an output array. Thus if you are wishing to have numpy.logical_and operate over three arrays it should be expressed as:
logical_and(logical_and((vals-datas[j] != 0), (datas[j] != 0)), (datas[j-1] != 0))
In any case, using a logical_and in an if statement makes no sense. It returns an array and an array does not have a truth value. That is, the result of a logical_and is an array of booleans, some of which are true and some false. Are you wishing to check if they are all true? Or if at least some are true?
If the former, then you should test it as:
if numpy.all(logical_and(...)):
...
And if the latter then test it as:
if numpy.any(logical_and(...)):
...
This has given me a lot of trouble, and I am perplexed by the incompatibility of numpy arrays with pandas series. When I create a boolean array using a series, for instance
x = np.array([1,2,3,4,5,6,7])
y = pd.Series([1,2,3,4,5,6,7])
delta = np.percentile(x, 50)
deltamask = x- y > delta
delta mask creates a boolean pandas series.
However, if you do
x[deltamask]
y[deltamask]
You find that the array ignores completely the mask. No error is raised, but you end up with two objects of different length. This means that an operation like
x[deltamask]*y[deltamask]
results in an error:
print type(x-y)
print type(x[deltamask]), len(x[deltamask])
print type(y[deltamask]), len(y[deltamask])
Even more perplexing, I noticed that the operator < is treated differently. For instance
print type(2*x < x*y)
print type(2 < x*y)
will give you a pd.series and np.array respectively.
Also,
5 < x - y
results in a series, so it seems that the series takes precedence, whereas the boolean elements of a series mask are promoted to integers when passed to a numpy array and result in a sliced array.
What is the reason for this?
Fancy Indexing
As numpy currently stands, fancy indexing in numpy works as follows:
If the thing between brackets is a tuple (whether with explicit parens or not), the elements of the tuple are indices for different dimensions of x. For example, both x[(True, True)] and x[True, True] will raise IndexError: too many indices for array in this case because x is 1D. However, before the exception happens, a telling warning will be raised too: VisibleDeprecationWarning: using a boolean instead of an integer will result in an error in the future.
If the thing between brackets is exactly an ndarray, not a subclass or other array-like, and has a boolean type, it will be applied as a mask. This is why x[deltamask.values] gives the expected result (empty array since deltamask is all False.
If the thing between brackets is any array-like, whether a subclass like Series or just a list, or something else, it is converted to an np.intp array (if possible) and used as an integer index. So x[deltamask] yeilds something equivalent to x[[False] * 7] or just x[[0] * 7]. In this case, len(deltamask)==7 and x[0]==1 so the result is [1, 1, 1, 1, 1, 1, 1].
This behavior is counterintuitive, and the FutureWarning: in the future, boolean array-likes will be handled as a boolean array index it generates indicates that a fix is in the works. I will update this answer as I find out about/make any changes to numpy.
This information can be found in Sebastian Berg's response to my initial query on Numpy discussion here.
Relational Operators
Now let's address the second part of your question about how the comparison works. Relational operators (<, >, <=, >=) work by calling the corresponding method on one of the objects being compared. For < this is __lt__. However, instead of just calling x.__lt__(y) for the expression x < y, Python actually checks the types of the objects being compared. If y is a subtype of x that implements the comparison, then Python prefers to call y.__gt__(x) instead, regardless of how you wrote the original comparison. The only way that x.__lt__(y) will get called if y is a subclass of x is if y.__gt__(x) returns NotImplemented to indicate that the comparison is not supported in that direction.
A similar thing happens when you do 5 < x - y. While ndarray is not a subclass of int, the comparison int.__lt__(ndarray) returns NotImplemented, so Python actually ends up calling (x - y).__gt__(5), which is of course defined and works just fine.
A much more succinct explanation of all this can be found in the Python docs.
I'm confused by what the results of numpy.where mean, and how to use it to index into an array.
Have a look at the code sample below:
import numpy as np
a = np.random.randn(10,10,2)
indices = np.where(a[:,:,0] > 0.5)
I expect the indices array to be 2-dim and contain the indices where the condition is true. We can see that by
indices = np.array(indices)
indices.shape # (2,120)
So it looks like indices is acting on the flattened array of some sort, but I'm not able to figure out exactly how. More confusingly,
a.shape # (20,20,2)
a[indices].shape # (2,120,20,2)
Question:
How does indexing my array with the output of np.where actually grow the size of the array? What is going on here?
You are basing your indexing on a wrong assumption: np.where returns something that can be immediatly used for advanced indexing (it's a tuple of np.ndarrays). But you convert it to a numpy array (so it's now a np.ndarray of np.ndarrays).
So
import numpy as np
a = np.random.randn(10,10,2)
indices = np.where(a[:,:,0] > 0.5)
a[:,:,0][indices]
# If you do a[indices] the result would be different, I'm not sure what
# you intended.
gives you the elements that are found by np.where. If you convert indices to a np.array it triggers another form of indexing (see this section of the numpy docs) and the warning message in the docs gets very important. That's the reason why it increases the total size of your array.
Some additional information about what np.where means: You get a tuple containing n arrays. n is the number of dimensions of the input array. So the first element that satisfies the condition has index [0][0], [1][0], ... [n][0] and not [0][0], [0][1], ... [0][n]. So in your case you have (2, 120) meaning you have 2 dimensions and 120 found points.
I have some code where I want to test if the product of a matrix and vector is the zero vector. An example of my attempt is:
n =2
zerovector = np.asarray([0]*n)
for column in itertools.product([0,1], repeat = n):
for row in itertools.product([0,1], repeat = n-1):
M = toeplitz(column, [column[0]]+list(row))
for v in itertools.product([-1,0,1], repeat = n):
vector = np.asarray(v)
if (np.dot(M,v) == zerovector):
print M, "No good!"
break
But the line if (np.dot(M,v) == zerovector): gives the error ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(). What is the right way to do this?
The problem is that == between two arrays is an element-wise comparison—you get back an array of boolean values. An array of boolean values isn't a boolean value itself, so you can't use it in an if. This is what the error is trying to tell you.
You could solve this by using the all method, to check whether all of the elements in the boolean array are true. But you're making this way more complicated than you need to. Nonzero values are truthy, zero values are falsey, so you can just use any without a comparison:
if not np.dot(M, v).any():
If you want to make the comparison to zero explicit, just compare to a scalar, don't build a zero vector; it'll get broadcast the same way. And, if you ever do want to build a zero vector, just use the zeros function; don't build a list of zeros in a complicated way and pass it to asarray.
You could also use the count_nonzero function here as a different alternative. If it returns anything truthy (that is, any non-zero number), the array had at least one non-zero.
In general, you're making almost everything harder than necessary, and working through a brief NumPy tutorial and then scanning the main docs pages for useful functions would really help you.
Also, if your values aren't integers, you probably don't actually want to compare == 0 in the first place. Floating-point numbers accumulate rounding errors. To handle that, use the allclose function instead.
as the error says you need to use all
if all(np.dot(M,v) == zerovector):
or np.all. np.dot(M,v) == zerovector gives you a vector which is pair-wise comparison of the two vectors.