Difference between + and np.add() - python

I have this code:
import numpy as np
a = np.arange(10)
b = a + 1
c = np.add(a, 1)
print("a:", a)
print("b:", b)
print("c:", c)
Which prints:
a: [0 1 2 3 4 5 6 7 8 9]
b: [ 1 2 3 4 5 6 7 8 9 10]
c: [ 1 2 3 4 5 6 7 8 9 10]
I want to know what is the difference between a + 1 and np.add(a, 1)?

numpy.add is the NumPy addition ufunc. Addition operations on arrays will usually delegate to numpy.add, but operations that don't involve arrays will usually not involve numpy.add.
For example, if we have two ordinary ints:
In [1]: import numpy
In [2]: numpy.add(1, 2)
Out[2]: 3
In [3]: type(_)
Out[3]: numpy.int64
In [4]: 1 + 2
Out[4]: 3
In [5]: type(_)
Out[5]: int
numpy.add will coerce the inputs to NumPy types, perform NumPy addition logic, and produce a NumPy output, while + performs ordinary Python int addition and produces an ordinary int.
Aside from that, numpy.add has a lot of additional arguments to control and customize how the operation is performed. You can specify an out array to write the output into, or a dtype argument to control the output dtype, or many other arguments for advanced use cases.

Related

Python language construction when filtering the array

I can see many questions in SO about how to filter an array (usually using pandas or numpy). The examples are very simple:
df = pd.DataFrame({ 'val': [1, 2, 3, 4, 5, 6, 7] })
a = df[df.val > 3]
Intuitively I understand the df[df.val > 3] statement. But it confuses me from the syntax point of view. In other languages, I would expect a lambda function instead of df.val > 3.
The question is: Why this style is possible and what is going on underhood?
Update 1:
To be more clear about the confusing part: In other languages I saw the next syntax:
df.filter(row => row.val > 3)
I understand here what is going on under the hood: For every iteration, the lambda function is called with the row as an argument and returns a boolean value. But here df.val > 3 doesn't make sense to me because df.val looks like a column.
Moreover, I can write df[df > 3] and it will be compiled and executed successfully. And it makes me crazy because I don't understand how a DataFrame object can be compared to a number.
Create an array and dataframe from it:
In [104]: arr = np.arange(1,8); df = pd.DataFrame({'val':arr})
In [105]: arr
Out[105]: array([1, 2, 3, 4, 5, 6, 7])
In [106]: df
Out[106]:
val
0 1
1 2
2 3
3 4
4 5
5 6
6 7
numpy arrays have methods and operators that operate on the whole array. For example you can multiply the array by a scalar, or add a scalar to all elements. Or in this case compare each element to a scalar. That's all implemented by the class (numpy.ndarray), not by Python syntax.
In [107]: arr>3
Out[107]: array([False, False, False, True, True, True, True])
Similarly pandas implements these methods (or uses numpy methods on the underlying arrays). Selecting a column of a frame, with `df['val'] or:
In [108]: df.val
Out[108]:
0 1
1 2
2 3
3 4
4 5
5 6
6 7
Name: val, dtype: int32
This is a pandas Series (slight difference in display).
It can be compared to a scalar - as with the array:
In [110]: df.val>3
Out[110]:
0 False
1 False
2 False
3 True
4 True
5 True
6 True
Name: val, dtype: bool
And the boolean array can be used to index the frame:
In [111]: df[arr>3]
Out[111]:
val
3 4
4 5
5 6
6 7
The boolean Series also works:
In [112]: df[df.val>3]
Out[112]:
val
3 4
4 5
5 6
6 7
Boolean array indexing works the same way:
In [113]: arr[arr>3]
Out[113]: array([4, 5, 6, 7])
Here I use the indexing to fetch values; setting values is analogous.

Why is .loc slicing in pandas inclusive of stop, contrary to typical python slicing?

I am slicing a pandas dataframe and I seem to be getting unexpected slices using .loc, at least as compared to numpy and ordinary python slicing. See the example below.
>>> import pandas as pd
>>> a = pd.DataFrame([[0,1,2],[3,4,5],[4,5,6],[9,10,11],[34,2,1]])
>>> a
0 1 2
0 0 1 2
1 3 4 5
2 4 5 6
3 9 10 11
4 34 2 1
>>> a.loc[1:3, :]
0 1 2
1 3 4 5
2 4 5 6
3 9 10 11
>>> a.values[1:3, :]
array([[3, 4, 5],
[4, 5, 6]])
Interestingly, this only happens with .loc, not .iloc.
>>> a.iloc[1:3, :]
0 1 2
1 3 4 5
2 4 5 6
Thus, .loc appears to be inclusive of the terminating index, but numpy and .iloc are not.
By the comments, it seems this is not a bug and we are well warned. But why is it the case?
Remember .loc is primarily label based indexing. The decision to include the stop endpoint becomes far more obvious when working with a non-RangeIndex:
df = pd.DataFrame([1,2,3,4], index=list('achz'))
# 0
#a 1
#c 2
#h 3
#z 4
If I want to select all rows between 'a' and 'h' (inclusive) I only know about 'a' and 'h'. In order to be consistent with other python slicing, you'd need to also know what index follows 'h', which in this case is 'z' but could have been anything.
There's also a section of the documents hidden away that explains this design choice Endpoints are Inclusive
Additionally to the point in the docs, pandas slice indexing using .loc is not cell index based. It is in fact value based indexing (in the pandas docs it is called "label based", but for numerical data I prefer the term "value based"), whereas with .iloc it is traditional numpy-style cell indexing.
Furthermore, value based indexing is right-inclusive, whereas cell indexing is not. Just try the following:
a = pd.DataFrame([[0,1,2],[3,4,5],[4,5,6],[9,10,11],[34,2,1]])
a.index = [0, 1, 2, 3.1, 4] # add a float index
# value based slicing: the following will output all value up to the slice value
a.loc[1:3.1]
# Out:
# 0 1 2
# 1.0 3 4 5
# 2.0 4 5 6
# 3.1 9 10 11
# index based slicing: will raise an error, since only integers are allowed
a.iloc[1:3.1]
# Out: TypeError: cannot do slice indexing on <class 'pandas.core.indexes.numeric.Float64Index'> with these indexers [3.2] of <class 'float'>
To give an explicit answer to your question why it is right-inclusive:
When using values/labels as indices, it is, at least in my opinion, intuitive, that the last index is included. This is as far as I know a design decision of how the implemented function is meant to work.

np.transpose() and np.reshape() combination gives different results in pure numpy and in numba

The following code produces different outputs:
import numpy as np
from numba import njit
#njit
def resh_numba(a):
res = a.transpose(1, 0, 2)
res = res.copy().reshape(2, 6)
return res
x = np.arange(12).reshape(2, 2, 3)
print("numpy")
x_numpy = x.transpose(1, 0, 2).reshape(2, 6)
print(x_numpy)
print("numba:")
x_numba = resh_numba(x)
print(x_numba)
Output:
numpy
[[ 0 1 2 6 7 8]
[ 3 4 5 9 10 11]]
numba:
[[ 0 4 8 2 6 10]
[ 1 5 9 3 7 11]]
What is the reason for this? I'm suspecting some order='C' vs order='F' happening somewhere, but I expected both numpy and numba to use order='C' by default everywhere.
It is to a bug due (at least) to the np.ndarray.copy implementation, I opened an issue here: https://github.com/numba/numba/issues/3557

numpy random not working with seed

import random
seed = random.random()
random_seed = random.Random(seed)
random_vec = [ random_seed.random() for i in range(10)]
The above is essentially:
np.random.randn(10)
But I am not able to figure out how to set the seed?
I'm not sure why you want to set the seed—especially to a random number, even more especially to a random float (note that random.seed wants a large integer).
But if you do, it's simple: call the numpy.random.seed function.
Note that NumPy's seeds are arrays of 32-bit integers, while Python's seeds are single arbitrary-sized integers (although see the docs for what happens when you pass other types).
So, for example:
In [1]: np.random.seed(0)
In [2]: s = np.random.randn(10)
In [3]: s
Out[3]:
array([ 1.76405235, 0.40015721, 0.97873798, 2.2408932 , 1.86755799,
-0.97727788, 0.95008842, -0.15135721, -0.10321885, 0.4105985 ])
In [4]: np.random.seed(0)
In [5]: s = np.random.randn(10)
In [6]: s
Out[6]:
array([ 1.76405235, 0.40015721, 0.97873798, 2.2408932 , 1.86755799,
-0.97727788, 0.95008842, -0.15135721, -0.10321885, 0.4105985 ])
Same seed used twice (I took the shortcut of passing a single int, which NumPy will internally convert into an array of 1 int32), same random numbers generated.
To put it simply random.seed(value) does not work with numpy arrays.
For example,
import random
import numpy as np
random.seed(10)
print( np.random.randint(1,10,10)) #generates 10 random integer of values from 1~10
[4 1 5 7 9 2 9 5 2 4]
random.seed(10)
print( np.random.randint(1,10,10))
[7 6 4 7 2 5 3 7 8 9]
However, if you want to seed the numpy generated values, you have to use np.random.seed(value).
If I revisit the above example,
import numpy as np
np.random.seed(10)
print( np.random.randint(1,10,10))
[5 1 2 1 2 9 1 9 7 5]
np.random.seed(10)
print( np.random.randint(1,10,10))
[5 1 2 1 2 9 1 9 7 5]

What's the difference of numpy.ndarray.T and numpy.ndarray.transpose() when self.ndim < 2

The document numpy.ndarray.T says
ndarray.T — Same as self.transpose(), except that self is returned if self.ndim < 2.
Also, ndarray.transpose(*axes) says
For a 1-D array, this has no effect.
Doesn't this mean the same thing?
Here's a little demo snippet:
>>> import numpy as np
>>> print np.__version__
1.5.1rc1
>>> a = np.arange(7)
>>> print a, a.T, a.transpose()
[0 1 2 3 4 5 6] [0 1 2 3 4 5 6] [0 1 2 3 4 5 6]
Regardless of rank, the .T attribute and the .transpose() method are the same—they both return the transpose of the array.
In the case of a rank 1 array, the .T and .transpose() don't do anything—they both return the array.
It looks like .T is just a convenient notation, and that .transpose(*axes) is the more general function and is intended to give more flexibility, as axes can be specified. They are apparently not implemented in Python, so one would have to look into C code to check this.

Categories

Resources