python numpy arange dtpye? why converting to integer was zero - python

x = np.arange(0.3, 12.5, 0.6)
print(x)
[ 0.3 0.9 1.5 2.1 2.7 3.3 3.9 4.5 5.1 5.7 6.3 6.9 7.5 8.1 8.7 9.3 9.9 10.5 11.1 11.7 12.3]
x = np.arange(0.3, 12.5, 0.6,int)
print(x)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

When dtype = int is specified, it is converting the start, stop and step into the same.
So, it becomes int(start), int(stop), int(step).
Hence, in your case, when dtype = int is specified, the start and step remain 0 and you get an array full of 0s.
This problem has been discussed with explanation here:
https://github.com/numpy/numpy/issues/2457

First let's skip the complexity of a float step, and use a simple integer start and stop:
In [141]: np.arange(0,5)
Out[141]: array([0, 1, 2, 3, 4])
In [142]: np.arange(0,5, dtype=int)
Out[142]: array([0, 1, 2, 3, 4])
In [143]: np.arange(0,5, dtype=float)
Out[143]: array([0., 1., 2., 3., 4.])
In [144]: np.arange(0,5, dtype=complex)
Out[144]: array([0.+0.j, 1.+0.j, 2.+0.j, 3.+0.j, 4.+0.j])
In [145]: np.arange(0,5, dtype='datetime64[D]')
Out[145]:
array(['1970-01-01', '1970-01-02', '1970-01-03', '1970-01-04',
'1970-01-05'], dtype='datetime64[D]')
Even bool work, within a certain range:
In [149]: np.arange(0,1, dtype=bool)
Out[149]: array([False])
In [150]: np.arange(0,2, dtype=bool)
Out[150]: array([False, True])
In [151]: np.arange(0,3, dtype=bool)
ValueError: no fill-function for data-type.
In [156]: np.arange(0,3).astype(bool)
Out[156]: array([False, True, True])
There are 2 possible boolean values, so asking for more should produce some sort of error.
arange is compiled code, so we can't readily examine its logic (but you are welcome to search for the C code on github).
The examples show that it does, in some sense convert the parameters into the corresponding dtype, and perform an iteration on that. It doesn't simply generate the range and convert to dtype at the end.

Related

Python language construction when filtering the array

I can see many questions in SO about how to filter an array (usually using pandas or numpy). The examples are very simple:
df = pd.DataFrame({ 'val': [1, 2, 3, 4, 5, 6, 7] })
a = df[df.val > 3]
Intuitively I understand the df[df.val > 3] statement. But it confuses me from the syntax point of view. In other languages, I would expect a lambda function instead of df.val > 3.
The question is: Why this style is possible and what is going on underhood?
Update 1:
To be more clear about the confusing part: In other languages I saw the next syntax:
df.filter(row => row.val > 3)
I understand here what is going on under the hood: For every iteration, the lambda function is called with the row as an argument and returns a boolean value. But here df.val > 3 doesn't make sense to me because df.val looks like a column.
Moreover, I can write df[df > 3] and it will be compiled and executed successfully. And it makes me crazy because I don't understand how a DataFrame object can be compared to a number.
Create an array and dataframe from it:
In [104]: arr = np.arange(1,8); df = pd.DataFrame({'val':arr})
In [105]: arr
Out[105]: array([1, 2, 3, 4, 5, 6, 7])
In [106]: df
Out[106]:
val
0 1
1 2
2 3
3 4
4 5
5 6
6 7
numpy arrays have methods and operators that operate on the whole array. For example you can multiply the array by a scalar, or add a scalar to all elements. Or in this case compare each element to a scalar. That's all implemented by the class (numpy.ndarray), not by Python syntax.
In [107]: arr>3
Out[107]: array([False, False, False, True, True, True, True])
Similarly pandas implements these methods (or uses numpy methods on the underlying arrays). Selecting a column of a frame, with `df['val'] or:
In [108]: df.val
Out[108]:
0 1
1 2
2 3
3 4
4 5
5 6
6 7
Name: val, dtype: int32
This is a pandas Series (slight difference in display).
It can be compared to a scalar - as with the array:
In [110]: df.val>3
Out[110]:
0 False
1 False
2 False
3 True
4 True
5 True
6 True
Name: val, dtype: bool
And the boolean array can be used to index the frame:
In [111]: df[arr>3]
Out[111]:
val
3 4
4 5
5 6
6 7
The boolean Series also works:
In [112]: df[df.val>3]
Out[112]:
val
3 4
4 5
5 6
6 7
Boolean array indexing works the same way:
In [113]: arr[arr>3]
Out[113]: array([4, 5, 6, 7])
Here I use the indexing to fetch values; setting values is analogous.

how to implement this difference operation efficiently?

I would like to build a data frame from an existing one, where each value per row is depending on the previous one. I have an initial value v0 as starting point. Let me make an example
In [126]:import pandas as pd
In [127]: df = pd.DataFrame([1.0, 1.1, 1.2, 1.3])
In [128]: df_result = df.copy()
In [129]: v0 = 10
In [130]: for i in range(1, len(df.index)):
...: df_result.iloc[i, 0] = df.iloc[i, 0]*df_result.iloc[i-1, 0]
...:
In [131]: df_result
Out[131]:
0
0 1.000
1 1.100
2 1.320
3 1.716
In [132]:
My question is about the for loop. How can I more efficiently writing this?
I believe need first numpy.insert value v0 to first position and then call numpy.cumprod:
df = pd.DataFrame([1.0, 1.1, 1.2, 1.3], columns=['r'])
v0 = 10
df['n'] = np.cumprod(np.insert(df['r'].values[1:], 0, v0))
print (df)
r n
0 1.0 10.00
1 1.1 11.00
2 1.2 13.20
3 1.3 17.16

pandas - cumulative median

I was wondering if there is any pandas equivalent to cumsum() or cummax() etc. for median: e.g. cummedian().
So that if I have, for example this dataframe:
a
1 5
2 7
3 6
4 4
what I want is something like:
df['a'].cummedian()
which should output:
5
6
6
5.5
You can use expanding.median -
df.a.expanding().median()
1 5.0
2 6.0
3 6.0
4 5.5
Name: a, dtype: float64
Timings
df = pd.DataFrame({'a' : np.arange(1000000)})
%timeit df['a'].apply(cummedian())
1 loop, best of 3: 1.69 s per loop
%timeit df.a.expanding().median()
1 loop, best of 3: 838 ms per loop
The winner is expanding.median by a huge margin. Divakar's method is memory intensive and suffers memory blowout at this size of input.
We could create nan filled subarrays as rows with a strides based function, like so -
def nan_concat_sliding_windows(x):
n = len(x)
add_arr = np.full(n-1, np.nan)
x_ext = np.concatenate((add_arr, x))
strided = np.lib.stride_tricks.as_strided
nrows = len(x_ext)-n+1
s = x_ext.strides[0]
return strided(x_ext, shape=(nrows,n), strides=(s,s))
Sample run -
In [56]: x
Out[56]: array([5, 6, 7, 4])
In [57]: nan_concat_sliding_windows(x)
Out[57]:
array([[ nan, nan, nan, 5.],
[ nan, nan, 5., 6.],
[ nan, 5., 6., 7.],
[ 5., 6., 7., 4.]])
Thus, to get sliding median values for an array x, we would have a vectorized solution, like so-
np.nanmedian(nan_concat_sliding_windows(x), axis=1)
Hence, the final solution would be -
In [54]: df
Out[54]:
a
1 5
2 7
3 6
4 4
In [55]: pd.Series(np.nanmedian(nan_concat_sliding_windows(df.a.values), axis=1))
Out[55]:
0 5.0
1 6.0
2 6.0
3 5.5
dtype: float64
A faster solution for the specific cumulative median
In [1]: import timeit
In [2]: setup = """import bisect
...: import pandas as pd
...: def cummedian():
...: l = []
...: info = [0, True]
...: def inner(n):
...: bisect.insort(l, n)
...: info[0] += 1
...: info[1] = not info[1]
...: median = info[0] // 2
...: if info[1]:
...: return (l[median] + l[median - 1]) / 2
...: else:
...: return l[median]
...: return inner
...: df = pd.DataFrame({'a': range(20)})"""
In [3]: timeit.timeit("df['cummedian'] = df['a'].apply(cummedian())",setup=setup,number=100000)
Out[3]: 27.11604686321956
In [4]: timeit.timeit("df['expanding'] = df['a'].expanding().median()",setup=setup,number=100000)
Out[4]: 48.457676260100335
In [5]: 48.4576/27.116
Out[5]: 1.7870482372031273

Understand any() and nan in Pandas

I have the same problem as in: Pandas series.all() returns nan
In [88]: pd.Series([False, np.nan]).any()
Out[88]: nan
where as:
In [84]: np.any([False, np.nan])
Out[84]: True
and also:
In [99]: pd.DataFrame([False, np.nan]).any()
Out[99]:
0 False
dtype: bool
I was curious what the explanation was for the different behaviors for the three types?
The difference here has nothing to do with the two different types implementing any differently. In fact, the docs for pandas.Series.any and numpy.ndarray.any both explicitly say "Refer to numpy.any for full documentation", because they both effectively just call numpy.any.
The difference is that you have different dtypes in the two cases. Creating a NumPy ndarray, implicitly or explicitly, from different numeric types coerces the types to be the same if possible, so you end up with float64, while a Pandas series keeps the types separate, which means you end up with object.
If you specify the dtype explicitly, you can see that they do the same thing:
>>> a = np.array([False, np.nan])
>>> a
array([ 0., nan])
>>> a.dtype
float64
>>> a.any()
True
>>> a = np.array([False, np.nan], dtype=object)
>>> a
array([False, nan], dtype=object)
>>> a.any()
nan
>>> p = pd.Series([False, np.nan])
>>> p
0 False
1 NaN
>>> p.dtype
dtype('O')
>>> p.any()
nan
>>> p = pd.Series([False, np.nan], dtype=np.float64)
>>> p
0 0
1 NaN
>>> p.any()
True

Numpy cumsum considering NaNs

I am looking for a succinct way to go from:
a = numpy.array([1,4,1,numpy.nan,2,numpy.nan])
to:
b = numpy.array([1,5,6,numpy.nan,8,numpy.nan])
The best I can do currently is:
b = numpy.insert(numpy.cumsum(a[numpy.isfinite(a)]), (numpy.argwhere(numpy.isnan(a)) - numpy.arange(len(numpy.argwhere(numpy.isnan(a))))), numpy.nan)
Is there a shorter way to accomplish the same? What about doing a cumsum along an axis of a 2D array?
Pandas is a library build on top of numpy. It's
Series class has a cumsum method, which preserves the nan's and is considerably faster than the solution proposed by DSM:
In [15]: a = arange(10000.0)
In [16]: a[1] = np.nan
In [17]: %timeit a*0 + np.nan_to_num(a).cumsum()
1000 loops, best of 3: 465 us per loop
In [18] s = pd.Series(a)
In [19]: s.cumsum()
Out[19]:
0 0
1 NaN
2 2
3 5
...
9996 49965005
9997 49975002
9998 49985000
9999 49994999
Length: 10000
In [20]: %timeit s.cumsum()
10000 loops, best of 3: 175 us per loop
How about (for not-too-big arrays):
In [34]: import numpy as np
In [35]: a = np.array([1,4,1,np.nan,2,np.nan])
In [36]: a*0 + np.nan_to_num(a).cumsum()
Out[36]: array([ 1., 5., 6., nan, 8., nan])
Masked arrays are for just this type of situation.
>>> import numpy as np
>>> from numpy import ma
>>> a = np.array([1,4,1,np.nan,2,np.nan])
>>> b = ma.masked_array(a,mask = (np.isnan(a) | np.isinf(a)))
>>> b
masked_array(data = [1.0 4.0 1.0 -- 2.0 --],
mask = [False False False True False True],
fill_value = 1e+20)
>>> c = b.cumsum()
>>> c
masked_array(data = [1.0 5.0 6.0 -- 8.0 --],
mask = [False False False True False True],
fill_value = 1e+20)
>>> c.filled(np.nan)
array([ 1., 5., 6., nan, 8., nan])

Categories

Resources