Checking NaN values in pandas dataframe with int8 columns - python

As was proposed in a question I posed last week, a memory efficient way to store a column with values in the range [True, False, NaN] would be to use the int8-datatype to denote True as 1, False as 0 and NaN as -1.
If I do this, what would be good practice to "redefine" pandas' isnull() methods to also take into account that, if a column in a dataframe has dtype int8, -1 should be considered a null-value. I could think of defining a new function def isnull(v), that returns if a value is NaN, or -1 in case of dtype int8, but I can imagine this will not be a very fast and efficient solution (given that the dataframe I am working with is multiple gigabytes big, and I want to be able to count the amount of "null"-values in a column/dataframe).

it should be pretty fast...
Timing for 100.000.000 rows series.
In [84]: s = pd.Series(np.random.choice([1,0,-1], 10**8), dtype=np.int8)
In [85]: s.shape
Out[85]: (100000000,)
simulating series.isnull():
In [86]: %timeit s==-1
87 ms ± 3.51 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [87]: %timeit s.values==-1
84.1 ms ± 2.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [88]: %timeit np.where(s==-1)
546 ms ± 14.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [89]: %timeit np.where(s.values==-1)
531 ms ± 2.78 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
simulating: series.isnull().sum():
In [90]: %timeit (s==-1).sum()
1.39 s ± 38.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [91]: %timeit (s.values==-1).sum()
181 ms ± 1.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
PS please pay attention that for counting (summing) them the difference between (s==-1).sum() and (s.values==-1).sum() is pretty noticeable

Related

Computing a slightly different matrix multiplication

I'm trying to find the best way to compute the minimum element wise products between two sets of vectors. The usual matrix multiplication C=A#B computes Cij as the sum of the pairwise products of the elements of the vectors Ai and B^Tj. I would like to perform instead the minimum of the pairwise products. I can't find an efficient way to do this between two matrices with numpy.
One way to achieve this would be to generate the 3D matrix of the pairwise products between A and B (before the sum) and then take the minimum over the third dimension. But this would lead to a huge memory footprint (and I actually dn't know how to do this).
Do you have any idea how I could achieve this operation ?
Example:
A = [[1,1],[1,1]]
B = [[0,2],[2,1]]
matrix matmul:
C = [[1*0+1*2,1*2+1*1][1*0+1*2,1*2+1*1]] = [[2,3],[2,3]]
minimum matmul:
C = [[min(1*0,1*2),min(1*2,1*1)][min(1*0,1*2),min(1*2,1*1)]] = [[0,1],[0,1]]
Use broadcasting after extending A to 3D -
A = np.asarray(A)
B = np.asarray(B)
C_out = np.min(A[:,None]*B,axis=2)
If you care about memory footprint, use numexpr module to be efficient about it -
import numexpr as ne
C_out = ne.evaluate('min(A3D*B,2)',{'A3D':A[:,None]})
Timings on large arrays -
In [12]: A = np.random.rand(200,200)
In [13]: B = np.random.rand(200,200)
In [14]: %timeit np.min(A[:,None]*B,axis=2)
34.4 ms ± 614 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [15]: %timeit ne.evaluate('min(A3D*B,2)',{'A3D':A[:,None]})
29.3 ms ± 316 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [16]: A = np.random.rand(300,300)
In [17]: B = np.random.rand(300,300)
In [18]: %timeit np.min(A[:,None]*B,axis=2)
113 ms ± 2.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [19]: %timeit ne.evaluate('min(A3D*B,2)',{'A3D':A[:,None]})
102 ms ± 691 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
So, there's some improvement with numexpr, but maybe not as much I was expecting it to be.
Numba can be also an option
I was a bit surprised of the not particularly good Numexpr Timings, so I tried a Numba Version. For large Arrays this can be optimized further. (Quite the same principles like for a dgemm can be applied)
import numpy as np
import numba as nb
import numexpr as ne
#nb.njit(fastmath=True,parallel=True)
def min_pairwise_prod(A,B):
assert A.shape[1]==B.shape[1]
res=np.empty((A.shape[0],B.shape[0]))
for i in nb.prange(A.shape[0]):
for j in range(B.shape[0]):
min_prod=A[i,0]*B[j,0]
for k in range(B.shape[1]):
prod=A[i,k]*B[j,k]
if prod<min_prod:
min_prod=prod
res[i,j]=min_prod
return res
Timings
A=np.random.rand(300,300)
B=np.random.rand(300,300)
%timeit res_1=min_pairwise_prod(A,B) #parallel=True
5.56 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_1=min_pairwise_prod(A,B) #parallel=False
26 ms ± 163 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit res_2 = ne.evaluate('min(A3D*B,2)',{'A3D':A[:,None]})
87.7 ms ± 265 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit res_3=np.min(A[:,None]*B,axis=2)
110 ms ± 214 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
A=np.random.rand(1000,300)
B=np.random.rand(1000,300)
%timeit res_1=min_pairwise_prod(A,B) #parallel=True
50.6 ms ± 401 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_1=min_pairwise_prod(A,B) #parallel=False
296 ms ± 5.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_2 = ne.evaluate('min(A3D*B,2)',{'A3D':A[:,None]})
992 ms ± 7.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_3=np.min(A[:,None]*B,axis=2)
1.27 s ± 15.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Efficient way to check dtype of each row in a series

Say I have mixed ts/other data:
ser = pd.Series(pd.date_range('2017/01/05', '2018/01/05'))
ser.loc[3] = 4
type(ser.loc[0])
> pandas._libs.tslibs.timestamps.Timestamp
I would like to filter for all timestamps. For instance, this gives me what I want:
ser.apply(lambda x: isinstance(x, pd.Timestamp))
0 True
1 True
2 True
3 False
4 True
...
But I assume it would be faster to use a vectorized solution and avoid apply. I thought I should be able to use where:
ser.where(isinstance(ser, pd.Timestamp))
But I get
ValueError: Array conditional must be same shape as self
Is there a way to do this? Also, am I correct in my assumption that it would be faster/more 'Pandasic'?
It depends of length of data, but here for small data (365 rows) is faster list comprehension:
In [108]: %timeit (ser.apply(lambda x: isinstance(x, pd.Timestamp)))
434 µs ± 57.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [109]: %timeit ([isinstance(x, pd.Timestamp) for x in ser])
140 µs ± 5.09 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [110]: %timeit (pd.to_datetime(ser, errors='coerce').notna())
1.01 ms ± 25.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
But if test larger DataFrame is faster to_datetime with test non missing values by Series.isna:
ser = pd.Series(pd.date_range('1980/01/05', '2020/01/05'))
ser.loc[3] = 4
print (len(ser))
14611
In [116]: %timeit (ser.apply(lambda x: isinstance(x, pd.Timestamp)))
6.42 ms ± 541 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [117]: %timeit ([isinstance(x, pd.Timestamp) for x in ser])
4.9 ms ± 256 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [118]: %timeit (pd.to_datetime(ser, errors='coerce').notna())
4.22 ms ± 167 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
To address your question of filtering, you can convert to datetime and drop NaNs.
ser[pd.to_datetime(ser, errors='coerce').notna()]
Or, if you don't mind the result being datetime,
pd.to_datetime(ser, errors='coerce').dropna()

"Pandorable" way to return index in dataframe slicing

Is there a pandorable way to get only the index in dataframe slicing?
In other words, is there a better way to write the following code:
df.loc[df['A'] >5].index
Thanks!
Yes, better is filter only index values, not all DataFrame and then select index:
#filter index
df.index[df['A'] >5]
#filter DataFrame
df[df['A'] >5].index
Difference is in performance too:
np.random.seed(1245)
df = pd.DataFrame({'A':np.random.randint(10, size=1000)})
print (df)
In [40]: %timeit df.index[df['A'] >5]
208 µs ± 11.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [41]: %timeit df[df['A'] >5].index
428 µs ± 6.42 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [42]: %timeit df.loc[df['A'] >5].index
466 µs ± 40.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
If performance is important use numpy - convert values of index and column by values to numpy array:
In [43]: %timeit df.index.values[df['A'] >5]
157 µs ± 8.71 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [44]: %timeit df.index.values[df['A'].values >5]
8.91 µs ± 196 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Why use reset_index(drop=True) when setting the index is much faster?

Why would I use reset_index(drop=True), when the alternative is much faster? I am sure there is something I am missing. (Or my timings are bad somehow...)
import pandas as pd
l = pd.Series(range(int(1e7)))
%timeit l.reset_index(drop=True)
# 35.9 ms +- 1.29 ms per loop (mean +- std. dev. of 7 runs, 10 loops each)
%timeit l.index = range(int(1e7))
# 13 us +- 455 ns per loop (mean +- std. dev. of 7 runs, 100000 loops each)
The costly operation in reseting the index is not to create the new index (as you showed, that is super fast) but to return a copy of the series. If you compare:
%timeit l.reset_index(drop=True)
22.6 ms ± 172 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit l.index = range(int(1e7))
14.7 µs ± 348 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit l.reset_index(inplace=True, drop=True)
13.7 µs ± 121 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
You can see that the inplace operation (where no copy is returned) is more or less equally fast as your methode. However it is generally discouraged to perform inplace operations.

Attribute access time very long with DataFrame

I would like to understand why Dataframe attributes access time seems so long (often 100 x slower vs other objects). An example :
In [37]: df=pd.DataFrame([])
In [38]: a=np.array([])
In [39]: %timeit df.size
28 µs ± 4.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [40]: %timeit a.size
136 ns ± 9.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Categories

Resources