Vectorizing a very simple pandas lambda function in apply - python

pandas apply/map is my nemesis and even on small datasets can be agonizingly slow. Below is a very simple example where there is nearly a 3 order of magnitude difference in speed. Below I create a Series with 1 million values and simply want to map values greater than .5 to 'Yes' and those less than .5 to 'No'. How do I vectorize this or speed it up significantly?
ser = pd.Series(np.random.rand(1000000))
# vectorized and fast
%%timeit
ser > .5
1000 loops, best of 3: 477 µs per loop
%%timeit
ser.map(lambda x: 'Yes' if x > .5 else 'No')
1 loop, best of 3: 255 ms per loop

np.where(cond, A, B) is the vectorized equivalent of A if cond else B:
import numpy as np
import pandas as pd
ser = pd.Series(np.random.rand(1000000))
mask = ser > 0.5
result = pd.Series(np.where(mask, 'Yes', 'No'))
expected = ser.map(lambda x: 'Yes' if x > .5 else 'No')
assert result.equals(expected)
In [77]: %timeit mask = ser > 0.5
1000 loops, best of 3: 1.44 ms per loop
In [76]: %timeit np.where(mask, 'Yes', 'No')
100 loops, best of 3: 14.8 ms per loop
In [73]: %timeit pd.Series(np.where(mask, 'Yes', 'No'))
10 loops, best of 3: 86.5 ms per loop
In [74]: %timeit ser.map(lambda x: 'Yes' if x > .5 else 'No')
1 loop, best of 3: 223 ms per loop
Since this Series only has two values, you might consider using a Categorical instead:
In [94]: cat = pd.Categorical.from_codes(codes=mask.astype(int), categories=['Yes', 'No']); cat
Out[94]:
[No, Yes, No, Yes, Yes, ..., Yes, No, Yes, Yes, No]
Length: 1000000
Categories (2, object): [Yes, No]
In [95]: %timeit pd.Categorical.from_codes(codes=mask.astype(int), categories=['Yes', 'No']); cat
100 loops, best of 3: 6.26 ms per loop
Not only is this faster, it is more memory efficient since it avoids creating the array of strings. The category codes are an array of ints which map to categories:
In [96]: cat.codes
Out[96]: array([1, 0, 1, ..., 0, 0, 1], dtype=int8)
In [97]: cat.categories
Out[99]: Index(['Yes', 'No'], dtype='object')

Related

Is there a more efficient and elegant way to filter pandas index by date?

I often use DatetimeIndex.date, especially in groupby methods. However, DatetimeIndex.date is slow when compared to DatetimeIndex.year/month/day. From what I understand, it is because the .date attribute works with a lambda function over the index and returns a datetime ordered index, while index.year/month/day just returns integer indices. I have made a small example function that performs a bit better and would speed up some of my code (at least for finding the values in a groupby), but I feel that there must be a better way:
In [217]: index = pd.date_range('2011-01-01', periods=100000, freq='h')
In [218]: data = np.random.rand(len(index))
In [219]: df = pd.DataFrame({'data':data},index)
In [220]: def func(df):
...: groupby = df.groupby([df.index.year, df.index.month, df.index.day]).mean()
...: index = pd.date_range(df.index[0], periods = len(groupby), freq='D')
...: groupby.index = index
...: return groupby
...:
In [221]: df.groupby(df.index.date).mean().equals(func(df))
Out[221]: True
In [222]: df.groupby(df.index.date).mean().index.equals(func(df).index)
Out[222]: True
In [223]: %timeit df.groupby(df.index.date).mean()
1 loop, best of 3: 1.32 s per loop
In [224]: %timeit func(df)
10 loops, best of 3: 89.2 ms per loop
Does the pandas/index have a similar functionality that I am not finding?
You can even improve it a little bit:
In [69]: %timeit func(df)
10 loops, best of 3: 84.3 ms per loop
In [70]: %timeit df.groupby(pd.TimeGrouper('1D')).mean()
100 loops, best of 3: 6 ms per loop
In [84]: %timeit df.groupby(pd.Grouper(level=0, freq='1D')).mean()
100 loops, best of 3: 6.48 ms per loop
In [71]: (func(df) == df.groupby(pd.TimeGrouper('1D')).mean()).all()
Out[71]:
data True
dtype: bool
another solution - using DataFrame.resample() method:
In [73]: (df.resample('1D').mean() == func(df)).all()
Out[73]:
data True
dtype: bool
In [74]: %timeit df.resample('1D').mean()
100 loops, best of 3: 6.63 ms per loop
UPDATE: grouping by the string:
In [75]: %timeit df.groupby(df.index.strftime('%Y%m%d')).mean()
1 loop, best of 3: 2.6 s per loop
In [76]: %timeit df.groupby(df.index.date).mean()
1 loop, best of 3: 1.07 s per loop

Row Sum of a dot product for huge matrix in python

I have 2 matrix 100kx200 and 200x100k
if they were small matrix I would just use numpy dot product
sum(a.dot(b), axis = 0)
however the matrix is too big, and also I can't use loops is there a smart way for doing this?
A possible optimization is
>>> numpy.sum(a # b, axis=0)
array([ 1.83633615, 18.71643672, 15.26981078, -46.33670382, 13.30276476])
>>> numpy.sum(a, axis=0) # b
array([ 1.83633615, 18.71643672, 15.26981078, -46.33670382, 13.30276476])
Computing a # b requires 10k×200×10k operations, while summing the rows first will reduce the multiplication to 1×200×10k operations, giving a 10k× improvement.
This is mainly due to recognizing
numpy.sum(x, axis=0) == [1, 1, ..., 1] # x
=> numpy.sum(a # b, axis=0) == [1, 1, ..., 1] # (a # b)
== ([1, 1, ..., 1] # a) # b
== numpy.sum(a, axis=0) # b
Similar for the other axis.
>>> numpy.sum(a # b, axis=1)
array([ 2.8794171 , 9.12128399, 14.52009991, -8.70177811, -15.0303783 ])
>>> a # numpy.sum(b, axis=1)
array([ 2.8794171 , 9.12128399, 14.52009991, -8.70177811, -15.0303783 ])
(Note: x # y is equivalent to x.dot(y) for 2D matrixes and 1D vectors on Python 3.5+ with numpy 1.10.0+)
$ INITIALIZATION='import numpy;numpy.random.seed(0);a=numpy.random.randn(1000,200);b=numpy.random.rand(200,1000)'
$ python3 -m timeit -s "$INITIALIZATION" 'numpy.einsum("ij,jk->k", a, b)'
10 loops, best of 3: 87.2 msec per loop
$ python3 -m timeit -s "$INITIALIZATION" 'numpy.sum(a#b, axis=0)'
100 loops, best of 3: 12.8 msec per loop
$ python3 -m timeit -s "$INITIALIZATION" 'numpy.sum(a, axis=0)#b'
1000 loops, best of 3: 300 usec per loop
Illustration:
In [235]: a = np.random.rand(3,3)
array([[ 0.465, 0.758, 0.641],
[ 0.897, 0.673, 0.742],
[ 0.763, 0.274, 0.485]])
In [237]: b = np.random.rand(3,2)
array([[ 0.303, 0.378],
[ 0.039, 0.095],
[ 0.192, 0.668]])
Now, if we simply do a # b, we would need 18 multiply and 6 addition ops. On the other hand, if we do np.sum(a, axis=0) # b we would only need 6 multiply and 2 addition ops. An improvement of 3x because we had 3 rows in a. As for OP's case, this should give 10k times improvement over simple a # b computation since he has 10k rows in a.
There are two sum-reductions happening - One from the marix-multilication with np.dot, and then with the explicit sum.
We could use np.einsum to do both of those in one go, like so -
np.einsum('ij,jk->k',a,b)
Sample run -
In [27]: a = np.random.rand(3,4)
In [28]: b = np.random.rand(4,3)
In [29]: np.sum(a.dot(b), axis = 0)
Out[29]: array([ 2.70084316, 3.07448582, 3.28690401])
In [30]: np.einsum('ij,jk->k',a,b)
Out[30]: array([ 2.70084316, 3.07448582, 3.28690401])
Runtime test -
In [45]: a = np.random.rand(1000,200)
In [46]: b = np.random.rand(200,1000)
In [47]: %timeit np.sum(a.dot(b), axis = 0)
100 loops, best of 3: 5.5 ms per loop
In [48]: %timeit np.einsum('ij,jk->k',a,b)
10 loops, best of 3: 71.8 ms per loop
Sadly, doesn't look like we are doing any better with np.einsum.
For changing to np.sum(a.dot(b), axis = 1), just swap the output string notation there - np.einsum('ij,jk->i',a,b), like so -
In [42]: np.sum(a.dot(b), axis = 1)
Out[42]: array([ 3.97805141, 3.2249661 , 1.85921549])
In [43]: np.einsum('ij,jk->i',a,b)
Out[43]: array([ 3.97805141, 3.2249661 , 1.85921549])
Some quick time tests using the idea I added to Divakar's answer:
In [162]: a = np.random.rand(1000,200)
In [163]: b = np.random.rand(200,1000)
In [174]: timeit c1=np.sum(a.dot(b), axis=0)
10 loops, best of 3: 27.7 ms per loop
In [175]: timeit c2=np.sum(a,axis=0).dot(b)
1000 loops, best of 3: 432 µs per loop
In [176]: timeit c3=np.einsum('ij,jk->k',a,b)
10 loops, best of 3: 170 ms per loop
In [177]: timeit c4=np.einsum('j,jk->k', np.einsum('ij->j', a), b)
1000 loops, best of 3: 353 µs per loop
In [178]: timeit np.einsum('ij->j', a) #b
1000 loops, best of 3: 304 µs per loop
einsum is actually faster than np.sum!
In [180]: timeit np.einsum('ij->j', a)
1000 loops, best of 3: 173 µs per loop
In [181]: timeit np.sum(a,0)
1000 loops, best of 3: 312 µs per loop
For larger arrays the einsum advantage decreases
In [183]: a = np.random.rand(100000,200)
In [184]: b = np.random.rand(200,100000)
In [185]: timeit np.einsum('ij->j', a) #b
10 loops, best of 3: 51.5 ms per loop
In [186]: timeit c2=np.sum(a,axis=0).dot(b)
10 loops, best of 3: 59.5 ms per loop

Fast way to check if a numpy array is binary (contains only 0 and 1)

Given a numpy array, how can I figure it out if it contains only 0 and 1 quickly?
Is there any implemented method?
Few approaches -
((a==0) | (a==1)).all()
~((a!=0) & (a!=1)).any()
np.count_nonzero((a!=0) & (a!=1))==0
a.size == np.count_nonzero((a==0) | (a==1))
Runtime test -
In [313]: a = np.random.randint(0,2,(3000,3000)) # Only 0s and 1s
In [314]: %timeit ((a==0) | (a==1)).all()
...: %timeit ~((a!=0) & (a!=1)).any()
...: %timeit np.count_nonzero((a!=0) & (a!=1))==0
...: %timeit a.size == np.count_nonzero((a==0) | (a==1))
...:
10 loops, best of 3: 28.8 ms per loop
10 loops, best of 3: 29.3 ms per loop
10 loops, best of 3: 28.9 ms per loop
10 loops, best of 3: 28.8 ms per loop
In [315]: a = np.random.randint(0,3,(3000,3000)) # Contains 2 as well
In [316]: %timeit ((a==0) | (a==1)).all()
...: %timeit ~((a!=0) & (a!=1)).any()
...: %timeit np.count_nonzero((a!=0) & (a!=1))==0
...: %timeit a.size == np.count_nonzero((a==0) | (a==1))
...:
10 loops, best of 3: 28 ms per loop
10 loops, best of 3: 27.5 ms per loop
10 loops, best of 3: 29.1 ms per loop
10 loops, best of 3: 28.9 ms per loop
Their runtimes seem to be comparable.
It looks you can achieve it with something like:
np.array_equal(a, a.astype(bool))
If your array is large, it should avoid copying too many arrays (as in some other answers). Thus, it should probably be slightly faster than other answers (not tested however).
With only a single loop over the data:
0 <= np.bitwise_or.reduce(ar) <= 1
Note that this doesn't work for floating point dtype.
If the values are guaranteed non-negative you can get short-circuiting behavior:
try:
np.empty((2,), bool)[ar]
is_binary = True
except IndexError:
is_binary = False
This method (always) allocates a temp array of the same shape as the argument and seems to loop over the data slower than the first method.
If you have access to Numba (or alternatively cython), you can write something like the following, which will be significantly faster for catching non-binary arrays since it will short circuit the calculation/stop immediately instead of continuing with all of the elements:
import numpy as np
import numba as nb
#nb.njit
def check_binary(x):
is_binary = True
for v in np.nditer(x):
if v.item() != 0 and v.item() != 1:
is_binary = False
break
return is_binary
Running this in pure python without the aid of an accelerator like Numba or Cython makes this approach prohibitively slow.
Timings:
a = np.random.randint(0,2,(3000,3000)) # Only 0s and 1s
%timeit ((a==0) | (a==1)).all()
# 100 loops, best of 3: 15.1 ms per loop
%timeit check_binary(a)
# 100 loops, best of 3: 11.6 ms per loop
a = np.random.randint(0,3,(3000,3000)) # Contains 2 as well
%timeit ((a==0) | (a==1)).all()
# 100 loops, best of 3: 14.9 ms per loop
%timeit check_binary(a)
# 1000000 loops, best of 3: 543 ns per loop
We could use np.isin().
input_array = input_array.squeeze(-1)
is_binary = np.isin(input_array, [0,1]).all()
1st line:
squeeze to unroll the input array, as we don't want to deal with the complication of np.isin() with a multi-dimension array.
2nd line:
np.isin() checks whether all elements of input belong to 0 or 1.
np.isin() returns a list of [True, False, True, True..].
Then all() to ensure that list contain all True.
The following should work:
ans = set(arr).issubset([0,1])
How about numpy unique?
np.unique(arr)
Should return [0,1] if binary.

Turn pandas series to series of lists or numpy array to array of lists

I have a series s
s = pd.Series([1, 2])
What is an efficient way to make s look like
0 [1]
1 [2]
dtype: object
Here's one approach that extracts into array and extends to 2D by introducing a new axis with None/np.newaxis -
pd.Series(s.values[:,None].tolist())
Here's a similar one, but extends to 2D by reshaping -
pd.Series(s.values.reshape(-1,1).tolist())
Runtime test using #P-robot's setup -
In [43]: s = pd.Series(np.random.randint(1,10,1000))
In [44]: %timeit pd.Series(np.vstack(s.values).tolist()) # #Nickil Maveli's soln
100 loops, best of 3: 5.77 ms per loop
In [45]: %timeit pd.Series([[a] for a in s]) # #P-robot's soln
1000 loops, best of 3: 412 µs per loop
In [46]: %timeit s.apply(lambda x: [x]) # #mgc's soln
1000 loops, best of 3: 551 µs per loop
In [47]: %timeit pd.Series(s.values[:,None].tolist()) # Approach1
1000 loops, best of 3: 307 µs per loop
In [48]: %timeit pd.Series(s.values.reshape(-1,1).tolist()) # Approach2
1000 loops, best of 3: 306 µs per loop
If you want the result to still be a pandas Series you can use the apply method :
In [1]: import pandas as pd
In [2]: s = pd.Series([1, 2])
In [3]: s.apply(lambda x: [x])
Out[3]:
0 [1]
1 [2]
dtype: object
This does it:
import numpy as np
np.array([[a] for a in s],dtype=object)
array([[1],
[2]], dtype=object)
Adjusting atomh33ls' answer, here's a series of lists:
output = pd.Series([[a] for a in s])
type(output)
>> pandas.core.series.Series
type(output[0])
>> list
Timings for a selection of the suggestions:
import numpy as np, pandas as pd
s = pd.Series(np.random.randint(1,10,1000))
>> %timeit pd.Series(np.vstack(s.values).tolist())
100 loops, best of 3: 3.2 ms per loop
>> %timeit pd.Series([[a] for a in s])
1000 loops, best of 3: 393 µs per loop
>> %timeit s.apply(lambda x: [x])
1000 loops, best of 3: 473 µs per loop

Pandas DataFrame to Dict Format with new Keys

What would be the best way to convert this:
deviceid devicetype
0 b569dcb7-4498-4cb4-81be-333a7f89e65f Google
1 04d3b752-f7a1-42ae-8e8a-9322cda4fd7f Android
2 cf7391c5-a82f-4889-8d9e-0a423f132026 Android
into this:
0 {"deviceid":"b569dcb7-4498-4cb4-81be-333a7f89e65f","devicetype":["Google"]}
1 {"deviceid":"04d3b752-f7a1-42ae-8e8a-9322cda4fd7f","devicetype":["Android"]}
2 {"deviceid":"cf7391c5-a82f-4889-8d9e-0a423f132026","devicetype":["Android"]}
I've tried df.to_dict() but that just gives:
{'deviceid': {0: 'b569dcb7-4498-4cb4-81be-333a7f89e65f',
1: '04d3b752-f7a1-42ae-8e8a-9322cda4fd7f',
2: 'cf7391c5-a82f-4889-8d9e-0a423f132026'},
'devicetype': {0: 'Google', 1: 'Android', 2: 'Android'}}
You can use apply with to_json:
In [11]: s = df.apply((lambda x: x.to_json()), axis=1)
In [12]: s[0]
Out[12]: '{"deviceid":"b569dcb7-4498-4cb4-81be-333a7f89e65f","devicetype":"Google"}'
To get the list for the device type you could do this manually:
In [13]: s1 = df.apply((lambda x: {"deviceid": x["deviceid"], "devicetype": [x["devicetype"]]}), axis=1)
In [14]: s1[0]
Out[14]: {'deviceid': 'b569dcb7-4498-4cb4-81be-333a7f89e65f', 'devicetype': ['Google']}
To expand on on the previous answer to_dict() should be a little faster than to_json()
This appears to be true for a larger test data frame, but the to_dict() method is actually a little slower for the example you provided.
Large test set
In [1]: %timeit s = df.apply((lambda x: x.to_json()), axis=1)
Out[1]: 100 loops, best of 3: 5.88 ms per loop
In [2]: %timeit s = df.apply((lambda x: x.to_dict()), axis=1)
Out[2]: 100 loops, best of 3: 3.91 ms per loop
Provided example
In [3]: %timeit s = df.apply((lambda x: x.to_json()), axis=1)
Out[3]: 1000 loops, best of 3: 375 µs per loop
In [4]: %timeit s = df.apply((lambda x: x.to_dict()), axis=1)
Out[4]: 1000 loops, best of 3: 450 µs per loop

Categories

Resources