Fast advanced indexing in numpy - python

I'm trying to take a slice from a large numpy array as quickly as possible using fancy indexing. I would be happy returning a view, but advanced indexing returns a copy.
I've tried solutions from here and here with no joy so far.
Toy data:
data = np.random.randn(int(1e6), 50)
keep = np.random.rand(len(data))>0.5
Using the default method:
%timeit data[keep]
10 loops, best of 3: 86.5 ms per loop
Numpy take:
%timeit data.take(np.where(keep)[0], axis=0)
%timeit np.take(data, np.where(keep)[0], axis=0)
10 loops, best of 3: 83.1 ms per loop
10 loops, best of 3: 80.4 ms per loop
Method from here:
rows = np.where(keep)[0]
cols = np.arange(a.shape[1])
%timeit (a.ravel()[(cols + (rows * a.shape[1]).reshape((-1,1))).ravel()]).reshape(rows.size, cols.size)
10 loops, best of 3: 159 ms per loop
Whereas if you're taking a view of the same size:
%timeit data[1:-1:2, :]
1000000 loops, best of 3: 243 ns per loop

There's no way to do this with a view. A view needs consistent strides, while your data is randomly scattered throughout the original array.

Related

Looping through pandas dataframe for speed

I'm trying to understand the fastest way to loop through in pandas. I read in many places that itertuples is much better than just regularly looping through data, and the best is apply. If this is the case why do regular loops come out the fastest? Maybe I'm not understanding the results, what does 10 loops, best of 3 mean?
%%timeit
xlist= []
for row in toMood.itertuples():
xlist.append(row[1] + 1)
1 loop, best of 3: 266 ms per loop
In [54]:
%%timeit
zlist = []
for row in toMood['user_id']:
zlist.append(row + 1)
10 loops, best of 3: 83 ms per loop
In [56]:
%%timeit
tlist = toMood['user_id'].apply(lambda x: x+1)
10 loops, best of 3: 138 ms per loop

Pandas selecting columns - best habit and performance

There are many different ways to select a column in a pandas.DataFrame (same for rows). I am wondering if it makes any difference and if there are any performance and style recommendations.
E.g., if I have a DataFrame as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.random.random((10,4)), columns=['a','b','c','d'])
df.head()
There are many different ways to select e.g., column d
1) df['d']
2) df.loc[:,'d'] (where df.loc[row_indexer,column_indexer])
3) df.loc[:]['d']
4) df.ix[:]['d']
5) df.ix[:,'d']
Intuitively, I would prefer 2), maybe because I am used to the [row_indexer,column_indexer] style from numpy
I would use ipython's magic function %timeit to find out the best performant method.
The results are:
%timeit df['d']
100000 loops, best of 3: 5.35 µs per loop
%timeit df.loc[:,'d']
10000 loops, best of 3: 44.3 µs per loop
%timeit df.loc[:]['d']
100000 loops, best of 3: 12.4 µs per loop
%timeit df.ix[:]['d']
100000 loops, best of 3: 10.4 µs per loop
%timeit df.ix[:,'d']
10000 loops, best of 3: 53 µs per loop
It turns out that the 1st method is considerably faster than others.

python pandas: why map is faster?

in pandas' manual, there is this example about indexing:
In [653]: criterion = df2['a'].map(lambda x: x.startswith('t'))
In [654]: df2[criterion]
then Wes wrote:
**# equivalent but slower**
In [655]: df2[[x.startswith('t') for x in df2['a']]]
can anyone here explain a bit why the map approach is faster? Is this a python feature or this is a pandas feature?
Arguments about why a certain way of doing things in Python "should be" faster can't be taken too seriously, because you're often measuring implementation details which may behave differently in certain situations. As a result, when people guess what should be faster, they're often (usually?) wrong. For example, I find that map can actually be slower. Using this setup code:
import numpy as np, pandas as pd
import random, string
def make_test(num, width):
s = [''.join(random.sample(string.ascii_lowercase, width)) for i in range(num)]
df = pd.DataFrame({"a": s})
return df
Let's compare the time they take to make the indexing object -- whether a Series or a list -- and the resulting time it takes to use that object to index into the DataFrame. It could be, for example, that making a list is fast but before using it as an index it needs to be internally converted to a Series or an ndarray or something and so there's extra time added there.
First, for a small frame:
>>> df = make_test(10, 10)
>>> %timeit df['a'].map(lambda x: x.startswith('t'))
10000 loops, best of 3: 85.8 µs per loop
>>> %timeit [x.startswith('t') for x in df['a']]
100000 loops, best of 3: 15.6 µs per loop
>>> %timeit df['a'].str.startswith("t")
10000 loops, best of 3: 118 µs per loop
>>> %timeit df[df['a'].map(lambda x: x.startswith('t'))]
1000 loops, best of 3: 304 µs per loop
>>> %timeit df[[x.startswith('t') for x in df['a']]]
10000 loops, best of 3: 194 µs per loop
>>> %timeit df[df['a'].str.startswith("t")]
1000 loops, best of 3: 348 µs per loop
and in this case the listcomp is fastest. That doesn't actually surprise me too much, to be honest, because going via a lambda is likely to be slower than using str.startswith directly, but it's really hard to guess. 10 is small enough we're probably still measuring things like setup costs for Series; what happens in a larger frame?
>>> df = make_test(10**5, 10)
>>> %timeit df['a'].map(lambda x: x.startswith('t'))
10 loops, best of 3: 46.6 ms per loop
>>> %timeit [x.startswith('t') for x in df['a']]
10 loops, best of 3: 27.8 ms per loop
>>> %timeit df['a'].str.startswith("t")
10 loops, best of 3: 48.5 ms per loop
>>> %timeit df[df['a'].map(lambda x: x.startswith('t'))]
10 loops, best of 3: 47.1 ms per loop
>>> %timeit df[[x.startswith('t') for x in df['a']]]
10 loops, best of 3: 52.8 ms per loop
>>> %timeit df[df['a'].str.startswith("t")]
10 loops, best of 3: 49.6 ms per loop
And now it seems like the map is winning when used as an index, although the difference is marginal. But not so fast: what if we manually turn the listcomp into an array or a Series?
>>> %timeit df[np.array([x.startswith('t') for x in df['a']])]
10 loops, best of 3: 40.7 ms per loop
>>> %timeit df[pd.Series([x.startswith('t') for x in df['a']])]
10 loops, best of 3: 37.5 ms per loop
and now the listcomp wins again!
Conclusion: who knows? But never believe anything without timeit results, and even then you have to ask whether you're testing what you think you are.

numpy np.array versus np.matrix (performance)

often when working with numpy I find the distinction annoying - when I pull out a vector or a row from a matrix and then perform operations with np.arrays there are usually problems.
to reduce headaches, I've taken to sometimes just using np.matrix (converting all np.arrays to np.matrix) just for simplicity. however, I suspect there are some performance implications. could anyone comment as to what those might be and the reasons why?
it seems like if they are both just arrays underneath the hood that element access is simply an offset calculation to get the value, so I'm not sure without reading through the entire source what the difference might be.
more specifically, what performance implications does this have:
v = np.matrix([1, 2, 3, 4])
# versus the below
w = np.array([1, 2, 3, 4])
thanks
I added some more tests, and it appears that an array is considerably faster than matrix when array/matrices are small, but the difference gets smaller for larger data structures:
Small (4x4):
In [11]: a = [[1,2,3,4],[5,6,7,8]]
In [12]: aa = np.array(a)
In [13]: ma = np.matrix(a)
In [14]: %timeit aa.sum()
1000000 loops, best of 3: 1.77 us per loop
In [15]: %timeit ma.sum()
100000 loops, best of 3: 15.1 us per loop
In [16]: %timeit np.dot(aa, aa.T)
1000000 loops, best of 3: 1.72 us per loop
In [17]: %timeit ma * ma.T
100000 loops, best of 3: 7.46 us per loop
Larger (100x100):
In [19]: aa = np.arange(10000).reshape(100,100)
In [20]: ma = np.matrix(aa)
In [21]: %timeit aa.sum()
100000 loops, best of 3: 9.18 us per loop
In [22]: %timeit ma.sum()
10000 loops, best of 3: 22.9 us per loop
In [23]: %timeit np.dot(aa, aa.T)
1000 loops, best of 3: 1.26 ms per loop
In [24]: %timeit ma * ma.T
1000 loops, best of 3: 1.24 ms per loop
Notice that matrices are actually slightly faster for multiplication.
I believe that what I am getting here is consistent with what #Jaime is explaining the comment.
There is a general discusion on SciPy.org and on this question.
To compare performance, I did the following in iPython. It turns out that arrays are significantly faster.
In [1]: import numpy as np
In [2]: %%timeit
...: v = np.matrix([1, 2, 3, 4])
100000 loops, best of 3: 16.9 us per loop
In [3]: %%timeit
...: w = np.array([1, 2, 3, 4])
100000 loops, best of 3: 7.54 us per loop
Therefore numpy arrays seem to have faster performance than numpy matrices.
Versions used:
Numpy: 1.7.1
IPython: 0.13.2
Python: 2.7

Why is numpy.array() is sometimes very slow?

I'm using the numpy.array() function to create numpy.float64 ndarrays from lists.
I noticed that this is very slow when either the list contains None or a list of lists is provided.
Below are some examples with times. There are obvious workarounds but why is this so slow?
Examples for list of None:
### Very slow to call array() with list of None
In [3]: %timeit numpy.array([None]*100000, dtype=numpy.float64)
1 loops, best of 3: 240 ms per loop
### Problem doesn't exist with array of zeroes
In [4]: %timeit numpy.array([0.0]*100000, dtype=numpy.float64)
100 loops, best of 3: 9.94 ms per loop
### Also fast if we use dtype=object and convert to float64
In [5]: %timeit numpy.array([None]*100000, dtype=numpy.object).astype(numpy.float64)
100 loops, best of 3: 4.92 ms per loop
### Also fast if we use fromiter() insead of array()
In [6]: %timeit numpy.fromiter([None]*100000, dtype=numpy.float64)
100 loops, best of 3: 3.29 ms per loop
Examples for list of lists:
### Very slow to create column matrix
In [7]: %timeit numpy.array([[0.0]]*100000, dtype=numpy.float64)
1 loops, best of 3: 353 ms per loop
### No problem to create column vector and reshape
In [8]: %timeit numpy.array([0.0]*100000, dtype=numpy.float64).reshape((-1,1))
100 loops, best of 3: 10 ms per loop
### Can use itertools to flatten input lists
In [9]: %timeit numpy.fromiter(itertools.chain.from_iterable([[0.0]]*100000),dtype=numpy.float64).reshape((-1,1))
100 loops, best of 3: 9.65 ms per loop
I've reported this as a numpy issue. The report and patch files are here:
https://github.com/numpy/numpy/issues/3392
After patching:
# was 240 ms, best alternate version was 3.29
In [5]: %timeit numpy.array([None]*100000)
100 loops, best of 3: 7.49 ms per loop
# was 353 ms, best alternate version was 9.65
In [6]: %timeit numpy.array([[0.0]]*100000)
10 loops, best of 3: 23.7 ms per loop
My guess would be that the code for converting lists just calls float on everything. If the argument defines __float__, we call that, otherwise we treat it like a string (throwing an exception on None, we catch that and puts in np.nan). The exception handling should be relatively slower.
Timing seems to verify this hypothesis:
import numpy as np
%timeit [None] * 100000
> 1000 loops, best of 3: 1.04 ms per loop
%timeit np.array([0.0] * 100000)
> 10 loops, best of 3: 21.3 ms per loop
%timeit [i.__float__() for i in [0.0] * 100000]
> 10 loops, best of 3: 32 ms per loop
def flt(d):
try:
return float(d)
except:
return np.nan
%timeit np.array([None] * 100000, dtype=np.float64)
> 1 loops, best of 3: 477 ms per loop
%timeit [flt(d) for d in [None] * 100000]
> 1 loops, best of 3: 328 ms per loop
Adding another case just to be obvious about where I'm going with this. If there was an explicit check for None, it would not be this slow above:
def flt2(d):
if d is None:
return np.nan
try:
return float(d)
except:
return np.nan
%timeit [flt2(d) for d in [None] * 100000]
> 10 loops, best of 3: 45 ms per loop

Categories

Resources