Why is numpy ma.average 24 times slower than arr.mean? - python

I've found something interesting in Python's numpy. ma.average is a lot slower than arr.mean (arr is an array)
>>> arr = np.full((3, 3), -9999, dtype=float)
array([[-9999., -9999., -9999.],
[-9999., -9999., -9999.],
[-9999., -9999., -9999.]])
%timeit np.ma.average(arr, axis=0)
The slowest run took 49.32 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 191 µs per loop
%timeit arr.mean(axis=0)
The slowest run took 6.63 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 7.41 µs per loop
with random numbers
arr = np.random.random((3,3))
%timeit arr.mean(axis=0)
The slowest run took 6.17 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 7.78 µs per loop
%timeit np.ma.average(arr, axis=0)
1000 loops, best of 3: 186 µs per loop
--> That's nearly 24 times slower.
Documentation
numpy.ma.average(a, axis=None, weights=None, returned=False)
Return the weighted average of array over the given axis.
numpy.mean(a, axis=None, dtype=None, out=None, keepdims)
Compute the arithmetic mean along the specified axis.
Why is ma.average so much slower than arr.mean? Mathematically they are the same (correct me if I'm wrong). My guess is that it has something to do with the weighted options on ma.average but shouldn't be there a fallback if no weights are passed?

A good way to find out why something is slower is to profile it. I'll use the 3rd party library line_profiler and the IPython command %lprun (see for example this blog) here:
%load_ext line_profiler
import numpy as np
arr = np.full((3, 3), -9999, dtype=float)
%lprun -f np.ma.average np.ma.average(arr, axis=0)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
519 def average(a, axis=None, weights=None, returned=False):
...
570 1 1810 1810.0 30.5 a = asarray(a)
571 1 15 15.0 0.3 m = getmask(a)
572
573 # inspired by 'average' in numpy/lib/function_base.py
574
575 1 5 5.0 0.1 if weights is None:
576 1 3500 3500.0 59.0 avg = a.mean(axis)
577 1 591 591.0 10.0 scl = avg.dtype.type(a.count(axis))
578 else:
...
608
609 1 7 7.0 0.1 if returned:
610 if scl.shape != avg.shape:
611 scl = np.broadcast_to(scl, avg.shape).copy()
612 return avg, scl
613 else:
614 1 5 5.0 0.1 return avg
I removed some irrelevant lines.
So actually 30% of the time is spent in np.ma.asarray (something that arr.mean doesn't have to do!).
However the relative times change drastically if you use a bigger array:
arr = np.full((1000, 1000), -9999, dtype=float)
%lprun -f np.ma.average np.ma.average(arr, axis=0)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
519 def average(a, axis=None, weights=None, returned=False):
...
570 1 609 609.0 7.6 a = asarray(a)
571 1 14 14.0 0.2 m = getmask(a)
572
573 # inspired by 'average' in numpy/lib/function_base.py
574
575 1 7 7.0 0.1 if weights is None:
576 1 6924 6924.0 86.9 avg = a.mean(axis)
577 1 404 404.0 5.1 scl = avg.dtype.type(a.count(axis))
578 else:
...
609 1 6 6.0 0.1 if returned:
610 if scl.shape != avg.shape:
611 scl = np.broadcast_to(scl, avg.shape).copy()
612 return avg, scl
613 else:
614 1 6 6.0 0.1 return avg
This time the np.ma.MaskedArray.mean function almost takes up 90% of the time.
Note: You could also dig deeper and look into np.ma.asarray or np.ma.MaskedArray.count or np.ma.MaskedArray.mean and check their line profilings. But I just wanted to show that there are lots of called function that add to the overhead.
So the next question is: did the relative times between np.ndarray.mean and np.ma.average also change? And at least on my computer the difference is much lower now:
%timeit np.ma.average(arr, axis=0)
# 2.96 ms ± 91 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit arr.mean(axis=0)
# 1.84 ms ± 23.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
This time it's not even 2 times slower. I assume for even bigger arrays the difference will get even smaller.
This is also something that is actually quite common with NumPy:
The constant factors are quite high even for plain numpy functions (see for example my answer to the question "Performance in different vectorization method in numpy"). For np.ma these constant factors are even bigger, especially if you don't use a np.ma.MaskedArray as input. But even though the constant factors might be high, these functions excel with big arrays.

Thanks to #WillemVanOnsem and #sascha in the comments above
edit: applies to small arrays, see accepted answer for more information
Masked operations are slow try, to avoid it:
mask = self.local_pos_history[:, 0] > -9
local_pos_hist_masked = self.local_pos_history[mask]
avg = local_pos_hist_masked.mean(axis=0)
old with masked
mask = np.ma.masked_where(self.local_pos_history > -9, self.local_pos_history)
local_pos_hist_mask = self.local_pos_history[mask].reshape(len(self.local_pos_history) // 3, 3)
avg_pos = self.local_pos_history
np.average is nearly equal to arr.mean:
%timeit np.average(arr, axis=0)
The slowest run took 5.81 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 9.89 µs per loop
%timeit np.mean(arr, axis=0)
The slowest run took 6.44 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 8.74 µs per loop
just for clarification still a tests on small batch

Related

Find least frequent value in whole dataframe

my dataframe is something like this
> 93 40 73 41 115 74 59 98 76 109 43 44
105 119 56 62 69 51 50 104 91 78 77 75
119 61 106 105 102 75 43 51 60 114 91 83
It has 8000 rows and 12 columns
I wanted to find the least frequent value in this whole dataframe (not only in columns).
I tried converting this dataframe into numpy array and use for loop to count the numbers and then return the least count number but it it not very optimal. I searched if there are any other methods but could not find it.
I only found scipy.stats.mode which returns the most frequent number.
is there any other way to do it?
You could stack and take the value_counts:
df.stack().value_counts().index[-1]
# 69
value_counts orders by frequency, so you can just take the last, though in this example many appear just once. 69 happens to be the last.
Another way using pandas.DataFrame.apply with pandas.Series.value_counts:
df.apply(pd.Series.value_counts).sum(1).idxmin()
# 40
# There are many values with same frequencies.
To my surprise, apply method seems to be the fastest among the methods I've tried (reason why I'm posting):
df2 = pd.DataFrame(np.random.randint(1, 1000, (500000, 100)))
%timeit df2.apply(pd.Series.value_counts).sum(1).idxmin()
# 2.36 s ± 193 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df2.stack().value_counts().index[-1]
# 3.02 s ± 86.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
uniq, cnt = np.unique(df2, return_counts=True)
uniq[np.argmin(cnt)]
# 2.77 s ± 111 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
As opposed to my understanding of apply being very slow, it even outperformed numpy.unique (perhaps my coding is wrong tho ;().

Resample Pandas Dataframe Without Filling in Missing Times

Resampling a dataframe can take the dataframe to either a higher or lower temporal resolution. Most of the time this is used to go to lower resolution (e.g. resample 1-minute data to monthly values). When the dataset is sparse (for example, no data were collected in Feb-2020) then the Feb-2020 row in will be filled with NaNs the resampled dataframe. The problem is when the data record is long AND sparse there are a lot of NaN rows, which makes the dataframe unnecessarily large and takes a lot of CPU time. For example, consider this dataframe and resample operation:
import numpy as np
import pandas as pd
freq1 = pd.date_range("20000101", periods=10, freq="S")
freq2 = pd.date_range("20200101", periods=10, freq="S")
index = np.hstack([freq1.values, freq2.values])
data = np.random.randint(0, 100, (20, 10))
cols = list("ABCDEFGHIJ")
df = pd.DataFrame(index=index, data=data, columns=cols)
# now resample to daily average
df = df.resample(rule="1D").mean()
Most of the data in this dataframe is useless and can be removed via:
df.dropna(how="all", axis=0, inplace=True)
however, this is sloppy. Is there another method to resample the dataframe that does not fill all of the data gaps with NaN (i.e. in the example above, the resultant dataframe would have only two rows)?
Updating my original answer with (what I think) is an improvement, plus updated times.
Use groupby
There are a couple ways you can use groupby instead of resample. In the case of a day ("1D") resampling, you can just use the date property of the DateTimeIndex:
df = df.groupby(df.index.date).mean()
This is in fact faster than the resample for your data:
%%timeit
df.resample(rule='1D').mean().dropna()
# 2.08 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
df.groupby(df.index.date).mean()
# 666 µs ± 15.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The more general approach would be to use the floor of the timestamps to do the groupby operation:
rule = '1D'
f = df.index.floor(rule)
df.groupby(f).mean()
# A B C D E F G H I J
# 2000-01-01 50.5 33.5 62.7 42.4 46.7 49.2 64.0 53.3 71.0 38.0
# 2020-01-01 50.4 56.3 57.4 46.2 55.0 60.2 60.3 57.8 63.5 47.3
This will work with more irregular frequencies as well. The main snag here is that by default, it seems like the floor is calculated in reference to some initial date, which can cause weird results (see my post):
rule = '7D'
f = df.index.floor(rule)
df.groupby(f).mean()
# A B C D E F G H I J
# 1999-12-30 50.5 33.5 62.7 42.4 46.7 49.2 64.0 53.3 71.0 38.0
# 2019-12-26 50.4 56.3 57.4 46.2 55.0 60.2 60.3 57.8 63.5 47.3
The major issue is that the resampling doesn't start on the earliest timestamp within your data. However, it is fixable using this solution to the above post:
# custom function for flooring relative to a start date
def floor(x, freq):
offset = x[0].ceil(freq) - x[0]
return (x + offset).floor(freq) - offset
rule = '7D'
f = floor(df.index, rule)
df.groupby(f).mean()
# A B C D E F G H I J
# 2000-01-01 50.5 33.5 62.7 42.4 46.7 49.2 64.0 53.3 71.0 38.0
# 2019-12-28 50.4 56.3 57.4 46.2 55.0 60.2 60.3 57.8 63.5 47.3
# the cycle of 7 days is now starting from 01-01-2000
Just note here that the function floor() is relatively slow compared to pandas.Series.dt.floor(). So it is best to us the latter if you can, but both are better than the original resample (in your example):
%%timeit
df.groupby(df.index.floor('1D')).mean()
# 1.06 ms ± 6.52 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
df.groupby(floor(df.index, '1D')).mean()
# 1.42 ms ± 14.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Converting a series of ints to strings - Why is apply much faster than astype?

I have a pandas.Series containing integers, but I need to convert these to strings for some downstream tools. So suppose I had a Series object:
import numpy as np
import pandas as pd
x = pd.Series(np.random.randint(0, 100, 1000000))
On StackOverflow and other websites, I've seen most people argue that the best way to do this is:
%% timeit
x = x.astype(str)
This takes about 2 seconds.
When I use x = x.apply(str), it only takes 0.2 seconds.
Why is x.astype(str) so slow? Should the recommended way be x.apply(str)?
I'm mainly interested in python 3's behavior for this.
Let's begin with a bit of general advise: If you're interested in finding the bottlenecks of Python code you can use a profiler to find the functions/parts that eat up most of the time. In this case I use a line-profiler because you can actually see the implementation and the time spent on each line.
However, these tools don't work with C or Cython by default. Given that CPython (that's the Python interpreter I'm using), NumPy and pandas make heavy use of C and Cython there will be a limit how far I'll get with profiling.
Actually: one probably could extend profiling to the Cython code and probably also the C code by recompiling it with debug symbols and tracing, however it's not an easy task to compile these libraries so I won't do that (but if someone likes to do that the Cython documentation includes a page about profiling Cython code).
But let's see how far I can get:
Line-Profiling Python code
I'm going to use line-profiler and a Jupyter Notebook here:
%load_ext line_profiler
import numpy as np
import pandas as pd
x = pd.Series(np.random.randint(0, 100, 100000))
Profiling x.astype
%lprun -f x.astype x.astype(str)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
87 #wraps(func)
88 def wrapper(*args, **kwargs):
89 1 12 12.0 0.0 old_arg_value = kwargs.pop(old_arg_name, None)
90 1 5 5.0 0.0 if old_arg_value is not None:
91 if mapping is not None:
...
118 1 663354 663354.0 100.0 return func(*args, **kwargs)
So that's simply a decorator and 100% of the time is spent in the decorated function. So let's profile the decorated function:
%lprun -f x.astype.__wrapped__ x.astype(str)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
3896 #deprecate_kwarg(old_arg_name='raise_on_error', new_arg_name='errors',
3897 mapping={True: 'raise', False: 'ignore'})
3898 def astype(self, dtype, copy=True, errors='raise', **kwargs):
3899 """
...
3975 """
3976 1 28 28.0 0.0 if is_dict_like(dtype):
3977 if self.ndim == 1: # i.e. Series
...
4001
4002 # else, only a single dtype is given
4003 1 14 14.0 0.0 new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors,
4004 1 685863 685863.0 99.9 **kwargs)
4005 1 340 340.0 0.0 return self._constructor(new_data).__finalize__(self)
Source
Again one line is the bottleneck so let's check the _data.astype method:
%lprun -f x._data.astype x.astype(str)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
3461 def astype(self, dtype, **kwargs):
3462 1 695866 695866.0 100.0 return self.apply('astype', dtype=dtype, **kwargs)
Okay, another delegate, let's see what _data.apply does:
%lprun -f x._data.apply x.astype(str)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
3251 def apply(self, f, axes=None, filter=None, do_integrity_check=False,
3252 consolidate=True, **kwargs):
3253 """
...
3271 """
3272
3273 1 12 12.0 0.0 result_blocks = []
...
3309
3310 1 10 10.0 0.0 aligned_args = dict((k, kwargs[k])
3311 1 29 29.0 0.0 for k in align_keys
3312 if hasattr(kwargs[k], 'reindex_axis'))
3313
3314 2 28 14.0 0.0 for b in self.blocks:
...
3329 1 674974 674974.0 100.0 applied = getattr(b, f)(**kwargs)
3330 1 30 30.0 0.0 result_blocks = _extend_blocks(applied, result_blocks)
3331
3332 1 10 10.0 0.0 if len(result_blocks) == 0:
3333 return self.make_empty(axes or self.axes)
3334 1 10 10.0 0.0 bm = self.__class__(result_blocks, axes or self.axes,
3335 1 76 76.0 0.0 do_integrity_check=do_integrity_check)
3336 1 13 13.0 0.0 bm._consolidate_inplace()
3337 1 7 7.0 0.0 return bm
Source
And again ... one function call is taking all the time, this time it's x._data.blocks[0].astype:
%lprun -f x._data.blocks[0].astype x.astype(str)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
542 def astype(self, dtype, copy=False, errors='raise', values=None, **kwargs):
543 1 18 18.0 0.0 return self._astype(dtype, copy=copy, errors=errors, values=values,
544 1 671092 671092.0 100.0 **kwargs)
.. which is another delegate...
%lprun -f x._data.blocks[0]._astype x.astype(str)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
546 def _astype(self, dtype, copy=False, errors='raise', values=None,
547 klass=None, mgr=None, **kwargs):
548 """
...
557 """
558 1 11 11.0 0.0 errors_legal_values = ('raise', 'ignore')
559
560 1 8 8.0 0.0 if errors not in errors_legal_values:
561 invalid_arg = ("Expected value of kwarg 'errors' to be one of {}. "
562 "Supplied value is '{}'".format(
563 list(errors_legal_values), errors))
564 raise ValueError(invalid_arg)
565
566 1 23 23.0 0.0 if inspect.isclass(dtype) and issubclass(dtype, ExtensionDtype):
567 msg = ("Expected an instance of {}, but got the class instead. "
568 "Try instantiating 'dtype'.".format(dtype.__name__))
569 raise TypeError(msg)
570
571 # may need to convert to categorical
572 # this is only called for non-categoricals
573 1 72 72.0 0.0 if self.is_categorical_astype(dtype):
...
595
596 # astype processing
597 1 16 16.0 0.0 dtype = np.dtype(dtype)
598 1 19 19.0 0.0 if self.dtype == dtype:
...
603 1 8 8.0 0.0 if klass is None:
604 1 13 13.0 0.0 if dtype == np.object_:
605 klass = ObjectBlock
606 1 6 6.0 0.0 try:
607 # force the copy here
608 1 7 7.0 0.0 if values is None:
609
610 1 8 8.0 0.0 if issubclass(dtype.type,
611 1 14 14.0 0.0 (compat.text_type, compat.string_types)):
612
613 # use native type formatting for datetime/tz/timedelta
614 1 15 15.0 0.0 if self.is_datelike:
615 values = self.to_native_types()
616
617 # astype formatting
618 else:
619 1 8 8.0 0.0 values = self.values
620
621 else:
622 values = self.get_values(dtype=dtype)
623
624 # _astype_nansafe works fine with 1-d only
625 1 665777 665777.0 99.9 values = astype_nansafe(values.ravel(), dtype, copy=True)
626 1 32 32.0 0.0 values = values.reshape(self.shape)
627
628 1 17 17.0 0.0 newb = make_block(values, placement=self.mgr_locs, dtype=dtype,
629 1 269 269.0 0.0 klass=klass)
630 except:
631 if errors == 'raise':
632 raise
633 newb = self.copy() if copy else self
634
635 1 8 8.0 0.0 if newb.is_numeric and self.is_numeric:
...
642 1 6 6.0 0.0 return newb
Source
... okay, still not there. Let's check out astype_nansafe:
%lprun -f pd.core.internals.astype_nansafe x.astype(str)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
640 def astype_nansafe(arr, dtype, copy=True):
641 """ return a view if copy is False, but
642 need to be very careful as the result shape could change! """
643 1 13 13.0 0.0 if not isinstance(dtype, np.dtype):
644 dtype = pandas_dtype(dtype)
645
646 1 8 8.0 0.0 if issubclass(dtype.type, text_type):
647 # in Py3 that's str, in Py2 that's unicode
648 1 663317 663317.0 100.0 return lib.astype_unicode(arr.ravel()).reshape(arr.shape)
...
Source
Again one it's one line that takes 100%, so I'll go one function further:
%lprun -f pd.core.dtypes.cast.lib.astype_unicode x.astype(str)
UserWarning: Could not extract a code object for the object <built-in function astype_unicode>
Okay, we found a built-in function, that means it's a C function. In this case it's a Cython function. But it means we cannot dig deeper with line-profiler. So I'll stop here for now.
Profiling x.apply
%lprun -f x.apply x.apply(str)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
2426 def apply(self, func, convert_dtype=True, args=(), **kwds):
2427 """
...
2523 """
2524 1 84 84.0 0.0 if len(self) == 0:
2525 return self._constructor(dtype=self.dtype,
2526 index=self.index).__finalize__(self)
2527
2528 # dispatch to agg
2529 1 11 11.0 0.0 if isinstance(func, (list, dict)):
2530 return self.aggregate(func, *args, **kwds)
2531
2532 # if we are a string, try to dispatch
2533 1 12 12.0 0.0 if isinstance(func, compat.string_types):
2534 return self._try_aggregate_string_function(func, *args, **kwds)
2535
2536 # handle ufuncs and lambdas
2537 1 7 7.0 0.0 if kwds or args and not isinstance(func, np.ufunc):
2538 f = lambda x: func(x, *args, **kwds)
2539 else:
2540 1 6 6.0 0.0 f = func
2541
2542 1 154 154.0 0.1 with np.errstate(all='ignore'):
2543 1 11 11.0 0.0 if isinstance(f, np.ufunc):
2544 return f(self)
2545
2546 # row-wise access
2547 1 188 188.0 0.1 if is_extension_type(self.dtype):
2548 mapped = self._values.map(f)
2549 else:
2550 1 6238 6238.0 3.3 values = self.asobject
2551 1 181910 181910.0 95.5 mapped = lib.map_infer(values, f, convert=convert_dtype)
2552
2553 1 28 28.0 0.0 if len(mapped) and isinstance(mapped[0], Series):
2554 from pandas.core.frame import DataFrame
2555 return DataFrame(mapped.tolist(), index=self.index)
2556 else:
2557 1 19 19.0 0.0 return self._constructor(mapped,
2558 1 1870 1870.0 1.0 index=self.index).__finalize__(self)
Source
Again it's one function that takes most of the time: lib.map_infer ...
%lprun -f pd.core.series.lib.map_infer x.apply(str)
Could not extract a code object for the object <built-in function map_infer>
Okay, that's another Cython function.
This time there's another (although less significant) contributor with ~3%: values = self.asobject. But I'll ignore this for now, because we're interested in the major contributors.
Going into C/Cython
The functions called by astype
This is the astype_unicode function:
cpdef ndarray[object] astype_unicode(ndarray arr):
cdef:
Py_ssize_t i, n = arr.size
ndarray[object] result = np.empty(n, dtype=object)
for i in range(n):
# we can use the unsafe version because we know `result` is mutable
# since it was created from `np.empty`
util.set_value_at_unsafe(result, i, unicode(arr[i]))
return result
Source
This function uses this helper:
cdef inline set_value_at_unsafe(ndarray arr, object loc, object value):
cdef:
Py_ssize_t i, sz
if is_float_object(loc):
casted = int(loc)
if casted == loc:
loc = casted
i = <Py_ssize_t> loc
sz = cnp.PyArray_SIZE(arr)
if i < 0:
i += sz
elif i >= sz:
raise IndexError('index out of bounds')
assign_value_1d(arr, i, value)
Source
Which itself uses this C function:
PANDAS_INLINE int assign_value_1d(PyArrayObject* ap, Py_ssize_t _i,
PyObject* v) {
npy_intp i = (npy_intp)_i;
char* item = (char*)PyArray_DATA(ap) + i * PyArray_STRIDE(ap, 0);
return PyArray_DESCR(ap)->f->setitem(v, item, ap);
}
Source
Functions called by apply
This is the implementation of the map_infer function:
def map_infer(ndarray arr, object f, bint convert=1):
cdef:
Py_ssize_t i, n
ndarray[object] result
object val
n = len(arr)
result = np.empty(n, dtype=object)
for i in range(n):
val = f(util.get_value_at(arr, i))
# unbox 0-dim arrays, GH #690
if is_array(val) and PyArray_NDIM(val) == 0:
# is there a faster way to unbox?
val = val.item()
result[i] = val
if convert:
return maybe_convert_objects(result,
try_float=0,
convert_datetime=0,
convert_timedelta=0)
return result
Source
With this helper:
cdef inline object get_value_at(ndarray arr, object loc):
cdef:
Py_ssize_t i, sz
int casted
if is_float_object(loc):
casted = int(loc)
if casted == loc:
loc = casted
i = <Py_ssize_t> loc
sz = cnp.PyArray_SIZE(arr)
if i < 0 and sz > 0:
i += sz
elif i >= sz or sz == 0:
raise IndexError('index out of bounds')
return get_value_1d(arr, i)
Source
Which uses this C function:
PANDAS_INLINE PyObject* get_value_1d(PyArrayObject* ap, Py_ssize_t i) {
char* item = (char*)PyArray_DATA(ap) + i * PyArray_STRIDE(ap, 0);
return PyArray_Scalar(item, PyArray_DESCR(ap), (PyObject*)ap);
}
Source
Some thoughts on the Cython code
There are some differences between the Cython codes that are called eventually.
The one taken by astype uses unicode while the apply path uses the function passed in. Let's see if that makes a difference (again IPython/Jupyter makes it very easy to compile Cython code yourself):
%load_ext cython
%%cython
import numpy as np
cimport numpy as np
cpdef object func_called_by_astype(np.ndarray arr):
cdef np.ndarray[object] ret = np.empty(arr.size, dtype=object)
for i in range(arr.size):
ret[i] = unicode(arr[i])
return ret
cpdef object func_called_by_apply(np.ndarray arr, object f):
cdef np.ndarray[object] ret = np.empty(arr.size, dtype=object)
for i in range(arr.size):
ret[i] = f(arr[i])
return ret
Timing:
import numpy as np
arr = np.random.randint(0, 10000, 1000000)
%timeit func_called_by_astype(arr)
514 ms ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit func_called_by_apply(arr, str)
632 ms ± 43.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Okay, there is a difference but it's wrong, it would actually indicate that apply would be slightly slower.
But remember the asobject call that I mentioned earlier in the apply function? Could that be the reason? Let's see:
import numpy as np
arr = np.random.randint(0, 10000, 1000000)
%timeit func_called_by_astype(arr)
557 ms ± 33.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit func_called_by_apply(arr.astype(object), str)
317 ms ± 13.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Now it looks better. The conversion to an object array made the function called by apply much faster. There is a simple reason for this: str is a Python function and these are generally much faster if you already have Python objects and NumPy (or Pandas) don't need to create a Python wrapper for the value stored in the array (which is generally not a Python object, except when the array is of dtype object).
However that doesn't explain the huge difference that you've seen. My suspicion is that there is actually an additional difference in the ways the arrays are iterated over and the elements are set in the result. Very likely the:
val = f(util.get_value_at(arr, i))
if is_array(val) and PyArray_NDIM(val) == 0:
val = val.item()
result[i] = val
part of the map_infer function is faster than:
for i in range(n):
# we can use the unsafe version because we know `result` is mutable
# since it was created from `np.empty`
util.set_value_at_unsafe(result, i, unicode(arr[i]))
which is called by the astype(str) path. The comments of the first function seem to indicate that the writer of map_infer actually tried to make the code as fast as possible (see the comment about "is there a faster way to unbox?" while the other one maybe was written without special care about performance. But that's just a guess.
Also on my computer I'm actually quite close to the performance of the x.astype(str) and x.apply(str) already:
import numpy as np
arr = np.random.randint(0, 100, 1000000)
s = pd.Series(arr)
%timeit s.astype(str)
535 ms ± 23.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit func_called_by_astype(arr)
547 ms ± 21.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit s.apply(str)
216 ms ± 8.48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit func_called_by_apply(arr.astype(object), str)
272 ms ± 12.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Note that I also checked some other variants that return a different result:
%timeit s.values.astype(str) # array of strings
407 ms ± 8.56 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit list(map(str, s.values.tolist())) # list of strings
184 ms ± 5.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Interestingly the Python loop with list and map seems to be the fastest on my computer.
I actually made a small benchmark including plot:
import pandas as pd
import simple_benchmark
def Series_astype(series):
return series.astype(str)
def Series_apply(series):
return series.apply(str)
def Series_tolist_map(series):
return list(map(str, series.values.tolist()))
def Series_values_astype(series):
return series.values.astype(str)
arguments = {2**i: pd.Series(np.random.randint(0, 100, 2**i)) for i in range(2, 20)}
b = simple_benchmark.benchmark(
[Series_astype, Series_apply, Series_tolist_map, Series_values_astype],
arguments,
argument_name='Series size'
)
%matplotlib notebook
b.plot()
Note that it's a log-log plot because of the huge range of sizes I covered in the benchmark. However lower means faster here.
The results may be different for different versions of Python/NumPy/Pandas. So if you want to compare it, these are my versions:
Versions
--------
Python 3.6.5
NumPy 1.14.2
Pandas 0.22.0
Performance
It's worth looking at actual performance before beginning any investigation since, contrary to popular opinion, list(map(str, x)) appears to be slower than x.apply(str).
import pandas as pd, numpy as np
### Versions: Pandas 0.20.3, Numpy 1.13.1, Python 3.6.2 ###
x = pd.Series(np.random.randint(0, 100, 100000))
%timeit x.apply(str) # 42ms (1)
%timeit x.map(str) # 42ms (2)
%timeit x.astype(str) # 559ms (3)
%timeit [str(i) for i in x] # 566ms (4)
%timeit list(map(str, x)) # 536ms (5)
%timeit x.values.astype(str) # 25ms (6)
Points worth noting:
(5) is marginally quicker than (3) / (4), which we expect as more work is moved into C [assuming no lambda function is used].
(6) is by far the fastest.
(1) / (2) are similar.
(3) / (4) are similar.
Why is x.map / x.apply fast?
This appears to be because it uses fast compiled Cython code:
cpdef ndarray[object] astype_str(ndarray arr):
cdef:
Py_ssize_t i, n = arr.size
ndarray[object] result = np.empty(n, dtype=object)
for i in range(n):
# we can use the unsafe version because we know `result` is mutable
# since it was created from `np.empty`
util.set_value_at_unsafe(result, i, str(arr[i]))
return result
Why is x.astype(str) slow?
Pandas applies str to each item in the series, not using the above Cython.
Hence performance is comparable to [str(i) for i in x] / list(map(str, x)).
Why is x.values.astype(str) so fast?
Numpy does not apply a function on each element of the array. One description of this I found:
If you did s.values.astype(str) what you get back is an object holding
int. This is numpy doing the conversion, whereas pandas iterates over
each item and calls str(item) on it. So if you do s.astype(str) you have
an object holding str.
There is a technical reason why the numpy version hasn't been implemented in the case of no-nulls.

Does Indexing makes Slice of pandas dataframe faster?

I have a pandas dataframe holding more than million records. One of its columns is datetime. The sample of my data is like the following:
time,x,y,z
2015-05-01 10:00:00,111,222,333
2015-05-01 10:00:03,112,223,334
...
I need to effectively get the record during the specific period. The following naive way is very time consuming.
new_df = df[(df["time"] > start_time) & (df["time"] < end_time)]
I know that on DBMS like MySQL the indexing by the time field is effective for getting records by specifying the time period.
My question is
Does the indexing of pandas such as df.index = df.time makes the slicing process faster?
If the answer of Q1 is 'No', what is the common effective way to get a record during the specific time period in pandas?
Let's create a dataframe with 1 million rows and time performance. The index is a Pandas Timestamp.
df = pd.DataFrame(np.random.randn(1000000, 3),
columns=list('ABC'),
index=pd.DatetimeIndex(start='2015-1-1', freq='10s', periods=1000000))
Here are the results sorted from fastest to slowest (tested on the same machine with both v. 0.14.1 (don't ask...) and the most recent version 0.17.1):
%timeit df2 = df['2015-2-1':'2015-3-1']
1000 loops, best of 3: 459 µs per loop (v. 0.14.1)
1000 loops, best of 3: 664 µs per loop (v. 0.17.1)
%timeit df2 = df.ix['2015-2-1':'2015-3-1']
1000 loops, best of 3: 469 µs per loop (v. 0.14.1)
1000 loops, best of 3: 662 µs per loop (v. 0.17.1)
%timeit df2 = df.loc[(df.index >= '2015-2-1') & (df.index <= '2015-3-1'), :]
100 loops, best of 3: 8.86 ms per loop (v. 0.14.1)
100 loops, best of 3: 9.28 ms per loop (v. 0.17.1)
%timeit df2 = df.loc['2015-2-1':'2015-3-1', :]
1 loops, best of 3: 341 ms per loop (v. 0.14.1)
1000 loops, best of 3: 677 µs per loop (v. 0.17.1)
Here are the timings with the Datetime index as a column:
df.reset_index(inplace=True)
%timeit df2 = df.loc[(df['index'] >= '2015-2-1') & (df['index'] <= '2015-3-1')]
100 loops, best of 3: 12.6 ms per loop (v. 0.14.1)
100 loops, best of 3: 13 ms per loop (v. 0.17.1)
%timeit df2 = df.loc[(df['index'] >= '2015-2-1') & (df['index'] <= '2015-3-1'), :]
100 loops, best of 3: 12.8 ms per loop (v. 0.14.1)
100 loops, best of 3: 12.7 ms per loop (v. 0.17.1)
All of the above indexing techniques produce the same dataframe:
>>> df2.shape
(250560, 3)
It appears that either of the first two methods are the best in this situation, and the fourth method also works just as fine using the latest version of Pandas.
I've never dealt with a data set that large, but maybe you can try recasting the time column as a datetime index and then slicing directly. Something like this.
timedata.txt (extended from your example):
time,x,y,z
2015-05-01 10:00:00,111,222,333
2015-05-01 10:00:03,112,223,334
2015-05-01 10:00:05,112,223,335
2015-05-01 10:00:08,112,223,336
2015-05-01 10:00:13,112,223,337
2015-05-01 10:00:21,112,223,338
df = pd.read_csv('timedata.txt')
df.time = pd.to_datetime(df.time)
df = df.set_index('time')
print(df['2015-05-01 10:00:02':'2015-05-01 10:00:14'])
x y z
time
2015-05-01 10:00:03 112 223 334
2015-05-01 10:00:05 112 223 335
2015-05-01 10:00:08 112 223 336
2015-05-01 10:00:13 112 223 337
Note that in the example the times used for slicing are not in the column, so this will work for the case where you only know the time interval.
If your data has a fixed time period you can create a datetime index which may provide more options. I didn't want to assume your time period was fixed so constructed this for a more general case.

Faster alternatives to numpy.argmax/argmin which is slow

I am using a lot of argmin and argmax in Python.
Unfortunately, the function is very slow.
I have done some searching around, and the best I can find is here:
http://lemire.me/blog/archives/2008/12/17/fast-argmax-in-python/
def fastest_argmax(array):
array = list( array )
return array.index(max(array))
Unfortunately, this solution is still only half as fast as np.max, and I think I should be able to find something as fast as np.max.
x = np.random.randn(10)
%timeit np.argmax( x )
10000 loops, best of 3: 21.8 us per loop
%timeit fastest_argmax( x )
10000 loops, best of 3: 20.8 us per loop
As a note, I am applying this to a Pandas DataFrame Groupby
E.G.
%timeit grp2[ 'ODDS' ].agg( [ fastest_argmax ] )
100 loops, best of 3: 8.8 ms per loop
%timeit grp2[ 'ODDS' ].agg( [ np.argmax ] )
100 loops, best of 3: 11.6 ms per loop
Where grp2[ 'ODDS' ].head() looks like this:
EVENT_ID SELECTION_ID
104601100 4367029 682508 3.05
682509 3.15
682510 3.25
682511 3.35
5319660 682512 2.04
682513 2.08
682514 2.10
682515 2.12
682516 2.14
5510310 682520 4.10
682521 4.40
682522 4.50
682523 4.80
682524 5.30
5559264 682526 5.00
682527 5.30
682528 5.40
682529 5.50
682530 5.60
5585869 682533 1.96
682534 1.97
682535 1.98
682536 2.02
682537 2.04
6064546 682540 3.00
682541 2.74
682542 2.76
682543 2.96
682544 3.05
104601200 4916112 682548 2.64
682549 2.68
682550 2.70
682551 2.72
682552 2.74
5315859 682557 2.90
682558 2.92
682559 3.05
682560 3.10
682561 3.15
5356995 682564 2.42
682565 2.44
682566 2.48
682567 2.50
682568 2.52
5465225 682573 1.85
682574 1.89
682575 1.91
682576 1.93
682577 1.94
5773661 682588 5.00
682589 4.40
682590 4.90
682591 5.10
6013187 682592 5.00
682593 4.20
682594 4.30
682595 4.40
682596 4.60
104606300 2489827 683438 4.00
683439 3.90
683440 3.95
683441 4.30
683442 4.40
3602724 683446 2.16
683447 2.32
Name: ODDS, Length: 65, dtype: float64
It turns out that np.argmax is blazingly fast, but only with the native numpy arrays. With foreign data, almost all the time is spent on conversion:
In [194]: print platform.architecture()
('64bit', 'WindowsPE')
In [5]: x = np.random.rand(10000)
In [57]: l=list(x)
In [123]: timeit numpy.argmax(x)
100000 loops, best of 3: 6.55 us per loop
In [122]: timeit numpy.argmax(l)
1000 loops, best of 3: 729 us per loop
In [134]: timeit numpy.array(l)
1000 loops, best of 3: 716 us per loop
I called your function "inefficient" because it first converts everything to list, then iterates through it 2 times (effectively, 3 iterations + list construction).
I was going to suggest something like this that only iterates once:
def imax(seq):
it=iter(seq)
im=0
try: m=it.next()
except StopIteration: raise ValueError("the sequence is empty")
for i,e in enumerate(it,start=1):
if e>m:
m=e
im=i
return im
But, your version turns out to be faster because it iterates many times but does it in C, rather that Python, code. C is just that much faster - even considering the fact a great deal of time is spent on conversion, too:
In [158]: timeit imax(x)
1000 loops, best of 3: 883 us per loop
In [159]: timeit fastest_argmax(x)
1000 loops, best of 3: 575 us per loop
In [174]: timeit list(x)
1000 loops, best of 3: 316 us per loop
In [175]: timeit max(l)
1000 loops, best of 3: 256 us per loop
In [181]: timeit l.index(0.99991619010758348) #the greatest number in my case, at index 92
100000 loops, best of 3: 2.69 us per loop
So, the key knowledge to speeding this up further is to know which format the data in your sequence natively is (e.g. whether you can omit the conversion step or use/write another functionality native to that format).
Btw, you're likely to get some speedup by using aggregate(max_fn) instead of agg([max_fn]).
For those that came for a short numpy-free snippet that returns the index of the first minimum value:
def argmin(a):
return min(range(len(a)), key=lambda x: a[x])
a = [6, 5, 4, 1, 1, 3, 2]
argmin(a) # returns 3
Can you post some code? Here is the result on my pc:
x = np.random.rand(10000)
%timeit np.max(x)
%timeit np.argmax(x)
output:
100000 loops, best of 3: 7.43 µs per loop
100000 loops, best of 3: 11.5 µs per loop

Categories

Resources