Faster alternatives to numpy.argmax/argmin which is slow - python

I am using a lot of argmin and argmax in Python.
Unfortunately, the function is very slow.
I have done some searching around, and the best I can find is here:
http://lemire.me/blog/archives/2008/12/17/fast-argmax-in-python/
def fastest_argmax(array):
array = list( array )
return array.index(max(array))
Unfortunately, this solution is still only half as fast as np.max, and I think I should be able to find something as fast as np.max.
x = np.random.randn(10)
%timeit np.argmax( x )
10000 loops, best of 3: 21.8 us per loop
%timeit fastest_argmax( x )
10000 loops, best of 3: 20.8 us per loop
As a note, I am applying this to a Pandas DataFrame Groupby
E.G.
%timeit grp2[ 'ODDS' ].agg( [ fastest_argmax ] )
100 loops, best of 3: 8.8 ms per loop
%timeit grp2[ 'ODDS' ].agg( [ np.argmax ] )
100 loops, best of 3: 11.6 ms per loop
Where grp2[ 'ODDS' ].head() looks like this:
EVENT_ID SELECTION_ID
104601100 4367029 682508 3.05
682509 3.15
682510 3.25
682511 3.35
5319660 682512 2.04
682513 2.08
682514 2.10
682515 2.12
682516 2.14
5510310 682520 4.10
682521 4.40
682522 4.50
682523 4.80
682524 5.30
5559264 682526 5.00
682527 5.30
682528 5.40
682529 5.50
682530 5.60
5585869 682533 1.96
682534 1.97
682535 1.98
682536 2.02
682537 2.04
6064546 682540 3.00
682541 2.74
682542 2.76
682543 2.96
682544 3.05
104601200 4916112 682548 2.64
682549 2.68
682550 2.70
682551 2.72
682552 2.74
5315859 682557 2.90
682558 2.92
682559 3.05
682560 3.10
682561 3.15
5356995 682564 2.42
682565 2.44
682566 2.48
682567 2.50
682568 2.52
5465225 682573 1.85
682574 1.89
682575 1.91
682576 1.93
682577 1.94
5773661 682588 5.00
682589 4.40
682590 4.90
682591 5.10
6013187 682592 5.00
682593 4.20
682594 4.30
682595 4.40
682596 4.60
104606300 2489827 683438 4.00
683439 3.90
683440 3.95
683441 4.30
683442 4.40
3602724 683446 2.16
683447 2.32
Name: ODDS, Length: 65, dtype: float64

It turns out that np.argmax is blazingly fast, but only with the native numpy arrays. With foreign data, almost all the time is spent on conversion:
In [194]: print platform.architecture()
('64bit', 'WindowsPE')
In [5]: x = np.random.rand(10000)
In [57]: l=list(x)
In [123]: timeit numpy.argmax(x)
100000 loops, best of 3: 6.55 us per loop
In [122]: timeit numpy.argmax(l)
1000 loops, best of 3: 729 us per loop
In [134]: timeit numpy.array(l)
1000 loops, best of 3: 716 us per loop
I called your function "inefficient" because it first converts everything to list, then iterates through it 2 times (effectively, 3 iterations + list construction).
I was going to suggest something like this that only iterates once:
def imax(seq):
it=iter(seq)
im=0
try: m=it.next()
except StopIteration: raise ValueError("the sequence is empty")
for i,e in enumerate(it,start=1):
if e>m:
m=e
im=i
return im
But, your version turns out to be faster because it iterates many times but does it in C, rather that Python, code. C is just that much faster - even considering the fact a great deal of time is spent on conversion, too:
In [158]: timeit imax(x)
1000 loops, best of 3: 883 us per loop
In [159]: timeit fastest_argmax(x)
1000 loops, best of 3: 575 us per loop
In [174]: timeit list(x)
1000 loops, best of 3: 316 us per loop
In [175]: timeit max(l)
1000 loops, best of 3: 256 us per loop
In [181]: timeit l.index(0.99991619010758348) #the greatest number in my case, at index 92
100000 loops, best of 3: 2.69 us per loop
So, the key knowledge to speeding this up further is to know which format the data in your sequence natively is (e.g. whether you can omit the conversion step or use/write another functionality native to that format).
Btw, you're likely to get some speedup by using aggregate(max_fn) instead of agg([max_fn]).

For those that came for a short numpy-free snippet that returns the index of the first minimum value:
def argmin(a):
return min(range(len(a)), key=lambda x: a[x])
a = [6, 5, 4, 1, 1, 3, 2]
argmin(a) # returns 3

Can you post some code? Here is the result on my pc:
x = np.random.rand(10000)
%timeit np.max(x)
%timeit np.argmax(x)
output:
100000 loops, best of 3: 7.43 µs per loop
100000 loops, best of 3: 11.5 µs per loop

Related

Performance Comparisons between Series and Numpy

I am building a Performance comparison Table between Numpy and Series:
Two Instances caught my Eye. Any help will be really helpful.
We say that we should avoid using Loops in Numpy and Series, but I came across one scenario where for loop is performing better
In Below Code I am Calculating Density of Planets using for Loops and without for Loop
mass= pd.Series([0.330, 4.87, 5.97, 0.073, 0.642, 1898, 568, 86.8, 102, 0.0146], index = ['MERCURY', 'VENUS', 'EARTH', 'MOON', 'MARS', 'JUPITER', 'SATURN', 'URANUS', 'NEPTUNE', 'PLUTO'])
diameter = pd.Series([4879, 12104, 12756, 3475, 6792, 142984, 120536, 51118, 49528, 2370], index = ['MERCURY', 'VENUS', 'EARTH', 'MOON', 'MARS', 'JUPITER', 'SATURN', 'URANUS', 'NEPTUNE', 'PLUTO'])
%%timeit -n 1000
density = mass / (np.pi * np.power(diameter, 3) /6)
1000 loops, best of 3: 617 µs per loop
%%timeit -n 1000
density = pd.Series()
for planet in mass.index:
density[planet] = mass[planet] / ((np.pi * np.power(diameter[planet], 3)) / 6)
1000 loops, best of 3: 183 µs per loop
Second, I am trying to replace nan values in Series using Two approaches
Why do the First approach works Faster??? My Guess is that second approach is converting Series Object in N-d array
sample2 = pd.Series([1, 2, 3, 4325, 23, 3, 4213, 102, 89, 4, np.nan, 6, 803, 43, np.nan, np.nan, np.nan])
x = np.mean(sample2)
x
%%timeit -n 10000
sample3 = pd.Series(np.where(np.isnan(sample2), x, sample2))
10000 loops, best of 3: 166 µs per loop
%%timeit -n 10000
sample2[np.isnan(sample2)] =x
10000 loops, best of 3: 1.08 ms per loop
In an ipython console session:
In [1]: import pandas as pd
In [2]: mass= pd.Series([0.330, 4.87, 5.97, 0.073, 0.642, 1898, 568, 86.8, 102, 0.
...: 0146], index = ['MERCURY', 'VENUS', 'EARTH', 'MOON', 'MARS', 'JUPITER', 'SA
...: TURN', 'URANUS', 'NEPTUNE', 'PLUTO'])
...: diameter = pd.Series([4879, 12104, 12756, 3475, 6792, 142984, 120536, 51118
...: , 49528, 2370], index = ['MERCURY', 'VENUS', 'EARTH', 'MOON', 'MARS', 'JUPI
...: TER', 'SATURN', 'URANUS', 'NEPTUNE', 'PLUTO'])
...:
In [3]: mass
Out[3]:
MERCURY 0.3300
VENUS 4.8700
EARTH 5.9700
MOON 0.0730
MARS 0.6420
JUPITER 1898.0000
SATURN 568.0000
URANUS 86.8000
NEPTUNE 102.0000
PLUTO 0.0146
dtype: float64
In [4]: diameter
Out[4]:
MERCURY 4879
VENUS 12104
EARTH 12756
MOON 3475
MARS 6792
JUPITER 142984
SATURN 120536
URANUS 51118
NEPTUNE 49528
PLUTO 2370
dtype: int64
Your density calculation creates a Series with the same index. Here the index of the Series match, but I think in general pandas is able to match up indices.
In [5]: density = mass / (np.pi * np.power(diameter, 3) /6)
In [6]: density
Out[6]:
MERCURY 5.426538e-12
VENUS 5.244977e-12
EARTH 5.493286e-12
MOON 3.322460e-12
MARS 3.913302e-12
JUPITER 1.240039e-12
SATURN 6.194402e-13
URANUS 1.241079e-12
NEPTUNE 1.603427e-12
PLUTO 2.094639e-12
dtype: float64
In [7]: timeit density = mass / (np.pi * np.power(diameter, 3) /6)
532 µs ± 437 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Since indices match, we can get the same numbers by using the numpy values arrays:
In [8]: mass.values/(np.pi * np.power(diameter.values, 3)/6)
Out[8]:
array([5.42653818e-12, 5.24497707e-12, 5.49328558e-12, 3.32246038e-12,
3.91330208e-12, 1.24003876e-12, 6.19440202e-13, 1.24107933e-12,
1.60342694e-12, 2.09463905e-12])
In [9]: timeit mass.values/(np.pi * np.power(diameter.values, 3)/6)
11.5 µs ± 67.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
This is much faster. numpy isn't taking time to match indices. Also it isn't making a new Series.
Your iteration approach:
In [11]: %%timeit
...: density = pd.Series(dtype=float)
...: for planet in mass.index:
...: density[planet] = mass[planet] / ((np.pi * np.power(diameter[planet],
...: 3)) / 6)
...:
7.36 ms ± 312 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This is much slower.
Out of curiosity, lets initial a Series and fill it with the numpy calc:
In [18]: %%timeit
...: density = pd.Series(index=mass.index, dtype=float)
...: density[:] = mass.values/(np.pi * np.power(diameter.values, 3)/6)
241 µs ± 8.97 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2x better than [7], but still quite a bit slower than pure numpy. While pandas uses numpy arrays - here I think both index and values are arrays. But pandas does add a significant overhead, relative to pure numpy code.

Resample Pandas Dataframe Without Filling in Missing Times

Resampling a dataframe can take the dataframe to either a higher or lower temporal resolution. Most of the time this is used to go to lower resolution (e.g. resample 1-minute data to monthly values). When the dataset is sparse (for example, no data were collected in Feb-2020) then the Feb-2020 row in will be filled with NaNs the resampled dataframe. The problem is when the data record is long AND sparse there are a lot of NaN rows, which makes the dataframe unnecessarily large and takes a lot of CPU time. For example, consider this dataframe and resample operation:
import numpy as np
import pandas as pd
freq1 = pd.date_range("20000101", periods=10, freq="S")
freq2 = pd.date_range("20200101", periods=10, freq="S")
index = np.hstack([freq1.values, freq2.values])
data = np.random.randint(0, 100, (20, 10))
cols = list("ABCDEFGHIJ")
df = pd.DataFrame(index=index, data=data, columns=cols)
# now resample to daily average
df = df.resample(rule="1D").mean()
Most of the data in this dataframe is useless and can be removed via:
df.dropna(how="all", axis=0, inplace=True)
however, this is sloppy. Is there another method to resample the dataframe that does not fill all of the data gaps with NaN (i.e. in the example above, the resultant dataframe would have only two rows)?
Updating my original answer with (what I think) is an improvement, plus updated times.
Use groupby
There are a couple ways you can use groupby instead of resample. In the case of a day ("1D") resampling, you can just use the date property of the DateTimeIndex:
df = df.groupby(df.index.date).mean()
This is in fact faster than the resample for your data:
%%timeit
df.resample(rule='1D').mean().dropna()
# 2.08 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
df.groupby(df.index.date).mean()
# 666 µs ± 15.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The more general approach would be to use the floor of the timestamps to do the groupby operation:
rule = '1D'
f = df.index.floor(rule)
df.groupby(f).mean()
# A B C D E F G H I J
# 2000-01-01 50.5 33.5 62.7 42.4 46.7 49.2 64.0 53.3 71.0 38.0
# 2020-01-01 50.4 56.3 57.4 46.2 55.0 60.2 60.3 57.8 63.5 47.3
This will work with more irregular frequencies as well. The main snag here is that by default, it seems like the floor is calculated in reference to some initial date, which can cause weird results (see my post):
rule = '7D'
f = df.index.floor(rule)
df.groupby(f).mean()
# A B C D E F G H I J
# 1999-12-30 50.5 33.5 62.7 42.4 46.7 49.2 64.0 53.3 71.0 38.0
# 2019-12-26 50.4 56.3 57.4 46.2 55.0 60.2 60.3 57.8 63.5 47.3
The major issue is that the resampling doesn't start on the earliest timestamp within your data. However, it is fixable using this solution to the above post:
# custom function for flooring relative to a start date
def floor(x, freq):
offset = x[0].ceil(freq) - x[0]
return (x + offset).floor(freq) - offset
rule = '7D'
f = floor(df.index, rule)
df.groupby(f).mean()
# A B C D E F G H I J
# 2000-01-01 50.5 33.5 62.7 42.4 46.7 49.2 64.0 53.3 71.0 38.0
# 2019-12-28 50.4 56.3 57.4 46.2 55.0 60.2 60.3 57.8 63.5 47.3
# the cycle of 7 days is now starting from 01-01-2000
Just note here that the function floor() is relatively slow compared to pandas.Series.dt.floor(). So it is best to us the latter if you can, but both are better than the original resample (in your example):
%%timeit
df.groupby(df.index.floor('1D')).mean()
# 1.06 ms ± 6.52 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
df.groupby(floor(df.index, '1D')).mean()
# 1.42 ms ± 14.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Python: (Pandas) How to ignore the lowest and highest 25% of values grouped by id for mean calculation

I'm trying to get the mean of each column while grouped by id, BUT for the calculation only the 50% between the first 25% quantil and the third 75% quantil should be used. (So ignore the lowest 25% of values and the highest 25%)
The data:
ID Property3 Property2 Property3
1 10.2 ... ...
1 20.1
1 51.9
1 15.8
1 12.5
...
1203 104.4
1203 11.5
1203 19.4
1203 23.1
What I tried:
data.groupby('id').quantile(0.75).mean();
#data.groupby('id').agg(lambda grp: grp.quantil(0.25, 0,75)).mean(); something like that?
CW 67.089733
fd 0.265917
fd_maxna -1929.522001
fd_maxv -1542.468399
fd_sumna -1928.239954
fd_sumv -1488.165382
planc -13.165445
slope 13.654163
Something like that, but the GroupByDataFrame.quantil doesn't know a inbetween to my knowledge and I don't know how to now remove the lower 25% too.
And this also doesn't return a dataframe.
What I want
Idealy, I would like to have a dataframe as follows:
ID Property3 Property2 Property3
1 37.8 5.6 2.3
2 33.0 1.5 10.4
3 34.9 91.5 10.3
4 33.0 10.3 14.3
Where only the data between the 25% quantil and the 75% quantil are used for the mean calculation. So only the 50% in between.
Using GroupBy.apply here can be slow so I suppose this is your data frame:
print(df)
ID Property3 Property2 Property1
0 1 10.2 58.337589 45.083237
1 1 20.1 70.844807 29.423138
2 1 51.9 67.126043 90.558225
3 1 15.8 17.478715 41.492485
4 1 12.5 18.247211 26.449900
5 1203 104.4 113.728439 130.698964
6 1203 11.5 29.659894 45.991533
7 1203 19.4 78.910591 40.049054
8 1203 23.1 78.395974 67.345487
So I would use GroupBy.cumcount + DataFrame.pivot_table
to calculate quantiles without using apply:
df['aux']=df.groupby('ID').cumcount()
new_df=df.pivot_table(columns='ID',index='aux',values=['Property1','Property2','Property3'])
print(new_df)
Property1 Property2 Property3
ID 1 1203 1 1203 1 1203
aux
0 45.083237 130.698964 58.337589 113.728439 10.2 104.4
1 29.423138 45.991533 70.844807 29.659894 20.1 11.5
2 90.558225 40.049054 67.126043 78.910591 51.9 19.4
3 41.492485 67.345487 17.478715 78.395974 15.8 23.1
4 26.449900 NaN 18.247211 NaN 12.5 NaN
#remove aux column
df=df.drop('aux',axis=1)
Now we calculate the mean using boolean indexing:
new_df[(new_df.quantile(0.75)>new_df)&( new_df>new_df.quantile(0.25) )].mean()
ID
Property1 1 59.963006
1203 70.661294
Property2 1 49.863814
1203 45.703292
Property3 1 15.800000
1203 21.250000
dtype: float64
or create DataFrame with the mean:
mean_df=( new_df[(new_df.quantile(0.75)>new_df)&( new_df>new_df.quantile(0.25) )].mean()
.rename_axis(index=['Property','ID'])
.unstack('Property') )
print(mean_df)
Property Property1 Property2 Property3
ID
1 41.492485 58.337589 15.80
1203 56.668510 78.653283 21.25
Measure times:
%%timeit
df['aux']=df.groupby('ID').cumcount()
new_df=df.pivot_table(columns='ID',index='aux',values=['Property1','Property2','Property3'])
df=df.drop('aux',axis=1)
( new_df[(new_df.quantile(0.75)>new_df)&( new_df>new_df.quantile(0.25) )].mean()
.rename_axis(index=['Property','ID'])
.unstack('Property') )
25.2 ms ± 1.09 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
def mean_of_25_to_75_pct(s: pd.Series):
low, high = s.quantile(.25), s.quantile(.75)
return s.loc[(s >= low) & (s < high)].mean()
df.groupby("ID").apply(lambda x: x.apply(mean_of_25_to_75_pct))
33 ms ± 1.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
def filter_mean(df):
bounds = df.quantile([.25, .75])
mask = (df < bounds.loc[0.75]) & (df > bounds.loc[0.25])
return df[mask].mean()
means = df.groupby("ID").apply(filter_mean)
23 ms ± 809 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
It is even almost faster with a small data frame, in larger data frames such as its original data frame, it would be much faster than the other proposed methods, see: when use apply
You can use the quantile function to return multiple quantiles. Then, you can filter out values based on this, and compute the mean:
def filter_mean(df):
bounds = df.quantile([.25, .75])
mask = (df < bounds.loc[0.75]) & (df > bounds.loc[0.25])
return df[mask].mean()
means = data.groupby("id").apply(filter_mean)
Please try this.
def mean_of_25_to_75_pct(s: pd.Series):
low, high = s.quantile(.25), s.quantile(.75)
return s.loc[(s >= low) & (s < high)].mean()
data.groupby("id").apply(lambda x: x.apply(mean_of_25_to_75_pct))
You could use scipy ready-made function for trimmed mean, trim_mean():
from scipy import stats
means = data.groupby("id").apply(stats.trim_mean, 0.25)
If you insist on getting a dataframe, you could:
data.groupby("id").agg(lambda x: stats.trim_mean(x, 0.25)).reset_index()

Does Indexing makes Slice of pandas dataframe faster?

I have a pandas dataframe holding more than million records. One of its columns is datetime. The sample of my data is like the following:
time,x,y,z
2015-05-01 10:00:00,111,222,333
2015-05-01 10:00:03,112,223,334
...
I need to effectively get the record during the specific period. The following naive way is very time consuming.
new_df = df[(df["time"] > start_time) & (df["time"] < end_time)]
I know that on DBMS like MySQL the indexing by the time field is effective for getting records by specifying the time period.
My question is
Does the indexing of pandas such as df.index = df.time makes the slicing process faster?
If the answer of Q1 is 'No', what is the common effective way to get a record during the specific time period in pandas?
Let's create a dataframe with 1 million rows and time performance. The index is a Pandas Timestamp.
df = pd.DataFrame(np.random.randn(1000000, 3),
columns=list('ABC'),
index=pd.DatetimeIndex(start='2015-1-1', freq='10s', periods=1000000))
Here are the results sorted from fastest to slowest (tested on the same machine with both v. 0.14.1 (don't ask...) and the most recent version 0.17.1):
%timeit df2 = df['2015-2-1':'2015-3-1']
1000 loops, best of 3: 459 µs per loop (v. 0.14.1)
1000 loops, best of 3: 664 µs per loop (v. 0.17.1)
%timeit df2 = df.ix['2015-2-1':'2015-3-1']
1000 loops, best of 3: 469 µs per loop (v. 0.14.1)
1000 loops, best of 3: 662 µs per loop (v. 0.17.1)
%timeit df2 = df.loc[(df.index >= '2015-2-1') & (df.index <= '2015-3-1'), :]
100 loops, best of 3: 8.86 ms per loop (v. 0.14.1)
100 loops, best of 3: 9.28 ms per loop (v. 0.17.1)
%timeit df2 = df.loc['2015-2-1':'2015-3-1', :]
1 loops, best of 3: 341 ms per loop (v. 0.14.1)
1000 loops, best of 3: 677 µs per loop (v. 0.17.1)
Here are the timings with the Datetime index as a column:
df.reset_index(inplace=True)
%timeit df2 = df.loc[(df['index'] >= '2015-2-1') & (df['index'] <= '2015-3-1')]
100 loops, best of 3: 12.6 ms per loop (v. 0.14.1)
100 loops, best of 3: 13 ms per loop (v. 0.17.1)
%timeit df2 = df.loc[(df['index'] >= '2015-2-1') & (df['index'] <= '2015-3-1'), :]
100 loops, best of 3: 12.8 ms per loop (v. 0.14.1)
100 loops, best of 3: 12.7 ms per loop (v. 0.17.1)
All of the above indexing techniques produce the same dataframe:
>>> df2.shape
(250560, 3)
It appears that either of the first two methods are the best in this situation, and the fourth method also works just as fine using the latest version of Pandas.
I've never dealt with a data set that large, but maybe you can try recasting the time column as a datetime index and then slicing directly. Something like this.
timedata.txt (extended from your example):
time,x,y,z
2015-05-01 10:00:00,111,222,333
2015-05-01 10:00:03,112,223,334
2015-05-01 10:00:05,112,223,335
2015-05-01 10:00:08,112,223,336
2015-05-01 10:00:13,112,223,337
2015-05-01 10:00:21,112,223,338
df = pd.read_csv('timedata.txt')
df.time = pd.to_datetime(df.time)
df = df.set_index('time')
print(df['2015-05-01 10:00:02':'2015-05-01 10:00:14'])
x y z
time
2015-05-01 10:00:03 112 223 334
2015-05-01 10:00:05 112 223 335
2015-05-01 10:00:08 112 223 336
2015-05-01 10:00:13 112 223 337
Note that in the example the times used for slicing are not in the column, so this will work for the case where you only know the time interval.
If your data has a fixed time period you can create a datetime index which may provide more options. I didn't want to assume your time period was fixed so constructed this for a more general case.

Why is numpy ma.average 24 times slower than arr.mean?

I've found something interesting in Python's numpy. ma.average is a lot slower than arr.mean (arr is an array)
>>> arr = np.full((3, 3), -9999, dtype=float)
array([[-9999., -9999., -9999.],
[-9999., -9999., -9999.],
[-9999., -9999., -9999.]])
%timeit np.ma.average(arr, axis=0)
The slowest run took 49.32 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 191 µs per loop
%timeit arr.mean(axis=0)
The slowest run took 6.63 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 7.41 µs per loop
with random numbers
arr = np.random.random((3,3))
%timeit arr.mean(axis=0)
The slowest run took 6.17 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 7.78 µs per loop
%timeit np.ma.average(arr, axis=0)
1000 loops, best of 3: 186 µs per loop
--> That's nearly 24 times slower.
Documentation
numpy.ma.average(a, axis=None, weights=None, returned=False)
Return the weighted average of array over the given axis.
numpy.mean(a, axis=None, dtype=None, out=None, keepdims)
Compute the arithmetic mean along the specified axis.
Why is ma.average so much slower than arr.mean? Mathematically they are the same (correct me if I'm wrong). My guess is that it has something to do with the weighted options on ma.average but shouldn't be there a fallback if no weights are passed?
A good way to find out why something is slower is to profile it. I'll use the 3rd party library line_profiler and the IPython command %lprun (see for example this blog) here:
%load_ext line_profiler
import numpy as np
arr = np.full((3, 3), -9999, dtype=float)
%lprun -f np.ma.average np.ma.average(arr, axis=0)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
519 def average(a, axis=None, weights=None, returned=False):
...
570 1 1810 1810.0 30.5 a = asarray(a)
571 1 15 15.0 0.3 m = getmask(a)
572
573 # inspired by 'average' in numpy/lib/function_base.py
574
575 1 5 5.0 0.1 if weights is None:
576 1 3500 3500.0 59.0 avg = a.mean(axis)
577 1 591 591.0 10.0 scl = avg.dtype.type(a.count(axis))
578 else:
...
608
609 1 7 7.0 0.1 if returned:
610 if scl.shape != avg.shape:
611 scl = np.broadcast_to(scl, avg.shape).copy()
612 return avg, scl
613 else:
614 1 5 5.0 0.1 return avg
I removed some irrelevant lines.
So actually 30% of the time is spent in np.ma.asarray (something that arr.mean doesn't have to do!).
However the relative times change drastically if you use a bigger array:
arr = np.full((1000, 1000), -9999, dtype=float)
%lprun -f np.ma.average np.ma.average(arr, axis=0)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
519 def average(a, axis=None, weights=None, returned=False):
...
570 1 609 609.0 7.6 a = asarray(a)
571 1 14 14.0 0.2 m = getmask(a)
572
573 # inspired by 'average' in numpy/lib/function_base.py
574
575 1 7 7.0 0.1 if weights is None:
576 1 6924 6924.0 86.9 avg = a.mean(axis)
577 1 404 404.0 5.1 scl = avg.dtype.type(a.count(axis))
578 else:
...
609 1 6 6.0 0.1 if returned:
610 if scl.shape != avg.shape:
611 scl = np.broadcast_to(scl, avg.shape).copy()
612 return avg, scl
613 else:
614 1 6 6.0 0.1 return avg
This time the np.ma.MaskedArray.mean function almost takes up 90% of the time.
Note: You could also dig deeper and look into np.ma.asarray or np.ma.MaskedArray.count or np.ma.MaskedArray.mean and check their line profilings. But I just wanted to show that there are lots of called function that add to the overhead.
So the next question is: did the relative times between np.ndarray.mean and np.ma.average also change? And at least on my computer the difference is much lower now:
%timeit np.ma.average(arr, axis=0)
# 2.96 ms ± 91 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit arr.mean(axis=0)
# 1.84 ms ± 23.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
This time it's not even 2 times slower. I assume for even bigger arrays the difference will get even smaller.
This is also something that is actually quite common with NumPy:
The constant factors are quite high even for plain numpy functions (see for example my answer to the question "Performance in different vectorization method in numpy"). For np.ma these constant factors are even bigger, especially if you don't use a np.ma.MaskedArray as input. But even though the constant factors might be high, these functions excel with big arrays.
Thanks to #WillemVanOnsem and #sascha in the comments above
edit: applies to small arrays, see accepted answer for more information
Masked operations are slow try, to avoid it:
mask = self.local_pos_history[:, 0] > -9
local_pos_hist_masked = self.local_pos_history[mask]
avg = local_pos_hist_masked.mean(axis=0)
old with masked
mask = np.ma.masked_where(self.local_pos_history > -9, self.local_pos_history)
local_pos_hist_mask = self.local_pos_history[mask].reshape(len(self.local_pos_history) // 3, 3)
avg_pos = self.local_pos_history
np.average is nearly equal to arr.mean:
%timeit np.average(arr, axis=0)
The slowest run took 5.81 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 9.89 µs per loop
%timeit np.mean(arr, axis=0)
The slowest run took 6.44 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 8.74 µs per loop
just for clarification still a tests on small batch

Categories

Resources