Does Indexing makes Slice of pandas dataframe faster? - python

I have a pandas dataframe holding more than million records. One of its columns is datetime. The sample of my data is like the following:
time,x,y,z
2015-05-01 10:00:00,111,222,333
2015-05-01 10:00:03,112,223,334
...
I need to effectively get the record during the specific period. The following naive way is very time consuming.
new_df = df[(df["time"] > start_time) & (df["time"] < end_time)]
I know that on DBMS like MySQL the indexing by the time field is effective for getting records by specifying the time period.
My question is
Does the indexing of pandas such as df.index = df.time makes the slicing process faster?
If the answer of Q1 is 'No', what is the common effective way to get a record during the specific time period in pandas?

Let's create a dataframe with 1 million rows and time performance. The index is a Pandas Timestamp.
df = pd.DataFrame(np.random.randn(1000000, 3),
columns=list('ABC'),
index=pd.DatetimeIndex(start='2015-1-1', freq='10s', periods=1000000))
Here are the results sorted from fastest to slowest (tested on the same machine with both v. 0.14.1 (don't ask...) and the most recent version 0.17.1):
%timeit df2 = df['2015-2-1':'2015-3-1']
1000 loops, best of 3: 459 µs per loop (v. 0.14.1)
1000 loops, best of 3: 664 µs per loop (v. 0.17.1)
%timeit df2 = df.ix['2015-2-1':'2015-3-1']
1000 loops, best of 3: 469 µs per loop (v. 0.14.1)
1000 loops, best of 3: 662 µs per loop (v. 0.17.1)
%timeit df2 = df.loc[(df.index >= '2015-2-1') & (df.index <= '2015-3-1'), :]
100 loops, best of 3: 8.86 ms per loop (v. 0.14.1)
100 loops, best of 3: 9.28 ms per loop (v. 0.17.1)
%timeit df2 = df.loc['2015-2-1':'2015-3-1', :]
1 loops, best of 3: 341 ms per loop (v. 0.14.1)
1000 loops, best of 3: 677 µs per loop (v. 0.17.1)
Here are the timings with the Datetime index as a column:
df.reset_index(inplace=True)
%timeit df2 = df.loc[(df['index'] >= '2015-2-1') & (df['index'] <= '2015-3-1')]
100 loops, best of 3: 12.6 ms per loop (v. 0.14.1)
100 loops, best of 3: 13 ms per loop (v. 0.17.1)
%timeit df2 = df.loc[(df['index'] >= '2015-2-1') & (df['index'] <= '2015-3-1'), :]
100 loops, best of 3: 12.8 ms per loop (v. 0.14.1)
100 loops, best of 3: 12.7 ms per loop (v. 0.17.1)
All of the above indexing techniques produce the same dataframe:
>>> df2.shape
(250560, 3)
It appears that either of the first two methods are the best in this situation, and the fourth method also works just as fine using the latest version of Pandas.

I've never dealt with a data set that large, but maybe you can try recasting the time column as a datetime index and then slicing directly. Something like this.
timedata.txt (extended from your example):
time,x,y,z
2015-05-01 10:00:00,111,222,333
2015-05-01 10:00:03,112,223,334
2015-05-01 10:00:05,112,223,335
2015-05-01 10:00:08,112,223,336
2015-05-01 10:00:13,112,223,337
2015-05-01 10:00:21,112,223,338
df = pd.read_csv('timedata.txt')
df.time = pd.to_datetime(df.time)
df = df.set_index('time')
print(df['2015-05-01 10:00:02':'2015-05-01 10:00:14'])
x y z
time
2015-05-01 10:00:03 112 223 334
2015-05-01 10:00:05 112 223 335
2015-05-01 10:00:08 112 223 336
2015-05-01 10:00:13 112 223 337
Note that in the example the times used for slicing are not in the column, so this will work for the case where you only know the time interval.
If your data has a fixed time period you can create a datetime index which may provide more options. I didn't want to assume your time period was fixed so constructed this for a more general case.

Related

Pandas read_hdf very slow for non-numeric data

When reading a large hdf file with pandas.read_hdf() I get extremely slow read time. My hdf has 50 million rows, 3 columns with integers and 2 with strings. Writing this using to_hdf() with table format and indexing took almost 10 minutes. While this is also slow, I am not too concerned as read speed is more important.
I have tried saving as fixed/table format, with/without compression, however the read time ranges between 2-5 minutes. By comparison, read_csv() on the same data takes 4 minutes.
I have also tried to read the hdf using pytables directly. This is much faster at 6 seconds and this would be the speed I would like to see.
h5file = tables.open_file("data.h5", "r")
table = h5file.root.data.table.read()
I noticed all the speed comparisons in the documentation use only numeric data and running these myself achieved similar performance.
I would like to ask whether there is something I can do to optimise read performance?
Edit
Here is a sample of the data
col_A col_B col_C col_D col_E
30649671 1159660800 10217383 0 10596000 LACKEY
26198715 1249084800 0921720 0 0 KEY CLIFTON
19251910 752112000 0827092 104 243000 WEMPLE
47636877 1464739200 06247715 0 0 FLOYD
14121495 1233446400 05133815 0 988000 OGU ALLYN CH 9
41171050 1314835200 7C140009 0 39000 DEBERRY A
45865543 1459468800 0314892 76 254000 SABRINA
13387355 970358400 04140585 19 6956000 LA PERLA
4186815 849398400 02039719 0 19208000 NPU UNIONSPIELHAGAN1
32666568 733622400 10072006 0 1074000 BROWN
And info on the dataframe:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52046850 entries, 0 to 52046849
Data columns (total 5 columns):
col_A int64
col_B object
col_C int64
col_D int64
col_E object
dtypes: int64(3), object(2)
memory usage: 1.9+ GB
Here is a small demo:
Generating sample DF (1M rows):
N = 10**6
df = pd.DataFrame({
'n1': np.random.randint(10**6, size=N),
'n2': np.random.randint(10**6, size=N),
'n3': np.random.randint(10**6, size=N),
's1': pd.util.testing.rands_array(10, size=N),
's2': pd.util.testing.rands_array(40, size=N),
})
let's write it to disk in CSV, HDF5 (fixed, table and table + data_columns=True) and in Feather formats
df.to_csv(r'c:/tmp/test.csv', index=False)
df.to_hdf(r'c:/tmp/test_fix.h5', 'a')
df.to_hdf(r'c:/tmp/test_tab.h5', 'a', format='t')
df.to_hdf(r'c:/tmp/test_tab_idx.h5', 'a', format='t', data_columns=True)
import feather
feather.write_dataframe(df, 'c:/tmp/test.feather')
Reading:
In [2]: %timeit pd.read_csv(r'c:/tmp/test.csv')
1 loop, best of 3: 4.48 s per loop
In [3]: %timeit pd.read_hdf(r'c:/tmp/test_fix.h5','a')
1 loop, best of 3: 1.24 s per loop
In [4]: %timeit pd.read_hdf(r'c:/tmp/test_tab.h5','a')
1 loop, best of 3: 5.65 s per loop
In [5]: %timeit pd.read_hdf(r'c:/tmp/test_tab_idx.h5','a')
1 loop, best of 3: 5.6 s per loop
In [6]: %timeit feather.read_dataframe(r'c:/tmp/test.feather')
1 loop, best of 3: 589 ms per loop
conditional reading - let's select only those rows where n2 <= 100000
In [7]: %timeit pd.read_hdf(r'c:/tmp/test_tab_idx.h5','a', where="n2 <= 100000")
1 loop, best of 3: 1.18 s per loop
the less data we need to select (after filtering) - the faster it is:
In [8]: %timeit pd.read_hdf(r'c:/tmp/test_tab_idx.h5','a', where="n2 <= 100000 and n1 > 500000")
1 loop, best of 3: 763 ms per loop
In [10]: %timeit pd.read_hdf(r'c:/tmp/test_tab_idx.h5','a', where="n2 <= 100000 and n1 > 500000 and n3 < 50000")
1 loop, best of 3: 379 ms per loop
UPDATE: for Pandas versions 0.20.0+ there we can write and read directly to/from feather format (thanks #jezrael for the hint):
In [3]: df.to_feather(r'c:/tmp/test2.feather')
In [4]: %timeit pd.read_feather(r'c:/tmp/test2.feather')
1 loop, best of 3: 583 ms per loop
Example of generated DF:
In [13]: df
Out[13]:
n1 n2 n3 s1 s2
0 719458 808047 792611 Fjv4CoRv2b 2aWQTkutPlKkO38fRQh2tdh1BrnEFavmIsDZK17V
1 526092 950709 804869 dfG12EpzVI YVZzhMi9sfazZEW9e2TV7QIvldYj2RPHw0TXxS2z
2 109107 801344 266732 aoyBuHTL9I ui0PKJO8cQJwcvmMThb08agWL1UyRumYgB7jjmcw
3 873626 814409 895382 qQQms5pTGq zvf4HTaKCISrdPK98ROtqPqpsG4WhSdEgbKNHy05
4 212776 596713 924623 3YXa4PViAn 7Y94ykHIHIEnjKvGphYfAWSINRZtJ99fCPiMrfzl
5 375323 401029 973262 j6QQwYzfsK PNYOM2GpHdhrz9NCCifRsn8gIZkLHecjlk82o44Y
6 232655 937230 40883 NsI5Y78aLT qiKvXcAdPVbhWbXnyD3uqIwzS7ZsCgssm9kHAETb
7 69010 438280 564194 N73tQaZjey ttj1IHtjPyssyADMYiNScflBjN4SFv5bk3tbz93o
8 988081 8992 968871 eb9lc7D22T sb3dt1Ndc8CUHyvsFJgWRrQg4ula7KJ76KrSSqGH
9 127155 66042 881861 tHSBB3RsNH ZpZt5sxAU3zfiPniSzuJYrwtrytDvqJ1WflJ4vh3
... ... ... ... ... ...
999990 805220 21746 355944 IMCMWuf97L bj7tSrgudA5wLvWkWVQyNVamSGmFGOeQlIUoKXK3
999991 232596 293850 741881 JD0SVS5uob kWeP8DEw19rwxVN3XBBcskibMRGxfoToNO9RDeCT
999992 532752 733958 222003 9X4PopnltN dKhsdKFK1EfAATBFsB5hjKZzQWERxzxGEQZWAvSe
999993 308623 717897 703895 Fg0nuq63hA kHzRecZoaG5tAnLbtlq1hqtfd2l5oEMFbJp4NjhC
999994 841670 528518 70745 vKQDiAzZNf M5wdoUNfkdKX2VKQEArvBLYl5lnTNShjDLwnb8VE
999995 986988 599807 901853 r8iHjo39NH 72CfzCycAGoYMocbw3EbUbrV4LRowFjSDoDeYfT5
999996 384064 429184 203230 EJy0mTAmdQ 1jfUQCj2SLIktVqIRHfYQW2QYfpvhcWCbRLO5wqL
999997 967270 565677 146418 KWp2nH1MbM hzhn880cuEpjFhd5bd7vpgsjjRNgaViANW9FHwrf
999998 130864 863893 5614 L28QGa22f1 zfg8mBidk8NTa3LKO4rg31Z6K4ljK50q5tHHq8Fh
999999 528532 276698 553870 0XRJwqBAWX 0EzNcDkGUFklcbKELtcr36zPCMu9lSaIDcmm0kUX
[1000000 rows x 5 columns]

Remove rows where a specific column has a blank entry

I have a dataframe Df which looks like:
date XNGS BBG FX
16/11/2007 19.41464766 0.6819 19.41464766
19/11/2007 19.34059332 0.6819 19.34059332
20/11/2007 19.49080536 0.6739 19.49080536
21/11/2007 19.2399259 0.673 19.2399259
22/11/2007 0.6734
23/11/2007 19.2009794 0.674 19.2009794
I would like to remove any rows where XNGS is empty. In this example I would like to remove the row with the date index 22/11/2007. So the resulting Df would look like:
date XNGS BBG FX
16/11/2007 19.41464766 0.6819 19.41464766
19/11/2007 19.34059332 0.6819 19.34059332
20/11/2007 19.49080536 0.6739 19.49080536
21/11/2007 19.2399259 0.673 19.2399259
23/11/2007 19.2009794 0.674 19.2009794
The dataframe changes a lot so the fix needs to be dynamic. I have tried:
Df = Df[Df.XNGS != ""]
and
Df.dropna(subset=["XNGS"])
but they don't work. What can I try next?
Safe Option
canonical dropna after replace
df.replace({'XNGS': {'': np.nan}}).dropna(subset=['XNGS'])
date XNGS BBG FX
0 16/11/2007 19.414648 0.6819 19.414648
1 19/11/2007 19.340593 0.6819 19.340593
2 20/11/2007 19.490805 0.6739 19.490805
3 21/11/2007 19.239926 0.6730 19.239926
5 23/11/2007 19.200979 0.6740 19.200979
Less Safe, but Cool
Empty strings evaluate to False
df[df.XNGS.values.astype(bool)]
date XNGS BBG FX
0 16/11/2007 19.414648 0.6819 19.414648
1 19/11/2007 19.340593 0.6819 19.340593
2 20/11/2007 19.490805 0.6739 19.490805
3 21/11/2007 19.239926 0.6730 19.239926
5 23/11/2007 19.200979 0.6740 19.200979
Timing
small data
%timeit (df.replace({'XNGS': {'': np.nan}}).dropna(subset=['XNGS']))
1000 loops, best of 3: 1.39 ms per loop
%timeit df[df.XNGS.values.astype(bool)]
1000 loops, best of 3: 192 µs per loop
large data
df = pd.concat([df] * 10000, ignore_index=True)
%timeit (df.replace({'XNGS': {'': np.nan}}).dropna(subset=['XNGS']))
100 loops, best of 3: 10.5 ms per loop
%timeit df[df.XNGS.values.astype(bool)]
100 loops, best of 3: 2.11 ms per loop
What about query?
Df.query('XNGS != ""', inplace=True)
or
Df = Df.query('XNGS != ""')
A long way of doing it is:
df["column name"].fillna(9999, inplace=True)
df = df[df["column name"]!= 9999]

Pandas DataFrame check colums value in other column

I have test_df with columns 'MonthAbbr' and 'PromoInterval'
Example output
1017174 Jun Mar,Jun,Sept,Dec
1017175 Mar Mar,Jun,Sept,Dec
1017176 Feb Mar,Jun,Sept,Dec
1017177 Feb Feb,May,Aug,Nov
1017178 Jan Feb,May,Aug,Nov
1017179 Jan Mar,Jun,Sept,Dec
1017180 Jan Mar,Jun,Sept,Dec
I want add column-indicator is month in promo interval, which will =1 if MonthAbbr in PromoInterval for current row, =0 otherwise
Is there more efficient way?
for ind in test_df.index:
test_df.set_value(ind ,'IsPromoInThisMonth',
test_df.MonthAbbr.astype(str)[ind] in (test_df.PromoInterval.astype(str)[ind])
This is a bit faster:
%%timeit
test_df['IsPromoInThisMonth'] = [x in y for x, y in zip(test_df['MonthAbbr'],
test_df['PromoInterval'])]
1000 loops, best of 3: 317 µs per loop
Than your approach:
%%timeit
for ind in test_df.index:
test_df.set_value(ind ,'IsPromoInThisMonth',
test_df.MonthAbbr.astype(str)[ind] in (test_df.PromoInterval.astype(str)[ind]))
1000 loops, best of 3: 1.44 ms per loop
UPDATE
Using a function with apply is slower than the list comprehension:
%%timeit
test_df['IsPromoInThisMonth'] = test_df.apply(lambda x: x[0] in x[1], axis=1)
1000 loops, best of 3: 804 µs per loop

Faster alternatives to numpy.argmax/argmin which is slow

I am using a lot of argmin and argmax in Python.
Unfortunately, the function is very slow.
I have done some searching around, and the best I can find is here:
http://lemire.me/blog/archives/2008/12/17/fast-argmax-in-python/
def fastest_argmax(array):
array = list( array )
return array.index(max(array))
Unfortunately, this solution is still only half as fast as np.max, and I think I should be able to find something as fast as np.max.
x = np.random.randn(10)
%timeit np.argmax( x )
10000 loops, best of 3: 21.8 us per loop
%timeit fastest_argmax( x )
10000 loops, best of 3: 20.8 us per loop
As a note, I am applying this to a Pandas DataFrame Groupby
E.G.
%timeit grp2[ 'ODDS' ].agg( [ fastest_argmax ] )
100 loops, best of 3: 8.8 ms per loop
%timeit grp2[ 'ODDS' ].agg( [ np.argmax ] )
100 loops, best of 3: 11.6 ms per loop
Where grp2[ 'ODDS' ].head() looks like this:
EVENT_ID SELECTION_ID
104601100 4367029 682508 3.05
682509 3.15
682510 3.25
682511 3.35
5319660 682512 2.04
682513 2.08
682514 2.10
682515 2.12
682516 2.14
5510310 682520 4.10
682521 4.40
682522 4.50
682523 4.80
682524 5.30
5559264 682526 5.00
682527 5.30
682528 5.40
682529 5.50
682530 5.60
5585869 682533 1.96
682534 1.97
682535 1.98
682536 2.02
682537 2.04
6064546 682540 3.00
682541 2.74
682542 2.76
682543 2.96
682544 3.05
104601200 4916112 682548 2.64
682549 2.68
682550 2.70
682551 2.72
682552 2.74
5315859 682557 2.90
682558 2.92
682559 3.05
682560 3.10
682561 3.15
5356995 682564 2.42
682565 2.44
682566 2.48
682567 2.50
682568 2.52
5465225 682573 1.85
682574 1.89
682575 1.91
682576 1.93
682577 1.94
5773661 682588 5.00
682589 4.40
682590 4.90
682591 5.10
6013187 682592 5.00
682593 4.20
682594 4.30
682595 4.40
682596 4.60
104606300 2489827 683438 4.00
683439 3.90
683440 3.95
683441 4.30
683442 4.40
3602724 683446 2.16
683447 2.32
Name: ODDS, Length: 65, dtype: float64
It turns out that np.argmax is blazingly fast, but only with the native numpy arrays. With foreign data, almost all the time is spent on conversion:
In [194]: print platform.architecture()
('64bit', 'WindowsPE')
In [5]: x = np.random.rand(10000)
In [57]: l=list(x)
In [123]: timeit numpy.argmax(x)
100000 loops, best of 3: 6.55 us per loop
In [122]: timeit numpy.argmax(l)
1000 loops, best of 3: 729 us per loop
In [134]: timeit numpy.array(l)
1000 loops, best of 3: 716 us per loop
I called your function "inefficient" because it first converts everything to list, then iterates through it 2 times (effectively, 3 iterations + list construction).
I was going to suggest something like this that only iterates once:
def imax(seq):
it=iter(seq)
im=0
try: m=it.next()
except StopIteration: raise ValueError("the sequence is empty")
for i,e in enumerate(it,start=1):
if e>m:
m=e
im=i
return im
But, your version turns out to be faster because it iterates many times but does it in C, rather that Python, code. C is just that much faster - even considering the fact a great deal of time is spent on conversion, too:
In [158]: timeit imax(x)
1000 loops, best of 3: 883 us per loop
In [159]: timeit fastest_argmax(x)
1000 loops, best of 3: 575 us per loop
In [174]: timeit list(x)
1000 loops, best of 3: 316 us per loop
In [175]: timeit max(l)
1000 loops, best of 3: 256 us per loop
In [181]: timeit l.index(0.99991619010758348) #the greatest number in my case, at index 92
100000 loops, best of 3: 2.69 us per loop
So, the key knowledge to speeding this up further is to know which format the data in your sequence natively is (e.g. whether you can omit the conversion step or use/write another functionality native to that format).
Btw, you're likely to get some speedup by using aggregate(max_fn) instead of agg([max_fn]).
For those that came for a short numpy-free snippet that returns the index of the first minimum value:
def argmin(a):
return min(range(len(a)), key=lambda x: a[x])
a = [6, 5, 4, 1, 1, 3, 2]
argmin(a) # returns 3
Can you post some code? Here is the result on my pc:
x = np.random.rand(10000)
%timeit np.max(x)
%timeit np.argmax(x)
output:
100000 loops, best of 3: 7.43 µs per loop
100000 loops, best of 3: 11.5 µs per loop

Why is numpy ma.average 24 times slower than arr.mean?

I've found something interesting in Python's numpy. ma.average is a lot slower than arr.mean (arr is an array)
>>> arr = np.full((3, 3), -9999, dtype=float)
array([[-9999., -9999., -9999.],
[-9999., -9999., -9999.],
[-9999., -9999., -9999.]])
%timeit np.ma.average(arr, axis=0)
The slowest run took 49.32 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 191 µs per loop
%timeit arr.mean(axis=0)
The slowest run took 6.63 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 7.41 µs per loop
with random numbers
arr = np.random.random((3,3))
%timeit arr.mean(axis=0)
The slowest run took 6.17 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 7.78 µs per loop
%timeit np.ma.average(arr, axis=0)
1000 loops, best of 3: 186 µs per loop
--> That's nearly 24 times slower.
Documentation
numpy.ma.average(a, axis=None, weights=None, returned=False)
Return the weighted average of array over the given axis.
numpy.mean(a, axis=None, dtype=None, out=None, keepdims)
Compute the arithmetic mean along the specified axis.
Why is ma.average so much slower than arr.mean? Mathematically they are the same (correct me if I'm wrong). My guess is that it has something to do with the weighted options on ma.average but shouldn't be there a fallback if no weights are passed?
A good way to find out why something is slower is to profile it. I'll use the 3rd party library line_profiler and the IPython command %lprun (see for example this blog) here:
%load_ext line_profiler
import numpy as np
arr = np.full((3, 3), -9999, dtype=float)
%lprun -f np.ma.average np.ma.average(arr, axis=0)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
519 def average(a, axis=None, weights=None, returned=False):
...
570 1 1810 1810.0 30.5 a = asarray(a)
571 1 15 15.0 0.3 m = getmask(a)
572
573 # inspired by 'average' in numpy/lib/function_base.py
574
575 1 5 5.0 0.1 if weights is None:
576 1 3500 3500.0 59.0 avg = a.mean(axis)
577 1 591 591.0 10.0 scl = avg.dtype.type(a.count(axis))
578 else:
...
608
609 1 7 7.0 0.1 if returned:
610 if scl.shape != avg.shape:
611 scl = np.broadcast_to(scl, avg.shape).copy()
612 return avg, scl
613 else:
614 1 5 5.0 0.1 return avg
I removed some irrelevant lines.
So actually 30% of the time is spent in np.ma.asarray (something that arr.mean doesn't have to do!).
However the relative times change drastically if you use a bigger array:
arr = np.full((1000, 1000), -9999, dtype=float)
%lprun -f np.ma.average np.ma.average(arr, axis=0)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
519 def average(a, axis=None, weights=None, returned=False):
...
570 1 609 609.0 7.6 a = asarray(a)
571 1 14 14.0 0.2 m = getmask(a)
572
573 # inspired by 'average' in numpy/lib/function_base.py
574
575 1 7 7.0 0.1 if weights is None:
576 1 6924 6924.0 86.9 avg = a.mean(axis)
577 1 404 404.0 5.1 scl = avg.dtype.type(a.count(axis))
578 else:
...
609 1 6 6.0 0.1 if returned:
610 if scl.shape != avg.shape:
611 scl = np.broadcast_to(scl, avg.shape).copy()
612 return avg, scl
613 else:
614 1 6 6.0 0.1 return avg
This time the np.ma.MaskedArray.mean function almost takes up 90% of the time.
Note: You could also dig deeper and look into np.ma.asarray or np.ma.MaskedArray.count or np.ma.MaskedArray.mean and check their line profilings. But I just wanted to show that there are lots of called function that add to the overhead.
So the next question is: did the relative times between np.ndarray.mean and np.ma.average also change? And at least on my computer the difference is much lower now:
%timeit np.ma.average(arr, axis=0)
# 2.96 ms ± 91 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit arr.mean(axis=0)
# 1.84 ms ± 23.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
This time it's not even 2 times slower. I assume for even bigger arrays the difference will get even smaller.
This is also something that is actually quite common with NumPy:
The constant factors are quite high even for plain numpy functions (see for example my answer to the question "Performance in different vectorization method in numpy"). For np.ma these constant factors are even bigger, especially if you don't use a np.ma.MaskedArray as input. But even though the constant factors might be high, these functions excel with big arrays.
Thanks to #WillemVanOnsem and #sascha in the comments above
edit: applies to small arrays, see accepted answer for more information
Masked operations are slow try, to avoid it:
mask = self.local_pos_history[:, 0] > -9
local_pos_hist_masked = self.local_pos_history[mask]
avg = local_pos_hist_masked.mean(axis=0)
old with masked
mask = np.ma.masked_where(self.local_pos_history > -9, self.local_pos_history)
local_pos_hist_mask = self.local_pos_history[mask].reshape(len(self.local_pos_history) // 3, 3)
avg_pos = self.local_pos_history
np.average is nearly equal to arr.mean:
%timeit np.average(arr, axis=0)
The slowest run took 5.81 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 9.89 µs per loop
%timeit np.mean(arr, axis=0)
The slowest run took 6.44 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 8.74 µs per loop
just for clarification still a tests on small batch

Categories

Resources