I know this is a very basic question but for some reason I can't find an answer. How can I get the index of certain element of a Series in python pandas? (first occurrence would suffice)
I.e., I'd like something like:
import pandas as pd
myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
print myseries.find(7) # should output 3
Certainly, it is possible to define such a method with a loop:
def find(s, el):
for i in s.index:
if s[i] == el:
return i
return None
print find(myseries, 7)
but I assume there should be a better way. Is there?
>>> myseries[myseries == 7]
3 7
dtype: int64
>>> myseries[myseries == 7].index[0]
3
Though I admit that there should be a better way to do that, but this at least avoids iterating and looping through the object and moves it to the C level.
Converting to an Index, you can use get_loc
In [1]: myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
In [3]: Index(myseries).get_loc(7)
Out[3]: 3
In [4]: Index(myseries).get_loc(10)
KeyError: 10
Duplicate handling
In [5]: Index([1,1,2,2,3,4]).get_loc(2)
Out[5]: slice(2, 4, None)
Will return a boolean array if non-contiguous returns
In [6]: Index([1,1,2,1,3,2,4]).get_loc(2)
Out[6]: array([False, False, True, False, False, True, False], dtype=bool)
Uses a hashtable internally, so fast
In [7]: s = Series(randint(0,10,10000))
In [9]: %timeit s[s == 5]
1000 loops, best of 3: 203 µs per loop
In [12]: i = Index(s)
In [13]: %timeit i.get_loc(5)
1000 loops, best of 3: 226 µs per loop
As Viktor points out, there is a one-time creation overhead to creating an index (its incurred when you actually DO something with the index, e.g. the is_unique)
In [2]: s = Series(randint(0,10,10000))
In [3]: %timeit Index(s)
100000 loops, best of 3: 9.6 µs per loop
In [4]: %timeit Index(s).is_unique
10000 loops, best of 3: 140 µs per loop
I'm impressed with all the answers here. This is not a new answer, just an attempt to summarize the timings of all these methods. I considered the case of a series with 25 elements and assumed the general case where the index could contain any values and you want the index value corresponding to the search value which is towards the end of the series.
Here are the speed tests on a 2012 Mac Mini in Python 3.9.10 with Pandas version 1.4.0.
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: data = [406400, 203200, 101600, 76100, 50800, 25400, 19050, 12700, 950
...: 0, 6700, 4750, 3350, 2360, 1700, 1180, 850, 600, 425, 300, 212, 150, 1
...: 06, 75, 53, 38]
In [4]: myseries = pd.Series(data, index=range(1,26))
In [5]: assert(myseries[21] == 150)
In [6]: %timeit myseries[myseries == 150].index[0]
179 µs ± 891 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [7]: %timeit myseries[myseries == 150].first_valid_index()
205 µs ± 3.67 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [8]: %timeit myseries.where(myseries == 150).first_valid_index()
597 µs ± 4.03 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [9]: %timeit myseries.index[np.where(myseries == 150)[0][0]]
110 µs ± 872 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [10]: %timeit pd.Series(myseries.index, index=myseries)[150]
125 µs ± 2.56 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [11]: %timeit myseries.index[pd.Index(myseries).get_loc(150)]
49.5 µs ± 814 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [12]: %timeit myseries.index[list(myseries).index(150)]
7.75 µs ± 36.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [13]: %timeit myseries.index[myseries.tolist().index(150)]
2.55 µs ± 27.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [14]: %timeit dict(zip(myseries.values, myseries.index))[150]
9.89 µs ± 79.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [15]: %timeit {v: k for k, v in myseries.items()}[150]
9.99 µs ± 67 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
#Jeff's answer seems to be the fastest - although it doesn't handle duplicates.
Correction: Sorry, I missed one, #Alex Spangher's solution using the list index method is by far the fastest.
Update: Added #EliadL's answer.
Hope this helps.
Amazing that such a simple operation requires such convoluted solutions and many are so slow. Over half a millisecond in some cases to find a value in a series of 25.
2022-02-18 Update
Updated all the timings with the latest Pandas version and Python 3.9. Even on an older computer, all the timings have significantly reduced (10 to 70%) compared to the previous tests (version 0.25.3).
Plus: Added two more methods utilizing dictionaries.
In [92]: (myseries==7).argmax()
Out[92]: 3
This works if you know 7 is there in advance. You can check this with
(myseries==7).any()
Another approach (very similar to the first answer) that also accounts for multiple 7's (or none) is
In [122]: myseries = pd.Series([1,7,0,7,5], index=['a','b','c','d','e'])
In [123]: list(myseries[myseries==7].index)
Out[123]: ['b', 'd']
Another way to do this, although equally unsatisfying is:
s = pd.Series([1,3,0,7,5],index=[0,1,2,3,4])
list(s).index(7)
returns:
3
On time tests using a current dataset I'm working with (consider it random):
[64]: %timeit pd.Index(article_reference_df.asset_id).get_loc('100000003003614')
10000 loops, best of 3: 60.1 µs per loop
In [66]: %timeit article_reference_df.asset_id[article_reference_df.asset_id == '100000003003614'].index[0]
1000 loops, best of 3: 255 µs per loop
In [65]: %timeit list(article_reference_df.asset_id).index('100000003003614')
100000 loops, best of 3: 14.5 µs per loop
If you use numpy, you can get an array of the indecies that your value is found:
import numpy as np
import pandas as pd
myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
np.where(myseries == 7)
This returns a one element tuple containing an array of the indecies where 7 is the value in myseries:
(array([3], dtype=int64),)
you can use Series.idxmax()
>>> import pandas as pd
>>> myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
>>> myseries.idxmax()
3
>>>
This is the most native and scalable approach I could find:
>>> myindex = pd.Series(myseries.index, index=myseries)
>>> myindex[7]
3
>>> myindex[[7, 5, 7]]
7 3
5 4
7 3
dtype: int64
Another way to do it that hasn't been mentioned yet is the tolist method:
myseries.tolist().index(7)
should return the correct index, assuming the value exists in the Series.
Often your value occurs at multiple indices:
>>> myseries = pd.Series([0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1])
>>> myseries.index[myseries == 1]
Int64Index([3, 4, 5, 6, 10, 11], dtype='int64')
The Pandas has builtin class Index with a function called get_loc. This function will either return
index (element index)
slice (if the specified number is in sequence)
array (bool array if the number is at multiple indexes)
Example:
import pandas as pd
>>> mySer = pd.Series([1, 3, 8, 10, 13])
>>> pd.Index(mySer).get_loc(10) # Returns index
3 # Index of 10 in series
>>> mySer = pd.Series([1, 3, 8, 10, 10, 10, 13])
>>> pd.Index(mySer).get_loc(10) # Returns slice
slice(3, 6, None) # 10 occurs at index 3 (included) to 6 (not included)
# If the data is not in sequence then it would return an array of bool's.
>>> mySer = pd.Series([1, 10, 3, 8, 10, 10, 10, 13, 10])
>>> pd.Index(mySer).get_loc(10)
array([False, True, False, False, True, True, False, True])
There are many other options too but I found it very simple for me.
df.index method will help you to find the exact row number
my_fl2=(df['ConvertedCompYearly'] == 45241312 )
print (df[my_fl2].index)
Name: ConvertedCompYearly, dtype: float64
Int64Index([66910], dtype='int64')
Related
I have an array like this:
array = np.random.randint(1, 100, 10000).astype(object)
array[[1, 2, 6, 83, 102, 545]] = np.nan
array[[3, 8, 70]] = None
Now, I want to find the indices of the NaN items and ignore the None ones. In this example, I want to get the [1, 2, 6, 83, 102, 545] indices. I can get the NaN indices with np.equal and np.isnan:
np.isnan(array.astype(float)) & (~np.equal(array, None))
I checked the performance of this solution with %timeit and got the following result:
243 µs ± 1.32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Is there faster solution?
array != array
The classic NaN test. Writing NaN tests like this is one of the reasons that motivated the NaN != NaN design decision, since the IEEE 754 designers couldn't assume programmers would have access to an isnan routine.
This significantly outperforms the code in the question when I try it:
In [1]: import numpy as np
In [2]: array = np.random.randint(1, 100, 10000).astype(object)
...: array[[1, 2, 6, 83, 102, 545]] = np.nan
...: array[[3, 8, 70]] = None
In [3]: %timeit array != array
139 µs ± 46.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [4]: %timeit np.isnan(array.astype(float)) & (~np.equal(array, None))
755 µs ± 123 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
And of course, it does give the same output:
In [5]: result1 = array != array
In [6]: result2 = np.isnan(array.astype(float)) & (~np.equal(array, None))
In [7]: np.array_equal(result1, result2)
Out[7]: True
I am trying to determine whether there is an entry in a Pandas column that has a particular value. I tried to do this with if x in df['id']. I thought this was working, except when I fed it a value that I knew was not in the column 43 in df['id'] it still returned True. When I subset to a data frame only containing entries matching the missing id df[df['id'] == 43] there are, obviously, no entries in it. How to I determine if a column in a Pandas data frame contains a particular value and why doesn't my current method work? (FYI, I have the same problem when I use the implementation in this answer to a similar question).
in of a Series checks whether the value is in the index:
In [11]: s = pd.Series(list('abc'))
In [12]: s
Out[12]:
0 a
1 b
2 c
dtype: object
In [13]: 1 in s
Out[13]: True
In [14]: 'a' in s
Out[14]: False
One option is to see if it's in unique values:
In [21]: s.unique()
Out[21]: array(['a', 'b', 'c'], dtype=object)
In [22]: 'a' in s.unique()
Out[22]: True
or a python set:
In [23]: set(s)
Out[23]: {'a', 'b', 'c'}
In [24]: 'a' in set(s)
Out[24]: True
As pointed out by #DSM, it may be more efficient (especially if you're just doing this for one value) to just use in directly on the values:
In [31]: s.values
Out[31]: array(['a', 'b', 'c'], dtype=object)
In [32]: 'a' in s.values
Out[32]: True
You can also use pandas.Series.isin although it's a little bit longer than 'a' in s.values:
In [2]: s = pd.Series(list('abc'))
In [3]: s
Out[3]:
0 a
1 b
2 c
dtype: object
In [3]: s.isin(['a'])
Out[3]:
0 True
1 False
2 False
dtype: bool
In [4]: s[s.isin(['a'])].empty
Out[4]: False
In [5]: s[s.isin(['z'])].empty
Out[5]: True
But this approach can be more flexible if you need to match multiple values at once for a DataFrame (see DataFrame.isin)
>>> df = DataFrame({'A': [1, 2, 3], 'B': [1, 4, 7]})
>>> df.isin({'A': [1, 3], 'B': [4, 7, 12]})
A B
0 True False # Note that B didn't match 1 here.
1 False True
2 True True
found = df[df['Column'].str.contains('Text_to_search')]
print(found.count())
the found.count() will contains number of matches
And if it is 0 then means string was not found in the Column.
You can try this to check a particular value 'x' in a particular column named 'id'
if x in df['id'].values
I did a few simple tests:
In [10]: x = pd.Series(range(1000000))
In [13]: timeit 999999 in x.values
567 µs ± 25.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [24]: timeit 9 in x.values
666 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [16]: timeit (x == 999999).any()
6.86 ms ± 107 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [21]: timeit x.eq(999999).any()
7.03 ms ± 33.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [22]: timeit x.eq(9).any()
7.04 ms ± 60 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [15]: timeit x.isin([999999]).any()
9.54 ms ± 291 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [17]: timeit 999999 in set(x)
79.8 ms ± 1.98 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Interestingly it doesn't matter if you look up 9 or 999999, it seems like it takes about the same amount of time using the in syntax (must be using some vectorized computation)
In [24]: timeit 9 in x.values
666 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [25]: timeit 9999 in x.values
647 µs ± 5.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [26]: timeit 999999 in x.values
642 µs ± 2.11 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [27]: timeit 99199 in x.values
644 µs ± 5.31 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [28]: timeit 1 in x.values
667 µs ± 20.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Seems like using x.values is the fastest, but maybe there is a more elegant way in pandas?
Or use Series.tolist or Series.any:
>>> s = pd.Series(list('abc'))
>>> s
0 a
1 b
2 c
dtype: object
>>> 'a' in s.tolist()
True
>>> (s=='a').any()
True
Series.tolist makes a list about of a Series, and the other one i am just getting a boolean Series from a regular Series, then checking if there are any Trues in the boolean Series.
Simple condition:
if any(str(elem) in ['a','b'] for elem in df['column'].tolist()):
Use
df[df['id']==x].index.tolist()
If x is present in id then it'll return the list of indices where it is present, else it gives an empty list.
I had a CSV file to read:
df = pd.read_csv('50_states.csv')
And after trying:
if value in df.column:
print(True)
which never printed true, even though the value was in the column;
I tried:
for values in df.column:
if value == values:
print(True)
#Or do something
else:
print(False)
Which worked. I hope this can help!
Use query() to find the rows where the condition holds and get the number of rows with shape[0]. If there exists at least one entry, this statement is True:
df.query('id == 123').shape[0] > 0
Suppose you dataframe looks like :
Now you want to check if filename "80900026941984" is present in the dataframe or not.
You can simply write :
if sum(df["filename"].astype("str").str.contains("80900026941984")) > 0:
print("found")
I have a Pandas dataframe with an arbitrary number of columns. I'd like to apply a function to every column. From the discussion on this SO POst, it's better to use np.vectorize compared to a pandas apply function.
However, how would I use np.vectorize to perform operations over every column?
The best idea I can come up with is np.vectorize with a for loop, but that comes out to take 2x the time on my machine on a dummy dataframe. Is apply with raw=True optimal, and then in terms of even faster options, we can only then take advantage of numba.
test_df = pd.DataFrame({'a': np.arange(5), 'b': np.arange(5), 'c': np.arange(5)})
def myfunc(a, b):
return a+b
start_time = time.time()
test_df.apply(lambda x: x + 3, raw = True)
print("--- %s seconds ---" % (time.time() - start_time))
start_time = time.time()
for i in range(test_df.shape[1]):
np.vectorize(myfunc)(test_df.iloc[:,i], 3)
print("--- %s seconds ---" % (time.time() - start_time))
The correct way to use np.vectorize is to not use it - unless you are dealing with a function that only accepts scalar values, and you don't care about speed. When ever I've tested it, explicit Python iteration has been faster.
At least that's the case when working with numpy arrays. With DataFrames, things become more complicated, since extracting Series and recreating frames can skew the timings substantially.
But lets look at your example in some detail.
Your sample frame:
In [177]: test_df = pd.DataFrame({'a': np.arange(5), 'b': np.arange(5), 'c': np.arange(5)})
...:
In [178]: test_df
Out[178]:
a b c
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
In [179]: def myfunc(a, b):
...: return a+b
...:
Your apply:
In [180]: test_df.apply(lambda x: x+3, raw=True)
Out[180]:
a b c
0 3 3 3
1 4 4 4
2 5 5 5
3 6 6 6
4 7 7 7
In [181]: timeit test_df.apply(lambda x: x+3, raw=True)
186 µs ± 524 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# 1.23 ms ± 13.9 µs per loop without **raw**
I get the same thing by simply using the frame's own addition operator - and it is faster. Ok, for a more general function that won't work. Your use of apply with default axis and raw implies you have a function that only works with one column at a time.
In [182]: test_df+3
Out[182]:
a b c
0 3 3 3
1 4 4 4
2 5 5 5
3 6 6 6
4 7 7 7
In [183]: timeit test_df+3
114 µs ± 3.02 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
With raw you are passing numpy arrays to the lambda. Array for the whole frame is:
In [184]: test_df.to_numpy()
Out[184]:
array([[0, 0, 0],
[1, 1, 1],
[2, 2, 2],
[3, 3, 3],
[4, 4, 4]])
In [185]: test_df.to_numpy()+3
Out[185]:
array([[3, 3, 3],
[4, 4, 4],
[5, 5, 5],
[6, 6, 6],
[7, 7, 7]])
In [186]: timeit test_df.to_numpy()+3
13.1 µs ± 119 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
That's much faster. But to return a frame takes time.
In [188]: timeit pd.DataFrame(test_df.to_numpy()+3, columns=test_df.columns)
91.1 µs ± 769 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [189]:
Testing vectorize.
In [218]: f=np.vectorize(myfunc)
f applies myfunc to each element of the input array iteratively. It has a clear performance disclaimer.
Even for this small array it is slow compared to direct application of the function to the array:
In [219]: timeit f(test_df.to_numpy(),3)
42.3 µs ± 123 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Passing the frame itself
In [221]: timeit f(test_df,3)
69.8 µs ± 1.98 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [223]: timeit pd.DataFrame(f(test_df,3), columns=test_df.columns)
154 µs ± 2.11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
And iteratively applying to columns - much slower:
In [226]: [f(test_df.iloc[:,i], 3) for i in range(test_df.shape[1])]
Out[226]: [array([3, 4, 5, 6, 7]), array([3, 4, 5, 6, 7]), array([3, 4, 5, 6, 7])]
In [227]: timeit [f(test_df.iloc[:,i], 3) for i in range(test_df.shape[1])]
477 µs ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
but a lot of that extra time comes from "extracting" columns:
In [228]: timeit [f(test_df.to_numpy()[:,i], 3) for i in range(test_df.shape[1])]
127 µs ± 357 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Quoting the np.vectorize docs:
Notes
-----
The `vectorize` function is provided primarily for convenience, not for
performance. The implementation is essentially a for loop.
This question already has answers here:
How can I sum every n array values and place the result into a new array? [duplicate]
(3 answers)
Closed 3 years ago.
For example,
I have a numpy array containing:
[1, 2, 3, 4, 5, 6]
I want to create an array as follows:
[3, 7, 11]
That is, I want to add the two neighboring elements into a new one.
I have tried the obvious:
for i in range(0, predictions.shape[0]+1, 2):
new_pred = np.append(new_pred, (predictions[i] + predictions[i+1]) / 2)
print(predictions.shape)
(16000, 0)
print(new_pred.shape)
(87998, 0)
But the dimension of new_pred is not half of 16000.
So I am wondering is there anything wrong with my code? And is there a convenient way to implement it?
There are many different possibilities, here it is one, neither the slowest one nor the fastest, of them,
>>> import numpy as np
>>> a = np.arange(30)
>>> a.reshape(-1, 2).sum(axis=1)
array([ 1, 5, 9, 13, 17, 21, 25, 29, 33, 37, 41, 45, 49, 53, 57])
>>>
For the record (please note that we have a new fastest answer that, imho, can't be bettered at all)
In [17]: a = np.arange(10**5)
In [18]: %timeit a.reshape(-1,2).sum(axis=1)
1.08 ms ± 1.35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [19]: %timeit [(a[i]+ a[i+1]) for i in range(0, len(a-1), 2)]
23.4 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [20]: %timeit [sum(item) for ind, item in enumerate(zip(a, a[1:])) if ind%2 == 0]
49.9 ms ± 313 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [21]: %timeit [sum(item) for item in zip(a[::2], a[1::2])]
30.2 ms ± 91 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
...
In [23]: %timeit a[::2]+a[1::2]
78.9 µs ± 79.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Use slices of ndarray:
predictions[::2] + predictions[1::2]
It is 10 times faster than "reshape" solution
>>> a = np.arange(10**5)
>>> timeit(lambda: a.reshape(-1,2).sum(axis=-1), number=1000)
0.785971520585008
>>> timeit(lambda: a[::2]+a[1::2], number=1000)
0.07569492445327342
another pythonic Possibility would be to use list comprehensions:
something like this for the example you posted:
import numpy as np
a = np.arange(1, 7)
res = [(a[i]+ a[i+1]) for i in range(0, len(a-1), 2)]
print(res)
hope it helps
Using zip
zip_ls = zip(ls[::2], ls[1::2])
new_ls = [sum(item) for item in zip_ls]
I have a pandas series with boolean entries. I would like to get a list of indices where the values are True.
For example the input pd.Series([True, False, True, True, False, False, False, True])
should yield the output [0,2,3,7].
I can do it with a list comprehension, but is there something cleaner or faster?
Using Boolean Indexing
>>> s = pd.Series([True, False, True, True, False, False, False, True])
>>> s[s].index
Int64Index([0, 2, 3, 7], dtype='int64')
If need a np.array object, get the .values
>>> s[s].index.values
array([0, 2, 3, 7])
Using np.nonzero
>>> np.nonzero(s)
(array([0, 2, 3, 7]),)
Using np.flatnonzero
>>> np.flatnonzero(s)
array([0, 2, 3, 7])
Using np.where
>>> np.where(s)[0]
array([0, 2, 3, 7])
Using np.argwhere
>>> np.argwhere(s).ravel()
array([0, 2, 3, 7])
Using pd.Series.index
>>> s.index[s]
array([0, 2, 3, 7])
Using python's built-in filter
>>> [*filter(s.get, s.index)]
[0, 2, 3, 7]
Using list comprehension
>>> [i for i in s.index if s[i]]
[0, 2, 3, 7]
As an addition to rafaelc's answer, here are the according times (from quickest to slowest) for the following setup
import numpy as np
import pandas as pd
s = pd.Series([x > 0.5 for x in np.random.random(size=1000)])
Using np.where
>>> timeit np.where(s)[0]
12.7 µs ± 77.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Using np.flatnonzero
>>> timeit np.flatnonzero(s)
18 µs ± 508 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Using pd.Series.index
The time difference to boolean indexing was really surprising to me, since the boolean indexing is usually more used.
>>> timeit s.index[s]
82.2 µs ± 38.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Using Boolean Indexing
>>> timeit s[s].index
1.75 ms ± 2.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
If you need a np.array object, get the .values
>>> timeit s[s].index.values
1.76 ms ± 3.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
If you need a slightly easier to read version <-- not in original answer
>>> timeit s[s==True].index
1.89 ms ± 3.52 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Using pd.Series.where <-- not in original answer
>>> timeit s.where(s).dropna().index
2.22 ms ± 3.32 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> timeit s.where(s == True).dropna().index
2.37 ms ± 2.19 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Using pd.Series.mask <-- not in original answer
>>> timeit s.mask(s).dropna().index
2.29 ms ± 1.43 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> timeit s.mask(s == True).dropna().index
2.44 ms ± 5.82 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Using list comprehension
>>> timeit [i for i in s.index if s[i]]
13.7 ms ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Using python's built-in filter
>>> timeit [*filter(s.get, s.index)]
14.2 ms ± 28.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Using np.nonzero <-- did not work out of the box for me
>>> timeit np.nonzero(s)
ValueError: Length of passed values is 1, index implies 1000.
Using np.argwhere <-- did not work out of the box for me
>>> timeit np.argwhere(s).ravel()
ValueError: Length of passed values is 1, index implies 1000.
Also works:
s.where(lambda x: x).dropna().index, and
it has the advantage of being easy to chain pipe - if your series is being computed on the fly, you don't need to assign it to a variable.
Note that if s is computed from r: s = cond(r)
than you can also use: r.where(lambda x: cond(x)).dropna().index.
You can use pipe or loc to chain the operation, this is helpful when s is an intermediate result and you don't want to name it.
s = pd.Series([True, False, True, True, False, False, False, True], index=list('ABCDEFGH'))
out = s.pipe(lambda s_: s_[s_].index)
# or
out = s.pipe(lambda s_: s_[s_]).index
# or
out = s.loc[lambda s_: s_].index
print(out)
Index(['A', 'C', 'D', 'H'], dtype='object')