Tried to do this with pandas.Series.apply function but it consider to be slow on big amount of data. Is there any quicker way to replace values?
Here is what I've tried, but it's slow on big Series (with million items for example)
s = pd.Series([1, 2, 3, 'str1', 'str2', 3])
s.apply(lambda x: x if type(x) == str else np.nan)
Use to_numeric with errors='coerce':
pd.to_numeric(s, errors='coerce')
If need also integers add Int64:
pd.to_numeric(s, errors='coerce').astype('Int64')
EDIT: You can use isinstance with map, and also Series.where:
#test 600k
N = 100000
s = pd.Series([1, 2, 3, 'str1', 'str2', 3] * N)
In [152]: %timeit s.apply(lambda x: x if type(x) == str else np.nan)
196 ms ± 2.81 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [153]: %timeit s.map(lambda x: x if isinstance(x, str) else np.nan)
174 ms ± 3.66 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [154]: %timeit s.where(s.map(lambda x: isinstance(x, str)))
168 ms ± 3.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [155]: %timeit s.where(pd.to_numeric(s, errors='coerce').isna())
366 ms ± 3.19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I have a pandas dataframe with two columns, where I need to check where the value at each row of column A is a string that starts with the value of the corresponding row at column B or viceversa.
It seems that the Series method .str.startswith cannot deal with vectorized input, so I needed to zip over the two columns in a list comprehension and create a new pd.Series with the same index as any of the two columns.
I would like this to be a vectorized operation with the .str accessor available to operate on iterables, but something like this returns NaN:
df = pd.DataFrame(data={'a':['x','yy'], 'b':['xyz','uvw']})
df['a'].str.startswith(df['b'])
while my working solution is the following:
pd.Series(index=df.index, data=[a.startswith(b) or b.startswith(a) for a,b in zip(df['a'],df['b'])])
I suspect that there may be a better way to tackle this issue as it also would benefit all string methods on series.
Is there any more beautiful or efficient method to do this?
One idea is use np.vecorize, but because working with strings performance is only a bit better like your solution:
def fun (a,b):
return a.startswith(b) or b.startswith(a)
f = np.vectorize(fun)
a = pd.Series(f(df['a'],df['b']), index=df.index)
print (a)
0 True
1 False
dtype: bool
df = pd.DataFrame(data={'a':['x','yy'], 'b':['xyz','uvw']})
df = pd.concat([df] * 10000, ignore_index=True)
In [132]: %timeit pd.Series(index=df.index, data=[a.startswith(b) or b.startswith(a) for a,b in df[['a', 'b']].to_numpy()])
42.3 ms ± 516 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [133]: %timeit pd.Series(f(df['a'],df['b']), index=df.index)
9.81 ms ± 119 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [134]: %timeit pd.Series(index=df.index, data=[a.startswith(b) or b.startswith(a) for a,b in zip(df['a'],df['b'])])
14.1 ms ± 262 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#sammywemmy solution
In [135]: %timeit pd.Series([any((a.startswith(b), b.startswith(a))) for a, b in df.to_numpy()], index=df.index)
46.3 ms ± 683 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
df = pd.DataFrame({'num_legs': [2, 4, 8, 0],
'num_wings': [2, 0, 0, 0],
'num_specimen_seen': [10, 2, 1, 8]},
index=['falcon', 'dog', 'spider', 'fish'])
where num_legs, num_wings and num_specimen_seen are columns.
Now, I've tuple like ('num_wings', 'num_legs') and wanted to check are there values are df columns? if yes then return true else false.
('num_wings', 'num_legs') -> this will return true
('abc', 'num_legs') -> false
You can use get_indexer here.
idxr = df.columns.get_indexer(tup)
all(idxr>-1)
Performance
cols = pd.Index(np.arange(10_000))
tup = tuple(np.arange(10_001))
%timeit all(cols.get_indexer(tup)>-1)
3.86 ms ± 87.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit all(e in cols for e in tup)
5.96 ms ± 69.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
You simply have to check if all elements of the tuple are contained in df.columns:
df = ...
def check(tup):
return all((e in df.columns) for e in tup)
Performance comparison
#user3483203 proposed an alternative, quite succinct, solution using get_indexer, so I performed a timeit comparison of both our solutions.
import random
import string
import pandas as pd
def rnd_str(l):
letters = string.ascii_lowercase
return ''.join(random.choice(letters) for i in range(l))
unique_strings = set(rnd_str(3) for _ in range(20000))
cols = pd.Index(unique_strings)
tup = tuple(rnd_str(3) for _ in range(5000))
%timeit all(cols.get_indexer(tup)>-1)
# 714 µs ± 12.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit all(e in cols for e in tup)
# 639 ns ± 0.988 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
###
tup = tuple(rnd_str(3) for _ in range(10000))
%timeit all(cols.get_indexer(tup)>-1)
# 1.29 ms ± 29.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit all(e in cols for e in tup)
# 1.23 µs ± 20.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Turns out the solution proposed in this post is significantly faster. The key advantage of this approach is that the all() functions exits early as soon as any element of the tuple that's not in df.columns has been spotted.
Y can't u iterate over each of the value in the tuple & check for them individually, if they are present in the dataframe.
>>> def check_presence(tuple):
... for x in tuple:
... if x not in df.columns:
... return False
... return True
check_presence(('num_wings', 'num_legs')) # returns True
check_presence(('abc', 'num_legs')) # returns False
I need to add a new column to a Pandas dataframe.
If the column "Inducing" contains text (not empty and not "") I need to add a 1 otherwise 0
I tried with
df['newColumn'] = np.where(df['INDUCING']!="", 1, 0)
This command works only for the values that are Strings initiated as "" but does not work if it is null.
Any idea on how to add this column correctly?
By De Morgan's laws, NOT(cond1 OR cond2) is equivalent to AND(NOT(cond1) AND NOT(cond2)).
You can combine conditions via the bitwise "and" (&) / "or" (|) operators as appropriate. This gives a Boolean series, which you can then cast to int:
df['newColumn'] = (df['INDUCING'].ne('') & df['INDUCING'].notnull()).astype(int)
Easiest way would be to .fillna('') first. Correction:
df['newColumn'] = np.where(df['INDUCING'].fillna('') != "", 1, 0)
or pass .astype(int) directly to the mask. This converts True to 1 and False to 0:
df['newcol'] = (df['INDUCING'].fillna('') != '').astype(int)
As the built-in bool produces True on a string exactly if it is non-empty, you can achieve this simply through
df['newColumn'] = df['INDUCING'].astype(bool).astype(int)
Some performance comparisons:
In [61]: df = pd.DataFrame({'INDUCING': ['test', None, '', 'more test']*10000})
In [63]: %timeit np.where(df['INDUCING'].fillna('') != "", 1, 0)
5.68 ms ± 500 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [62]: %timeit (df['INDUCING'].ne('') & df['INDUCING'].notnull()).astype(int)
5.1 ms ± 223 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [64]: %timeit np.where(df['INDUCING'], 1, 0)
667 µs ± 25.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [65]: %timeit df['INDUCING'].astype(bool).astype(int)
655 µs ± 5.55 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [99]: %timeit df['INDUCING'].values.astype(bool).astype(int)
553 µs ± 18.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I know this is a very basic question but for some reason I can't find an answer. How can I get the index of certain element of a Series in python pandas? (first occurrence would suffice)
I.e., I'd like something like:
import pandas as pd
myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
print myseries.find(7) # should output 3
Certainly, it is possible to define such a method with a loop:
def find(s, el):
for i in s.index:
if s[i] == el:
return i
return None
print find(myseries, 7)
but I assume there should be a better way. Is there?
>>> myseries[myseries == 7]
3 7
dtype: int64
>>> myseries[myseries == 7].index[0]
3
Though I admit that there should be a better way to do that, but this at least avoids iterating and looping through the object and moves it to the C level.
Converting to an Index, you can use get_loc
In [1]: myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
In [3]: Index(myseries).get_loc(7)
Out[3]: 3
In [4]: Index(myseries).get_loc(10)
KeyError: 10
Duplicate handling
In [5]: Index([1,1,2,2,3,4]).get_loc(2)
Out[5]: slice(2, 4, None)
Will return a boolean array if non-contiguous returns
In [6]: Index([1,1,2,1,3,2,4]).get_loc(2)
Out[6]: array([False, False, True, False, False, True, False], dtype=bool)
Uses a hashtable internally, so fast
In [7]: s = Series(randint(0,10,10000))
In [9]: %timeit s[s == 5]
1000 loops, best of 3: 203 µs per loop
In [12]: i = Index(s)
In [13]: %timeit i.get_loc(5)
1000 loops, best of 3: 226 µs per loop
As Viktor points out, there is a one-time creation overhead to creating an index (its incurred when you actually DO something with the index, e.g. the is_unique)
In [2]: s = Series(randint(0,10,10000))
In [3]: %timeit Index(s)
100000 loops, best of 3: 9.6 µs per loop
In [4]: %timeit Index(s).is_unique
10000 loops, best of 3: 140 µs per loop
I'm impressed with all the answers here. This is not a new answer, just an attempt to summarize the timings of all these methods. I considered the case of a series with 25 elements and assumed the general case where the index could contain any values and you want the index value corresponding to the search value which is towards the end of the series.
Here are the speed tests on a 2012 Mac Mini in Python 3.9.10 with Pandas version 1.4.0.
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: data = [406400, 203200, 101600, 76100, 50800, 25400, 19050, 12700, 950
...: 0, 6700, 4750, 3350, 2360, 1700, 1180, 850, 600, 425, 300, 212, 150, 1
...: 06, 75, 53, 38]
In [4]: myseries = pd.Series(data, index=range(1,26))
In [5]: assert(myseries[21] == 150)
In [6]: %timeit myseries[myseries == 150].index[0]
179 µs ± 891 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [7]: %timeit myseries[myseries == 150].first_valid_index()
205 µs ± 3.67 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [8]: %timeit myseries.where(myseries == 150).first_valid_index()
597 µs ± 4.03 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [9]: %timeit myseries.index[np.where(myseries == 150)[0][0]]
110 µs ± 872 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [10]: %timeit pd.Series(myseries.index, index=myseries)[150]
125 µs ± 2.56 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [11]: %timeit myseries.index[pd.Index(myseries).get_loc(150)]
49.5 µs ± 814 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [12]: %timeit myseries.index[list(myseries).index(150)]
7.75 µs ± 36.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [13]: %timeit myseries.index[myseries.tolist().index(150)]
2.55 µs ± 27.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [14]: %timeit dict(zip(myseries.values, myseries.index))[150]
9.89 µs ± 79.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [15]: %timeit {v: k for k, v in myseries.items()}[150]
9.99 µs ± 67 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
#Jeff's answer seems to be the fastest - although it doesn't handle duplicates.
Correction: Sorry, I missed one, #Alex Spangher's solution using the list index method is by far the fastest.
Update: Added #EliadL's answer.
Hope this helps.
Amazing that such a simple operation requires such convoluted solutions and many are so slow. Over half a millisecond in some cases to find a value in a series of 25.
2022-02-18 Update
Updated all the timings with the latest Pandas version and Python 3.9. Even on an older computer, all the timings have significantly reduced (10 to 70%) compared to the previous tests (version 0.25.3).
Plus: Added two more methods utilizing dictionaries.
In [92]: (myseries==7).argmax()
Out[92]: 3
This works if you know 7 is there in advance. You can check this with
(myseries==7).any()
Another approach (very similar to the first answer) that also accounts for multiple 7's (or none) is
In [122]: myseries = pd.Series([1,7,0,7,5], index=['a','b','c','d','e'])
In [123]: list(myseries[myseries==7].index)
Out[123]: ['b', 'd']
Another way to do this, although equally unsatisfying is:
s = pd.Series([1,3,0,7,5],index=[0,1,2,3,4])
list(s).index(7)
returns:
3
On time tests using a current dataset I'm working with (consider it random):
[64]: %timeit pd.Index(article_reference_df.asset_id).get_loc('100000003003614')
10000 loops, best of 3: 60.1 µs per loop
In [66]: %timeit article_reference_df.asset_id[article_reference_df.asset_id == '100000003003614'].index[0]
1000 loops, best of 3: 255 µs per loop
In [65]: %timeit list(article_reference_df.asset_id).index('100000003003614')
100000 loops, best of 3: 14.5 µs per loop
If you use numpy, you can get an array of the indecies that your value is found:
import numpy as np
import pandas as pd
myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
np.where(myseries == 7)
This returns a one element tuple containing an array of the indecies where 7 is the value in myseries:
(array([3], dtype=int64),)
you can use Series.idxmax()
>>> import pandas as pd
>>> myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
>>> myseries.idxmax()
3
>>>
This is the most native and scalable approach I could find:
>>> myindex = pd.Series(myseries.index, index=myseries)
>>> myindex[7]
3
>>> myindex[[7, 5, 7]]
7 3
5 4
7 3
dtype: int64
Another way to do it that hasn't been mentioned yet is the tolist method:
myseries.tolist().index(7)
should return the correct index, assuming the value exists in the Series.
Often your value occurs at multiple indices:
>>> myseries = pd.Series([0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1])
>>> myseries.index[myseries == 1]
Int64Index([3, 4, 5, 6, 10, 11], dtype='int64')
The Pandas has builtin class Index with a function called get_loc. This function will either return
index (element index)
slice (if the specified number is in sequence)
array (bool array if the number is at multiple indexes)
Example:
import pandas as pd
>>> mySer = pd.Series([1, 3, 8, 10, 13])
>>> pd.Index(mySer).get_loc(10) # Returns index
3 # Index of 10 in series
>>> mySer = pd.Series([1, 3, 8, 10, 10, 10, 13])
>>> pd.Index(mySer).get_loc(10) # Returns slice
slice(3, 6, None) # 10 occurs at index 3 (included) to 6 (not included)
# If the data is not in sequence then it would return an array of bool's.
>>> mySer = pd.Series([1, 10, 3, 8, 10, 10, 10, 13, 10])
>>> pd.Index(mySer).get_loc(10)
array([False, True, False, False, True, True, False, True])
There are many other options too but I found it very simple for me.
df.index method will help you to find the exact row number
my_fl2=(df['ConvertedCompYearly'] == 45241312 )
print (df[my_fl2].index)
Name: ConvertedCompYearly, dtype: float64
Int64Index([66910], dtype='int64')